Computational Statistics & Data Analysis 3 (1986) 241-249 North-Holland
241
A simple test for the Behrens Fisher problem Andrzej M A T U S Z E W S K I and D a v i d S O T R E S Colegio de Postgraduados, CEC, Chapingo, Mexico 56230 Received 3 February 1983 Revised 8 December 1985
Abstract: A simple test for the Behrens-Fisher problem is introduced. This is based on comparing bounds of 80% confidence intervals for the population means. The test is confronted with the standard ones for the Behrens-Fisher problem. The test behaves remarkably well. Also, the proposed test is practical and is easy to interpret.
Keywords: Confidence interval, Behrens-Fisher problem, Robustness, Monte-Carlo study.
1. Introduction
It is desired to test the null hypothesis Ho: #a = ~2 against alternatives Ha: ~a '2- For this, we are given {xn, x 1 2 , . . . , Xl,,} and {x21, X 2 2 , . . . , X2n2} which are the observed values of two random samples independently drawn from the normal populations N(/&, o 2) and N(~ 2, o2) respectively. This is known as the Behrens-Fisher (BF) problem. Various tests for this problem have been developed by Fisher [3,4], Welch [13], Aspin [1], Cochran and Cox [2], Wald [12], Pagurova [9] and Lee and Gurland [8]. Following the standard notation (see e.g. [8]), the critical regions corresponding to these tests have the general form (1.1)
v > V(C)
where 2 *
~1/2 ,
c=
~a and ~2 are the sample means and s 2, s22 are the usual sample variances. The function V(C) depends in fact also on n 1, n 2 and the nominal significance level a, but as usual we omit explicit mention of them in the notation. The purpose of this paper is to present a simple test (see Section 2) based on 80% confidence intervals (CI) for the population means #a and g2, for the BF 0167-9473/86/$3.50 © 1986, Elsevier Science Publishers B.V. (North-Holland)
242
A. Matuszewski, D. Sotres / The Behrens- Fisher problem
problem. In Section 3, the power of this test is approximated and compared with that of the W e l c h - A s p i n test. For computing the size and power a technique proposed by Lee and Gurland (1975) is used. This technique yields a high degree of accuracy and is applicable to any test of the form (1.1). Finally, the results of a Monte-Carlo study on the actual size of the W e l c h - A s p i n test and the proposed test under different departures from the normality assumption is presented.
2. Proposed test
The proposed test will be motivated by two particular cases of the BF problem for which there exists the uniformly most powerful (UMP) test and the U M P unbiased test respectively for testing H 0 against H a. Problem A. Consider the BF problem with the assumption that 0 2 and a 2 are known. In this situation, it is well known that the U M P test for testing H 0 against Ha at the level a is to reject H 0 if 22 - 21 > C~,
(2.1)
where C~ is a given constant. It is not difficult to verify that this test is equivalent to that test which rejects when the upper bound of the 100y% CI for /~a: (21 ± is smaller than the lower bound of the 1003,% CI for/~2: (-~2 ± °2nza/2z(1+v)/2) choosing the value of
oan~l/2zo+v)/2)
~,= 2 ~ [ z 1 _ ~ - + R 2 / ( 1 + R ) ] - 1 , where • is the cumulative distribution function of the standard normal r a n d o m variable, z B = (/,-1 (fl) and R = That is, the U M P test for testing H0: ~a ~2 against Ha: /~1 < ~2 (where 01 and o 2 are known) rejects H 0 when the usual CI's for/~a and ~2 do not overlap. Also note that
oana1/Z/o2n21/2 =
a=f(3,, R ) =
I-~[zo+v)/2(l
+
R)/~].
This means that, for the above test, the significance level a can be specified by the confidence level 3,. It seems that for some practical problems this can be more convenient than the usual approach of fixing a. Now, since a is not only a function of 3, but also it depends on R, it seems relevant to examine the influence of R on a for a fixed 7, because a big influence would be inconvenient. Fixing 7 = 80%, the results in Table 1 were obtained for different values of R. From these results it is clear that the influence of R on a is slight. (In fact, in can be shown that the m i n i m u m a = 0.035 is attained at R = 1 and the m a x i m u m a = 0.1 is attained at R = 0 and m.) But, more important, the examination of these values shows that for 3, = 80% the size of the test a varies around 5% which is the most frequently used significance level. This suggests the main idea of the
A. Matuszewski, D. Sotres / The Behrens- Fisher problem
243
Table 1 Significance level for proposed test applied to problem A R
oln{ 1/2
Significance level
o2n~ 1/2
1
0.035
2
0.043
3
0.053
5
0.066
paper. The proposed procedure (p-test) for the BF problem is to construct the standard 80% CI's for ~1 and ~2, i.e. Xl + t 0 . 9 ( n l -- 1)sln~ 1/2 and x2 + to.9(n2 1) s2 n 21/2 respectively, and reject H 0 with nominal significance level of 5 % if and only if ~'1 nt- t o . 9 ( n l
(2.2)
-- 1)S1/'/1 1/2 < X2 -- 10.9(?/2 -- 1)$2/'/2 1/2,
where ta(n) is the 100- fl-percentile of a t-distribution with n degrees of freedom. That is, the proposed procedure rejects H 0 when the standard CI's for/~1 and /~2 do not overlap. To examine the behavior of the actual significance level of the p-test for the BF problem, it was calculated for several combinations of the sample sizes n 1, n 2 and for different values of the ratio Q = Ol/a 2. The results given in Table 2 show that the actual significance level of the p-test varies around 5% which is the nominal significance level of the p-test. Only for very unusual values of Q = Ol/O 2 (Q = 100 or Q = 1/100) is the actual size close to 10%. It may be checked that the critical region (2.2) can be rewritten in the form of (1.1) with V(C) =
v~-to.9(r/1
--
1) + V~ - C to.9(n 2 - 1).
This form is convenient for computing the size and power of the p-test. Another motivation for the proposed procedure is the following.
Table 2 Actual significance level for the proposed test applied to the Behrens-Fisher problem (nominal significance level = 0.05) n]
6 9 6 9 6 9
n 2
6 9 12 18 18 27
Q
=
01/0"2 1
1
1
1
1
2
3
5
~
~
g
100
100
0.0348 0.0349 0.0375 0.0373 0.0410 0.0406
0.0425 0.0427 0.0521 0.0516 0.0578 0.0572
0.0522 0.0524 0.0627 0.0623 0.0681 0.0676
0.0655 0.0657 0.0746 0.0743 0.0788 0.0784
0.0425 0.0427 0.0364 0.0367 0.0350 0.0351
0.0522 0.0524 0.0442 0.0434 0.0389 0.0393
0.0655 0.0657 0.0554 0.0560 0.0498 0.0505
0.0978 0.0978 0.0985 0.0985 0.0988 0.0988
0.0978 0.0978 0.0967 0.0968 0.0959 0.0960
244
A. Matuszewski, D. Sotres / The Behrens- Fisher problem
Problem B. Consider the BF problem with the assumption that the variances are unknown but o 2 = 02. It can be checked, in this situation, that there exist a confidence level y such that the U M P unbiased test for testing H 0 against H 1 at the level a rejects H 0 if and only if x l + tv(n~ + n 2 -
2 ) s p n l 1/2 < x 2 -
tv(nl + n2 - 2 ) s p n 2 1 / 2
where -e
Sp2 ~-
[(n I - 1)s 2 + (t/2
--
1)s2]/(nl
+ n 2 - 2).
That is, the U M P unbiased test for testing Ho: /.1 =/*2 against Hi:
/.1 < /*2
(a I = 02), rejects Ho when the CI's for N and !'2 do not overlap.
3. Power comparison Several authors have shown that the W e l c h - A s p i n (WA) test behaves remarkably well. Recent references are: Gohlar [5], Pfanzagl [11] and Lee and Gurland [8]. We also considered the U M P test (see e.g. [8]) for testing H 0 against H 1. This test assumes Q = Ol/O 2 to be known, which for most cases is not realistic. We have performed a computational study for comparing the power of the above two tests with the proposed test. The main results are shown in Table 3. In order to explain the meaning of entries in Table 3 it must be taken into account that the actual size of the p-test is not exactly 0.05 but only around this value, as we have seen in Section 2. To describe how we obtained the same size for the W A test as for the p-test, we must recall that the form of the W A test depends on a parameter ~ (see [1]) which is determined by the nominal size of this test.
Table 3 Ratios of the Welch-Asping ( W A ) and the proposed (P) test power to the power of the U M P test for different sample sizes and different Q = o i / o 2. In all cases B = 1. Actual sizes of W A and U M P tests were matched to that of the p-test log Q
Sample sizes: nl, n 2 7,7 WA
-
0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5
Mean Std. dev.
7,11
7,21
11,11
P
WA
P
WA
P
WA
P
0.999 0.993 0.980 0.963 0.957 0.953
0.999 0.997 0.995 0.991 0.987 0.984
0.987 0.990 0.997 0.999 0.994 0.980 0.962 0.949 0.941 0.938 0.938
0.998 0.999 0.999 0.997 0.995 0.989 0.986 0.976 0.972 0.970 0.965
0.999 0.997 0.988 0.971 0.952 0.936 0.925 0.918 0.919 0.922 0.925
0.996 0.991 0.985 0.978 0.970 0.964 0.962 0.953 0.951 0.950 0.948
1.000 0.995 0.989 0.981 0.975 0.972
0.999 0.999 0.997 0.994 0.992 0.989
0.974 0.019
0.992 0.006
0.970 0.025
0.986 0.013
0.950 0.033
0.968 0.017
0.985 0.011
0.995 0.004
A. Matuszewski, D. Sotres / The Behrens- Fisher problem
245
That nominal size is not equal to the actual one. Therefore to match the actual size of the WA test to that of the p-test we performed a bisection procedure based on ~. It was easier to match the actual size of the U M P test to that of the p-test because the nominal size of the former is exactly the actual size. The entries in the table are ratios of powers of the WA and p-tests to that of the U M P test, respectively. Of course, these ratios depend on Q, nl, n2 and the noncentrality parameter -~" ( P ' 2 -
~1)/(O'?/H2
2 xl/2 nt- 02/1"12) "
We think that the chosen values of the parameters are representative of practical situations. We didn't perform computations for more combinations of these parameters because the cost involved would be prohibitive. The values for log Q, n 1, n 2 are as shown in Table 3 and ~ = 1 in the whole table. We considered the case 8 = 2 but the results are very similar. Table 3 shows an important superiority in the power of the p-test over the WA test. First, for each combination of sample sizes the average power of the p-test is much closer to the power of the UMP-test than to the power of the WA test. Second, the dispersion of the ratios for the proposed test is much smaller than for the WA test. For a few values of log Q in which the power of the WA test is higher than that of the p-test, one can notice that the standard deviations of both sample means are very close. This shows a special feature of the WA test, which is 'oriented' for the above specific situation. Also note that the size for each combination of (log Q, nl, n2) is different. The smallest (0.0348) was for the combination (0, 7, 7) and the biggest (0.0691) was for (1, 7, 21). Nevertheless the behavior of the ratios is not affected very much by this. In what follows, the computational aspects of the above power comparison are discussed. To compute the power of tests of the form (1.1), we used formula (2.5) of Lee and Gurland [8]. To compute size the same formula was used with some convenient reduction. As was proposed in that paper, we approximate the integral appearing in the formula by Romberg's method. To do this we used to double precision subroutine DQART (see [6]). For the parameters EPS and NDIM, required by this subroutine, we used the values 10 - 7 and 50, respectively. This allowed us to obtain an accuracy of 10 - 7 for each integral used to compute entries of Table 3. It must be noted, however, that Lee-Gurland's formula includes in the integrand function a truncated series which leads to another computational problem. We used different numbers of terms of that series and the T-transform suggested by Lee and Gurland, to obtain sufficiently good approximations. As an additional check, we compared computed powers with those computed (using a different formula) by Golhar [5]. Recall that for comparing power it was necessary to mach the actual sizes of the WA test with the proposed one. As we noted in Section 3, for this problem bisection procedure was used. We stopped this procedure after obtaining a difference in sizes of less than 10-7.
A. Matuszewski, D. Sotres / The Behrens- Fisher problem
246
Table 4 A c t u a l significance level of W e l c h - A s p i n test for n 1 = n 2 = 7, n o m i n a l size = 0,05, a n d C = o 2 /
(o? + C
A c t u a l size
0.1 0.2 0.3 0.4 0.5
0.05003 0.05010 0.05006 0.04998 0.04994
4. Robustness study on the actual size
It is well known (see e.g. [8] and [5]) that the WA test has fine performance in keeping the actual size very near to the nominal one. Table 4 is taken from [5]. Comparing Tables 2 and 4 which give the actual significance level of the p-test and WA test respectively, it is clear that, in controlling the actual size, the WA test is much better than the p-test when the underlying normality assumption is fulfilled. However, in practice, the normality assumption is almost never exactly satisfied, so that the excellent behavior of the actual size of WA test under normality may not be very important. We have performed a Monte-Carlo study to evaluate the robustness of the actual size of the two tests under typical departures from the normality assumption of the parent populations and the results of the study, given in Tables 5 and 6 below, show that the robustness of the actual size of both tests is similar under certain non-normal distributions. In what follows, the details of the Monte-Carlo study mentioned above will be described. The models for departures from the normality assumption considered were as follows. First, named kurtosis, modeling flat symmetrical distributions.
Table 5 Estimated actual size of the p r o p o s e d test and the W e l c h - A s p i n test for symmetrical n o n - n o r m a l distributions with nominal size = 5% and n 1 = 6, n 2 = 12
B2
R = a 1/o 2 1
3
~1
5
I
P
WA
P
WA
P
WA
P
WA
P
WA
1.8 2.2 2.6 3.0 10.0 20.0 30.0
0.0419 0.0359 0.0411 0.0345 0.0355 0.0332 0.0297
0.0552 0.0527 0.0477 0.0495 0.0424 0.0375 0.0346
0.0675 0.0601 0.0675 0.0601 0.0539 0.0419 0.0434
0.0563 0.0523 0.0467 0.0499 0.0380 ,0.0331 0.0299
0.0788 0.0704 0.0787 0.0732 0.0659 0.0534 0.0512
0.0564 0.0523 0.0476 0.0504 0.0377 0.0360 0.0264
0.0409 0.0422 0.0429 0.0416 0.0456 0.0423 0.0407
0.0520 0.0486 0.0494 0.0512 0.0432 0.0427 0.0398
0.0533 0.0539 0.0539 0.0546 0.0581 0.0540 0.0520
0.0508 0.0478 0.0518 0.0503 0.0468 0.0425 0.0380
Mean Mean diff.
0.0360
0.0457
0.0563
0.0437
0.0674
0.0438
0.0423
0.0467
0.0543
0.0469
0.01.40
0.0066
0.0105
0.0087
0.0174
0.0089
0.0077
0.0042
0.0043
0.0040
A. Matuszewski, D. Sotres / The Behrens- Fisher problem
247
Table 6 E s t i m a t e d actual size o f the p r o p o s e d test a n d the W e l c h - A s p i n test for n o n - s y m m e t r i c a l n o n - n o r m a l distributions with n o m i n a l size = 5% fll
r2
Distribution
nI
=
rt 2 =
15
P
0.65 2.13 0.08 2.36 0.32 2.40 1.56 3.20 0.40 3.25 0.46 4.35 0.76 5.59 1.99 6.00 2.72 7.44 5.86 10.36 8.05 15.00 43.80 87.72 38.20113.94
Sa; ~ = 0 , 5 3 3 , 6 = 0 . 5 B e t a ( 3 , 2) Beta(2, 1) LC; p = 0 . 2 , ~ = 7 Weibull; K = 2 LC; p = 0 . 0 5 , / t = 3 S,; y = 1, 8 = 2 x2; y = 4 LC; p = 0 . 0 5 , # = 5 LC; p = 0 . 0 5 ; / ~ = 7 x2; y = 1 Weibuil; K = 0 . 5 Log-normal; / ~ = 0 , o = 1 Mean
n 1 = 12, n 2 = 24 WA
P
WA
&
10.05-&l &
10.05-~1 ~
10.05-&l ~
10.05-~1
0.0331 0.0378 0.0349 0.0402 0.0342 0.0383 0.0371 0.0366 0.0341 0.0399 0.0406 0.0427 0.0440
0.0169 0.0122 0.0151 0.0098 0.0138 0.0117 0.0129 0.0144 0.0159 0.0101 0.0094 0.0073 0.0060
0.0030 0.0008 0.0001 0.0013 0.0014 0.0012 0.0019 0.0005 0.0001 0.0045 0.0058 0.0109 0.0110
0.0026 0.0142 0.0161 0.0095 0.0062 0.0113 0.0161 0.0003 0.0008 0.0123 0.0164 0.0379 0.0153
0.0105 0.0046 0.0083 0.0281 0.0104 0.0081 0.0115 0.0147 0.0179 0.0283 0.0306 0.0330 0.0234
0.0380 0.0120
0.0530 0.0508 0.0499 0.0513 0.0514 0.0488 0.0481 0.0495 0.0499 0.0455 0.0442 0.0391 0.0390
0.0477 0.0033
0.0474 0.0358 0.0329 0.0595 0.0428 0.0387 0.0329 0.0503 0.0508 0.0623 0.0664 0.0879 0.0653
0.0518 0.0122
0.0605 0.0454 0.0417 0.0781 0.0604 0.0581 0.0385 0.0647 0.0679 0.0783 0.0806 0.0830 0.0734
0.0639 0.0176
Second, called skewness, modeling assymmetric distributions. To consider a wide variety of symmetric distributions, the family of distributions proposed by Johnson, Tietjen and Beckmann [7] was used. This family offers the following advantages: (1) It includes an infinite number of symmetrical distributions with arbitrary mean and variance and kurtosis in the range (1.8, + oo). (2) It includes the exponential power distribution and, hence, the uniform, normal and Laplace distributions as special cases. (3) Random-variate generation is straightforward. (4) Results of Monte-Carlo simulation studies can be conveniently organized. The density function of this family is +oo
:(x) = 2oy(A ) f8
w ~-~-1 e x p ( - w )
dw
(4.1)
where - oo < x, tt < + oo; a, "r, o > 0; and
A = [ F ( a + 2 T ) / 3 F ( a ) ] 1/2,
B=[(A/o)(x-#)]
1/~
The parameters tt and o are location and scale parameters, respectively; a and -r are shape parameters. This distribution is unimodal and symmetric about it. A random variable X with density (4.1) has mean/.t and variance 0 2. Moreover, if tt = 0 and o = 1, then the coefficient of kurtosis/3 2 is
f12=
9r(a + 4~)r(a) 5 F z ( a + 2~-)
(4.2)
The study was based on two samples from populations with distributions as in (4.1) with both means equal 0, a I = a 2 = 1.5 and standard deviations Ol = 1 and o2 = 1/R (R = 1, 3, 5, ½ and 1) respectively. The values of ~- were chosen so that
248
A. Matuszewski, D. Sotres / The Behrens-Fisher problem
t2 equals 1.8, 2.2, 2.6, 3, 10, 20 and 30. The actual significance levels of p-test and WA test were estimated using 10 thousands pairs of samples with sizes nl = 6, n 2 = 12 respectively from the distributions described above. The results are given in Table 5. For the second part of the study we used the non-symmetrical distributions considered in the classical Monte-Carlo experiment of Pearson, D'Agostino and Bowman [10]. These distributions are listed in Table 6 and the exact formulas of the corresponding density functions as well as the values of t2 and fla (coefficient of asymmetry), can be obtained from the paper. Here, the same distribution was used for both populations in the BF problem. The values of fla and t2 considered are those listed in Table 6.
5. Final remarks
The proposed procedure gives another interpretation for the problem of testing H 0 : # 1 =/~2 against Ha: /~1 < ~2- Namely, the null hypothesis is rejected when there is no overlapping between intervals which respectively contain the comparing parameters with a rather high level of probability. It is important from a practical point of view that this level is always the same and is equal to 80%. Also, the proposed procedure makes possible the graphical presentation of the comparison of the parameters. Another interpretation of the proposed procedure is that it rejects when the 90% one-sided confidence intervals for ~1 and /~2, i.e. ( - o 0 , X1 + t 0 . 9 ( n l 1)sln~ 1/2) and (-x2 - to.9(n2 - 1)s2n21/2, + oo), respectively, do not overlap.
References [1] A.A. Aspin, An examination and further development of a formula arising in the problem of comparing two mean values, Biometrika 35 (1948) 88-96. [2] W.G. Cochran and G.M. Cox, Experimental Designs (John Wiley and Sons, New York, 1950). [3] R.A. Fisher, The fiducial argument in statistical inference, Ann. of Eugenics 6 (1935) 395-398. [4] R.A. Fisher, The asymptotic approach to Behrens' integral with further tables for the d test of significance, Ann. of Eugenics 11 (2) (1941) 141-172. [5] M.B. Golhar, The errors of first and second kinds in Welch-Aspin's solution of the Behrens-Fisher problem, J. Statist. Comput. Simulation 1 (1972) 209-224. [6] IBM Application, System/360 Scientific Subroutine Package (360A-CM-03X) Version III, Programmer's Manual Order No. H20-0205-3 (1968). [7] M.E. Johnson, G.L. Tietjen and R.J. Beckman, A new family of probability distributions with applications to Monte Carlo studies, J. Amer. Statist. Assoc. 75 (370) (1980) 276-279. [8] A.F.S. Lee and J. Gurland, Size and power of test for equality of means of two normal populations with unequal variances, J. Amer. Statist. Assoc. 70 (1975) 933-941. [9] V.I. Pagurova, On a comparison of means of two normal samples, Theory Probab. Appl. 13 (3) (1968) 527-534. [10] E.S. Pearson, R.B. D'Agostino and K.O. Bowman, Tests for departure from normality: Comparison of powers, Biometrika 64 (2) (1977) 231-246.
A. Matuszewski, D. Sotres / The Behrens- Fisher problem
249
[11] J. Pfanzagl, On the Behrens-Fisher problem. Biometrika 61 (1974) 39-46. [12] A. Wald, Thesting the difference between the means of two normal populations with unknown standard deviations, in: T.W. Anderson et al. Eds., Selected Papers in Statistics and Probability by Abraham Wald (McGraw-Hill, New York, 1955) 669-695. [13] B.L. Welch, The generalization of Students' problem when several different population variances are involved, Biometrika 34 (1947) 28-35.