Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720 www.elsevier.com/locate/jspi
How robust are tests for two independent samples? Dieter Rascha , Friedrich Teuscherb , Volker Guiardb,∗ a Insitut für angewandte Statistik und EDV Universität für bodenkultur Wien, Österreich b Research Institute for the Biology of Farm Animals Dummerstorf, Research Unit Genetics & Biometry, Germany
Received 30 December 2005; accepted 14 April 2006 Available online 14 January 2007
Abstract Non-parametric procedures are sometimes in use even in cases where the corresponding parametric procedure is preferable. This is mainly due to the fact that in practical applications of statistical methods too much attention is paid to any violation of the normality assumption–normal distribution is, however, primarily supposed in order to easily derive the exact distribution of the statistic used within parametric approaches. As concerns the case of two independent samples and the comparison of (the expectations of) two populations the t-test and its non-parametric counterpart, the Wilcoxon (Mann–Whitney) test are of particular interest. These both serve as the illustrative example, given here. The Wilcoxon test compares the two distributions, indeed, but may in cases where we are interested in comparing expectations lead to significance even if the expectations are equal; this because any higher moments in the two populations may differ. On the other hand, the t-test is so robust against non-normality that there is nearly no need to use the Wilcoxon test in comparing expectations. Most results for continuous distributions have been obtained in a research group in Dummerstorf–Rostock some years ago and have been published by Herrendörfer [1980. Robustheit I, Arbeitsmaterial zum Forschungsthema Robustheit, Probleme der angewandten Statistik, vol. 4. Forschungszentrum Dummerstorf-Rostock, Heft], Herrendörfer et al. [1983. Robustness of statistical methods. II, Methods for the one-sample problem. Biometrical J. 25, 327–343], Rasch [1995. Mathematische Statistik. Johann Ambrosius Barth, Leipzig-Heidelberg], Rasch and Tiku [1984. Robustness of Statistical Methods and Nonparametric Statistics. VEB Deutscher Verlag der Wissenschaften, Berlin (D, Reidel Publ. Co., Dordrecht, Lancaster, Boston, Tokyo, 1985)], and Rasch and Guiard [2004. The robustness of parametric statistical methods. Psychol. Sci. 46(2), 175–208]. Most of the results are based on extensive simulation experiments with 10 000 runs each. In the present paper we discuss the two-sample problem firstly by summarising the results for continuous distributions from some preprints in German language; however, secondly we offer new results. All above we investigate the t-test’s and the Wilcoxon test’s robustness in cases of the distribution not being continuous. All the results are that in most practical cases the two-sample t-test is so robust that it can be recommended in nearly all applications. © 2007 Published by Elsevier B.V. Keywords: t-test; Wilcoxon test; Ordered categorical data; Robustness; Simulation
∗ Corresponding author. (F. Teuscher)
E-mail address:
[email protected] (V. Guiard). 0378-3758/$ - see front matter © 2007 Published by Elsevier B.V. doi:10.1016/j.jspi.2006.04.011
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
2707
1. Robustness of a test Several robustness concepts like those of Box and Tiao (1964), Huber (1964), Zielinsky (1977) and Rasch and Guiard (2004) have been discussed. The following concept of -robustness is used in this paper. Definition 1. Let d 1 be a confidence estimation based on an experimental design Vn of size n concerning a parameter ϑ of a class G of distributions with (nominal) confidence coefficient 1 − (0 < < 1) in G. For an element h ∈ H ⊃ G of a class H of distributions which contains G we denote by 1 − (Vn , h) the actual confidence coefficient of d . Then we call d -robust in H if Max | − (Vn , h)|. h∈H
(1)
If we are interested only in not too large (Vn , h)-values (conservative procedures are not considered as non-robust) we call d ∗ -robust in H if Max (Vn , h) − . h∈H
(1*)
In an analogue way the - and ∗ -robustness of tests or selection procedures can be defined. Due to the fact that a test for testing a null hypothesis H0 : ϑ = ϑ0 against HA : ϑ = ϑ0 with significance level can be performed by accepting H0 if ϑ0 is inside the (1 − ) confidence interval and rejecting it otherwise, Definition 1 includes the robustness of a test concerning the significance level . In this paper we use for G the family of univariate normal (N(, 2 )) distributions and an -robustness of 0.2 thus a 20% robustness. 2. Robustness against non-normal continuous distributions If we assume non-normal but continuous distributions we use besides G as mentioned above for H the Fleishman system of distributions and/or truncated normal distributions. Definition 2. A distribution belongs to the Fleishman (1978) system if it is the distribution of the transform y = a + bx + cx2 + dx3 ,
(2)
where x is a standard normal random variable. By a proper choice of the coefficients a, b, c and d the random variable y will have any quadruple of first four moments or equivalently (, 2 , 1 , 2 ). By 1 and 2 we denote the skewness and the kurtosis of any distribution, respectively. For instance any normal distribution (i.e. any element of G) with expectation and variance 2 can be represented as a member of the Fleishman system by choosing a = , b = and c = d = 0. This shows that we really have H ⊃ G as required in Definition 1. It is known that all probability distributions with existing 1 and 2 and all empirical distributions (with estimated skewness g1 and kurtosis g2 , respectively) fulfil the inequality 2 21 − 2,
(g2 g12 − 2).
(3)
The equality sign in (3) defines a parabola in the (1 , 2 )-plane {(g1 , g2 )-plane}. In Fig. 1 the position of the (g1 , g2 )-values calculated for 144 characters in that parabola are shown. Seven (1 , 2 )-values in that parabola have been selected for the robustness investigations reported. These values together with the coefficients a, b, c and d of the elements in the Fleishman system for the case = 0 (a = −c) and = 1 (b2 + 6bd + 2c2 + 15d 2 = 1) are given in Table 1. They are marked in Fig. 1 by a . 1 Random variables are in bold print.
2708
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
γ2;g2 8
7
6
5 4
3
γ2 = γ12 -2
γ1;g1 -3
-2
-1
1
2
3
-1
-2 Fig. 1. Values of empirical skewness g1 and kurtosis g2 of 144 characters in a (1 , 2 ) (also (g1 , g2 )) plane, by the parameters (1 , 2 ) of 7 distributions of Table 1 are denoted.
Table 1 (1 , 2 )-values and the corresponding coefficients in (2) No. of distribution
1
2
c = −a
b
1 2 3 4 5 6 7
0 0 0 0 1 1.5 2
0 1.5 3.75 7 1.5 3.75 7
0 0 0 0 0.163 194 276 264 0.221 027 621 012 0.260 022 598 940
1 0.866 0.748 0.630 0.953 0.865 0.761
d
993 020 446 076 882 585
269 807 727 897 603 274
415 992 840 706 523 860
0 0.042 0.077 0.110 0.006 0.027 0.053
522 872 696 597 220 072
484 716 742 369 699 273
238 101 040 744 158 491
The values of b, c and d in Table 1 are taken from Table A3 in Nürnberg (1982). Comparing the expectations of two distributions may be one of the most frequent applications of statistics in many research fields. We discuss here the independent samples case, this means, the samples of size n1 and n2 have been drawn independently from two populations with parameters 1 , 21 , 11 , 12 and 2 , 22 , 21 , 22 , respectively. Our purpose therefore is to take two independent random samples (y11 , . . . , y1n1 ) and (y21 , . . . , y2n2 ) of sizes n1 and n2 from the two populations in order to test the null hypothesis: H0 : 1 = 2 ,
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
2709
against one of the following one- or two-sided alternative hypotheses: (a) HA : 1 > 2 , (b) HA : 1 < 2 , (c) HA : 1 = 2 . The following tests were investigated. Two sample t-test: Calculate n1 n2 2 2 i=1 (y1i − y 1 ) + i=1 (y2i − y 2 ) 2 sp = n1 + n 2 − 2
(4)
and the test statistic y − y2 n1 n2 t= 1 sp n1 + n 2 and reject H0 : in case (a) if t > t (n1 + n2 − 2; 1 − ), in case (b) if t < − t (n1 + n2 − 2; 1 − ) and in case (c) if |t| > t (n1 + n2 − 2; 1 − (/2)) with t (f ; P ) as the P -quantile of the central t-distribution with f d.f. and otherwise accept H0 . Welch test: Welch (1947) proposed the following test statistic: y − y2 tW = 1 , s12 s22 + n1 n2
(5)
where s12 and s22 are the two sample variances. Taking f ∗ as 2 s22 s12 + n1 n2 ∗ , f = s14 s24 + (n1 − 1)n21 (n2 − 1)n22
(6)
reject H0 : in case (a), if tW > t (f ∗ ; 1 − ), in case (b) if tW < − t (f ∗ ; 1 − ) and in case (c), if |tW | > t (f ∗ ; 1 − /2) and accept it otherwise (the three cases correspond to those in the two-sample t-test). Wilcoxon (Mann–Whitney) test: Wilcoxon (1945) and later Mann and Whitney (1947) proposed a two-sample test based on the ranks of the observations. This test is not based on the normal assumption, in its exact form only two continuous distributions with all moments existing are assumed, we call it the Wilcoxon test. As can be seen from the title of the second paper, this test is testing whether one of the underlying random variables is stochastically larger than the other. It can be used for testing the equality of expectations under additional assumptions: The null hypothesis tested by the Wilcoxon test corresponds with our null hypothesis H0 : 1 = 2 if and only if all higher moments of the two populations exist and are equal. Otherwise rejecting that one of the underlying random variables is stochastically larger than the other says few about the rejection of H0 : 1 = 2 . The procedure runs as follows: Calculate: 1 if yij y2j , dij = (7) 0 otherwise and define 1 2 n1 (n1 + 1) + dij . 2
n
W = W12 =
n
i=1 j =1
(8)
2710
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
Reject H0 in case (a), if W > W (n1 , n2 ; 1 − ), in case (b) if W < W (n1 , n2 ; ) and in case (c), if either W < W (n1 , n2 ; /2) or W > W (n1 , n2 ; 1 − (/2)) and accept it otherwise (the three cases correspond to those in the two-sample t-test). Extensive tables of the quantiles W (n1 , n2 ; ) are given by Verdooren (1963). A normal approximation is shown in Chapter 3. Range test of Lord: Lord (1947) proposed the following test for small samples: Calculate the ranges (maximum minus minimum sample value) w1 and w2 of the two samples and then TLord =
y1 − y2 w1 + w 2
and reject H0 in case (a), if TLord > (n1 , n2 ; 1 − ) in case (b) if TLord < (n1 , n2 ; ), and in case (c), if either TLord < (n1 , n2 ; /2) or TLord > (n1 , n2 ; 1 − (/2)) and accept it otherwise (the three cases correspond to those in the two-sample t-test). Tables of the quantiles (n1 , n2 ; P ) for small values of n1 = n2 are given in Lord’s paper. Tiku’s T-test: Tiku (1980) proposed a test based on modified maximum-likelihood-estimators after symmetrically censoring the samples by ri (i = 1, 2 here and in the sequel) elements. Modification of the likelihood equations led to a better (or at all to a) convergence than the usual ones. The estimates of i and i are with the jth order statistics yi(j ) in sample i: ⎛ ⎞ ni −ri 1 ⎝ ˆ i = yi(j ) + ri i (yi(ri+1 ) + yi(ni −ri ) )⎠ (9) mi j =ri +1
and
ˆ i =
Bi + Bi2 + 4Ai Ci . √ 2 Ai (Ai − 1)
(10)
The following notations are used in the estimates with int[x] = x if x is integer or int[x] = largest integer smaller x otherwise: ri = int[0.5 + 0.1ni ], Ai = ni − 2ri , mi = ni + 2ri (i − 1), Bi = ri i (yi(ni −ri ) − yi(ri +1) ), Ci =
n i −ri j =ri +1
2 2 2 yi(j ˆ 2i , ) + rii i yi(ri +1) + yi(ni −ri ) − mi
f (ti ) f (ti ) ti − qi i = − , qi f (ti ) − i ti , qi √ ti 2 with ti from qi = 1 − −∞ f (z) dz and f (z) = 1/ 2 · e−z /2 . i =
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
2711
The test statistic is now with ˆ 2 = 1/(A1 + A2 − 2)[(A1 − 1)ˆ 21 + (A2 − 1)ˆ 22 ] |ˆ − ˆ 2 | T = 1 . 1 1 ˆ + m1 m2
(11)
The two-sided null hypothesis is rejected if T > t (A1 + A2 − 2; 1 − (/2)), analogously cases (a) and (b) have to be treated. Tiku’s TW -test: Tiku’s (1980) TW -test is a Welch type modification of Tiku’s T-test. With the notations above its test statistic is |ˆ − ˆ 2 | . (12) TW = 1 ˆ 21 ˆ 22 + m1 m2 The null hypothesis is rejected if TW > t (f ∗∗ ; 1 − (/2)) with f ∗∗ analogously defined to f ∗ in (6), analogously cases (a) and (b) have to be treated. The t-test and the Wilcoxon test was one of the few already investigated cases with a sufficiently large number of runs (40 000 for the Wilcoxon test (Posten, 1982) and 100 000 for the t-test (Posten, 1978)) before our robustness research started. Posten used the family of Pearson distributions as the class H of distributions in Definition 1. He simulated distributions in the Pearson system in a grid of (1 , 2 )-values with 1 = 0(0.4)2 and 2 = −1.6(0.4)4.8 and equal sample sizes of 5(5)30 in both samples. Posten stated in his 1978 paper: “It would seem, therefore, that further studies of the effects of variance heterogeneity on the two tests would be needed over an extensive practical family of non-normal distributions before a single procedure might be specified as a somewhat general choice for the two-sample location problem” This was the reason that Tuchscherer (1985) started a further investigation (Tuchscherer and Pierer, 1985 but see also Posten, 1985). They used the seven Fleishman distributions of Table 1, type I risks of 0.01; 0.05 and 0.1; the tests mentioned above and seven further tests and a ratio of the two variances of 0.5; 1; and 1.5 and a sample size of 5 in population 1 and of 5 and 10 in population 2. In Guiard and Rasch (1987) further results can be found for truncated normal distributions with (1 , 2 )-values: (1.0057; 0.5915); (0.348; −0.3488) and (1.505; 3.75). The results of the simulations described above led to a huge data set and therefore we give in Fig. 2 a summary with the results of Posten (1978, 1985) and Tuchscherer and Pierer (1985). Continuous distributions
Normal
Symmetric but non normal
Design n1 = n2 and take the t-test. For unplanned experiments with n1 ≠ n2 take the Welch test
Proceed as for the normal or use for equal variances Tiku´s T-test and for unequal variances Tiku´s TW-test
No information
Design n1 = n2 and take the t-test for ni>5 or for ni<10 Lord´s test. For unplanned experiments with n1 ≠ n2 take the Welch test
Fig. 2. Proposals for the comparison of two expectations by independent samples from continuous distributions.
2712
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
The Wilcoxon test in the case of equal variances and equal sample sizes is also quite good but there is no real need to use it even with small samples. As at any rate, the sample sizes have to be determined before a study is performed, in order to fulfil some precision requirement (see Rasch, 2003), it is recommended to use samples of equal size not only because the total size N =n1 +n2 is a minimum for a given precision. But also because the robustness is better in this case and the Welch-test must not be applied. 3. Robustness against ordinal distributions In many applications—especially in the social sciences—ordered categorical data are observed. It is obvious that the assignment of natural numbers to the categories is purely arbitrary and any monotone transformation of given scores is equally justified but may lead to different results using parametric tests (see for instance Ivanova and Berger, 2001). In textbooks on psychological statistics and in sociological research practice for ordered categorical data the Wilcoxon test is recommended and used, respectively. Though, most of the recommendations are not aware of the fact, that this test is based on the assumption of a continuous distribution. There are developments to use rank tests and permutation tests (Conover, 1973; Streitberg and Roehmel, 1986; Brunner and Puri, 1996) instead of the Wilcoxon test, but these tests are seldom used in practice up to now. Brunner and Puri (2001) summarise recent developments of the use of nonparametric methods in simple and high-order factorial experiments for independent observations as well as for repeated measurements. The Wilcoxon test comes out as a very special case in a simple design. Because many of the results are asymptotic, the practical behaviour of nonparametric tests has been investigated for small samples. For the paired sample case see Munzel (1999) and for more general repeated measurement designs in factorial experiments (Singer et al., 2004). Both compare parametric and nonparametric tests for repeated measurements. We now concentrate on the one-sided independent two-sample problem and discuss in the following mainly the special case that there is some underlying hidden continuous variable which generates ordered categorical data. We investigate the properties of the likelihood ratio (LR) test for categorical data generated in this way and compare it with the properties of the Wilcoxon test and the two-sample t-test, if integers are used for k categories, that is the natural order of numbers 0, 1, . . . , k − 1 or 1, 2, . . . , k—but suppose, as indicated—any plausible underlying hidden continuous variable. We consider distributions of k = 5 ordered categories. 3.1. Ordinal distributions of the categories We assume that we have to compare the distributions of two populations over the same k ordered categories C1 ≺ C2 ≺ · · · ≺ Ck , where the symbol ≺ stands for “inferior to”. In psychological research and in practical life, for instance school scores, often numerical scores xi are given to these categories which are ordered in the opposite way (or like in The Netherlands in case of school scores in the same way) as the categories; now, of course, we order the numbers by the <-sign in place of ≺. Here we consider only the case, that an underlying (hidden) continuous quantitative random variable z is assumed. We use here for z the logistic distribution with the distribution function F (z) =
1 . 1 + e−z
(13)
The k categories are represented by k intervals [ j −1 , j ]; j = 1, . . . , k with 0 = −∞ and k = ∞. The category Cj occurs if z ∈ [ j −1 , j ]; j = 1, . . . , k. The ordinal distribution of the k categories is completely characterised by the probabilities pi = P (Ci ) = F ( i ) − F ( i−1 ) = We write such a distribution as C 1 C2 · · · C k , Dk = p 1 p2 · · · p k
1 1 − , −
i 1+e 1 + e− i−1
k 2,
k i=1
pi = 1.
i = 1, . . . , k.
(14)
(15)
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
2713
Due to the condition ki=1 pi = 1 an ordinal categorical distribution has k − 1 parameters free of choice. These parameters are p1 , . . . , pk−1 , or, equivalently, 1 , . . . , k−1 . The choice of the logistic distribution for the hidden continuous variable is only one of many options. By means of a pilot study it can be checked, whether this distribution is sensible. For other distributions further simulation studies for the LR-test are necessary. Let us consider a special scoring system for the k categories that is to use the labels x1 , x2 , . . . , xk ; k 2 with x1 < x2 < · · · < xk . Then the distribution of x ∈ {x1 , x2 , . . . , xk } corresponding to the distribution of the categories is given by pi = P (x = xi ) = P (Ci ) with P (Ci ) from (14). For this discrete distribution the expected value is defined as =
k
p j xj .
j =1
3.2. Samples from ordinal distributions A random sample of size n from an ordinal distribution is characterised by the frequencies hi of category Ci , i = 1, . . . , k, in the sample. The vector of these frequencies follows a (k − 1)-dimensional multinomial distribution with probability function n! ph1 , . . . , pkhk . h1 ! . . . hk ! 1 Therefore, we have kj =1 hj = n and as before kj =1 pj = 1. P (h1 = h1 , . . . , hk−1 = hk−1 ) =
(16)
3.3. The tests used We consider the one-sided two-sample problem with the index l = 1, 2 for the two samples, respectively. Here l = 1 corresponds in experiments to the control group and l = 2 to the treatment group. In the sequel we use pli instead of pi and nl instead of n. Written in terms of the logistic function (13) we have F1 = F (x) for l = 1 and F2 = F (x − ) for l = 2 and > 0 (two-sided and the left-sided alternatives can be handled analogously). With respect to the null hypothesis H0 : = 0 and the alternative hypothesis HA : > 0 we compared the power function of the following statistical 0.05-tests: • • • • • •
normal approximation for the Wilcoxon test without correction for ties; normal approximation for the Wilcoxon test with correction for ties; chi-square test for 2 × k-tables; logistic procedure with the Wald test; logistic procedure with the LR test; and t-test using the values 1, 2, . . . for the categories.
The use of values 1, 2, . . . for the categories is a tool for the applicability of the t-test but nevertheless the hypotheses about have to be tested. If the values 1, 2, . . . are considered as original variables, the t-test is related to test the hypothesis H0 : 1 = 2 against the alternative hypothesis HA : 2 − 1 = > 0. (Note that there is a difference between the shift parameter of the logistic distribution and 2 − 1 = .)
2714
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
According to the Wilcoxon test we approximate W in (8) by n1 n2 (n1 + n2 + 1) n2 (n1 + n2 + 1) u = R2 − 2 12 and H0 will be rejected if u is greater than u1− , where u1− denotes the (1 − )-quantile of the standardised normal distribution. R2 denotes the sum of the ranks of the treatment group. In case of ties, mid-ranks are used.
For the Wilcoxon test with ties correction the test statistics u defined above must be divided by 1 − j (Tj3 − Tj )/(N 3 − N ) (Kruskal and Wallis, 1952). with N = n1 + n2 . Tj denotes the number of observations within a group of tied observations. Both logistic procedures are based on the maximum likelihood estimate ˆ of . Within the first logistic procedure the Wald test or the so-called delta method is applied. In the second logistic procedure the LR test is used. Both tests can be applied by using standard procedures in program packages like CADMOD of the package SAS/STAT (SAS-Institute, 1999). The Wald test yielded sometimes non-increasing power functions for increasing deviations from the null hypothesis. This disappointing fact is caused by non-convergence of the likelihood method. Non-convergence appears in those cases, when one sample consists of elements of one extreme category C1 or Ck only. The Wald test can be represented ˆ But in case of non-convergence we using the Wald confidence interval of . This interval is symmetrically around . ˆ but the standard error is also very large. Thus, in most have ˆ = ∞. The calculation stops at a very large value of , cases the lower limit of the confidence interval will be negative and therefore H0 will not be rejected. This is the reason why we also used the LR test. This one-sided test corresponds to a two-sided (1 − 2)-confidence interval, which is not always symmetrical. Especially in case of non-convergence it is open in the direction of ˆ = ∞ or −∞, respectively. Therefore in most cases this interval does not contain = 0. 3.4. Simulations and results Each test was simulated for 10 000 independent pairs of samples from ordinal distributions with five categories with samples sizes equal to 6, 10, 15, 25, 50 and 100 in each of the samples. Sometimes also unequal sample sizes are used. Such distributions violate the assumption for the Wilcoxon test (continuity of the distribution) and of course also for the t-test. The distributions used are under H0 : 1 2 3 4 5 (1) • D5 = , 0.05 0.25 0.4 0.25 0.05 1 2 3 4 5 (2) • D5 = and 0.2 0.2 0.2 0.2 0.2 1 2 3 4 5 (3) • D5 = . 0.3 0.15 0.1 0.15 0.3 By means of such a distribution it is possible to calculate the thresholds j using the inverse of the logistic distribution function F (x). Now, for any alternative value of the shift parameter the distribution under HA can be calculated by means of (14) using the given j and F2 (x) = F (x − ) instead of F (x). For a nominal risk of the first kind nominal = 0.05 the values actual can be found in Table 2. For nominal = 0.05 a 20% robustness is given as long as 0.04 actual 0.06. The corresponding results are bold printed in Table 2. The corrected Wilcoxon test and the t-test control the first kind risk best for all sample sizes. The Wilcoxon test without correction of ties is of course very conservative for small sample sizes. But taking into consideration that we have only five categories and therefore a large amount of ties, the conservativeness is surprisingly small. It is known that the chi-square test is asymptotically exact. But for the small sample sizes shown within Table 2 the values of actual are mostly large. The chi-square test is very sensitive to the underlying distribution, while the other tests are not.
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
2715
Table 2 The values of actual for several tests and distributions and samples of size 6, 10, 15, 25, 50 and 100, nominal = 0.05 Test
Wilcoxon test Corrected Wilcoxon test Chi-square test Logistic pr. with Wald Logistic pr. with LR t-test
n=6
n = 10 (2) D5
(3) D5
(1) D5
(2) D5
(3) D5
D5
D5
D5
0.009 0.052 0.038 0.041 0.063 0.047
0.011 0.048 0.085 0.063 0.063 0.048
0.010 0.049 0.065 0.060 0.064 0.053
0.012 0.051 0.056 0.052 0.057 0.048
0.016 0.054 0.106 0.058 0.057 0.050
0.013 0.047 0.091 0.054 0.057 0.053
0.017 0.055 0.057 0.051 0.054 0.049
0.017 0.049 0.092 0.056 0.055 0.050
0.018 0.048 0.089 0.054 0.054 0.048
(1) D5
(2) D5
(3) D5
(1) D5
(2) D5
(3) D5
D5
D5
D5
0.038 0.052 0.061 0.051 0.052 0.050
0.046 0.049 0.066 0.050 0.052 0.048
0.036 0.049 0.079 0.054 0.055 0.050
0.039 0.049 0.073 0.048 0.049 0.053
0.046 0.046 0.053 0.053 0.053 0.051
0.041 0.050 0.058 0.052 0.052 0.050
0.042 0.051 0.057 0.044 0.044 0.049
0.048 0.050 0.053 0.045 0.046 0.050
0.044 0.049 0.054 0.046 0.046 0.048
n = 25
Wilcoxon test Corrected Wilcoxon Chi-square test Logistic pr. with Wald Logistic pr. with LR t-test
n = 15
(1) D5
n = 50
(1)
(2)
(3)
n = 100 (1)
(2)
(3)
20% robust values of actual between 0.04 and 0.06 in bold print.
1.0
Power
0.8 0.6 Chi-square-test Logistic Wald Logistic LR
0.4 0.2 0.0 0.0
0.5
1.0
1.5
2.0
2.5
3.0
(1)
Fig. 3. Simulated power functions for D5 and n = 25.
The logistic procedures with the Wald test and with the LR test hold the first kind risk rather well. The differences between the versions are mostly due to simulation. Fig. 3 shows the results of the chi-square test and of the logistic procedures with the Wald test and with the LR test. Here the decreasing part of the power mentioned before can be seen. Therefore, in the following we use the logistic procedure only in connection with LR test. The chi-square test is loosing much power because it is based on contingency tables without taken the known order relation into account. Therefore also this test will not be applied furthermore. The Wilcoxon test without correction of ties yields a smaller power than that one with correction (Fig. 4), but the difference is surprisingly small. In the following we will use only the Wilcoxon test with correction of ties. As shown in Fig. 5, in case of equal sample sizes the corrected Wilcoxon test, the t-test and the logistic procedure LR have almost the same power. Within Fig. 6 the power functions of the t-test for n1 = n2 and n1 = n2 , n1 + n2 = N = 50, are demonstrated. In case of equal residual variances 2 the variance of the mean difference is given as 2diff = 2 (1/n1 + (1/n2 )) = 2 ((n1 + n2 )/n1 n2 ) = 2 (N/n1 n2 ). Therefore, in case of unequal sample sizes the power must be smaller. But in our situation
2716
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
1.0 n=100 n=50
Power
0.8
n=25
0.6
uncorrected Wilcoxon corrected Wilcoxon
0.4 0.2 0.0 0.0
0.5
1.0
1.5
2.0
2.5
3.0
(1)
Fig. 4. Simulated power functions of the corrected and uncorrected Wilcoxon test for D5 and n = 25, 50 and 100.
1.0 n=100 0.8
n=50 Wilcoxon t-test Logistic
Power
n=25 0.6 0.4 0.2 0.0 0.0
0.5
1.0
0.5
2.0
2.5
3.0
(1)
Fig. 5. Simulated power functions of the corrected Wilcoxon test, the t-test and the logistic procedure LR for D5 and n=25, 50 and 100, respectively.
t-test 1.0
Power
0.8 0.6 n1=5 n1=25 n1=45
0.4 0.2 0.0 0.0
0.5
1.0
1.5 (1)
2.0
2.5
3.0
Fig. 6. Simulated power functions of the t-test for D5 and (n1 , n2 ) = (5, 45), (25, 25), (45, 5).
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
2717
Table 3 True and assumed values of the variance of mean differences
True value of 2diff =
˜ 2diff =
21 n1
+
22 n2
21 (n1 − 1) + 22 (n2 − 1)
N −2 Used in the dominator of the t-value
N n1 n2
(n1 , n2 ) = (5, 45)
(n1 , n2 ) = (45, 5)
0.1960
0.1640
0.1633
0.1967
Logistic LR 1.0
Power
0.8 0.6 n1=5 n1=25 n1=45
0.4 0.2 0.0 0.0
0.5
1.0
1.5
2.0
2.5
3.0
(1)
Fig. 7. Simulated power functions of the logistic procedure LR test for D5 and (n1 , n2 ) = (5, 45), (25, 25), (45, 5).
we observe a further effect. The region of scores is limited between 1 and 5. For the control group the expectation of the scores lies in the middle of this region. Within the treatment group the expectation of the scores is shifted to the right, but the upper limit of these scores does not change. Therefore the variance of these scores will be smaller. By means of simulation we get 21 ≈ 0.9 and for = 2, 22 ≈ 0.72. Therefore, for the cases (n1 , n2 ) = (45, 5) and (n1 , n2 ) = (5, 45) we get different power values. In order to explain this we use the values of Table 3. If the larger sample size is combined with the larger variance then the variance used in the denominator of the t-value is greater than the true variance of the mean difference. Therefore in this case the power is smaller as in the reciprocal case. Note that for H0 : = 0 we have 21 = 22 . Therefore the effect demonstrated above has no influence on actual . Fig. 7 shows the same cases as in Fig. 4 but for the logistic procedure with the LR test. It is interesting that the power of both of the cases (n1 , n2 ) = (45, 5) and (n1 , n2 ) = (5, 45) are almost identical with the better case (n1 , n2 ) = (5, 45) of the t-test. Also for the corrected Wilcoxon test (results not shown) the power functions for the cases (n1 , n2 )=(45, 5) and (n1 , n2 ) = (5, 45) are almost identical, but they correspond with the worse case, (n1 , n2 ) = (45, 5), of the t-test. (1) Until now, for H0 we considered only the distribution D5 . Therefore it would be interesting to study the influence of the different distributions defined in Section 3.4. If we consider the power function of the Wilcoxon test as a function of the shift parameter, , of the hidden logistic distribution, then the power functions are almost identical for all three distributions (Fig. 8a). But representing the power as a function of the difference, = 2 − 1 , of the score means then there are large differences between the power functions (Fig. 8b). 3.5. Conclusions From the results above we can conclude: Either the logistic procedure with the LR test or the corrected Wilcoxon test but also the t-test or for unequal sample sizes the Welch test (results for the Welch test are not shown here) should be used but not the chi-square test. In all the cases considered, these tests are 20% robust.
2718
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
b 1.0
1.0
0.8
0.8
0.6
Power
Power
a
D5(1) D5(2) D5(3)
0.4 0.2
0.6
D5(1) D5(2) D5(3)
0.4 0.2 0.0
0.0 0.0
0.5
1.0
1.5 θ
2.0
2.5
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 δ
3.0
(1)
(2)
Fig. 8. Simulated power functions of the corrected Wilcoxon test for D5 , D5 (b) in dependence of the difference = 2 − 1 .
(3)
and D5
and n = 25: (a) in dependence of the shift parameter ,
Distributions of ordinal variables
General case
Assign the numbers 1,2,3,... to the categories as a new variate. Design n1 = n2 and take the t-test otherwise the Welch test
Ordinal variables assumed to be generated by a hidden logistic variate
corrected Wilcoxon test.
logistic procedure
Fig. 9. Proposals for the comparison of two independent samples from ordinal distributions.
In Fig. 9 the possible robust tests for different situations are shown. Fig. 9 corresponds with Fig. 2 in the continuous case. If the ordinal variable is assumed to be generated by a hidden logistic variate then the logistic procedure with LR can be applied. For unequal sample sizes in our simulations the power of the logistic procedure was larger as that one of the other tests, but the difference is very small. Note that it is possible, that the properties of the logistic procedure depend on the correctness of the assumption of a hidden logistic variate. This means, that if this assumption is false, the power could be worse. For this problem further simulation studies are necessary. For unequal sample sizes the power of the t-test depends on the relation between the sample sizes and the residual variances of the groups. Therefore for unequal sample sizes we recommend to apply the Welch-test instead of the t-test. The power of the corrected Wilcoxon test and of the t/Welch-test are very similar. For unequal sample sizes in our simulation the power of the Wilcoxon test was a little bit smaller than that one of the other tests.
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
2719
4. Final remarks We investigated ordinal random variables amongst them also those which can be thought to be the result of an underlying “hidden” logistic distribution. By ordinal variables as well the normality assumption of the t-test (after replacing the k categories by 1, 2, . . . , k) as also the continuity assumption of the Wilcoxon test is violated. For the hidden variable we investigated the logistic distribution because it is used as a model in many applications (for instance in drug screening and for several psychological characters) but other distributions are still open to discussion. We further limited our investigation on k = 5 categories. This because less than five categories relatively seldom occur in applications. If we have more than 5 we expect that the results will become better because we expect that with larger k the number of ties will become smaller which have to be corrected for the Wilcoxon test. But also in this direction further investigations are needed. By this paper we do not recommend to use the numbers 1, 2, . . . , k for the categories. We only show that if such a scoring is used then the t-test is applicable. Acknowledgements We thank both reviewers for their helpful and detailed comments. Further we thank Karsten Schlettwein from the Research Unit Genetics & Biometry of the Research Institute for the Biology of Farm Animals Dummerstorf for drawing most of the figures. References Box, G.E.P., Tiao, G.C., 1964. A note on criterion robustness and inference robustness. Biometrika 51, 1–34. Brunner, E., Puri, M.L., 1996. Nonparametric methods in design and analysis of experiments. In: Ghosh, S., Rao, C.R. (Eds.), Handbook of Statistics, Design and Analysis of Experiments, vol. 13. Elsevier Science BV, Amsterdam, pp. 631–703. Brunner, E., Puri, M.L., 2001. Nonparametric methods in factorial designs. Statist. Papers 42, 1–52. Conover, W.J., 1973. Rank tests for one sample, two samples and k samples without the assumption of continuous functions. Ann. Statist. 1, 1105–1125. Fleishman, A.J., 1978. A method for simulating non-normal distributions. Psychometrika 43, 521–532. Guiard, V., Rasch, D. (Eds.), 1987. Robustheit Statistischer Verfahren, Probleme der angewandten Statistik, vol. 20. Forschungszentrum DummerstorfRostock. Herrendörfer, G. (Ed.), 1980. Robustheit I, Arbeitsmaterial zum Forschungsthema Robustheit, Probleme der angewandten Statistik, vol. 4. Forschungszentrum Dummerstorf-Rostock. Herrendörfer, G., Rasch, D., Feige, K.-D., 1983. Robustness of statistical methods. II, Methods for the one-sample problem. Biometrical J. 25, 327–343. Huber, P.J., 1964. Robust estimation of a location parameter. Ann. Math. Statist. 35, 73–101. Ivanova, A., Berger, V.W., 2001. Drawbacks to integer scoring for ordered categorical data. Biometrics 57, 567–570. Kruskal, W.H., Wallis, W.A., 1952. Use of ranks in one-criterion variance analysis. J. Amer. Statist. Assoc. 47, 583–621. Lord, E., 1947. The use of the range in place of the standard deviation in the t-test. Biometrika 34, 41–67. Mann, H.B., Whitney, D.R., 1947. On a test whether one of two random variables is stochastically larger than the other. Ann. Math. Statist. 18, 50–60. Munzel, U., 1999. Nonparametric methods for paired samples. Statist. Neerlandica 53, 277–286. Nürnberg, G., 1982. Beiträge zur Versuchsplanung für die Schätzung von Varianzkomponenten und Robustheitsuntersuchungen zum Vergleich zweier Varianzen, Probleme der angewandten Statistik, vol. 6. Forschungszentrum Dummerstorf-Rostock. Posten, H.O., 1978. The robustness of the two-sample t-test over the Pearson system. J. Statist. Comput. Simulation 6, 295–311. Posten, H.O., 1982. Two-sample Wilcoxon power over the Pearson system. J. Statist. Comput. Simulation 16, 1–18. Posten, H.O., 1985. In: Rasch, D., Tiku, M.L. (Eds.), Robustness of the Two-Sample t-Test. pp. 92–99. Rasch, D., 1995. Mathematische Statistik. Johann Ambrosius Barth, Leipzig-Heidelberg. Rasch, D., 2003. Determining the optimal size of experiments and surveys in empirical research. Psychol. Sci. 45, 3–48. Rasch, D., Guiard, V., 2004. The robustness of parametric statistical methods. Psychol. Sci. 46 (2), 175–208. Rasch, D., Tiku, M.L. (Eds.), 1984. Robustness of Statistical Methods and Nonparametric Statistics. VEB Deutscher Verlag der Wissenschaften, Berlin (D, Reidel Publ. Co., Dordrecht, Lancaster, Boston, Tokyo, 1985). SAS Institute Inc., 1999. SAS Campu Drive, Cary, North Carolina 27513. Singer, J.M., Poleto, F.Z., Rosa, P., 2004. Parametric and nonparametric analyses of repeated ordinal categorical data. Biometrical J. 46, 460–473. Streitberg, B., Roehmel, J., 1986. Exact distribution for permutation and rank tests: an introduction to some recently published algorithms. Statist. Software Newslett. 12, 10–17.
2720
D. Rasch et al. / Journal of Statistical Planning and Inference 137 (2007) 2706 – 2720
Tiku, M.L., 1980. Robustness of MML estimators based on censored samples and robust test statistics. J. Statist. Plann. Inference 4, 123–143. Tuchscherer, A., 1985. In: Rasch, D., Tiku, M.L. (Eds.), The robustness of some procedures for the two-sample location problem—a simulation study (concept). Tuchscherer, A., Pierer, H., 1985. Simulationsuntersuchungen zur Robustheit verschiedener Verfahren zum Mittelwertvergleich im Zweistichprobenproblem (Simulationsergebnisse). In: Rudolph, P.E. (Ed.), Robustheit V, Probleme der angewandten Statistik, vol. 15, pp. 1–42. Forschungszentrum Dummerstorf-Rostock. Verdooren, L.R., 1963. Extended tables of critical values for Wilcoxon’s test statistic. Biometrika 50, 177–186 (Correction, (1964), Biometrika 51, 527). Welch, B.L., 1947. The generalization of student’s problem when several different population variances are involved. Biometrika 34, 28–35. Wilcoxon, F., 1945. Individual comparisons by ranking methods. Biometrics 1, 80–82. Zielinsky, R., 1977. Robustness: a quantitative approach. Bull. L’Acad. Polonaise Sci. Ser. Math. Astr. Phys. XXV (12), 1281–1286.