Computer Methods and Programs in Biomedicine 63 (2000) 21 – 28 www.elsevier.com/locate/cmpb
Mann – Whitney/Wilcoxon’s nonparametric cumulative probability distribution Herman P. Wijnand a,*, Robbert van de Velde b b
a Sibeliuspark 754, 5343 BS, Oss, The Netherlands Dr. Van der Steenlaan 18, 5348 JL, Oss, The Netherlands
Received 7 January 1998; received in revised form 20 May 1999; accepted 16 December 1999
Abstract It is demonstrated how complete nonparametric, cumulative probability distributions for Mann – Whitney/Wilcoxon’s nonparametric, two-sample test can be constructed based on algorithms published earlier in this journal. These procedures provide cumulative probabilities for all possible rank sums in the nonparametric two-sample test. Separate programs for MS(PC)-DOS and Macintosh computers are offered. © 2000 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Mann – Whitney/Wilcoxon; Nonparametrics; Two-sample test; Cumulative probability distribution
1. Introduction For deciding whether two samples can reasonably be considered to origin from the same distribution (the null hypothesis of ‘there is no systematic difference’), a number of statistical tests can be used. If such a test results in a fairly large probability that the samples come from one and the same distribution, the null hypothesis is not rejected. If the test results in a small probability, usually 5% or smaller, the null hypothesis is rejected and the alternative hypothesis (‘the samples origin from different distributions’) is accepted. The difference is then said to be statistically significant at the 5% level. * Corresponding author. Tel./fax: +31-412-626693. E-mail address:
[email protected] (R. van de Velde)
For samples for which the underlying distribution is normal, the two-sample t-test is generally being used, which can be termed parametric. If, however, the underlying distribution is not normal nor can be made normal by some suitable transformation, a nonparametric two-sample test may offer advantages such as a higher relative efficiency. In such a nonparametric procedure the original elements are replaced by rank numbers. The only assumptions to be made in such cases are that both samples are random samples from their respective populations, and that there is independence within each sample and mutual independence between the samples. In a nonparametric two-sample test the two samples are combined into a single ordered sample and rank numbers are assigned to the com-
0169-2607/00/$ - see front matter © 2000 Elsevier Science Ireland Ltd. All rights reserved. PII: S 0 1 6 9 - 2 6 0 7 ( 0 0 ) 0 0 0 5 8 - 4
22
H.P. Wijnand, R. 6an de Velde / Computer Methods and Programs in Biomedicine 63 (2000) 21–28
bined sample values from the smallest to the largest value, irrespective of the population each value came from. The relevant test statistic then is the sum of the rank numbers of one sample. A very small (or very large) rank sum may indicate that the values from that population tend to be smaller (or larger) than the values from the other population. Over 50 years ago, Wilcoxon [1] introduced such a nonparametric two sample test for samples of equal size. Seven years after Wilcoxon’s publication, White [2] and Van der Reyden [3] extended the test to two samples of unequal size. A test similar to Wilcoxon’s was published by Festinger [4]. Two years after Wilcoxon’s first publication, however, Mann and Whitney [5] were the first to explore the case of unequal sample sizes and to provide tables of critical rank sums for relatively small sample sizes. It is predominantly their publication which has been the basis for the widespread use of the nonparametric two-sample test. Conover [6] remarked that because the test is attributed to various authors, it is the user’s prerogative as to which name to call it by. Nowadays every textbook on statistics contains tables of critical rank sums for this test. Only one example is given here: Geigy’s scientific tables [7] contain extensive tables with two-sided probabilities of 0.100, 0.050, 0.020, 0.010, 0.005, 0.002 and 0.001, for sample sizes up to 25+ 50. Conover [6] explored methods to assess nonparametric confidence intervals for the mean difference between two populations. In addition to the assumptions of randomness and independence, the two population distribution functions are assumed to be identical except for a possible difference in location parameters. Another approach to obtain nonparametric confidence intervals was developed independently by Hauschke et al. [8,9] for the treatment effect in bioequivalence studies carried out in the form of two-treatment two-period cross-over studies. For this type of study, Hauschke et al.’s approach was extended by Wijnand [10,11] to period and sequence effects by using nonparametric cumulative probability distributions. The purpose of the present paper is to demonstrate that the procedures published earlier [10,11]
can also be used to obtain complete nonparametric cumulative probability distributions providing probabilities for all possible rank sums of Mann– Whitney/Wilcoxon’s two-sample test, and to make available PC programs to compute such probability distributions. Separate programs will be offered for MS(PC)-DOS and Macintosh computers.
2. Theory Suppose there are two samples with sample sizes n1 and n2 such that n1 5 n2. The n1 + n2 elements are combined into a single ordered sample and rank numbers from 1 to n1 + n2 are assigned to the combined sample values from the smallest to the largest value, irrespective of the population each value came from. The rank sum Rgroup1 is calculated for the smaller group (with n1 elements). Similarly, the rank sum Rgroup2 is obtained for the larger group (with n2 elements). The sum of the two rank sums equals the sum of all rank numbers, so that the equation Rgroup1 + Rgroup2 = (n1 + n2)(n1 + n2 + 1)/2 can be used as a quick check that all values have been included in ranking. The ultimate test is to assess the cumulative probability associated with the empirical rank sums Rgroup1 or Rgroup2. This means that we are not satisfied by concluding that Rgroup1 or Rgroup2 is equal to or smaller or larger than certain tabulated critical rank sums: we want to assess the exact cumulative probability associated with the empirical values Rgroup1 or Rgroup2. To achieve this, all possible rank sums and the associated probabilities have to be calculated. The possible values which can be taken by the rank sums follow from the fact that two samples with sizes n1 and n2 provide a total number of n1n2 + 1 rank sums which in the smaller group can take the values: R1,1 = n1(n1 + 1)/2
(1)
R1,2 = n1(n1 + 1)/2 + 1
(2)
R1,n1n2 + 1 = n1(n1 + 1)/2 + n1n2
(3)
H.P. Wijnand, R. 6an de Velde / Computer Methods and Programs in Biomedicine 63 (2000) 21–28
with indices i=1, 2, … , n1n2 +1, respectively. In the larger group the rank sums can take the values: R2,1 =n2(n2 +1)/2
(4)
R2,2 =n2(n2 +1)/2 +1
(5)
R2,n1n2 + 1 =n2(n2 + 1)/2 + n1n2
(6)
with the same indices i= 1, 2, ... , n1n2 +1, respectively. It follows that if n1 =n2, then R1,i =R2,i for all values of i. Let Ncomb be the total number of possible rankings of all combinations of n1 or n2 elements out of n1 +n2 elements: Ncomb = (n1 +n2)!/(n1! · n2!)
(7)
Next, let Fi be the number of cases in which each R1,i or R2,i occurs in Ncomb. The Fi are integer numbers, symmetrically distributed around the rank sum n1(nl +n2 +1)/2 for the smaller group and around n2(nl +n2 +1)/2 for the larger group, and will reach their maximum at or around this value. How to obtain Fi will be discussed in the next section. The symmetry leads to the following equations for the probabilities Pi associated with all possible rank sums R1,i and R2,i: Pi = P{Rgroup1 =R1,i } = P{Rgroup2 =R2,n1n2 − i + 2} = Fi /Ncomb
(8)
and the ultimately desired cumulative probability SPj represents the probability that the rank sum is smaller than or equal to R1,i or larger than or equal to R2,n1n2 − i + 2: j=i
j=i
j=1
j=1
% Pj = % Fj /Ncomb =P{Rgroup1 5R1,i } =P{Rgroup2 ]R2,n1n2 − i + 2}
(9)
3. Algorithms to obtain Fi
3.1. Algorithms used Two algorithms to calculate the values of Fi have been described earlier in this journal: Meineke and De Mey’s algorithm [12] and Dromey’s recursive algorithm [13], both explored for the nonparametric analysis of bioequivalence
23
studies [11]. Essentially these algorithms compute the number Fi of the rankings producing the same rank surn when all possible rankings (combinations) of n1 or n2 out of n1 + n2 elements are written out. In the resursive algorithm the Fi are obtained by a one-dimensional array and in Meineke and De Mey’s algorithm by a 3-dimensional array.
3.2. Implementation of algorithms The arrays for obtaining Fi can be declared as normal integers (2 bytes per number) for n1 + n2 B 12. However, to prevent ‘integer overflow’, these arrays must be declared as long integers (four bytes per number) for 245 n1 + n2 5 40, and as doubles (eight bytes per number) for n1 + n2 \ 40. Similarly, Ncomb should be declared as a double. Internal memory requirements for the two algorithms are quite different. For n1 = n2 = 30, which already exceeds many published tables of critical rank sums, the Fi array in the recursive algorithm occupies only 11 kbytes. Meineke and De Mey’s algorithm, however, when implemented in a FORTRAN program, requires : 20 Mbytes for the same group sizes. With long integers and n1 = n2 = 14, a compiled and linked MS-DOS FORTRAN program (to be described in Section 4) requires : 563 kbytes of free internal memory. This program size can still be run in most older-type MS-DOS personal computers. Memory requirements for the PASCAL program are comparable, but only for the heap size during run time. The computing speed of the algorithms is very different. A few examples are given. With long integers, the recursive algorithm in a TurboBASIC program needs 34 s for 12+12 elements on a Pentium (120 MHz) personal computer. Each additional element approximately doubles the computing time. Meineke and De Mey’s algorithm, however, implemented in a FORTRAN program, gives almost instantaneous results for 14+ 14 elements on the same computer. On a type 5.3-1 VAX VMS computer with DEC FORTRAN and doubles, only a few seconds are required for 30+ 30
24
H.P. Wijnand, R. 6an de Velde / Computer Methods and Programs in Biomedicine 63 (2000) 21–28
elements. Essentially the same results were obtained with these group sizes and a PASCAL program on a Power Macintosh 8600 (604e, 250 MHz) computer. For the type of problem outlined here, the conclusions are the same as reported earlier in this journal in the framework of bioequivalence studies [11]: the advantages and disadvantages of the two algorithms are mutually exclusive. Meineke and De Mey’s algorithm is very fast but requires much internal memory for relatively large groups, while the recursive algorithm requires very little internal memory but may be time consuming for relatively large study sizes. The numerical results from the two algorithms are identical.
4. Computer programs Based on the theory outlined in Section 2 and the algorithms discussed in Section 3, a number of PC programs have been developed for different types of computer. All programs give exactly the same numerical results. The layout of the results may be slightly different between programs. 1. The Program MANNWHIT is based on Dromey’s recursive algorithm [13], earlier also explored for the nonparametric analysis of the treatment effect of twoperiod crossover bioequivalence studies [10]. It has been written in TurboBASIC, compiled and linked to a program that requires only 50 kbytes of free internal memory on a PC of the 80n86 family. As exemplified in Section 3.2, it combines the advantage of requiring very little internal memory with the disadvantage of long computing times for relatively large study sizes. This program can only be recommended for computers with fast Pentium processors. Data input consists of the two group sizes and the name or path of the user-defined disk(ette) file. 2. The Program WILCOCUM employs a FORTRAN-77 version of Meineke and De Mey’s algorithm [12], earlier also explored for the nonparametric analogue of analysis of variance on two-treatment two-period crossover studies [11]. This program, compiled and linked by Ryan and McFarlane’s FORTRAN,
requires 563 kbytes of free internal memory for group sizes of up to 14 + 14 elements. This program size can still be handled by most older type computers of the 80n86 family. As exemplified in Section 3.2, its execution speed is extremely fast. Data input of this program is the same as for the program discussed in the previous paragraph: the two group sizes and the name or path of the user-defmed disk(ette) file. The FORTRAN-77 source code, including hints for programmers, is available to those programmers who wish to extend the program to group sizes larger than 14+ 14. This, however, is more suitable for workstations and mainframe computers, since the conventional internal memory of most MS(PC)-DOS computers would very soon be a limiting factor. 3. The most versatile implementation of Meineke and De Mey’s algorithm [12] is provided by the PASCAL programs ProbMac and ProbPas. ProbMac, a Macintosh program which uses the Macintosh Toolbox, runs on both 68K and PowerPC systems, with MacOS versions 7.5 up to and including 8.1 tested. ProbPas is an ANS (American National Standard) Pascal program which can be compiled with most Pascal compilers after a few adaptations of the source code. The programs were compiled with Metrowerks CodeWarrior (version 2.0) Pascal compiler on a Macintosh 8600 (250 MHz) with 128 Mbytes of RAM, in 68K emulation mode. Internal memory requirements are a function of the maximum group sizes to be handled, ranging for example from 225 kbytes for 10+ 10 elements to 111 Mbytes for 55+ 55 elements. Group sizes of 65+ 65 are the largest to be used with the ARRAYs of REALs (4 bytes, maximum value 3.4× 1038) used to store program data. The largest possible number is the total number of combinations in these groups, which is :9.5× 1037. Larger group sizes require the data type DOUBLE (8 bytes, maximum value 1.7× 10308), which means more than twice the amount of internal memory needed for REALs. Run times also depend on group sizes. Only, a few seconds are required for 30 + 30 elements, and : 2.5 min for 55+ 55 elements. Group sizes
H.P. Wijnand, R. 6an de Velde / Computer Methods and Programs in Biomedicine 63 (2000) 21–28
larger than 55+55 elements on the above Macintosh make use of virtual memory, which causes very long run times. For example, 60 + 60 elements (requiring 157 Mbytes of internal memory) showed a run time of 514 h. In such cases an alternative may be found in the normal approximation, to be discussed in Section 6. This approximation, however, is not included in our programs, since these were intended to provide exact probabilities. The only data input of the PASCAL programs consists of the two group sizes; the results files on disk(ette) are automatically generated. For those interested the PASCAL source codes are accompanied by a README file with conversion instructions for other programming environments.
25
5. Sample runs The programs discussed in the previous Section have in common that they produce the same results and that all results are saved to a userdefined disk(ette) file. An example of the results for equal group sizes (7 and 7) is given in Table 1 and for unequal group sizes (7 and 8) in Table 2.
6. Discussion Mann–Whitney/Wilcoxon’s nonparametric cumulative probability distribution provides exact probabilities for two empirical samples without ties (equal values). If ties are present mid ranks (average ranks) are assigned to each element of a
Table 1 Sample run with equal group sizes Cumulative probability distribution for Mann–Whitney/Wilcoxon’s two-sample test Disk file name: C:¯WILCOXON¯FOR0707.RES Group sizes: 7 and 7 No.
Rank sum
Cumulative probability
No.
Rank sum
Cumulative probability
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
0.000291 0.000583 0.001166 0.002040 0.003497 0.005536 0.008741 0.013112 0.018939 0.026515 0.036422 0.048660 0.064103 0.082459 0.104312 0.129662 0.158800 0.191434 0.227855 0.267483 0.310023 0.355186 0.402389 0.450758 0.500000
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
0.549242 0.597611 0.644814 0.689977 0.732517 0.772145 0.808566 0.841200 0.870338 0.895688 0.917541 0.935897 0.951340 0.963578 0.973485 0.981061 0.986888 0.991259 0.994464 0.996503 0.997960 0.998834 0.999417 0.999709 1.000000
26
H.P. Wijnand, R. 6an de Velde / Computer Methods and Programs in Biomedicine 63 (2000) 21–28
Table 2 Sample run with unequal group sizes Cumulative probability distribution for Mann–Whitney/Wilcoxon’s two-sample test Disk file name: C:¯WILCOXON¯FOR0708.RES Group sizes: 7 and 8 Rank sums Rank sums No.
Smaller group
Larger group
Cumulative probability
No.
Smaller group
Larger group
Cumulative probability
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
36 37 38 39 40 41 42 43 44 45 46 47 48 49 so 51 52 53 54 55 56 57 58 59 60 61 62 63 64
0.000155 0.000311 0.000622 0.001088 0.001865 0.002953 0.004662 0.006993 0.010256 0.014452 0.020047 0.027040 0.036053 0.046931 0.060295 0.075991 0.094639 0.115929 0.140482 0.167832 0.198446 0.231702 0.267910 0.306294 0.347164 0.389433 0.433256 0.477545 0.522455
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 so 51 52 53 54 55 56 57
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
0.566744 0.610567 0.652836 0.693706 0.732090 0.768298 0.801554 0.832168 0.859518 0.884071 0.905361 0.924009 0.939705 0.953069 0.963947 0.972960 0.979953 0.985548 0.989744 0.993007 0.995338 0.997047 0.998135 0.998912 0.999378 0.999689 0.999845 1.000000
subset of tied elements, irrespective of the group the tied element belongs to. If the number of ties is relatively small, Mann – Whitney/Wilcoxon’s probabilities are good approximations. In cases with no or relatively few ties, good approximations can also be obtained by a normal distribution without any corrections. As discussed in almost every textbook on statistics, this is based on the fact that if either group size is at least 20, the rank sums follow an almost perfect normal distribution with mean n1(n1 +n2 +1)/2 for the smaller group and n2(n1 +n2 +1)/2 for the larger
group, and variance n1n2(n1 + n2 + 1)/12. For group sizes smaller than 20 the normal approximation is still good or acceptable in many cases. For samples with relatively many ties, however, the exact nonparametric probability distribution is different from data set to data set because it would not only depend on the group sizes but also on the number and the distribution of ties over the two groups. The calculation of such exact probability distributions would be extremely cumbersome. In such cases a good solution is found in the normal approximation extended with
H.P. Wijnand, R. 6an de Velde / Computer Methods and Programs in Biomedicine 63 (2000) 21–28
appropriate corrections for ties. The equations can be found in almost every textbook covering nonparametric statistics [6,14,15]. It should be noted that the normal approximation can be improved, especially for relatively small group sizes, by a continuity correction (De Jonge [15]). This is because rank sums are discretely distributed and the normal distribution is a continuous function. Tabulated critical rank sums of Mann – Whitney/Wilcoxon’s two-sample test can be seen as a subset of the complete set of rank sums and their cumulative probabilities as can be obtained with our programs. For example, in Geigy’s Scientific Tables [7] the critical rank sums for 7+ 7 elements and 2a =0.100 are tabulated as 39 and 66. As shown in our Table 1, the lower rank sum of 39 is associated with the cumulative probability 0.04866, which is the largest value 5 0.050. As can also be seen from Table 1, the tabulated POE rank sum of 66, however, is not associated with a cumulative probability of 0.95134, which is the smallest value ]0.950, but with the larger probability of 0.96358. It should be noted that this is not due to a lack of agreement between the two methods. This follows from the fact that Prob{Ri ] 66} = 1 − Prob{Ri B 66} = 1 − Prob {Ri 565}. The probability of Ri 565 is read from Table 1 as 0.95134, so that Prob{Ri ]66} = 1 − 0.95134=0.04866. Therefore, the probability that the rank sum is 66 or higher equals the probability that the rank sum is 39 or lower. Its value of 0.04866 is the largest probability 50.050, which also illustrates the symmetry of the nonparametric probability distribution. The same conclusion can be obtained by applying Eq. (9). This numerical example is easily generalized as follows: 1. The probability that a rank sum is equal to or smaller than a certain value with index i is the cumulative probability with the index i. 2. The probability that a rank sum is equal to or larger than a certain value with index i is 1− the cumulative probability with the index i − 1. Conversely, the lower and upper critical rank sums at the a-level can be obtained from the cumulative probability distributions as follows:
27
1. The lower critical rank sum at the a-level, Ra, is the largest rank sum with cumulative probability 5 a; its index is designated as ia. 2. The upper critical rank sum at the a-level, R1 − a is not the smallest rank sum which is associated with the cumulative probability ] 1− a, but the next higher rank sum, which has the index i1 − a = n1n2 − ia + 2. 7. Validation In order to validate our programs, we extracted lower and upper critical rank sums from the cumulative probabilities by using procedures 1 and 2 discussed in the previous section. These values were compared with the critical rank sums of the seven tables of the nonparametric two-sample test in the Eighth Edition of Geigy’s Scientific Tables [7]. In this context it should be noted that earlier editions of Geigy’s tables should not be used for such comparisons, since the Eighth Edition has been the result of a critical revision of earlier versions. No differences in the values of critical rank sums between the results from our programs and those in Geigy’s Eighth Edition were found.
8. Availability of programs Free copies of all programs are available on request by e-mail or by sending a formatted blank 90 mm (3.5 in.) 1.4 Mbytes diskette to one of the authors. Also on request free copies of the FORTRAN-77 source code (including hints for programmers) and of the PASCAL source code (including a detailed README file) is available for those who wish to adapt the programs to specific requirements of their hardware or programming environment.
Acknowledgements The authors wish to thank a reviewer for useful comments which considerably improved an earlier version of this paper.
28
H.P. Wijnand, R. 6an de Velde / Computer Methods and Programs in Biomedicine 63 (2000) 21–28
References [1] F. Wilcoxon, Individual comparisons by ranking methods, Biometrics 1 (1945) 80–83. [2] C. White, The use of ranks in a test of significance for comparing two treatments, Biometries 8 (1952) 33–41. [3] D. van der Reyden, A simple statistical significance test, Rhodesia Agri. J. 49 (1952) 96–104. [4] L. Festinger, The significance of difference between means without reference to the frequency distribution function, Psychometrika 11 (1946) 97–105. [5] H.B. Mann, D.R. Whitney, On a test on whether one of two random variables is stochastically larger than the other, Ann. Math. Stat. 18 (1947) 50–60. [6] W.J. Conover, in: Practical Nonparametric Statistics, second ed., John Wiley and Sons, 1980, pp. 223–225. [7] Wissenschaftliche Tabellen Geigy, Teilband Statistik, Eighth ed., Basle, 1980, pp. 156–162. [8] D. Hauschke, V.W. Steinijans, E. Diletti, A distributionfree procedure for the statistical analysis of bioequivalence studies, Int. J. Pharmacol. Ther. Tox. 28 (1990) 72 – 78. [9] D. Hauschke, V.W. Steinijans, Ein verteilungstreies Verfahren zur statistischen Auswertung von Bioaquivalenzs-
.
[10]
[11]
[12]
[13]
[14] [15]
tudien, in: J. Vollmar (Ed.), Biometrie in der chemisch-pharmazeutischen Industrie, Band 5, Statistische Beurteilung der Bioa¨quivalenz sofort freisetzender Arzneiformcn, Gustav Fisher, Stuttgart, 1991, pp. 69 – 89. H.P. Wijnand, On the assessment of bioequivalence in a two-period cross-over design, Comput. Meth. Progr. Biomed. 37 (1992) 151 – 157. H.P. Wijnand, Bioequivalence revisited: nonparametric analysis of two-period cross-over studies, Comput. Meth. Progr. Biomed. 40 (1993) 249 – 259. I. Meineke, C. de Mey, The assessment of bioequivalence in a two-period cross-over design: Development of a simple BASIC program, Comput. Meth. Progr. Biomed. 35 (1991) 151 – 153. R.R. Dromey, in Problemen oplossen met de computer (Academic Services, Schoonhoven, The Netherlands, 1987). Translated from: How to Solve It by Computer, Prentice-Hall International London, UK, 1982. E.L. Lehmann, in: Nonparametrics, Statistical Methods Based on Ranks, McGraw-Hill, 1975. H. de Jonge, in: Inleiding tot de Medische Statistiek, Nederlands Instituut voor Praeventieve Geneeskunde, Leiden, The Netherlands, 1963, p. 299.