20 Nonparametric analysis of experiments

20 Nonparametric analysis of experiments

S. Ghosh and C. R. Rao, eds., Handbook of Statistics, Vol. 13 © 1996 Elsevier Science B.V. All rights reserved. '~N ~.alkl Nonparametric Analysis o...

3MB Sizes 8 Downloads 77 Views

S. Ghosh and C. R. Rao, eds., Handbook of Statistics, Vol. 13 © 1996 Elsevier Science B.V. All rights reserved.

'~N

~.alkl

Nonparametric Analysis of Experiments

A. M. Dean and D. A. Wolfe

1. Overview

1.1. Introduction Traditional analysis of a factorial experiment is based on the assumption of an underlying normal distribution for the error variables. Data transformation procedures may help to correct for deviations from normality, but possibly at the expense of equality of variances of the error variables. Similarly, data transformation procedures commonly used for equalizing the error variances may result in the violation of the normality assumption. A nonparametric analysis avoids the necessity of specifying any particular error distribution, although most such procedures do require equality of error variances. Nonparametric procedures are generally easy to apply and, even when the normality assumption is valid, they are reasonably powerful. In this article, we illustrate some simple nonparametric procedures that can be used to analyse factorial experiments. Some of these techniques are available in texts such as that by Hollander and Wolfe (1973), while others have appeared more recently in the literature. Most of the procedures that we shall discuss are hypothesis testing or decision procedures. Where confidence intervals are available, we shall discuss those as well. For simplicity, we have chosen to concentrate mostly on procedures applied to ranks of the data values. For procedures applied to other types of scores (functions of ranks), the reader is referred to Randles and Wolfe (1979) or Hettmansperger (1984), and for other rank-based methods, see Draper (1988). For all of the situations we shall consider, the model is of the form Model:

Assumptions:

Yi = ~-~j=l P xiJ~/J + Ei, i = 1 , . . . , N, where Yi is the ith observation, 7 1 , . . . , ~ p are unknown parameters, xij is the known design variable, Ei is the random error associated with the ith observation, and where, in general, 1. El, E2,. •., EN are mutually independent. 2. The distributions of El, E2,. •., EN are the same (in particular, the variances are equal). 3. Ei has a continuous distribution, i = 1 , . . . , N. 705

706

A. M. Dean and D. A. Wolfe

In particular, the assumption of continuous error distributions ensures that the relevant permutations of ranks are equally likely. However, this uniform distribution on the permutations is also implied, to a reasonable approximation, by most discrete distributions with adequate support (that is, sufficiently large number of possible values for the errors). Therefore, the procedures we discuss are also approximately valid for most discrete data. The material in this article is organized around a series of experiments which are described in the following subsections. The reader is invited to look at these descriptions first, and then to follow the references to the sections containing the details of the various nonparametric procedures. Section 2 is devoted to the analysis of the one-way layout; that is, experiments with responses which can be modelled by the effect of a single factor, called the treatment factor. In the case of factorial experiments, the one-way layout analysis can be used to compare the effects of different treatment combinations. We shall use the data from Experiments 1.2 and 1.3, described in the following subsections, to illustrate various nonparametric procedures applicable to the one-way layout. Experiment 1.2 is used to illustrate a factorial experiment with three factors and a control treatment combination. Experiment 1.3 involves a single quantitative treatment factor with unequally spaced levels. Experiments 1.4, 1.5 and 1.6 have two treatment factors each and are analysed as two-way layouts in Section 3. The first two of these experiments have two observations per treatment combination (i.e., per cell), while the third has only one. The procedures illustrated for two treatment factors are equally relevant when one factor is a block factor, that is, for complete block designs. Experiment 1.7 is used to illustrate the balanced incomplete block design setting. Experiment 1.2 is used again in Section 4 to illustrate analysis procedures for three or more treatment factors. Fewer nonparametric procedures are currently available for three-or-more-factor settings than for the one- or two-factor settings. However, we will discuss a rank-based general linear model procedure which is quite flexible.

1.2. E x p e r i m e n t - one-way layout; general alternatives

The following experiment was described by Raghu Kackar and Anne Shoemaker in the book edited by Dehnad (1989). The experiment was set up to investigate the robustness of the settings of eight "design" factors ( A - H ) to the thickness of deposition of an epitaxial layer on a polished silicon wafer. Thickness measurements were taken on wafers in 14 different positions in the bell jar in which the deposition took place and at 5 places on each wafer (see the original reference for details). The position variables were treated as noise factors and the mean thickness and the log sample variance of the thickness were measured over the 70 position combinations (noise levels) for each combination of the eight design factors. An analysis of the thickness variability was done to investigate which settings of the design factors were more sensitive to the noise factors, and an analysis of the mean thickness was done to investigate which factors could be adjusted to produce the target thickness.

Nonparametric analysis of experiments

707

Table 1 Thickness of epitaxial layer, Dehnad (1989) Mean thickness

Log variance of thickness

14.821 14.888 14.037 13.880 14.165 13.860 14.757 14.921 13.972 14.032 14.843 14.415 14.878 14.932 13.907 13.914

--0.443 -1.199 -1.431 -0.651 -1.423 -0.497 -0.327 -0.627 -0.347 -0.856 -0.437 --0.313 -0.615 --0.229 -0.119 -0.863

A

/3

C

D

E

F

G

H

-1 -1 -1 -1 -1 -1 -I -1 1 1 1 1 1 1 1 1

-1 -1 -1 -1 1 1 1 1 -1 -1 -I -1 1 1 1 1

-1 -1 1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 1

-1 -1 1 1 1 1 -1 -1 1 1 -1 -1 -1 -1 1 l

-1 1 -1 1 -1 1 -1 1 -1 1 -1 1 --1 1 --1 1

-1 1 -1 1 1 -I 1 -1 1 -1 1 -1 -1 1 -i 1

--1 1 1 -1 -1 I 1 -1 1 -1 -I 1 1 -1 -1 1

--1 1 1 -1 I -1 --1 1 -1 1 l --1 1 -1 --1 1

The design was a fractional factorial experiment with 16 observations in total. Contrast estimates for m a i n effects were calculated in the original article, but no formal testing was done. The design and response variables are reproduced in Table 1. The codes - 1 and 1 m e a n that the factor is at its first and second levels, respectively. To illustrate techniques for the o n e - w a y layout, in Section 2.2.6 we consider just factors A and H - a total of 4 treatment combinations, each observed four times, and we use log variance thickness as the response variable. Factor A was rotation m e t h o d (continuous, oscillating) and factor H was nozzle position (2, 6). If we represent the effect on the response of the j t h treatment c o m b i n a t i o n as Tj ( j = 1,2, 3, 4) then a starting point in the analysis of this set of data might be to test the null hypothesis that the four treatment c o m b i n a t i o n s give rise to the same log variance thickness against a general alternative hypothesis; that is, Ho: [~-1 = ~-2 = r3 = r4]

versus

H i : [T1,T2, T3, r4 are not all equal].

The w e l l - k n o w n Kruskall-Wallis procedure can be used to test H0 against H1, and this is described in Section 2.2.1. The Kruskall-Wallis procedure requires equality of variances of the error variables. The R u s t - F l i g n e r procedure for the B e h r e n s - F i s h e r p r o b l e m of testing H0 against H1 when the error variances differ is presented in Section 2.2.2. A follow-up (or alternative) to the Kruskall-Wallis test is the simultaneous investigation of the equality or non-equality of each pair of treatment effects zv, ~'~. A m u l t i p l e comparison decision procedure which controls the experimentwise error rate (the probability of m a k i n g at least one incorrect decision) is given in Section 2.2.3. S i m u l t a n e o u s confidence intervals for the pairwise comparisons ~-v - 7~ are discussed

708

A. M. Dean and D. A. Wolfe

Table 2 Number of revertant colonies for Acid Red 114, TA98, hamster liver activation, Simpson and Margolin (1986) Dose (~g/ml) 0

100

333

1000

3333

10000

22 23 35

60 59 54

98 78 50

60 82 59

22 44 33

23 21 25

m

in Section 2.2.5. General contrasts ~jm__1 CjTj, Y]j=I Cj = 0, are often of interest, and these include the pairwise comparisons as a subset. Point estimators for these contrasts are available and, for these, the reader is referred to SpjCtvoll (1968) or Hollander and Wolfe (1973, Section 6.4). Prior to this particular experiment, factors A and H had been set at the levels "oscillating and 4", respectively. Let us suppose that "oscillating, 6", which is the treatment combination (1, 1), is the combination which requires least change to the original manufacturing process. We may then regard this combination as a control, or standard, treatment. In this case, comparisons of the other three treatments with the control treatment would be of particular interest. This is called the treatment versus control multiple comparison problem and is discussed in Section 2.2.4. All of these procedures are illustrated for the wafer experiment in Section 2.2.6.

1.3. E x p e r i m e n t - one-way layout, ordered alternatives

Simpson and Margolin (1986) present data from three replicated Ames tests in which plates containing Salmonella bacteria of strain TA98 were exposed to m = 6 different doses of Acid Red 114. The response variable is the number of visible revertant colonies observed on each plate. The data from the first replicate are reproduced in Table 2. If we represent the effect on the response of the jth dose of Acid Red 114 as Tj, then a starting point in the analysis of these data might be to test the null hypothesis

H0: [7, ..... against the general alternative hypothesis that the different doses do not give the same average number of visible revertant colonies. An appropriate test for this hypothesis is discussed in Section 2.2.1. On the other hand, one may expect a priori that the number of colonies will increase as the dose increases, at least over the lower dose range. Thus, if we just consider the part of the experiment up to and including 1000 #g/ml, we may prefer to test the null hypothesis against the specific ordered alternative hypothesis //2:[~-1 ~< ~-a N Z~ ~< T4, with at least one strict inequality].

Nonparametric analysis of experiments

709

The Terpstra-Jonckheere test of H0 against//2 is presented in Section 2.3.1. A followup to such a test is to investigate simultaneously the possible equality versus preconceived ordering of each pair of treatment effects 7-j, 7"q. A multiple comparison decision procedure which controls the experimentwise error rate in this ordered setting is the Hayter-Stone procedure discussed in Section 2.3.2. The corresponding procedure for obtaining simultaneous confidence bounds for pairwise comparisons 7"v -7"~(1 ~< u < v ~< 4) is given in Section 2.3.3. All of these procedures are illustrated for the Acid Red experiment in Section 2.3.5. With further knowledge about drug testing, it may be reasonable to expect that, eventually, a drug dose will become toxic and the response will begin to decline. In this case, an umbrella alternative hypothesis may be of interest; that is //.3." [7"1 ~ "'" ~ 7"p--1 ~ 7"p ~ "'" ~ 7"6, with at least one strict inequality],

where the location of the "peak", p, of the umbrella (turning point) may or may not be known. Tests of H0 against an umbrella alternative hypothesis ~ will be discussed in Section 2.3.4, and illustrated for the Acid Red experiment in Section 2.3.5.

1.4. Experiment - two-way layout, with interaction, multiple observations per cell The data shown in Table 3 were given by Anderson and McLean (1974) and show the strength of a weld in a steel bar. The two factors of interest were time of welding (total time of the automatic weld cycle) and gage bar setting (the distance the weld die travels during the automatic weld cycle). We let ~Tij denote the expected response at the ith gage bar setting and jth welding time. It is often convenient to express this as rlij = /3i + 7-j + (¢3~-)ij, where Pi represents the effect of the ith gage bar setting, 7-j represents the effect of the jth time of welding, and (/37-)ij represents the effect of the ith gage bar setting and jth welding time that is not accounted for by the "additive model" rlij ---/3i + Tj. The term (/37-)ij is often called an interaction term. In Section 3.2.1, we discuss procedures that have been proposed for testing the hypothesis of additivity; that is, H~da: [rlij = /3i + 7"5, for all i and j] against an alternative hypothesis of non-additivity. We also present there an alternative procedure that tests for rank interaction in the two-way layout. Rank interaction is present if the ranking of the levels of one of the factors is not the same for each of the levels of Table 3 Strength of weld, Anderson and McLean (1974) Time of welding (j) Gage Bar Setting

i 1 2 3

1 10 12 15 t9 10 8

2 13 17 14 12 12 9

3 21 30 30 38 10 5

4 18 16 15 11 14 15

5 17 21 14 12 19 11

710

A. M. Dean and D. A. Wolfe

the other factor. In terms of the general model for a two-way layout, a test for rank interaction is a test of the pair of null hypotheses [71 + (97) 1 = 72 +

.

.

.

.

+

.

for i = 1 , . . . , b ] and g ~ *('r): []31 Jr- (]~T)Ij = 1~2 -I- ( ~ T ) 2 j . . . . .

~b -]- (]~T)bj,

for j = 1 , . . . , m ] . The general alternative hypotheses are that the corresponding parameters are not all equal. It is possible for rank interaction to show in one direction but not the other; that is, Ho *(n) might be rejected but not H0n*(~'), or vice versa.

1.5.

Experiment - two-way layout, no interaction, multiple observations p e r cell

Wood and Hartvigsen (1964) present the design and analysis of a qualification test program for a small rocket engine. In Table 4, we reproduce their data for the thrust duration of a rocket engine under absence or presence of altitude cycling (factor C at levels 1 or 2) and four different temperature levels (factor D at levels 1, 2, 3 or 4). We have ignored two other factors (which their analysis showed to be nonsignificant) for the purposes of illustration of the nonparametric methods. We let 13~,7j and (/37)~j denote, respectively, the effect on thrust duration of the ith level of altitude cycling, the jth level of temperature and their interaction. The RGLM test described in Section 4.3, can be used to show that the interaction between the two factors is negligible. Therefore it is of interest to test the main effect hypotheses H0~: [ill

~2]

=

and

H~': [T1 = T2 = 73 = 7-4]

against their respective general alternative hypotheses. Main effect tests against a general alternative hypothesis in the two-way layout are discussed in Section 3.2.3, and corresponding multiple comparison procedures are presented in Section 3.2.4. Table 4 Thrust duration of a rocket engine under varying conditions of altitude cycling and temperature Temperature 1

3

4

1

21.60 21.60

21.09 19.57

11.54 11.75

11.14 11.69

19.09 19.50

21.31 20.11

13.11 13.72

11.26 12.09

2

21.60 21.86

22.17 21.86

11.50 9.82

11.32 11.18

21.08 21.66

20.44 20.44

11.72 13.03

12.82 12.29

Altitude Cycling

2

Nonparametric analysis of experiments

711

1.6. Experiment - two-way layout, one observation per cell The data given in Table 5 form part of an experiment, described by Wilkie (1962), to examine the position of maximum velocity of air blown down the space between a roughened rod and a smooth pipe surrounding it. The two factors of interest were the height of ribs on the roughened rod at equally spaced levels 0.010, 0.015 and 0.020 inches (coded 1, 2, 3) and Reynolds number at six levels (coded 1-6) equally spaced logarithmically over the range 4.8 to 5.3. The responses (positions of maximum air velocity) were measured as y = ( d - 1.4) x 103, where d is the distance in inches from the center of the rod. Let ~ j denote the expected response at the ith rib height and jth Reynolds number. Then, as in Experiment 1.4, a null hypothesis of additivity can be expressed as

H~dd:

[7]ij = ~i + q-j, for all 1 ~< i ~< 3 and 1 ~< j ~< 6].

The general alternative hypothesis of non-additivity is H~aa: [~ij ¢/3i + "rj, for at least one ( i , j ) combination, l~
and

H~-: [r~ . . . . .

r6]

against their respective general alternative hypotheses H~: [/3i not all equal] and H~-: [~-j not all equal]. The null hypothesis Ho~ is the hypothesis that the effects of the rib heights are the same (taken as an average over Reynolds numbers) and the null Table 5 Position of maximum air velocity, Wilke (1962)

Rib Height

i

1

i 2 3

-24 33 37

Reynolds number (j) 2 3 4 5 -23 28 79

1 45 79

8 57 95

29 74 101

6 23 80 111

A. M. Dean and D. A. Wolfe

712

hypothesis H~ is the hypothesis that the effects of the different Reynolds numbers are the same (taken as an average over rib heights). The Friedman test for these hypotheses is described in Section 3.3.2, and a multiple comparisons procedure for comparing the effects of pairs of rib heights or pairs of Reynolds numbers is given in Section 3.3.3. The procedures are illustrated in Section 3.3.4. Since the levels of Reynolds number are ordered, an ordered alternative main effect hypothesis might be of interest rather than a general alternative hypothesis; that is, one may wish to see if the location of maximum velocity is further from the center of the rod as the Reynolds number increases. The alternative hypothesis would then be H I : [~q <~ ... ~< ~-6,with at least one strict inequality]. A test of H~ against H~ is discussed in Section 3.3.5, and a multiple comparisons procedure in Section 3.3.6. The procedures are illustrated in Section 3.3.7. All of the procedures discussed in Section 3 are also applicable to the randomized block design setting, where an additive model is often reasonable.

1.7. Experiment- incomplete layout The nonparametric procedures used for experiments with two treatment factors or for randomized block designs are not suitable when one or more of the cells are empty. We shall use the following experiment, described by Kuehl (1994) and run by J. Berry and A. Deutschman at the University of Arizona, to illustrate the test of the hypothesis of no treatment effects against a general alternative hypothesis in the special case of a balanced incomplete block design. The experiment was run to obtain specific information about the effect of pressure on percent conversion of methyl glucoside to monovinyl isomers. The conversion is achieved by addition of acetylene to methyl glucoside in the presence of a base under high pressure. Five pressures were examined in the experiment, but only three could be examined at any one time under identical experimental conditions, resulting in a balanced incomplete block design. The data and design are shown in Table 6. Under the assumption of additivity (no interaction between the treatment and the block factors), we discuss in Section 3.4.1 a procedure that can be used to test the

Table 6 Percent conversion of methyl glucoside to monovinyl isomers, Keuhl (1994) Block Pressure(psi)

I

II

250 325 400 475 550

16 18

19

32

III

26 39 46 45

61

IV

V

VI

VII

VIII

IX

20

13 13 34

21

24 10

19 21 35 55

47 48

33 31

30 52

50

X

24 31 37

Nonparametric analysis of experiments

713

null hypothesis of no difference in percent conversion of the different pressures, that is Hi:

.....

against a general alternative hypothesis that at least two of the pressures differ. In Section 3.4.3, we give a multiple comparison decision procedure for detecting differences between treatments in the balanced incomplete block design setting. These procedures are illustrated for the above experiment in Section 3.4.4. A test for H~- against a general alternative in an arbitrary two-way layout (that is, a two-way layout that does not necessarily satisfy the conditions of a balanced incomplete block design) is described in Section 3.4.2.

1.8. E x p e r i m e n t - h i g h e r w a y l a y o u t s

For an example of a three-way layout, we revisit Experiment 1.2. The currently available nonparametric procedures for analyzing factorial experiments with three or more factors all require at least one observation per cell. Consequently, for illustration purposes, we will analyze factors C, D and E only, allowing two observations in each of the eight cells. All factorial experiments can be analysed using the techniques of the one-way layout, as illustrated in Section 2.2.6 with factors A and H from Experiment 1.2 and the log variance response. In Section 4.2, we show a method of using two-way layout techniques to analyse factors C, D and E of this experiment using the average thickness response. The RGLM method, discussed in Section 4.3, is completely general, and can be used for any factorial experiment with multiple observations per treatment combination.

2. One-way layout 2.1. I n t r o d u c t i o n

In the setting of the one-way layout, we are generally interested in detecting whether there are any differences in the probability distributions of m populations based on data contained in independent random samples from each of them. These populations may correspond to m levels of a single treatment factor or m treatment combinations in a factorial experiment. Let Y I j , • • •, Y n j j , j = 1 , . . . , m , represent independent random samples from the m populations, where the jth population has cumulative distribution function (c.d.f.) Fj. Then, the null hypothesis that the distributions of the treatment populations are identical corresponds to F1 =- . . . = F m . In the one-way layout setting, the variety of possible alternative hypotheses almost always correspond to differences in locations for the m population distributions. As a result of this, we specialize our model assumptions and assume that the distribution functions F1,. •., Fm are connected by

A. M. Dean and D. A. Wolfe

714

the relationship F j ( t ) = F ( t - r j ) , - e ~ < t < oc, for j = 1 , . . . ,m, where F is a distribution function for a continuous probability distribution with unknown median 0 and where rj represents the unknown "treatment effect" for the jth population. These assumptions are equivalent to the model formulation Yij=O+rj+Eij,

i=l,...,njandj=l,...,m,

(1)

where 0 is the common median, rj is the effect of the jth treatment and the error variables Eij form a random sample from any continuous distribution with median 0. In this section and using model (1), we are interested in testing the "location" null hypothesis

Ho: [q . . . . .

~-,~]

(2)

versus alternative hypotheses that correspond to a variety of different (non-null) relationships between the rj's. As mentioned in Section 1.1, the procedures discussed in this section are also approximately valid for discrete data, provided that the support of the error distribution is sufficiently large. Distribution-free procedures in the one-way layout setting are based on the fact that, under the hypothesis that F1 =-- ... = F m = F (which is implied by (1) and (2)), every ordered arrangement of the sample observations Y n , . . . , Yn,~m is equally likely. Let N = nl + ... + n,~ denote the total number of observations and let R i j denote the rank (from least to greatest) of Y~j among all N observations. Let R = (Rll,. •., Rn11, R12, . . . , Rn22, ... , R I , ~ , . . . , Rn,~m) be the vector of these "joint ranks." Thus, in the absence of ties, R is a random permutation of the available joint ranks ( 1 , 2 , . . . , N ) . Under F1 = ... = F,~ = F, any test statistic that is a function of the response variables Y~j only through their joint ranks R will have the same probability distribution under H0 (2), regardless of the form of the common underlying continuous F. Such procedures based on the joint ranks R of the Yij's are, therefore, distribution-free tests of the null hypothesis H0 (2) and they control the type I error rate to be equal to the specified rate a over the entire class of continuous distributions F.

2.2. One-way layout - general alternatives 2.2.1. Test designed f o r general alternatives when the variances are equal Kruskal and Wallis (1952) proposed a test procedure that is designed to detect the broad class of alternatives to H0 (2) corresponding to

Hi: [not all ~-j's are equal].

(3)

The associated test statistic is a direct analogue of the usual classical one-way layout F-test with the original observations replaced by their joint ranks.

Nonparametric analysis of experiments

715

We use the notation R.j = (R U + . . . + t ~ j ) for the sum of the joint ranks assigned to the n j observations on the j t h treatment (j = 1 , . . . , m ) a n d / { . j = R.j/nj for their average. The Kruskal-Wallis statistic H for testing H0 (2) versus the general alternative//1 (3) is based on the sum of the squared distances of the average ranks from their common null expected value ( N + 1)/2. The specific form of the statistic is m

H = [12/N(N -t- 1)] ~

nj(/~ 4 -(N

+ 1)/2) 2

j=l m =

(12/N(N+I))ER2.j/nj

-3(N+l)"

(4)

j=l

The associated test of H0 versus H1 at significance level a is reject Ho if and only if H >~h,~(m, nl,... ,nm),

(5)

where ha (m, h i , . • . , r i m ) is the upper ath percentile for the distribution of the statistic H under the null hypothesis H0 (2). The complete null distribution of H is available in Hollander and Wolfe (1973) for m = 3 and ni = 1 , . . . , 5, i = 1,2, 3. Additional tables of ha(m, n l , . . . , nm) values for a ~< 0.10 are provided in Iman et al. (1975) for m = 3, ni ~< 6, i = 1,2,3; ra = 3, ni = n2 = •3 = 7,8; m = 4, ni ~ 4, i = 1 , 2 , 3 , 4 ; m = 5, ni ~< 3, i = 1 , . . . , 5 . Finally, values o f h ~ ( r n , n a , . . . , n m ) for a = 0.01, 0.05, and 0.10 are presented in Mosteller and Rourke (1973) for m = 3, nl + n 2 + n 3 ~< 13 and m = 4, nl -}- n2 -}- n3 -[- n4 ~ 9. When either the number of treatments or the sample size configurations are beyond the limits of these existing tables, we use the asymptotic distribution of H to provide approximate critical values for this test procedure. When the null hypothesis H0 (2) is true, the statistic H has an asymptotic distribution [as m i n ( n a , . . . , rim) -4 c~] which is chi-square with m - 1 degrees of freedom. Thus the large sample approximation to procedure (5) is reject H0 if and only if H >. x2(m - l ) ,

(6)

where X 2 ( m - 1) is the upper ath percentile of the chi-square distribution with m - 1 degrees of freedom. Calculation of the Kruskal-Wallis statistic H is available in a variety of software packages. These include, but are not limited to, M i n i t a b ' s KRUSKAL-WALLIS command, StatXact's K W / M O command, SAS's N P A R I W A Y or PROC R A N K - P R O C A N O V A commands, and IMSL's KRSKL command. An example of the Kruskal-Wallis test using Experiment 1.2 is given in Section 2.2.6.

A.M. Dean and D. A. Wolfe

716

2.2.2. Test designed for general alternatives when the variances may not be equal; The k-sample Behrens-Fisher setting Two of the basic assumptions associated with the test procedure discussed in Section 2.2.1 are that the underlying distributions belong to the same common family (F) and that they differ within this family at most in their medians. The less restrictive setting where these assumptions are relaxed to permit the possibility of differences in scale parameters as well as medians (but still requiring the same common family F ) is known as the k-sample Behrens-Fisher problem. Rust and Fligner (1984) proposed a modification of the Kruskal-Wallis statistic to deal with this broader Behrens-Fisher setting when general alternatives (3) are of interest. Defining

djj, --=P(Ylj > Ylj,),

for all j ¢ j ' = 1 , . . . , m,

(7)

the Rust-Fligner procedure is designed as a test of the less restrictive null and alternative hypotheses

H~: [djj, : 1

for all j 7~ j ' :

1,...,m]

(8)

versus

H~: [djj,• 1 f o r a t l e a s t o n e j ¢ j ' = l , . . . , m ] .

(9)

Define

Uj = [ n j / N ( N + I ) ] [ / ~ . j - ( N + I)/2],

for j =

1,...,m,

where/~.j is the average rank (from the joint ranking of all N observations) for the jth treatment. Set U = [U1, • • •, Urn]~and let Condition C denote the sample outcome that the minimum value in each of the m samples is less than the maximum value in each of the other samples (that is, every pair of samples overlaps). Then, the Rust-Fligner test statistic is defined to be

Q* =

[m

]

H ( n j - 1)/nj N U ' A - U ,

provided Condition C occurs,

j=l = 0% if Condition C does not occur,

(10)

where .4 is a sample estimator for the asymptotic covariance matrix for the vector N1/2U and -4- is the Moore-Penro~ generalized inverse of A (see Rust and Fligner, 1984, for computational details for A - ) . The statistic Q* is exactly distribution-free under the more restrictive null hypothesis H0 (2), bfit exact null distribution critical values for small sample sizes are not available in the literature. The large sample approximate level a Rust-Fligner test of the relaxed hypothesis H~ (8) versus H{ (9) is given by reject H~ if and only if Q* >~xa(m - 1),

(11)

Nonparametricanalysisof experiments

717

where XZ~( m - 1) is the upper ath percentile for the chi-square distribution with m - 1 degrees of freedom. As far as we know, the Rust-Fligner statistic Q* is not an option on any available software package. An example illustrating the calculation of Q* is not included in this paper because of space limitations.

2.2.3. Two-sided all treatments multiple comparisons based on joint rankings In this subsection, we present a multiple comparison procedure that is designed to make two-sided decisions about all m ( m - 1)/2 pairs of treatment effects. The procedure was suggested by Nemenyi (1963) and critical values for equal sample sizes were supplied by Nemenyi (1963) and McDonald and Thompson (1967). Additional critical values for more general sample sizes were later given by Damico and Wolfe (1987). In the context of this paper, the procedure is to be viewed as an appropriate follow-up procedure to the general alternatives Kruskal-Wallis test of Section 2.2.1. Since multiple comparison procedures are often used following the rejection of H0, it is not uncommon to see the use of experimentwise error rates larger (e.g.,. 10 or. 15) than would be used as significance levels for hypothesis tests. It often requires an experimentwise error rate higher than the p-value associated with the previous test in order to find such differences among the individual pairs of treatments. At an experimentwise error rate of a, the Nemenyi-McDonald-ThompsonDamico-Wolfe two-sided all-treatments multiple comparison procedure based on joint rankings reaches its m ( m - 1)/2 decisions through the criterion decide ~-,, ¢ ~-~ if and only if N*]/~.,~-/~.v] >~9=(m, n l , . . . , n , ~ ) ,

(12)

where /~j is the average rank for the jth treatment in the joint ranking of all N observations, for j = 1 , . . . , m, and N* is the least common multiple of n l , . • •, nm, and the critical value 9~(m, n l , . . . , nm) satisfies the probability restriction

Po(N*I/~.~ -/~.~l < 9=(m,n~,... ,nm), u=l,...,m-1;v=u+l,...,m)=l-oe,

(13)

with the probability P0(') being computed under H0 (2). Equation (13) corresponds to the guarantee that the probability is 1 - c~ that we will make all of the correct decisions under the strict null hypothesis that all of the treatment effects are equal. Values of 9~(m, n l , . . . , n m ) can be found in Damico and Wolfe (1987) for available experimentwise error rates c~ closest to but not exceeding .001, .005, .01(.005).05(.01).15 and useful combinations of either m = 3 and 1 ~< nl ~< n2 ~< n3 ~< 6 or m = 4 and 1 ~< nl ~< n2 ~< n3 ~< n 4 ~ 6. When either the number of treatments or the sample size configurations are beyond these existing tables for g~(m, n l , . . . , n~), we must consider a large sample approximation to (12)i In the case of equal numbers n of observations on each of the treatments, Miller (1966) suggested the large sample approximation to procedure (12) given by decide 7-u 7~ ~-v if and only if [/~.u - / ~ . v l ~> qa[m(ran + 1)/12] 1/;,

(14)

718

A. M. Dean and D. A. Wolfe

where qc~ is the upper ath percentile for the distribution of the range of m independent

N(O, 1) variables. Values ofq~ for c~ = .0001, .0005, .001, .005, .01, .025, .05, .10, and .20 and m = 3(1)20(2)40(10)100 can be found in Hollander and Wolfe (1973, Table A.IO). If the sample sizes are not equal, approximation (14) is no longer appropriate and we turn to Dunn's (1964) suggestion of using Bonferroni's Inequality to approximate the value of g~(m, n l , . . . , rim). The Dunn approximation to procedure (12) is decide ~-~ ~ ~-v if and only if

IR.~ - Rvl. >1 za/m(m-1)[N(N + 1)/12] '/2r'[knu + 7~v)/nuT~v] 1/2,

(15)

where za is the upper ath percentile of the standard normal distribution. The Dunn approximation (15) could also be applied to the case of equal sample sizes. However, it can be quite conservative in such settings and the Miller approximation (14) is preferred. The Nemenyi-McDonald-Thompson-Damico-Wolfe two-sided multiple comparison procedure based on joint rankings is illustrated in Section 2.2.6 for Experiment 1.2.

2.2.4. One-sided treatment versus control multiple comparisons based on joint rankings In this subsection, we present a multiple comparison procedure that is designed to make one-sided decisions about which treatment populations have higher median effects than that of a single control population. In the context of this paper, it is to be viewed as an appropriate follow-up procedure to rejection of H0 (2) against a general alternative hypothesis when one of the populations corresponds to a control treatment. The procedure was suggested by Nemenyi (1963) and additional critical values were obtained by Damico and Wolfe (1989). For simplicity of notation, we let treatment 1 correspond to the single control population. Let N* be the least common multiple of the sample sizes na, n 2 , . . . , rim, and let/~.j be the average rank for the jth treatment in the joint ranking of all N observations, for j -- 1 , . . . , m. At an experimentwise error rate of a, the Nemenyi-DamicoWolfe one-sided treatment versus control multiple comparison procedure based on joint rankings reaches its m - 1 decisions through the criterion decide ~-u > ~-1 if and only if N* (/~.u - / ~ . 1 ) ~ g~(m, n l , . . . ,rim), where the critical value 9~(m, n l , . . . , Po(N*(R.u

(16)

rim) satisfies the probability restriction

- R.1) < 9;(?'gl,,nl, . . . ,~2m), U = 2 , . . . , m ) = 1 -- o~,

(17)

with the probability Po(') being computed under H0 (2). Values of g*(m, n l , . . . , nm) are available in Damico and Wolfe (1989) for achievable experimentwise error rates ~x closest to but not exceeding .001, .005, .01(.005).05(.01).15 and useful combinations of either m = 3 and nl = 1(1)6, 1 ~< n2 ~< n3 ~ 6 or m = 4 and nl = 1(1)6, 1 ~, n2 ~ n3 ~ n 4 ~<6.

Nonparametric analysis of experiments

719

Note that we can make decisions regarding which treatments lead to a smaller response than does the control treatment by interchanging the subscripts u and 1 in (16). When either the number of experimental treatments or the sample size configurations are beyond the existing tables for 9~* (~ , nl, . . ., n,~), we consider two possible large sample approximations. First, if n2 = n3 . . . . . n m = n, and both nl and n are large, Miller (1966) suggested the large sample approximation to procedure (16) given by decide ~-~ > ~-1 if and only if (ff(~ - R.1) >~ q ~ [ N ( N + l)(n 1 q- n)/12nln] 1/2,

(18)

where q~ is the upper ath percentile for the distribution of the maximum of (m - 1) standard normal random variables with common correlation p = [n/(nl +n)]. Selected values of q~ for ra = 3 (1) 13 and p = . 1, .125, .2, .25, .3, .333, .375, .4, .5, .6, .625, .667, .7, .75, .8, .875, and .9 can be found in Hollander and Wolfe (1973, Table A.13). When the sample sizes of the experimental treatments are not equal, Dunn (1964) suggested using the approximation to procedure (16) given by decide ~-~ >

~-a

if and only if

(R.u -- /~.1) ~ z ~ / ( m - O [ U ( N + 1)(hi + n)/12nln] 1/2,

(19)

where z a is the upper ath percentile of the standard normal distribution. The Dunn approximation (19) could be applied in the case of equal sample sizes, but it can be quite conservative in such settings and the Miller approximation (18) is preferred. The Nemenyi-Damico-Wolfe one-sided treatment versus control multiple comparison procedure is illustrated in Section 2.2.6 in the context of Experiment 1.2. 2.2.5. Simultaneous confidence intervals or bounds for simple contrasts Nonparametric point estimates for contrasts Y'~j~-I CjTj(~jml Cj = 0) are available and, for these, the reader is referred to Spjctvoll (]-968) or H~)]-landerand Wolfe (1973, Section 6.4). Here, we discuss simultaneous confidence intervals and bounds for the collection of simple contrasts (pairwise comparisons) T. -- ~-~, 1 ~ U < V ~< m. The method was proposed by Critchlow and Fligner (1991). For each pair of treatments (u, v), rank the n~ + nv observations in the combined uth and vth samples, and let / ~ l , R u z , . . . , R u n , be those ranks assigned to the observations Y1~,..., Yn,~ in the vth sample. Let nv

W~=~R~j,

for 1 ~ < u < v ~ < m ,

j=l

and standardize W,~ as follows, W * v = (2)I/2[W~v - Eo(Wuv)]/[varo(Wuv)]U 2 = [Wuv - nv(nu + nv + 1)12]/{nunv(nu + nv + 1)/24} U2.

(20)

720

A. M. Dean and D. A. Wolfe

Let the upper ath percentile for the distribution of maximum{IW~l, u ¢ v = 1 , . . . , m } under H0 (2) be denoted by d ~ ( m , n l , . . . , n m ) . Values of d ~ ( m , n l , • .. , n m ) are given in Critchlow and Fligner (1991) for m = 3 and useful combinations of sample sizes in the range 2 4 nl ~< n2 ~< n3 4 7, and all achievable values of a ~ 0.2. Set nm)[n,~n~(n~ + n , + 1)/24] '/2 + 1

a~ = (n~n~/2) - d~(m, nl,...,

(21)

and buv =- auv - 1.

(22)

For each pair of treatments (u, v), u ¢ v -----1 , . . . , m, define the sample differences to be D~" = Yj~ - Yi~,

i=

l,...,n~;

j=

l,...,n,,

(23)

~v) ~< ... 4 D (n~,~,) uv denote the ordered values of the n~,nv sample and let D~,~ ~< D(2 differences, for u ¢ v = 1 , . . . , m. Then the simultaneous 100(1 - a ) % confidence intervals for the collection of all simple contrasts {~-~ - ~-~: 1 ~ u < v ~< m} are given by DUV

((a,~)),

UV D(n.~n,,-(b,,,)))

,

1 <. U < v <~ m ,

(24)

where (t) denotes the greatest integer less than or equal to t. When either the number of treatments or the sample size configurations are beyond the limits of the exact tables provided in Critchlow and Fligner (1991), the exact value d~(m, n , , . . . ,nm) can be approximated by q~, the upper ath percentile for the distribution of the range of m independent N(0, 1) variables. Values of q~ for a = .0001, .0005, .001, .005, .01, .05, .10, and .20 and m = 3(1)20(2)40(10)100 can be found in Hollander and Wolfe (1973, Table A.10). (When using this large sample approximation, the approximate value for auv can be less than 1. In such cases, take (auv) = 1 and (buy) = 0.) The Critchlow-Fligner procedure for simultaneous confidence intervals is illustrated in the next section in the context of Experiment 1.2. 2.2.6. Example - one-way layout, general alternatives

We illustrate the various one-way layout procedures using the log variance wafer data from Experiment 1.2. We consider only factors A and H, as though these were the only factors in the experiment. Thus we wish to compare the effects on log variance response of the m = 4 treatment combinations listed in Table 7. There are n = 4 observations per treatment combination. The joint ranks of the N = 16 data values (from least to greatest) are shown in parentheses in Table 7.

Nonparametricanalysisof experiments

721

Table 7 Log variance responses for factors A and H, Experiment 1.2 (joint ranks in parentheses) A

H

Treatmentj

lnS2j(Rlj)

InS2j(R2j)

lnSZj (R3j)

In $42j (R4j)

-1 -1 l 1

-1 1 -1 1

1 2 3 4

-0.443 (10) -1.199 (3) -0.313 (14) -0.437(11)

-0.327 -0.627 -0.229 -0.615

-0.651 -1.431 -0.347 -0.856

-0.497 -1.423 -0.119 -0.863

(13) (7) (15) (8)

(6) (1) (12) (5)

(9) (2) (16) (4)

R.j 38 13 57 28

First we test the null hypothesis Ho (2) against the general alternative hypothesis H1 (3), using the Kruskall-Wallis test. The value of the Kruskall-Wallis statistic (4) is H = [(12/(16 x 17))(382/4 + 132/4 + 572/4 + 282/4)] - 3(17) = 11.272. We compare the value of H = 11.272 with the percentile h.ol (4, 4, 4, 4, 4) given by Iman et al. (1975). Now, h.0o999 = 9.287. Since the observed value of H is larger than h.o0999, there is evidence of a difference among treatments in terms of log variance response. If we apply the Nemenyi-McDonald-Thompson-Damico-Wolfe multiple comparisons procedure of Section 2.2.3, we decide ~-~ ¢ ~'v if and only if IR.~ - R.v] ~> 9,~(4, 4, 4, 4, 4). Selecting a = .099, we obtain g~(4, 4, 4, 4, 4) = 30 from Damico and Wolfe (1987, Table II). Thus, using the values of R.j (j = 1,2, 3, 4) from Table 7, our 6 decisions at an experimentwise error rate of 0.099 are ~-1 = ~-2, ~-1 = ~-3, 7-i = ~'4, T2 ~ ~-3,T2 = T4, T3 = T4, and only treatments 2 and 3 are judged to have different effects on log variance response. We note that treatment 2 has both factors A and H at different levels from treatment 3. The lower variability is given by treatment 2. At an experimentwise error rate of 0.1239, treatment 3 would also be declared as having a different effect from treatment 4. These two treatments have factor H at different levels. Treatments 3 and 4 are the treatments closest to the operating conditions that were in place prior to the experiment. If treatment 4 were to be regarded as the control, then it is of interest to know if any other treatment would reduce the log variance response. Using the Nemenyi-Damico-Wolfe procedure (16) for the treatment versus control problem, we decide ~-~ < ~-4 if and only if (R.4

-

-

R.u) ~ 9 * ( 4 , 4 , 4 , 4 , 4 ) .

The critical value 9*(4, 4, 4 , 4 , 4 ) is not among those listed by Damico and Wolfe (1989). For illustration, we use Miller's large sample approximation (I 8) even though the sample sizes are not particularly large. Selecting a = . 1067, the critical value q*

A. M. Dean and D, A. Wolfe

722

for m - 1 = 3 experimental treatments and common correlation p = 0.5 is q*Ot = 1.70. We decide "r~ < ~-4 if and only if /~.4 - / ~ . ~ >/(1.70)[(16 × 17 × 8)/(12 x 4 x 4)] 1/2 = 5.723 or, equivalently, if and only if R.4-R.~ >~22.89. This, at significance level c~ = . 1067, we cannot conclude that any treatment gives a significantly lower log variance response than control treatment 4. Finally, we illustrate the calculation of simultaneous 90% confidence intervals for simple contrasts ~-~, - %. For illustration, let u = 3 and v = 4, then the sample differences D 3,4 ij (23) are D~i4

0.124,

/-)3,4

3,4

~12 = 0.302,

"'' '

D43 = 0.737,

3,4) = 0.124, D(2

...,

3,4 = 0.737, D(15)

/-)3,4 ~ 4 4 = 0.744,

giving the ordered values 3,4) = 0.090, D(1

3,4 = 0.744. D(16)

The value of da(4, 4, 4, 4, 4) required for (21) and (22) is not available in the literature, and we use the large sample approximation for 63,4. Selecting ~ = 0.1, we have 63, 4 =

(4)(4)/2 - q , [(4)(4)(4 + 4 + I)/24] 1/2 + 1

= 8 - (3.24)[6] 1/2 + 1 = 1.06, b3,4 = 0 . 0 6 .

Thus, as part of a set of simultaneous approximate 90% confidence intervals, we have D 3'4 ~ = [ 0 . 0 9 0 , 0 . 7 4 4 ) .

2.3. One-way layout- ordered alternatives 2.3.1. Test designed for monotonically ordered alternatives In this subsection, we discuss a test procedure proposed independently by Terpstra (1952) and Jonckheere (1954) that is especially effective at detecting restricted alternatives to Ho (2) that are ordered with respect to the treatment labels, corresponding to

//2: ['q <~ " " ~< ~-~, with at least one strict inequality.]

(25)

Nonparametric analysis of experiments

723

For each pair of treatments (u, v), let Uuv be the number of pairs of responses from treatments u and v for which treatment u has the smaller response (tied responses Contribute a value 1/2 to the total); that is, n~

~v

= Z E +(x,o,x ol,

1 ~
(26)

i=1 j = l

where ¢(a, b) = 1, 1/2, 0 if a <, =, > b. Then, U~v is the Mann-Whitney statistic between responses on the uth and vth treatments. The Jonckheere-Terpstra statistic J for testing H0 (2) versus the ordered alternative hypothesis//2 (25) is the sum of these k(k - 1)/2 Mann-Whitney counts, namely, v--1 m

J:

voo,

(27)

u=l v=2

and the associated test of Ho versus H2 at significance level c~ is reject Ho if and only if J >~jc~(m, n l , . . . ,nm),

(28)

where j,~(m, n l , . . . , 11,m) is the upper c~th percentile of the null sampling distribution of the statistic J (27). Tables of j,~(m, n l , . . . , nm) for selected significance levels o~ a n d r e = 3,2 ~< nl ~ n2 ~< n3 ~< 8 a n d m = 4 , 5 , 6 a n d n l ..... nm = 2 ( 1 ) 6 are available in Odeh (1971) and Jonckheere (1954) or Hollander and Wolfe (1973, Table A.8). When either the number of treatments or the sample size configurations are beyond the limits of these existing tables, the asymptotic behavior of a standardized version of the J statistic can be used to provide approximate critical values for the test procedure (28). When the null hypothesis Ho (2) is true, the standardized statistic

j* = [J - Eo( J)]/[varo( J)]l/2

(29)

has an asymptotic distribution [as m i n ( n l , . . . , nm) --+ c~] that is standard normal, where

j=l m

varo(J) =

N2(2N + 3) - ~ n Z ( 2 n j

+ 3) /72

(30)

j=l

are the expected value and variance, respectively, of J under the null hypothesis H0. The large sample approximation to procedure (28) is, then, reject Ho if and only if J* ~> z~,

(31)

724

A. M. Dean and D. A. Wolfe

where z~ is the upper c~th percentile of the standard normal distribution. Calculation of the Jonckheere-Terpstra statistic J is also available in a variety of software packages. These include, but are not limited to, S t a t X a c t ' s JT/MO command and IMSL's KTRND command. The Jonckheere-Terpstra test is illustrated in the context of Experiment 1.3 in Section 2.3.5.

2.3.2. One-sided all treatments multiple comparisons using pairwise rankings In this subsection, we present a multiple comparison procedure suggested by Hayter and Stone (1991) that is designed to make one-sided decisions about all m ( m 1)/2 pairs of treatment effects. In the context of this paper, it is to be viewed as an appropriate follow-up procedure to the Jonckheere-Terpstra test for monotonically ordered alternatives discussed in Section 2.3.1. Let W~,v be the standardized two-sample rank sum statistic for the uth and vth samples, as defined explicitly in (20). At an experimentwise error rate of c~, the Hayter-Stone one-sided all-treatments multiple comparison procedure based on pairwise rankings reaches its m ( m - 1)/2 decisions through the criterion decide 7-~ > ~-u if and only if W~v >~c~(m, n l , . . . ,n,~),

l<~u
(32)

where the critical value ca(m, n l , . . . , n,~) satisfies the probability restriction

Po(W*, < ca(m, n l , . . . , n ~ ) , u : 1 , . . . , m -

1;v = u-t- 1 , . . . , m )

= 1 - c~,

(33)

with the probability Po(') being computed under Ho (2). Values of ca(m, nl,..., 1%m) are given in Hayter and Stone (1991) for m = 3, 3 ~< nl, n2, n3 ~< 7, and all achievable experimentwise error rates less than .20. When either the number of treatments exceeds 3, or m = 3 and one of the sample sizes is larger than 7, Hayter and Stone (1991) proposed approximating the exact critical value ca(m, n l , . . . , n,~) by ka(m, n l , . . . , nm), where ka(m, n l , . . . , n ~ ) is the upper c~th percentile point for the distribution of K =

max

l~i
[(Zj - Z i ) / { ( n i

+ nj)/2niT~j}l/2],

(34)

where Z 1 , . . . , Z ~ are mutually independent and Zi has a N(0, n~<) distribution, for i = 1 , . . . , m. Thus, the large sample approximation to procedure (32) is decide ~-~ > ~-u if and only if W~, >>.ka(m, n l , . . . ,n,~),

l <~u < v < . m .

(35)

Values of ka(m, n l , . . . , nm) for large equal sample sizes nl . . . . . nm = n, m = 3(1)9 and experimentwise error rate c~ = .01, .05, and .10 can be found in Hayter

Nonparametric analysis of experiments

725

and Stone (1991). Values of ka(m, n l , . . . , nm) for large, but unequal, sample sizes Z~l,..., nm are not presently available in the literature. If one of the treatments is a control treatment, the Nemenyi-Damico-Wolfe multiple comparison procedure (Section 2.2.4) can be used instead of the Hayter-Stone procedure. The Hayter-Stone one-sided multiple comparison procedure based on pairwise rankings is illustrated for the Acid Red Experiment 1.3 data in Section 2.3.5, and the Nemenyi-Damico-Wolfe procedure is illustrated in Section 2.2.6 in the setting of an unordered alternative hypothesis for the wafer data of Experiment 1.2.

2.3.3. Simultaneous confidence bounds for simple contrasts in the ordered setting In this subsection, we operate under the ordered restriction ~'1 ~< r2 ~< .. • ~< ~-,,~,so that we are assuming a priori that each of the simple contrasts rv - ~-u, 1 ~< u < v ~< m, is non-negative. Therefore, in this setting, we will be interested only in simultaneous lower confidence bounds for this set of simple contrasts. uv) ~< D(z uv) ~< .-- ~< D uv Let D(1 ( ~ n , ) denote the ordered values of the nunv sample differences defined in (23), for u ¢ v = 1 , . . . , m . Let c~(m, n l , . . . , n ~ ) be the critical value for the Hayter-Stone one-sided all-treatments multiple comparison procedure with experimentwise error rate o~ as given in (33). Set

buy = (nunv/2) - ca(m, n l , . . . , n~)[nunv(n~, + n~ + 1)/24] 1/2 + 1. (36) Then, the simultaneous 100(1 - c~)% lower confidence bounds for the collection of all simple contrasts {~-v - 7-u: 1 ~< u < v ~< m} are given by [D(~(~v)), c~),

1 ~< u < v ~< m,

(37)

where (t) denotes the greatest integer less than or equal to t. When either the number of treatments or the sample sizes are beyond the extent of the exact critical value tables provided in Hayter and Stone (1991), the constant ca(m, n l , . . . , nm) can be approximated by ka(m, n l , . . . , nm), the upper c~th percentile point for the distribution of K (34) in the previous subsection. (When using this large sample approximation, the approximate value for h~,v can be less than 1. In such cases, take (buy) = 1.) An example illustrating the Hayter-Stone procedure for simultaneous confidence bounds is not included in this paper because of space limitations.

2.3.4. Tests designed for umbrella alternatives In this subsection, we consider test procedures that address problems where we have a priori knowledge that enables us to label the treatments in such a way that the treatment effects will be in a monotonically increasing relationship with the treatment labels (as in Section 2.3.1) or will be ordered so that there is an initial period of increases in the treatment effects, but with an eventual downturn for the later treatments.

A. M. Dean and D. A. Wolfe

726

Such alternative hypotheses are generally referred to as "umbrella alternatives" in the literature and they can be written as

//3: [~-1~< ~-2 ~ . . .

~< rp_l ~<'rp >~ rp+l ~> ... ~>'rm,

with at least one strict inequality, for some p E { 1 , 2 , . . . , m}].

(38)

The treatment label "p" is referred to as the peak of the umbrella and may either be known or unknown in practical settings. Note that, if we are interested in alternative hypotheses corresponding to "inverted umbrellas" of the form Tl ~> r2 ~> • " ~> ~-p-1 /> ~-p <~ "rp+l <. • • • <<.Win, with at least one strict inequality, we can apply the procedures described in this section to the negatives of the sample data. The negation of the sample values turns an inverted umbrella into an umbrella. The test procedure proposed by Mack and Wolfe (1981), is a direct extension of the Jonckheere-Terpstra procedure for ordered alternatives. The umbrella alternatives (38) with known peak can be viewed as two sets of ordered alternatives, one increasing monotonically from the initial treatment level up to the known level, p, and the second set beginning at treatment level p and decreasing monotonically to the final level m. For each pair of treatments (u, v), let U,,v(1 <. u < v ~ ra) be the Mann-Whitney statistic defined in (26). The Mack-Wolfe statistic Ap for testing Ho (2) versus the umbrella alternatives//3 (38), with peak known a priori to be at p, is given by v--1

p

v--1

ra

(39) u=l v=2

u=p v = p + l

and the associated test of H0 versus ~

at significance level a is

reject Ho if and only if Ap >~ ac~(p, m , n l , . . . , n m ) ,

(40)

where am(p, m , r~l,..., rim) is the upper c~th percentile of the null sampling distribution of the statistic Ap (39). Values of am (!9, m , n l , . . . , n m ) are generally not available in the published literature and, therefore, a large sample approximation to the exact null distribution of Ap is essential for its use in applications. Let N1 = nl + . . . -t- np and N2 = np + . • • + n m . When the null hypothesis Ho (2) is true, the standardized statistic

Ap : [Ap - Eo(A,)]/[varo(Ap)]'/2

(41)

has an asymptotic distribution [as m i n ( n l , . . . , n,~) -+ e~] that is standard normal, where Eo(Ap)=

N 2+N22-E

n2-n i=1

gl/

4

(42)

Nonparametric analysis of experiments

727

and m

2(N13 + N23) + 3(N 2 + N~)

varo(Ap) =

- ~n2(2ni + 3) i=l

-n2p(2np + 3) + 12npN1N2 - 12nZN} /72

(43)

are the expected value and variance, respectively, of Ap under the null hypothesis Ho. The large sample approximation to procedure (40) is then reject Ho if and only if



Ap ~ z,~,

(44)

where z~ is the upper ath percentile of the standard normal distribution. For the case of umbrella alternatives with unknown peak, we first estimate the peak of the umbrella configuration. To accomplish this, we calculate the m peak-picking statistics m

U.v=ZU~v,

forv=l,...m,

(45)

uCv

where U~v is the Mann-Whitney statistic (26) computed for the observations on the uth and vth treatments. Next, each U.v is standardized to obtain u;

=

-

(46)

= [u.v - nv(N

- nv)/2]/{nv(N

- nv)(N

+ 1)/12}

1/2 .

Let r be the number of treatments that are tied for the maximum U.~ value and let M be the subset of { 1 , 2 , . . . , m} that corresponds to these r treatments tied for the maximum. The Mack-Wolfe peak-unknown statistic Ap for testing H0 (2) versus the umbrella alternatives H3 (38) with unknown peak p is then given by

A; = (l/r) ~

[Aj -

Eo(Aj)]/{varo(Aj)} 1/2,

(47)

jEM

where Aj (39) is the peak-known statistic with peak at the jth treatment group and E0 (A j) and varo(Aj) are given in equations (42) and (43), respectively. The associated test of Ho versus//3 at significance level a is reject Ho if and only if

Ap >~a*~(m,n l , . . . ,

n,~),

(48)

where a*(m, nl, • • •, rim) is the upper ath percentile of the null sampling distribution of the statistic A~ (41). This sampling distribution properly takes into account the

A. M. Dean and D. A. WolJe

728

fact that we have first used the sample data to estimate the unknown peak. Tables of a*~(m, n l , . . . ,n,~) for c~ ~ .01, .05, and .10, m = 3(1)10, and equal sample sizes nl . . . . . n,~ = 2(1)10 are available in Mack and Wolfe (1981). For unequal sample sizes ( n l , . . . , rim), Mack and Wolfe (1981) suggest approximating a * ( m , ~ x , . . . , rim) by the corresponding critical value for equal sample sizes nl . . . . . n m = q, namely, a*(m, q , . . . , q), where q is the integer closest to the average sample size (nl + ... + nm)/m. If the value of q is greater than 10, they suggest using am(m, 1 0 , . . . , 10). Finally, if the number of treatments m is greater than 10, Mack and Wolfe suggest using the m = 10 critical value. An example of the Mack-Wolfe test using the data from the Acid Red Experiment 1,3 is presented in the next section.

2.3.5. Example - one-way layout, ordered and umbrella alternatives We illustrate various one-way layout analyses, when prior knowledge about ordering of the treatment effects is available. For the Acid Red data of Experiment 1.3, we first test the null hypothesis H0 (2) of no differences in the treatment effects against the ordered alternative hypothesis//2 (25), using the Jonckheere-Terpstra procedure of Section 2.3.1. For this experiment, we have ~ = 6, nj = 3 , j -- 1 , . . . , 6 , and N = 18. We compute the m ( m - 1)/2 = 15 individual Mann-Whitney statistics Uuv to be

U15 ~ 5 . 5 ,

U16 = 3.5,

U23 = 6,

U24 =

7,

U34 = 4,

U56 =

2.

Combining these counts in expression (27), we obtain J = 55. If we compare this value with the null distribution tables for J (Hollander and Wolfe, 1973, Table A.8), we find that the p-value for these data is greater than .50. Thus, there is no evidence of an increase in the number of revertant colonies as the dosage of Acid Red 114 increases. This might lead to the conclusion that we should not suspect the compound of being a potential mutagen. However, as Simpson and Margolin (1986) point out in their original discussion of these data, the Salmonella bacteria of strain TA 98 may actually succumb to the toxic effects of the higher doses of Acid Red 114, resulting in a reduction in the number of organisms at risk of mutation and leading to a downturn in the dose-response curve. To address this issue, we now illustrate the application of the Mack-Wolfe peakunknown procedure of Section 2.3.4 to the same data. We assume that, if there is a carcinogenic effect of the Acid Red 114, it will result in an umbrella pattern, but with unknown peak. First we need to estimate the peak of the umbrella through computation of the m = 6 combined-samples peak pickers U.I,. •., U,m, as given by equation (45). After calculating the necessary Mann-Whitney counts U~v, 1 ~< v ¢ u ~< 6, we find that U.1 ~- 9,

U.2 =

32,

U.3 = 38,

U. 4 =

38,

U.5 = 12.5,

U.6 =

5.5.

Nonparametric analysis of experiments

729

Since we have equal sample sizes nj = 3 (j = 1 , . . . , 6 ) , each of the combinedsamples Mann-Whitney statistics U.1,..., U.6 has the same null mean and variance. As a result, we do not need to compute the standardized forms U.] in (46), since the treatment group with the largest U.q value will also be the one with the largest U.] value. Therefore, for these Acid Red 114 data, we have a tie between dosage levels 333 #g/ml and 1000 #g/ml (treatments 3 and 4) for the maximum U.q value, giving r = 2 and M = {3, 4}. The resulting form of the Mack-Wolfe peak-unknown statistic is then given by

A~ = { [A3 - Eo(Z3)]/[Varo(Z3)] 1/2 + [A4 - Eo(A4)]/[varo(A4)] 1/2 }/2 = { [76 - 40.5]/[96.75] 1/2 + [69 - 40.5]/[96.75] 1/2}/2 = 3.25. Comparing this value of A~ with the null distribution in Mack and Wolfe (1981) we find that the p-value for these data is much smaller than .01, since ~*(.01,6, 3, 3, 3, 3, 3, 3) ~ 2.733. Thus, there is substantial evidence in favor of an umbrella effect in the number of revertant colonies over the studied range of Acid Red 114 doses, with the peak effect estimated to be at either the dosage 333 #g/ml or the dosage 1000#g/ml. To illustrate the Hayter-Stone one-sided multiple comparison procedure (32), we consider only the first three dosage levels (0, 100, and 333 #g/ml) so that we can make use of an exact critical value. With an experimentwise error rate of a = .1262, we obtain the value e.1262(3, 3, 3, 3) : 2.7775 from Table 1 in Hayter and Stone (1991). Since the sample size is 3 for each of the three dosages being considered, the standardizing constants in the calculation of W,~, (20) are the same for all of the pairwise rank sums, that is, :

-

10.5)/(5.25) '/2,

1

u < v

3.

As a result, the decision criterion for this setting can be rewritten to be decide % > ~-~ if and only if W~v >~ 15. We find the individual sums of pairwise ranks (using average ranks to break ties) to be W12 -- 15, W13 = 15, and W23 : 12. Hence our three decisions at experimentwise error rate c~ = .1262 are that 7-2 > T1,7-3 > ~-1, and ~'3 = 7-2, indicating that, while both dosage levels 100 #g/ml and 333 #g/ml produce significantly greater numbers of revertant colonies than does the 0 dosage, there is no significant difference in the effects of the two larger doses. (We note that .1262 is the smallest possible exact experimentwise error rate for ra = 3 and 7"/,1 : '/Z 2 : n 3 = 3.)

3. Two-way layout

3.1. Introduction Let Y~jl, Yij2,.. •, Y~j,~,j (i = 1 , . . . , b;j = 1 , . . . , m) represent independent random samples from bm populations, where the (ij)th population has cumulative distribu-

730

A. M. Dean and D. A. Wolfe

tion function (cdf) Fij. Then Y~jk is the response variable corresponding to the kth observation on the ith level of a factor B and jth level of a factor U. Specifying the assumptions as for the one-way layout, we write our model as (49) i=l,...,b;j--1,...,m;

k= 1,...,nij,

where 0 is the common median, j3i is the effect of the ith level of factor/3, Tj is the effect of the jth level of factor U, (/3~-)ij is the effect of the interaction between the ith level o f / 3 and the jth level of U, and the N ( = nil + " " + nb,~) error variables Eijk form a random sample from any continuous distribution with median 0. In the sections that follow, factor U is regarded as a treatment factor and factor/3 may be either a second treatment factor or a block factor. In some experiments, the response can be modelled as the sum of the effects of the two factors; that is

Y~jk =

O + -cj + /3~ +

i=l,...,b;

E~jk,

j=l,...,m;

k=l,...,nij.

(5o)

This is known as an additive model and, in this setting, main effect comparisons of the two factors are of interest. Tests for additivity are discussed in Sections 3.2.1 and 3.3.1. Main effect tests and multiple comparison procedures are described in Sections 3.2.3, 3.2.4 and Sections 3.3.2-3.3.6. Distribution-free procedures in this two-way layout additive model setting are most often based on the fact that under H~-: [~-j are all equal, j = 1 , . . . , m], every ordered arrangement of the ni. = ~jm= 1 n~j sample observations within the ith level of factor /3 is equally likely, separately and independently for each level i = 1 , . . . , b of factor /3. This leads directly to the use of the joint ranks of the sample observations within levels of factor B. Any test procedure that is a function of the observations only through their joint ranks within levels of factor B, will have the same probability distribution under H~- (61) regardless of the form of the common underlying continuous distribution for the error variables Eijk. An alternative design for the two-way layout is that of the nested design where the levels of one factor (say U) are observed at only one of the levels of the other factor (/3). The model for the two-way nested layout is

i=l,...,b;

j=l,...,m;

k=l,...,nij,

(51)

where 0 is the common median,/3i is the effect of the ith level of factor/3, ~-(/3)j(i) is the effect of the jth level of U observed at the ith level o f / 3 , and the N error variables Eijk form a random sample from a continuous distribution with median 0. The null hypothesis Ho(~): [~'(fl)i(j) are all equal, i - - 1 , . . . , b , j = 1 , . . . ,m]

(52)

Nonparametric analysis of experiments

731

is equivalent to a global hypothesis of no main effect of U and no interaction between B and U for the two-way crossed model; that is, HT: [Tj -~ (/~T)ij are all equal, i = 1 , . . . , b, j = 1 , . . . , m]

(53)

(see Section 3.5). Consequently, the same testing procedures can be used and we will not discuss the nested case further.

3.2. Two-way layout - multiple observations per cell 3.2.1. Testing for non-additivity - multiple observations per cell Under the two-way model (49), a test of additivity is a test of the null hypothesis

H~dd:

[?~ij = ~i q- Tj, for all 1 ~ i ~ b and 1 ~< j ~ m]

(54)

against the general alternative hypothesis H~dd: [r/ij ¢ /3/ + Tj, for at least one 1 ~< i ~< b, 1 ~< j ~< m],

(55)

where rhj is the expected response when factor B is at level i and factor U is at level j. Conover and Iman (1976, 1981) suggested the use of the rank transform approach for testing for additivity in the two-way model with multiple observations per cell. Such tests have also been advocated in some user manuals for statistical computer packages and libraries. The rank transform procedure first ranks all of the observations from 1 to N without regard for the corresponding levels of the factors, and then uses the usual analysis of variance formulae on the ranks. The procedures are simple to use, but unfortunately, recent studies have shown that, in the presence of two large main effects, the significance levels of the rank transform test for testing Hd dd (54) can be severely inflated (for example, see Blair et al. (1987) for simulation studies, and see Akritas (1990) and Thompson (1991), for theoretical investigations). Consequently, the rank transform procedure is not recommended for testing H~~d (54) (the only exception being when both factors have two levels). Akritas and Arnold (1994) show that the rank transform test of additivity does not, in fact, test H~ dd (54). Instead, it tests the null hypothesis that the common distribution, Fij, of the response variables in the ijth cell is a mixture of two distributions, one depending on the level i of factor B and the other depending on the jth level of factor U; that is, F , j ( U ) = a D , ( y ) + (1 -

where a is the same for every cell. Consequently, when such a hypothesis is of interest, the rank transform method would appear to give a valid test. Sawilowsky (1990), in a review article, discusses a number of different tests for H~ dd (54), but many have drawbacks such as being conditional on the data collected or requiring the data to be collected replication by replication (which is generally only

732

A. M. Dean and D. A. Wolfe

done when the replications are synonymous with blocks). Hettmansperger and McKean (1983), Aubuchon and Hettmansperger (1984) and Draper (1988) all give details of some of these tests. The only method that we shall discuss in this article is the rankbased robust general linear model (RGLM) method of McKean and Hettmansperger (1976). The RGLM method possesses many of the features of the method of least squares and is very flexible. It can be used for testing a null hypothesis of additivity or for testing null hypotheses of no main effects (both against general alternatives). It can be used for any number of factors and for fractional factorial experiments provided that there are sufficient degrees of freedom to provide an estimate of the error variability. In addition, the method allows contrast estimates and confidence intervals to be calculated. It should be noted, however, that in the case of a fully parameterized model with two observations per cell, the contrast estimates are identical to those provided by the method of least squares. Since the RGLM method can be applied in many different settings, we postpone discussion of the details to Section 4.3. We illustrate the procedure for the two-factor welding experiment (Experiment 1.4) in Section 4.3.2. When factor B is a block factor, an alternative to testing Ht~dd (54) is to test for a common ordering of treatment effects within the blocks. "Rank interaction" is said to occur when the orderings fail to be the same in all blocks simultaneously. Rank interaction can also occur when both factors are treatment factors, although it is perhaps less likely to be of interest in this setting. The null hypothesis Ho *(z) of no rank interaction of treatments with blocks is +

=

+

.....

+

fori= 1,...,b]

(56)

and the general alternative hypothesis is H~*(Z): [~q + (/~T)il, ~-2 + (f~-)i%..., 7"m + (/~'r)im,

(57)

are not all equal].

A test for the null hypothesis Ho *(~) (56) was developed by De Kroon and Van der Laan (1981) for equal numbers n of observations per cell as follows. Let Rijk denote the rank of Yijk among the r a n observations Y~I1,Yi12. . . . ,Yiln, • . . , Yi,~l, Y i m 2 , . . . , Y i m n associated with the ith level of factor B (i = 1 , . . . , b) and let R i = ( R i l l , Ri12 . . . . , R i l n . . . . , R i m 1 , . . . , R i m n ) be the vector of these joint ranks for level i of factor B. The test statistic developed by De Kroon and Van der Laan (1981) is

7'2 = n 2 m ( n m

+ 1)

Ri~" - b-1 I, j = l i=1

R2"J" j=l

'

(58)

Nonparametric analysis of experiments

733

Rijk and R.j. = ~ i b l 2~=1Rijk. Their test of the null hypoth-

where Rij. = ~ = l

esis Ho *(z) (56) of no rank interaction against the alternative hypothesis H~ *(;~) (57) is reject Ho *(~) if and only if 7'2 >~ t ~ ) (m, b, n , . . . , n),

(59)

where t~ ) is the upper c~th percentile of the null distribution of T2. The percentiles are tabulated by De Kroon and Van der Laan (1981) for various values of m, b and n in the range 2 ~< m ~< 4, 2 ~< b ~< 10, and 2 ~< n ~< 4. These authors state that the asymptotic distribution [as n --+ ec] of T2 is chi-square with (b - 1)(m - 1) degrees of freedom and, consequently, the large sample approximation to procedure (59) is reject Ho *(z) if and only if T2 >~ X2((m - 1)(b - 1)),

(60)

where x~(q) is the upper c~th percentile of the chi-square distribution with q degrees of freedom. When a null hypothesis of additivity or of no rank interaction is rejected, the individual cells can be compared using the techniques of the one-way layout. If the null hypothesis is not rejected, then analysis of the main effects is of interest (Sections 3.2.3 and 3.2.4).

3.2.2. Example - testing for non-additivity, multiple observations per cell. We use the welding data in Table 3 of Experiment 1.4 in order to illustrate the analysis of a two-way layout with an equal number of observations per cell. The hypothesis of additivity H~ dd (54) can be tested using the RGLM method. Since this method can be used for higher-way layouts also, we have postponed the illustration until Section 4.3.2. As an alternative, we may wish to test the hypothesis Ho *(~) (56) of no rank interaction. Ranking the observations in Table 3 within the i = 3 levels of the gage bar settings (and using average ranks to break the ties), we obtain the rank vectors /~1 R2 R3

= = =

(1, (6.5, (4.5,

2, 8, 2,

3, 4.5, 7,

5.5, 2.5, 3,

8.5, 9, 4.5,

10, 10, 1,

7, 6.5, 8,

4, 1, 9,

5.5, 4.5, 10,

8.5), 2.5), 6),

giving 7"2= (4)(5)(11) 12{

(32+ 1 4 . 5 2 + ' - . + 162) - 3 1 ( 2 4 2 + 2 5 " 5 2 + ' " + 3 7 2 ) }

= 15.38. No tables are given by De Kroon and Van der Laan (1981) for m = 5, b = 3, and n = 2. Here, n is not large, and therefore we should not expect the chi-square distribution with (b - 1)(m - 1) = 8 degrees of freedom to be an accurate approximation to the true distribution of T2. However, if we do compare the value T2 = 15.38 with the

A. M. Dean and D. A. Wolfe

734

percentiles of the X2(8) distribution, we see that 15.38 is around the 95th percentile. Table 5.3.6 of De Kroon and Van der Laan (1981) suggests that the percentile of the exact distribution is likely to be slightly lower, so there is some evidence at a significance level of just under a = 0.05 to reject H o *(~) (56) and to conclude that there is rank interaction of U with/3. 3.2.3. Main effect tests designed f o r general alternatives - at least one per cell In this subsection, we consider two-way layout settings with at least one per cell (that is, nij ~> 1 for every i = 1 , . . . , b and j = 1 , . . . , m). We the additive model (50) holds and we wish to test the null hypothesis of factor U,

observation observation assume that no effect of

H~-: ['rj are all equal, j = 1 , . . . , m]

(61)

against the general class of alternatives H{: [not all 7-j's equal].

(62) m

Let Rijk be the rank of Y~jk within the hi. = ~-~j=l nij observations on the ith level of B, for j = 1 , . . . , m and k = 1 , . . . , n i j . For each level of U (j = 1 , . . . , m ) , compute the sum of 'cell-wise weighted' average ranks, given by b

(63) ¢=1Lni" J nlj where Rij. = ~ k = l Rijk. Define the vector R to be R = (Sl - E0[Sll,...

=

S l --

, S i n - 1 -- E 0 [ S m - I ] )

[nil(n/. +

l)/2ni.],...,

i=1

S m - I --~-~[ni,m-l(ni. ~- 1)/2hi.]

(64)

.

i=1

Note that we have chosen to define R without a term for the ruth level of U. The S j ' s are linearly dependent, since a weighted linear combination of all rn of them is a constant. We could omit any one of them in the definition of R and the test would be exactly the same no matter which one was chosen for omission. The covariance matrix for R under H~ (61) has the form 270 = ((crst)), where b

cr~t = ~ i=1

[ni~(ni. - nis)(ni. + 1)/12n2.],

for s = t =

1,...,m-

1

Nonparametric analysis of experiments

735

b

=- ~

[nisnit(ni. + 1)/12n2.],

for s ¢ t = 1 , . . . , m -

1.

(65)

i=1

Letting 2701 denote the inverse of 270, a general test statistic proposed by Mack and Skillings (1980) is MS = R'2Jol-R,

(66)

and the associated level a test of H~" (61) versus H~ (62) is reject H~ if and only if MS ~> w~(b, m,

rill,...

,

nbm),

(67)

where w~ (b, m, nal,..., nbm) is the upper ath percentile for the null sampling distribution of MS (66). Values of w~(b, m, n11,..., nbm) are available in the literature for equal nij (see the subsection below) but not for arbitrary replications. However, the asymptotic distribution [as N -+ oo] of MS under H~" is chi-square with m - 1 degrees 73% of freedom where N = ~ i b I }-]~j=l nij. Thus, when N is large, the approximation to procedure (67) is reject H~- if and only if MS ~> X~(m - 1),

(68)

where X2 ( r a - 1) is the upper ath percentile of the chi-square distribution with r a - 1 degrees of freedom. Mack and Skillings (1980) note that this chi-square approximation tends to be conservative, especially for small levels of a.

Equal cell replications In the special case where we have an equal number of replications for each of the bm combinations of levels of factors U and B, the statistic MS (66) permits a closed form expression. Let nli = na2 . . . . . nbr~ = n. Then N = bran and ni. = m n for i = 1 , . . . , b, and the Mack-Skillings statistic MS (66) can be written in closed form as

MS = [ 1 2 / m ( N + b)] ~

(mSj - (N + b)/2) 2.

(69)

j=l

The associated level a test of H~- versus H~- is, then, reject H~- if and only if MS ) wa (b, m, n , . . . , n).

(70)

The exact critical values w~ (b, m, n , . . . , n), some obtained via simulation of the exact null distribution, can be found in Mack and Skillings (1980) for selected significance levels a and b = 2(1)5, rn = 2(1)5, and common number of replications n = 2(1)5.

736

A. M. Dean and D. A. Wolfe

3.2.4. Two-sided all treatments multiple comparisons - equal numbers of observations per cell In this subsection, we discuss a multiple comparison procedure that is designed to make two-sided decisions about all m ( m - 1)/2 pairs of treatment effects when we have an equal number n of replications from each combination of factors B and U in a two-way layout. It is appropriate as a follow-up procedure to the equalreplication Mack-Skillings test in the previous subsection and was proposed by Mack and Skillings (1980). Let Sj be as given in (63) with n~. = rnn for j = 1 , . . . , ra. At an experimentwise error rate no greater than c~, the Mack-Skillings two-sided all-treatments multiple comparison procedure reaches its ra(ra - 1)/2 decisions through the criterion decide ~-~ 7£ 7-~ if and only if IS~ - S~] ~> [w~(N + b)/6] 1/2,

(71)

where N = bran is the total number of observations and w~ = w~(b, m, n, n , . . . , n) is the upper c~th percentile for the null sampling distribution of the Mack-Skillings statistic MS (69). The multiple comparison procedure in (71) guarantees that the probability is at least 1 - c~ that we will make all of the correct decisions under the strict null hypothesis that all of the treatment effects are equal and controls the experimentwise error rate over the entire class of continuous distributions for the error terms. The procedure (71) often requires an experimentwise error rate higher than the p-value associated with any previous test in order to find the most important differences between the various treatments. When the total number of observations is large, the critical value [w~(N + b)/6] 1/2 can be approximated by [(N+b)/12]l/2q~, where qa is the upper c~th percentile for the distribution of the range of m independent N(0, 1) variables. Thus, the approximation to procedure (71) for N large is decide T~ ¢ ~-~ if and only if IS,,, - Svl >1 q~[(N + b)/12] 1/~.

(72)

Values of q~ (c~ = .0001, .0005, .001, .005, .01, .025, .05, .10, .20; m = 3(1)20(2) 40(10)100) can be found, for example, in Hollander and Wolfe (1973, Table A10). Mack and Skillings (1980) note that the procedure (71) is rather conservative; that is, the true experimentwise error rate might be a good deal smaller than the bound c~ provided by (71). As a result, they recommend using the approximation (72) whenever the number of observations is reasonably large.

3.2.5. Example - main effects tests and multiple comparisons, two-way layout, multiple observations per ceil We use the rocket thrust duration data in Table 4 of Experiment 1.5 to illustrate the Mack-Skillings test of the hypothesis H0 (61) against the general alternative hypothesis H1 (62). For these data, the RGLM test (Section 4.3) would indicate no interaction between the factors. Similarly, the test (59) would indicate no rank interaction. Consequently, the additive model (50) is an adequate representation of the response.

Nonparametric analysis of experiments

737

Table 8 Ranks within levels of altitude cycling of the rocket thrust duration data Temperature 1 1

Altitude Cycling

2

2

3

4

15.5 15.5

13 11

3

1

9

14

7

2

5

4

10

12

8

6

12 14.5

16 14.5

4 1

3 2

11 13

5 8

6 7

9.5 9.5

We let f a c t o r / 3 denote the i = 2 levels of altitude cycling and let the levels of factor U denote the m -- 4 levels of temperature. The ranks of the data in Table 4 within levels o f / 3 (with average ranks used to break the ties) are shown in Table 8. To test the hypothesis H~- (61) against the alternative hypothesis H~" (62), we first form the sum of cell-wise weighted average ranks (63), with hi. = m n = 16, and obtain $1--7,

$ 2 = 1.4375,

$3 = 5 . 5 ,

$4=3.0625.

Since we have an equal number n = 4 observations per cell, we use the closed form Mack-Skillings statistic (69) with N = bran = 32, giving M S = [ 1 2 / 4 ( 3 2 + 2)] {(4 x 7 -

17) z + . . . + (4 x 3 . 0 6 2 5 - 17) 2}

-- 26.04. Comparing this value with the exact null distribution tables for MS in Mack and Skillings (1980) with rn = 4, b = 2 and n -- 4 we find the p-value for these data is considerably smaller than 0.01. Thus there is strong evidence to suggest that temperature has an effect on the rocket thrust duration. In order to determine which temperatures differ in their effects on the thrust duration, we use the Mack-Skillings multiple comparison procedure (71) and, selecting a significance level of a = 0.0994, we decide ~'u ¢ ~-v if and only if ISu - Sv I ) [6.243(32 + 2)/6] 1/2 = 5.948. Thus, using the above values of Sj (j = 1, 2, 3, 4), our 6 decisions, at an experimentwise error rate of 0.0994, are that no two treatments differ in their effect on the rocket thrust duration (although at a slightly larger experimentwise error rate, we would conclude a difference between the effects of the first two levels of temperature). This indicates the conservative nature of the decision procedure as pointed out by Mack and Skillings (1980).

A. M. Dean and D. A. Wolfe

738

3.3. Two-way layout - one observation per cell 3.3.1. Testing for non-additivity - one observation per cell The test statistic MCRA was proposed by Hartlaub et al. (1993) for testing H~ dd (54) under the two-way model (49) with one observation per cell. The procedure is as follows. Arrange the data in a two-way table, with the levels of factor B defining the rows and those of factor U defining the columns. Subtract the jth column average from the data values in the jth column (j = 1 , . . . , m) to give Y/j - ~ j in the (ij)th cell (i = 1 , . . . , b; j = 1 , . . . , m). This is called aligning in the columns and removes the effect of factor U from the data in the table. (The column median can be subtracted instead, but this does not completely remove the effect of factor U from the test statistic.) Then, rank the aligned data values Y/1 - Y.I, Y/2 - Y.2,..., Y~,~ - ~ m in the ith row to give the rank vector Ri = (R/l, R i 2 , . . . , Rim), for each i = 1 , . . . , b. This removes the effect of factor B from the aligned data values in the cells. Any remaining variation is due to non-additivity. The m ( m - 1)/2 statistics Wjj, are computed as Wjj, =

~

(Rij - R e j - Rij, +

Ri,j,) 2

(73)

l<~i
for each 1 ~ j < j~ <~ m. The test statistic is then MCRA =

max

l<~j
Wjj,.

(74)

The statistic MCRA is more sensitive to some types of non-additivity than others. In particular, MCRA may detect non-additivity more easily when the rows and columns of the table are interchanged before the aligning and ranking takes place. If we let 3//1 be the statistic MCRA (74) with the levels of factors B and U defining the rows and columns, respectively, and let M2 be MCRA with the levels of factors U and B defining the rows and columns, respectively, the rejection rule suggested by Hartlaub et al. (1993) is reject H~ dd if and only if M1 > h~/2(m, b) or M2 > h~/2(b, m)

(75)

where h~/a(b, m) is the upper (c~/2)th percentile of the distribution of MCRA under a null hypothesis of no interaction and no main effects, and computed when the errors have a normal distribution. A FORTRAN computer program for calculating the critical values is available from the authors. The test (75) is not strictly distribution free, and the significance level can rise a little above its nominal level for skewed distributions (such as the exponential). For heavy-tailed distributions (such as the Cauchy distribution), a more powerful test can be obtained by subtracting the column medians rather than the column averages before ranking in the rows. (However, as mentioned above, this does not completely remove the effect of the factor defining the columns).

Nonparametric analysis of experiments

739

3.3.2. Main effect test f o r general alternatives - one observation per cell

In this subsection, we consider two-way layout settings with a single observation for each combination of levels of two factors B and U and where an additive model (50) holds. We are interested in testing the hypothesis H d- (61) of no effect of factor U against the general class of alternatives given by H~- (62). The most commonly used test statistic for this setting, proposed independently by Friedman (1937) and Kendall and Babington Smith (1939), but commonly known as the Friedman statistic, is closely related to the usual classical two-way layout F-statistic with the original observations replaced by their within-block joint ranks. For each i = 1 , . . . , b separately, let R~I, Ri2,.. •, Rim be the ranks of the observations Y~I, Y/2,..., Y~m for the ith level of factor B. Let R.j = R l j q - . . . q- Rbj be the sum of these ranks that are assigned to the observations on the jth level of factor U. Let /~.j = R . j / b be the corresponding average, for j = 1 , . . . ,m. The Friedman statistic for testing H~- (61) versus the general alternative H~- (62) is then given by m

S = [ 1 2 b / m ( m + 1)1 ~ ( / ~ . j - (ra + 1)/2) 2,

(76)

j=l

and the associated level a test of H~- versus H]- is reject H~- if and only if S ~> s~(m, b),

(77)

where s,~(m, b) is the upper ath percentile for the null sampling distribution of the statistic S (76): The complete null distribution of S is available in Hollander and Wolfe (1973) for m = 3,b = 2(1)13; m = 4, b = 2(1)8; and m = 5, b = 3,4,5. Additional tables for m = 5, b = 6(1)8 and m = 6, b = 2(1)6 can be found in Odeh (1977a). When the number of levels of factor B is large, the appropriate asymptotic distribution of S can be used to provide approximate critical values for the test procedure. When the null hypothesis H~- (61) is true, the statistic S has an asymptotic distribution [as b --+ c~] that is chi-square with m - 1 degrees of freedom. Thus, the approximation to procedure (77) for a large number of levels of B is reject H~- if and only if S >~ X2~(ra - 1),

(78)

where x 2 ( m - 1) is the upper c~th percentile of the chi-square distribution with r a - 1 degrees of freedom. We note that the chi-square approximation (78) can, for smaller values of b, be rather conservative in the sense that the stated upper-tail probabilities are larger than the true ones. For this reason, Iman and Davenport (1980) proposed the alternative approximation that compares the statistic Q = (b - 1 ) S / [ b ( m - 1) - S] with critical values from the F distribution having m - 1 and (b - 1)(m - 1) degrees of freedom in order to assess the approximate significance of the observed data. Calculation of the Friedman statistic S is available on a variety of software packages. These include M i n i t a b ' s FRIEDNAN command, SAS's PROC RANK/PROC ANOVA commands when applied to within-blocks ranks, and IMSL's FRDNN command.

740

A. M. Dean and D. A. Wolfe

3.3.3. Two-sided all-treatments multiple comparisons - no replications m m Nonparametric point estimates for contrasts }-]~j=l cj~-j ( ~ j = l cj = 0) are available and, for these, the reader is referred to Doksum (1967) or Hollander and Wolfe (1973, Section 7.4). Here, we present a multiple comparison procedure that is designed to make two-sided decisions about all ra(m - 1)/2 pairs of effects of factor U in a two-way layout with one observation per cell and which is an appropriate followup procedure to the Friedman test of Section 3.3.2. The procedure was described by Nemenyi (1963) and by McDonald and Thompson (1967), who attributed it to Wilcoxon. Let Ril, • • •, Rim be the ranks of the data within the ith level of factor B~ i = 1 , . . . , b and let R.j be the sum of these ranks assigned to the jth level of U, for j = 1 , . . . , m. At an experimentwise error rate of c~, the Wilcoxon-Nemenyi-McDonaldThompson two-sided all-treatments multiple comparison procedure reaches its r e ( m 1)/2 decisions through the criterion decide "ru # Tv if and only if IR.~ - R.v[ >, v~(rn, b),

(79)

where the critical value r~(m, b) satisfies the probability restriction

P o ( l R . ~ - R.~l ~ r~(m,b), u = l , . . . , m - 1 ;

v = u ÷ l,...,m)

= 1-o~,

(80)

with the probability P0(.) being computed under H~- (61). As with previously presented multiple comparison procedures, experimentwise error rates larger than those used as significance levels for hypothesis tests (e.g., .10 or .15) are frequently used in practice. Values of r~(m, b) can be found in McDonald and Thompson (1967) or in Hollander and Wolfe (1973, Table A.17) for selected experimentwise error rates and m = 3, b = 3(1)15 or m = 4(1)15, b = 2(1)15. When the number of levels of factor B is large, the critical value r~(m, b) can be approximated by the constant [bm(m + 1)/12]l/zq=, where q,~ is the upper c~th percentile for the distribution of the range of m independent N(0, 1) variables. Thus, the approximation to procedure (79) for b large is decide T~ ¢ Tv if and only if IR.~ - R.vl/> qc~[bm(m + 1)/12] 1/2.

(81)

Values of qa for c~ = .0001, .0005, .001, .005, .01, .025, .05, .10, and .20 and m = 3(1)20(2)40(10)100 can be found, for example, in Hollander and Wolfe (1973, Table

A.IO). 3.3.4. Example - two-way layout, no replications, general alternatives To illustrate the analysis of a two-way layout with one observation per cell, we use the air velocity data of Experiment 1.6 in Table 5. The response variable Yij is the position of maximum velocity of air blown down the space between a rib-roughened rod and a smooth pipe surrounding it. Let factor/3 denote the rib height, which has b = 3 levels, and let factor U denote the Reynolds number which has m = 6 levels.

Nonparametric analysis of experiments

741

Table 9 Aligned ranks for position of maximum air velocity. Column aligned data prior to ranking are shown in brackets Reynolds number (j) Rib Height

i

1

2

3

4

5

6

1 2 3

5 (-39.3) 6 (17.7) 1 (21.7)

1 (-51.0) 1 (0.0) 6 (51.0)

4 (-40.7) 2 (3.3) 3 (37.3)

3 (-45.3) 3 (3.7) 5 (41.7)

6 (-39.0) 4 (6.0) 2 (33.0)

2 (-48.3) 5 (8.7) 4 (39.7)

We use the M C R A procedure to test the hypothesis of additivity (54) against the general alternative hypothesis (55) of no additivity. Aligning the data in the columns of Table 5 by subtracting column averages, and then ranking in the rows, gives the aligned ranks of Table 9. If we calculate the statistics Wjj, (73) and the test statistic MCRA (74), we obtain M C R A = M1 = 182. When the table is transposed, and the new aligned ranks obtained, the test statistic (74) is MCRA = M2 -- 89. Critical values obtained from the program of Hartlaub, Dean and Wolfe are m3,6,.013

-=

200,

7n3,6,.072 = 182,

m6,3,.0075 = 128,

m6,3,.034 = 125.

If we select an overall significance level of a ~ 0.026, we have that M1 < m 3 , 6 , . 0 1 3 and M2 < m6,3,.0o75. Consequently, the hypothesis H~ dd (54) of no interaction would not be rejected at level c~ = .026. However, at a higher overall significance level, say c~ ~< 0.144, we have M1 ~> m3,6,.072 and the hypothesis H~ dd would be rejected. A plot of the data indicates only a small amount of interaction between factors, and this is caused mostly by the small value recorded for the combination of rib height 1 and Reynold's number 3. Accordingly, the main effects are likely to be of interest here. In order to use the Friedman test as described in Section 3.3.2, we let the m = 3 levels of factor U represent the rib heights and the b = 6 levels of factor/3 represent the Reynolds numbers. Ranking the three observations from least to greatest within each of the levels o f / 3 (Reynolds number), we find that Ril = 1, Ri2 = 2, Ri3 = 3 for each level i of/3. So the average "within-B rank" for each of the three rib heights, is /~.1 = 1, /~.2 = 2 and/~.3 = 3. Using these values in expression (76) for S, we obtain S = 12. If we compare this value with the exact null distribution tables for S (Table A.15 in Hollander and Wolfe, 1973, for example) with ra = 3 and b = 6, we find that the p-value for these data is less than .0005. This indicates strong evidence that there is a difference in the effects on position of maximum air velocity of the three rib heights. In order to detect which of the three rib heights (levels of factor U) differ in their effects on the position of maximum air velocity, we apply the Wilcoxon-NemenyiMcDonald-Thompson multiple comparison procedure (79). With an experimentwise error rate of a = .009, we see from Table A.17 in Hollander and Wolfe (1973), with

A. M. Dean and D. A. Wolfe

742

m = 3 and b = 6, that r.009(3, 6) = 10. Using the above values of the within-B ranks we see that IR.~ - R.21 = 16 - 121 = 6,

IR.1 - R.3[ = 16 - 181 = 12,

IR.2 - R.3I = 112 - 181 = 6. Comparing these observed differences in the treatment rank sums with the critical value 10, we see that our 3(2)/2 = 3 two-sided decisions are T1 = ~-2, T2 = T3, and T1 ~ ~-3. Thus, there is a statistically significant difference in the locations of maximum air velocity for the first and third rib heights.

3.3.5. Test for monotonically ordered alternatives - one observation per cell In this subsection, we consider the case of a two-way layout where there is a single observation for each combination of levels of the two factors B and U, and where we have a priori knowledge that enables us to label the levels of factor U so that their effects will be in a monotonically increasing relationship with the treatment labels. The test procedure that we now present was initially proposed by Page (1963) and is especially effective at detecting alternatives to H~" (61) that are ordered with respect to the treatment labels corresponding to H~-: [~-1~< ~-2 ~< " " ~ 7-m, with at least one strict inequality].

(82)

Let Ril, Ri2,. •., R~m be the ranks of the data for the ith level of factor B, i = 1 , . . . , b , and let R.j be the sum of the ranks assigned to the jth level of U, j = 1 , . . . ,m. The Page statistic for testing H~ (61) versus the ordered alternative hypothesis H~- (82) is defined by m

L = ~jR.j,

(83)

j=l

and the associated level c~ test of H~- versus H~- is reject H~- if and only if L >~ l,~(m, b),

(84)

where l,~(m, b) is the upper c~th percentile for the null distribution of the statistic L (83). The critical values lob(m, b) are given for c~ = .05, .01 and .001 by Page (1963), and also by Hollander and Wolfe (1973, Table A16), for m = 3, b = 2(1)20 and m = 4(1)8, b = 2(1)12. Additional critical values are given by Odeh (1977b) for c~ = .2, .1, .025, .005 and m = 3(1)8, b = 2(1)10. When the number of levels of B is large, the appropriate asymptotic distribution of a standardized form of L can be used to provide approximate critical values for the test procedure. When the null hypothesis H~- (61) is true, the standardized statistic

L* = [L - Eo(L)]/[varo(L)] '/2

(85)

Nonparametric analysis of experiments

743

has an asymptotic distribution [as b -+ c~] that is standard normal, where Eo(L) = b m ( m + 1)2/4

and

var0(L) = bm2(m + 1)2(ra - 1)/144

(86)

are the expected value and variance, respectively, of L under the null hypothesis HJ. Thus the approximation to procedure (85) for b large is reject H~- if and only if L* ~> z~,

(87)

where za is the upper ath percentile of the standard normal distribution. 3.3.6. One-sided treatment versus control multiple comparisons - one observation per cell In this subsection, we present a multiple comparison procedure that is designed for the two-way layout to make one-sided decisions about individual differences in the effects of levels 2 , . . . , m of factor U relative to the effect of a single control population corresponding to level 1 of U. In the context of this paper, it is to be viewed as an appropriate follow-up procedure to rejection of H~- (61) with either the Friedman procedure of Section 3.3.2 or the Page ordered alternatives procedure of Section 3.3.5 when one of the populations corresponds to a control population. Let R i a , . . . , Rim be the ranks of the data within the ith level of factor B, i = 1 , . . . , b and let R.j be the sum of these ranks assigned to the jth level of U, for j = 1 , . . . , rn. At an experimentwise error rate of c~, the procedure described by Nemenyi (1963), Wilcoxon-Wilcox (1964), and Miller (1966), reaches its rn - 1 decisions through the criterion decide ~-~ > ~-1 if and only if (R.~ - R.1) ~> r* (m, b),

(88)

where the critical value r~ (m, b) satisfies the probability restriction Po([R.u - R.,] ) r * ( m , b ) , u = 2 , . . . , m )

= 1 - a,

(89)

with the probability Po (.) computed under H~ (61). Values of r~ (m, b) can be found in Hollander and Wolfe (1973, Table A.18) for m = 3, b = 2(1)18 and m = 4, b = 2(1)5, and additional tables for m = 2(1)5, b = 2(1)8 and m = 6, b = 2(1)6 in Odeh (1977c). When the number of levels of factor B is large, the critical value r~ (ra, b) can be approximated by the constant q*,l/2[bm(m + 1)/6] a/2, where q'3/2 is the upper cah percentile for the distribution of the maximum of rn - 1 N(0, 1) variables with common correlation p = 1/2. Thus the approximation to procedure (88) for b large is decide ~-~ > ~-1 if and only if (R.= - R.,) >>.q~,l/2[brn(rn + 1)/6] 1/z.

(90)

* Selected values of q~,1/2 for m = 3(1)13 can be found in Hollander and Wolfe (1973, Table A.13).

744

A. M. Dean and D. A. Wolfe

3.3.7. Example- two-way layout, no replications, ordered alternatives To illustrate the test of H~- (61) against the ordered alternative H f (82) we use the air velocity data from Experiment 1.6. We are interested in testing for possible differences in effects among the m = 6 Reynolds numbers which we regard as levels of factor U. In this setting, it is reasonable to expect that, if there is a significant treatment effect associated with the Reynolds number, this effect will be a monotonically increasing function of the number. Therefore, we are interested in testing H~-: IT1 --- ~-2 . . . . . T6] versus H~': ['rl ~< "r2 ~< .-. ~< "r6, with at least one strict inequality]. The b -- 3 rib heights are the levels of factor B. We rank the m = 6 observations from least to greatest within each of the rib height levels (using average ranks to break the one tie). This gives the following rank sums for each of the six Reynolds numbers: R.1 = 4,

R.2 = 5.5,

R.3 = 8.5,

R.4 = 12,

R.5 = 16,

and

R.6 = 17. Page's test statistic (83) is then L = 270.5. If we compare this value with the exact tables for L (Table A.16 in Hollander and Wolfe (1973), for example) with m = 6 and b = 3, we see that 270.5 > 1.001(6, 3) = 260, so that the p-value for these data is less than .001. Thus, there is strong evidence that the distance (from the center of the rod) of maximum air velocity is a monotonically increasing function of the Reynolds number. In order to detect which of the Reynolds numbers yield maximum air velocities further from the center of the rod than for a baseline Reynolds number (designated as level 1 of U), we apply the Nemenyi-Wilcoxon-Wilcox-Miller one-sided treatment versus control procedure (88). Taking our experimentwise error rate to be a = .03475, we obtain r.*03475(6, 3 ) = 11 from Table I in Odeh (1977c), with m = 6 and b = 3. Using the above values of the within-B rank sums, we see that (R.2 - R . l ) = 1.5,

(R.5 - R.1) = 12,

(R.3 - R . l ) = 4.5,

(R.4 - R.1) = 8,

(R.6 - R.1) = 13.

If we compare these observed differences in the rank sums with the critical value 11, we see that our one-sided treatment versus control decisions are ~'2 = T1, ~-3 ----T1, ~-4 = ~-1, ~'5 > ~-1, and T6 > T1. Thus there is a statistically significant increase (over the effect of the baseline control Reynolds number corresponding to level 1 of U) in the position of maximum air velocity for the Reynolds numbers corresponding to levels 5 and 6 of U.

3.4. Two-way incomplete layout 3.4.1. Test for general alternatives in a balanced incomplete block design In this subsection, we discuss a distribution-free procedure that can be used to analyze data that arise from a balanced incomplete block design. A balanced incomplete block

Nonparametric analysis of experiments

745

design has b blocks with s (< m) treatments observed per block and no treatment observed more than once per block. Every pair of treatments occurs together in )~ blocks, and each of the rn treatments is observed p times. The parameters of a balanced incomplete block design satisfy p(s - 1) = , ~ ( m - 1). We are interested in testing H~(61) against the general alternatives H~- (62), under the assumption of no treatmentblock interaction. An appropriate nonparametric test statistic for a balanced incomplete block design was first proposed by Durbin (1951), with Skillings and Mack (1981) providing additional critical values. For this setting, we rank the observations within each block from 1 to s. Let R . 1 , . . . , R.m denote the sums of the within-blocks ranks for treatments 1,... ,m, respectively. The Durbin statistic for testing H~- (61) versus the general alternative H~- (62) is defined to be Tt~

T = [12/)~m(s + 1)] E ( R . j

- p(s + 1)/2) 2,

(91)

j=l

and the associated level c~ test of H~- versus H{ is reject H~" if and only if T >~ t~(m, s, A,p),

(92)

where tc~(m, s, .~,p) is the upper ath percentile for the null sampling distribution of the statistic T (91). Values (some of them obtained via simulation) of t~(zn, s, )bP) for a variety of balanced incomplete block designs and significance levels closest to .10, .05, and .01 have been tabulated by Skillings and Mack (1981). When the number of blocks is large, the statistic T has an asymptotic distribution [as b -+ ec] under H~- (61), that is chi-square with rn - 1 degrees of freedom. Thus the approximation to procedure (92) for large number of blocks is reject H~- if and only if T >~ ) / ~ ( m - 1),

(93)

where )/2 (m - 1) is the upper ath percentile of the chi-square distribution with m - 1 degrees of freedom. Skillings and Mack (1981) have noted that the chi-square approximation (93) can be quite conservative when a = .01 and either b or A is small. In particular, they suggest that the approximation is not adequate when A is either 1 or 2. In such cases, they strongly recommend the use of the exact tabulated values of t~ (rn, s, A, p) or the generation of 'exact' values via simulation in lieu of the chi-square approximation. 3.4.2. Main effect tests f o r general alternatives in arbitrary two-way incomplete layouts - no replications Not all incomplete block designs satisfy the necessary constraints to be balanced incomplete block designs. In this subsection, we present a procedure for dealing with data from a general two-way layout with at most one observation per cell. Let qi denote the number of treatments observed in block i, for i = 1 , . . . , b. (If q~ = 1 for any block i, remove that block from the analysis and let b correspond to the

746

A. M. Dean and D. A. Wolfe

number of blocks for which qi > 1.) We are interested in testing H~" (61) against the general alternatives H~" (62), under the assumption of no treatment-block interaction. Skillings and Mack (1981) proposed a test procedure that is appropriate for any incomplete two-way layout. We rank the data within block i from 1 to qi. For i = 1 , . . . , b and j = 1 , . . . , m , let

Rij = rank of Y~j among the observations present in block i, if nij

=

1,

= (qi + 1)/2, if nij = O. Next, we compute the adjusted sum of ranks for each of the m treatments as follows

b Aj = ~ [12/(qi + 1)]x/2[Rij - (qi + 1)/2], i=1

j = 1,...,m.

(94)

Set A = [A1,...,Am-1]'.

(95)

Without loss of generality, we have chosen to omit Am from the vector A. The Aj's are linearly dependent, since a weighted linear combination of all m of them is a constant. We could omit any one of the Aj's and the approach we now discuss would lead to the same test statistic• Now, the covariance matrix for A under H~" (61) is given by

~t~=2/~lt -A12

--A12 m

~ t ¢ 2 A2t

--A13

--~l,m-1

-A23

--)k2,ra--I

220 =

,

--Al,rn-1

--A2,m-1

(96)

--A3,m-1 "•" ~tm¢m_l )~m-l,t

where, for t ¢ j = 1 , . . . , m , Ajt = [number of blocks in which both treatments j and t are observed].

(97)

Let 220 be any generalized inverse for 22o. The Skillings-Mack statistic SM for testing H~- (61) versus the general alternatives H~- (62) is given by SM = X 2 2 o A .

(98)

If Ajt > 0 for all j ~ t, then the covariance matrix XT0 (96) has full rank and we can use the ordinary inverse 2201 in the definition of SM (98). The level c~ test of H~(61) versus H~" (62) is reject H~- if and only if SM ~> s*~(b,m, n11,... ,nb,~),

(99)

Nonparametric analysis of experiments

747

where s*~(b,ra, n , 1 , . . . , nbr,) is the upper c~th percentile for the null sampling distribution of the statistic SM (98). Values of s*~(b,m, nal,... ,nb,~) are not available in the literature for arbitrary incomplete block configurations. However, when every pair of treatments occurs in at least one block and the number of blocks is large, the statistic SM has an asymptotic distribution [as b -+ c~] under H~- that is chi-square with m - I degrees of freedom. Thus, when Ajt > 0 for all j ~ t = 1 , . . . , m, the approximation to procedure (99) for large b is reject H~- if and only if SM/> x ~ ( m - 1),

(100)

where X~ (m - 1) is the upper c~th percentile of the chi-square distribution with m - 1 degrees of freedom. Skillings and Mack (1981) have pointed out that the chi-square approximation (100) can be quite conservative when c~ is smaller than .01 and, in such cases, they recommend generation of 'exact' critical values s~(b, m, n , , , . . . , nbm) via simulation in lieu of the chi-square approximation. Note that if Ajt -- 0 for a particular pair of treatments j and t, so that j and t never appear together in a block, then H~- (61) could fail to be rejected even when ~-j and ~-t are quite different. Consequently, we recommend removing any such pairs of treatments from the analysis. If the sample size configuration satisfies the constraints for a balanced incomplete block design, then the Skillings-Mack statistic SM (98) is identical to the Durbin statistic T (91). If the sample sizes are all 1, so that we have a randomized complete block design, the Skillings-Mack statistic SM (98) is identical to the Friedman statistic S (76).

3.4.3. Two-sided all treatments multiple comparisons for data from a balanced incomplete block design In this subsection, we discuss a multiple comparison procedure that is designed to make two-sided decisions about all ra(m - 1)/2 pairs of treatment effects when we have data from a balanced incomplete block design. It is appropriate as a followup procedure to the general alternatives Durbin-Skillings-Mack test for equality of treatment effects in a balanced incomplete block design, as discussed in Section 3.4.1. Let R.,, . . . , R.m be the sums of the within-blocks ranks for treatments 1 , . . . , m, respectively. Let s denote the number of observations present in each of the blocks, let ,k denote the number of blocks in which each pair of treatments occurs together, and let p denote the total number of observations on each of the m treatments. At an experimentwise error rate no greater than a, the Skillings-Mack two-sided alltreatments multiple comparison procedure reaches its m ( m - 1)/2 decisions through the criterion decide ~-~ ¢ T. if and only if IR.~ - R.v[ ) [t~Am(s + 1)/6] '/2,

(101)

where t~ = ta (m, s, A, p) is the upper ath percentile for the null sampling distribution of the Durbin test statistic T (91), as discussed in Section 3.4.1. The associated multiple

748

A. M. Dean and D. A. Wolfe

comparison procedure (101) controls the experimentwise error rate over the entire class of continuous distributions for the error terms. Values of t~ = t ~ ( m , s, A,p) for a variety of balanced incomplete block designs and experimentwise error rates closest to .10, .05, and .01 are available in Skillings and Mack (1981). When the number of blocks is large, the critical value [t~Am(s + 1)/6] 1/2 can be approximated by [(s + 1)(ps - p + A)/12]l/2q~, where q~ is the upper ath percentile for the distribution of the range of m independent N(0, 1) variables. Thus, the approximation to procedure (101) for b large is decide ~-u ¢ % if and only if

IR.u - R.vl >1 q~[(s + 1)(ps - p + A)/1211/2.

(lO2)

Values of q~ for a = .0001, .0005, .001, .005, .01, .025, .05, .10, and .20 and m --- 3(1)20(2)40(10)100 can be found, for example, in Hollander and Wolfe (1973, Table A. 10). Skillings and Mack (1981) note that the procedure (101) is rather conservative; that is, the true experimentwise error rate might be a good deal smaller than the bound c~ provided by (101). As a result, they recommend using the approximation (102) whenever the number of blocks is reasonably large. 3.4.4. Example - incomplete two-way layout

We apply the Durbin procedure (92) for testing H~- (61) against a general alternative H~- (62) to the monovinyl isomers data in the balanced incomplete block design of Experiment 1.7. The blocks (levels of B) are shown as columns in Table 6, and the levels of treatment factor U ("pressure" measured in pounds per square inch) indicate the rows. If we rank the data within each of the ten blocks (using average ranks to break the one tie), we obtain the following treatment rank sums: R.1 = 7.5,

R.2 = 7.5,

R.3 = 13,

R .4 =

15,

R.5 = 17.

Using these treatment rank sums in expression (91), we obtain T = 15.1. If we compare this value with the null distribution for T (in Table 2 of Skillings and Mack, 1981) with ra = 5, b = 10,p = 6, s = 3 and A = 3, we find that the p-value for these data is less than .0105. This indicates strong evidence that there is a significant difference in the effects of the various pressure levels on the percent conversion of methyl glucoside to monovinyl isomers. In order to detect which of the five pressure levels differ, we apply the SkillingsMack multiple comparison procedure, as given in (10!). Taking our experimentwise error rate to be a = .0499, we see from Table 2 in Skillings and Mack (1981), with m = 5, b --- 10, A = 3,p = 6, and s = 3, that t.0499 = 9.200. Thus, our decision criterion is decide ~-~ ¢ ~-v if and only if IR.u - R.vl ) [(9:20)(3)(5)(4)/6] 1/2 = 9.59.

Nonparametric analysis of experiments

749

Comparing the differences in the above treatment rank sums with the critical value 9.59, we see that, with an experimentwise error rate of c~ = .0499, we cannot find any pairwise differences between the treatment effects, despite the fact that the p-value for the Durbin hypothesis test was less than .0105. This clearly illustrates the conservative nature of procedure (101), as noted by Skillings and Mack (1981). If we choose instead to use approximation (102), taking b = 10 to be 'large', our critical value for approximate experimentwise error rate c~ = .05 would be lower; that is, q.0514{6(3) - 6 + 3}/12] 1/2 = 3.858(2.236) = 8.627, where q.05 = 3.858 is obtained from Hollander and Wolfe (1973, Table A.10) with m = 5. The decisions associated with this approximate procedure (102) at c~ = .05 are that T1 = 7-2,7-1 = 7"3,7-1 = 7"4,7-1 • 7-5,7-2 = 7"3,7-2 ~-- 7-4,7-2 ¢ 7"5,7-3 = 7"4,7-3 = 7-5, and 7-4 = TS. Thus, with the approximate procedure (102), we are able to distinguish that pressures of 475 and 550 psi each lead to different percent conversions of methyl glucoside to monovinyl isomers than does the pressure 250 psi.

3.5. Other tests in the two-way layout

Mack (1981) describes a test of H d" (61) against the general class of alternatives H I (62) for the incomplete two-way layout with some cells empty and other cells with more than one observation per cell. A test against the class of ordered alternatives H~" (83) for this same setting is given by Skillings and Wolfe (1977, 1978). For equal sample sizes, Mack's statistic is identical to a statistic proposed by De Kroon and Van der Laan (1981) for testing the global hypothesis H0T (53) of no main effect of factor U and no interaction between factors/3 and U. We refer the reader to the original papers for details of each of these tests.

4. Higher way layouts 4.1. Introduction

Specialized procedures for the two-way layout were discussed in Section 3. In this section, we consider general procedures for experiments with two or more factors. Most of the techniques for the one-way layout, as described in Section 2, can be used for comparing the effects on the response of the various treatment combinations. However, there are available some specialized nonparame.tric techniques for investigating the effects of the factors separately. We discuss two such procedures in this section. In Section 4.2, we discuss a technique for analyzing a factorial experiment with any number of factors when there is at least one observation per cell and all interactions among the factors are expected to be negligible. The method investigates the main effects, one factor at a time, grouping the levels of the other factors together and using the techniques of the two-way layout.

750

A. M. Dean and D. A. Wolfe

A robust version of the general linear model least squares analysis (RGLM) was developed by McKean and Hettmansperger (1976). This procedure, which is described in Section 4.3, is very general, but requires a computer algorithm for its implementation. Hettmansperger and McKean (1983), Aubuchon and Hettmansperger (1984), and Draper (1988) describe and discuss other methods available in the literature for testing parameters in the general linear model. The results of a small power study run by Hettmansperger and McKean (1983) suggest that the RGLM method, with the traditional type of test statistic based on the difference of error sums of squares for the reduced and full models, gives tests at least as powerful as the competitors. In addition, the RGLM procedure has most of the features of least squares analysis and allows estimation of parameters and multiple comparison procedures to be implemented. Consequently, we only discuss the RGLM method of handling the general linear model in this article, and refer the reader to the above mentioned papers for aligned rank and other related procedures.

4.2. Using two-way layout techniques 4.2.1. Main effect tests - multi-factor experiment

Groggel and Skillings (1986) suggested a method for testing individually the hypotheses of no main effects in a multi-factor experiment which utilizes the two-way analysis of Mack and Skillings (see Section 3.2.3). The method requires that there are no interactions between the factors and that all treatment combinations are observed at least once. First of all, focus on just one of the factors and label it factor U with m levels, and regard the combinations of the levels of the other factors as the b levels of a combined factor/3. If there are nij observations in a particular cell in the original experiment, then there are still nij observations in an analogous cell in the experiment involving factors B and U. We write the model as an additive two-way layout model (50) and we wish to test the hypothesis H~-: [~-j are all equal, j = 1 , . . . , m], where 7j is the effect of the jth level of factor U, against a general alternative hypothesis that the effects of factor U are not all equal. The general formula for the Mack-Skillings statistic is given in (66). For factorial experiments with a large number of factors, it is usual that a small equal number n of observations are taken per treatment factor combination, in which case the test statistic reduces to (69). We take this equal number n of cell observations to be the case throughout the rest of this section. Let Rijk be the rank of Yijk within the m n observations on the ith level of the combined factor B. Let b

n

i=1 k=l

The decision rule for testing H~" against a general alternative hypothesis, as given by (70), is

Nonparametric analysis of experiments

751

reject Hff if and only if 12

( N + b)] 2

~[

r a ( N + b)

j=l

2

mSj

>~ w a ( b , m , n , . . .

,n),

(103)

where N = bran and w=(b, re, n , . . . , n) can be obtained from the table in Mack and Skillings (1980) for small values of b, m, and n, and can be approximated by X ~ ( m - 1) if either n or b is large (see Groggel and Skillings, 1986). The test is applied to each factor in turn, the remaining factors forming the combined factor/3.

4.2.2. Example - multi-factor experiment, main effect tests We use factors C, D and E from Experiment 1.2 as an illustration of Groggel and Skilling's two-way procedure, and we ignore the other five factors as though they had not been part of the experiment. (Since this method is not appropriate for fractional factorial experiments, it cannot be used to test the significance of the eight factors in the entire experiment.) We use the mean thickness response (averaged over seventy levels of noise variables) for this illustration. The levels of the three factors C, D and E and the observed mean thicknesses are shown in Table 10. (Note that the treatment combinations have been reordered from Table 1.) Table 10 Ranks of mean thickness of epitaxial layer - factors C, D and E Mean thickness

C

D

E -1 -1

14.821 14.757

14.878 14.843

-1 1

-1 -1

14.888 14.921

14.932 14.415

-1 1

-1 -1

14.165 14.037

13.972 13.907

-1 1

1 1

13.860 13.880

14.032 13.914

-1 1

1 1

Rij k

Rij.

2 1

4 3

6 4

1

2

4

6

1

3

1

4

-1 -1

4 3

2 1

6 4

1 2

4 3

5 5

1 1

First, let U be factor C with m = 2 levels, and take the b = 4 levels of factor /3 to be the four levels ( - 1 , - 1 ) , ( - 1 , 1), ( 1 , - 1 ) and (1, 1) of the combined factors D and E. Then the ranks assigned to the m = 2 levels of factor U (C) at each of the b = 4 levels of the combined f a c t o r / 3 (D and E ) are as shown in Table 10. The corresponding cell-wise weighted average ranks (63) for the two levels of U are S ~ = 2 3 / 4 = 5.75 and S ~ = 17/4 = 4.25. The decision rule (103) for testing H~against a general alternative hypothesis is to reject H~- if and only if 12 2 ( 1 6 + 4 ) [ ( 1 1 . 5 - 1 0 ) 2 + ( 8 . 5 - 10) 2] = 1.35 > w a ( 4 , 2 , 2 , . . . , 2 ) .

752

A, M. Dean and D. A. Wolfe

From the table of Mack and Skillings (1980) for m = 2, b = 4, and n = 4, we obtain w.0o77(4, 2, 2 , . . . , 2) = 7.35. Thus at level c~ = .0077 there is not sufficient evidence to reject the null hypothesis of no effect of the levels of C on the average thickness of the epitaxial layer. On the other hand, if we let U be factor D and let B represent the combined factors C and E, similar calculations lead to the rank sums for the two levels of U being S D = 28/4 = 7 and S D = 12/4 = 3. Using the decision rule (103) for testing H~against a general alternative hypothesis for factor D, we reject Hff since 12 [(14 - 10) 2 + (6 - 10) 2] = 9.6 > 7.35 = w.0077(4, 2, 2, 2). 2(16+4) "'" The hypothesis that the levels of E have no effect on the average thickness of the epitaxial layer would not be rejected since the value of the associated test statistic is 0.

4.3. Robust general linear model 4.3.1. Tests and confidence intervals - multi-factor experiments Hettmansperger and McKean (1977, 1978) describe a robust alternative (RGLM) to the method of least squares which gives a general theory for estimation and testing of parameters in the linear model without requiring assumptions about the common distribution of the independent error variables. Recent discussion articles on the method are given by Draper (1988), Aubuchon and Hettmansperger (1984) and McKean and Vidmar (1994). Let 3, be the vector of parameters corresponding to a complete set of linearly independent contrasts in all treatment and block effects (but excluding the general mean) and let X c be the corresponding design matrix, whose columns contain the contrast coefficients (so that the column sums of the X~ matrix are zero). We can then write the general linear model in matrix terms as Y = 10 + X c ~ ' + E where Y ' = [ Y ] , . . . , 1~)] and E ' = [ E l , . . . , EN] are the vectors of response and error variables, respectively, 1 is a vector of l's, and 3, is the vector of location parameters. Temporarily, we assume that the common distribution of the errors is symmetric around zero. We write the qth row of the matrix model as Yq = O+xq'7+Eq with Xq representing the qth row of the design (contrast) matrix Xc. We let y and e represent the "observed" values of Y and E , respectively. The traditional method of least squares selects an estimator "~ of "7 to minimize the sum of squares of the errors; that is, to minimize N N e r e ~- E eq: = q=l q=l

Nonparametric analysis of experiments

753

In a similar fashion, the RGLM method selects an estimator ~ of 7 to minimize the sum of the weighted errors, N

D(~') = E

N

[a (R(eq))] eq = E

q=l

a (Rq)eq,

(104)

q=l

where, for q = 1,... ,N, Rq = R(eq) is the rank of eq among all Net's and a (Rq) is some function of Rq, called the score. The most commonly used scores are the Wilcoxon scores, defined as (105) These scores ensure that negative and positive residuals of the same magnitude have equal weight in the minimization and that larger residuals have more weight than smaller ones (in the same way that a large squared residual has more weight than a small one in the method of least squares). It is possible to modify the scores so that residuals corresponding to outlying observations play little or no part in the minimization, as discussed by Hettmansperger and McKean (1977) and by Draper (1988), who also lists several other types of scores. Using the Wilcoxon scores, the RGLM estimate of 3' is then the value ~, that minimizes

D(~') = E

q=l

x/~

NT1

(106)

An experimental version of an algorithm which minimizes this function is available in the r r e g routine of all versions of MINITAB since version 8. McKean and Vidmar (1994) state that a SAS algorithm is currently under development and that a PC algorithm can be obtained from those authors. For a single replicate or fractional factorial experiment, the robust contrast estimates coincide with the estimates obtained from the method of least squares, although the test statistic, discussed below, differs. The same is true of any fully parameterized model with two observations per cell. The test was originally developed as an asymptotic test and, as such, is best with large numbers of observations per cell. In factorial experiments, large sample sizes rarely occur, and an approximation for small samples will be discussed below. In such cases, stated significance levels should be interpreted with caution. Suppose that the null hypothesis H~r: [Hq, = 0] is to be tested against the general alternative hypothesis H~: [ H 7 # 0], where H ' 7 represents a vector of h linearly independent estimable functions of the parameters. Then, as in the general linear model approach, a test statistic can be obtained by comparing the value D(Tn) of

754

A. M. Dean and D. A. Wol'fe

D(3") (106) for the reduced model under H i with D ( ~ F ) for the full model. The test statistic is

where 6 is an estimator of 6. For symmetric error distributions, 6 is taken by Hettmansperger and McKean (1977) to be

where W I , . . . , WN(N+I)/2 are the pairwise Walsh a v e r a g e s (~q + ~ p ) / 2 , with ~i = (yi - x ~ ) . Also, W@) is the ith largest among W1, • • •, W N ( N + O / 2 , and t is the lower critical point of a two-sided level c~ Wilcoxon signed-rank test, which is approximated by

The estimator 6 is called the H o d g e s - L e h m a n n e s t i m a t o r and is calculated in the HINITAB r r e g routine using c~ = 0.10 in (108). When the error distribution is not symmetric, the Wilcoxon scores can be replaced by the sign scores and the estimator of 6 by a kernel density estimator or a two-sample Hodges-Lehmann estimator (see Hettmansperger, 1984, 244-250 and Draper, 1988, 253). For a large number of observations per cell, the test statistic FR (107) has an approximate chi-square distribution divided by its degrees of freedom, h. However, for small numbers of observations per cell, the true distribution of FR is better approximated by an F ( h , N - h - 1) distribution. The test of the null hypothesis H~ against the general alternative hypothesis H~ at significance level c~ is then

where F ~ ( h , N - h - 1) is the upper c~th percentile of an F-distribution with h and N - h - 1 degrees of freedom. The sample estimates ~ of a set of linearly independent contrasts 3' and their corresponding estimated standard errors (given by the diagonal elements of 6 x / ( X ~ X c ) -1 ) can be obtained from the MINITAB r r e g routine. A standard set of linearly independent contrasts that can be used in a model with two factors are the treatment-control contrasts f l l - f l i , 7"1--Tj and the interaction contrasts (/3T)11 --(/3~-)il --(j3~-) U +(/3T)ij (i = 2 , . . . , b; j = 2 , . . . , ra). If there are more than two factors, then the standard set of contrasts would be extended by including similar sets of contrasts involving the extra factors, together with higher order interaction contrasts. Any other contrast can be written as a linear combination, I'3", of the standard contrasts. The contrast estimator

Nonparametric analysis of experiments

755

Table 11 Design matrix X~ for the welding experiment 1 1 1 1 1

--1 --1 --1 -I -l 0 0 0 0 0

1 1 1 1 1

1

1 0

-1 0 0 0

1 0 0

-1 0 0

1 0 0 0

-i 0

-1

2 1 0 -1 -2

-2 1 2 1 -2

0 1 1 1 1 0 0 0 --1 0 0 0 0 0 0 0 --1 0 0 0 0 0 0 0 --1 0 0 0 0 0 0 0 --1 0 0 --1 1 1 1 1 --2 2 --1 --1 0 0 0 --1 --1 --1 0 -1 0 0 0 --2 --1 0 0 --1 0 1 --1 --1 0 0 0 --1 2 2

-2 -1

2 0 1 2

4 2 0 --2 --4 --2 --1 0 1 2

-1 -2 -I 2

--4 2 4 2 --4 2 --1 --2 --1 2

would then be l'~,, with the estimated standard error ~ ~ / l ' ( X ~ c X c ) - l l . Confidence intervals for a set of contrasts can be calculated using the usual Scheff6 method of multiple comparisons; that is, a set of 100(1 - ~ ) % simultaneous confidence intervals for any number of contrasts of the form l' 7 is given by

l'~, =t=v/hF,~(h, N - h - 1)

~v/lt(XtaXa)-lg

(111)

(see Hettmansperger and McKean, 1978).

4.3.2. Example - R G L M method, multi-factor experiment We use the welding data of Experiment 1.4 to illustrate the R G L M approach. There are two treatment factors with 3 and 5 levels, respectively, and an interaction is expected. A fully parameterized model with two observations per cell will lead to contrast estimates identical to the least squares estimates. For illustration purposes we include in the matrix model just the treatment-control contrasts/31 - / 3 i and "q - ~-j, and the linearly independent trend contrasts representing the linear x linear, linear x quadratic, quadratic x linear and quadratic x quadratic components of the interaction (obtained from a table of orthogonal polynomials). The X c matrix is then as shown in Table 11, but with each row repeated twice, corresponding to the two observations per cell. If the columns of X e are entered into the Minitab r r e g command, the minimum value o f D ( ' y ) (106) for the full model (with main effects and the four interaction trend contrasts) is obtained as D ( ~ ' r ) = 118.06. Dropping the last 4 columns of the X ~ matrix (which correspond to the four interaction contrasts) gives the the minimum value of D ( ' y ) for the reduced model under the hypothesis o f negligible interaction contrasts as D ( ~ n ) = 136.44. These values are obtained, together with the value of = 5.359, from NTNITAB r r e g by specifying that the hypothesis o f interest is that the last 4 parameters (contrasts) are all zero. The statistic F n for testing the hypothesis

756

A. M. Dean and D. A. Wolfe

H~ ad (54) against an alternative hypothesis that the linear x linear, linear x quadratic, quadratic x linear and quadratic x quadratic c o m p o n e n t s of the interaction are not all zero is then given by (107) as F R = ( 1 3 6 . 4 4 - 1 1 8 . 0 6 ) / 4 _ 1.71. (5.359/2) Since/;0.o5(4, 25) = 3.06, the hypothesis that the low order interaction trend contrasts are negligible is not rejected at significance level c~ = 0.05. In Section 3.2.2, the hypothesis of no rank interaction was rejected at significance level just u n d e r 0.05. The reason for the apparent contradiction is that the strongest interaction trend contrasts are the quartic contrasts, and these were assumed negligible in the above m o d e l (as is often done in practice w h e n low order trends are fitted). T h e m a i n effect treatment-control contrasts are estimated by the M i n i t a b rreg routine as /3l~"~'-/32 = - 2 . 1 1 rl - T2 = 2.73 A

/31~"~/33 = 3.82 rl -- <~ = --6.67 A

A

rl

A

--

T4

=

0.65

71 - r5 = 0.03.

Other contrast estimates can be obtained as linear c o m b i n a t i o n s of these basic contrast estimates. For example, the estimate of the contrast which compares the third gage bar setting with the average of the other two is A

/ 3 3 - 1(/31 "-b/32) = 1[(/31--'~-/32)- 2(/31~~'~/33)] " =--4.88. If a confidence interval is required for this contrast as part of a simultaneous set of 95% Scheff6 intervals, we use (111) and obtain A

/33 - 1(/31 +/32) 6 - 4 . 8 8 ± X/10F0.05(4, 2 5 ) ( 5 . 3 5 9 ) ~ = - 4 . 8 8 ± 8.17.

References Akritas, M. G. (1990). The rank transform method in some two-factor designs. J. Amer. Statist. Assoc. 85, 73-78. Akritas, M. G. and S. E Arnold (1994). Fully nonparametric hypotheses for factorial designs I: Multivariate repeated measures designs. J. Amer. Statist. Assoc. 89, 336-343. Anderson, V. L. and R. A. McLean (1974). Design of Experiments: A Realistic Approach. Marcel Dekker, New York. Aubuchon, J. C. and T. P. Hettmansperger (1984). On the use of rank tests and estimates in the linear model. In: P. R. Krishnaiah and P. K. Sen, eds., Handbook of Statistics, Vol. 4. Elsevier, Amsterdam, 259-274. Blair, R. C., S. S. Sawilowsky and J. J. Higgins (1987). Limitations of the rank transform statistic in tests for interaction. Comm. Statist. Simulation Comput. 16, 1133-1145. Conover, W. J. and R. L. Iman (1976). On some alternative procedures using ranks for the analysis of experimental designs. Comm. Statist. Theory Methods 5, 1349-1368.

Nonparametric analysis of experiments

757

Conover, W. J. and R. L. Iman (1981). Rank transformations as a bridge between parametric and nonparametric statistics (with discussion). Amer. Statist. 35, 124--133. Critchlow, D. E. and M. A. Higner (1991). On distribution-free multiple comparisons in the one-way analysis of variance. Comm. Statist. Theory Methods 20, 127-139. Damico, J. A. and D. A. Wolfe (1987). Extended tables of the exact distribution of a rank statistic for all treatments multiple comparisons in one-way layout designs. Comm. Statist. Theory Methods 16, 23432360. Darnico, J. A. and D. A. Wolfe (1989). Extended tables of the exact distribution of a rank statistic for treatments versus control multiple comparisons in one-way layout designs. Comm. Statist. Theory Methods 18, 3327-3353. De Kroon, J. and E Van der Laan (1981). Distribution-free test procedures in two-way layouts; a concept of rank interaction. Statist. Neerlandica 35, 189-213. Dehnad, K. (1989). Quality Control, Robust Design and the Taguchi Method. Wadsworth and Brooks/Cole, Pacific Grove. Doksum, K. (1967). Robust procedures for some linear models with one observation per cell. Ann. Math. Statist. 38, 878-883. Draper, D. (1988). Rank-based robust analysis of linear models I: Exposition and review. Statist. Sci. 3, 239-271. Durbin, J. (1951), Incomplete blocks in ranking experiments. British J. Statist. Psychol. 4, 85-90. Dunn, O. J. (1964). Multiple comparions using rank sums. Technometrics 6, 241-252. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Amer. Statist. Assoc. 32, 675-701. Groggel, D. J. and J. H. Skillings (1986). Distribution-free tests for main effects in mnltifactor designs. Amer. Statist. 40, 99-102. Hartlanb, B. A., A. M. Dean and D. A. Wolfe (1993). Nonpararnetric aligned-rank test procedures for the presence of interaction in the two-way layout with one observation per cell. Tech. Report 525, Department of Statistics, The Ohio State University. Hayter, A. J. and G. Stone (1991). Distribution free multiple comparisons for monotonically ordered treatment effects. Austral J. Statist. 33, 335-346. Hettmansperger, T. E (1984). Statistical Inference Based on Ranks. Wiley, New York. Hettmansperger, T. E and J. W. McKean (1977). A robust alternative based on ranks to least squares in analyzing linear methods. Technometrics 19, 275-284. Hettmansperger, T. E and J. W. McKean (1978). Statistical inference based on ranks. Psychometrika 43, 69-79. Hettmansperger, l". E and J. W. McKean (1983). A geomelric interpretation of inferences based on ranks in the linear model. J. Amer. Statist. Assoc. 78, 885-893. Hollander, M. and D. A. Wolfe (1973). Nonparametric Statistical Methods. Wiley, New York. Iman, R. L. and J. M. Davenport (1980). Approximations of the critical region of the Friedman statistic. Comm. Statist. Theory Methods 9, 571-595. Iman, R. L., D. Quade and D. A. Alexander (1975). Exact probability levels for the Kruskal-Wallis test. In: H. L. Hatter and D. B. Owen, eds., Selected Tables in Mathematical Statistics, Vol. III. Markham Press, Chicago, 329-384. Jonckheere, A. R. (1954). A distribution-free k-sample test against ordered alternatives. Biometrika 41, 133-145. Kendall, M. G. and B. Babington Smith (1939). The problem of rn rankings. Ann. Math. Statist. 10, 275-287. Kruskal, W. H. and W. A. Wallis (1952). Use of ranks in one-criterion variance analysis. J. Amer. Statist. Assoc. 47, 583-621. Kuehl, R. O. (1994). Statistical Principles of Research Design and Analysis. Duxbury Press, Belmont, CA. Mack, G. A. (1981). A quick and easy distribution-free test for main effects in a two-factor ANOVA. Comm. Statist. Simulation Comput. 10, 571-591. Mack, G. A. and J. H. Skillings (1980). A Friedman-type rank test for main effects in a two-factor ANOVA. J. Amer. Statist. Assoc. 75, 947-951.

758

A. M. Dean and D. A. Wolfe

Mack, G. A. and D. A. Wolfe (1981). K-sample rank tests for umbrella alternatives. J. Amer Statist. Assoc. 76, 175-181. McDonald, B. J. and W. A. Thompson, Jr. (1967). Rank sum multiple comparisons in one- and two-way classifications. Biometrika 54, 487497. McKean, J. W. and T. P. Hettmansperger (1976). Tests of hypotheses of the general linear model based on ranks. Comm. Statist. Theory Methods 5, 693-709. McKean, J. W. and T. J. Vidmar (1994). A comparison of two rank-based methods for the analysis of linear models. Amer. Statist. 48, 220-229. Miller, R. G., Jr. (1966). Simultaneous Statistical Inference. McGraw-Hill, New York. Mosteller, E and R. E. K. Rourke (1973). Sturdy Statistics. Addison-Wesley, Reading, MA. Nemenyi, P. (1963). Distribution-free multiple comparisons. Ph.D. Thesis, Princeton University, Princeton, NJ. Odeh, R. E. (1971). On Jonckheere's k-sample test against ordered alternatives. Technometrics 13, 912-918. Odeh, R. E. (1977a). Extended tables of the distribution of Friedman's S-statistic in the two-way layout. Comm. Statist. Simulation Comput. 6, 29-48. Odeh, R. E. (1977b). The exact distribution of Page's L-statistic in the two-way layout. Comm. Statist. Simulation Comput. 6, 49-61. Odeb, R. E. (1977c). Extended tables of the distributions of rank statistics for treatment versus control in randomized block designs. Comm. Statist. Simulation Comput. 6, 101-113. Page, E. B. (1963). Ordered hypotheses for multiple treatments: A significance test for linear ranks. J. Amer. Statist. Assoc. 58, 216-230. Randles, R. H. and D. A. Wolfe (1979). Introduction to the Theory of Nonparametric Statistics. John Wiley, New York (reprinted in 1991 by Krieger Publishing, Malabar, FL). Rust, S. W. and M. A. FISgner (1984). A modification of the Kruskal-Wallis statistic for the generalized Behrens-Fisher problem. Comm. Statist. Theory Methods 13, 2013-2027. Sawilowsky, S. S. (1990). Nonparametric tests of interaction in experimental design. Rev. Ed. Res. 60, 91-126. Simpson, D. G. and B. H. Margolin (1986). Recursive nonparametric testing for dose-response relationships subject to downturns at high doses. Biometrika 73, 589-596. Skillings, J. H. and G. A. Mack (1981). On the use of a Friedman-type statistic in balanced and unbalanced block designs. Technometrics 23, 171-177. Skillings, J. H. and D. A. Wolfe (1977). Testing for ordered alternatives by combining independent distribution-free block statistics. Comm. Statist. Theory Methods 6, 1453-1463. Skillings, J. H. and D. A. Wolfe (1978). Distribution-free tests for ordered alternatives in a randomized block design. J. Amer. Statist. Assoc. 73, 427-431. SpjCtvoll, E. (1968). A note on robust estimation in analysis of variance. Ann. Math. Statist. 39, 1486-1492. Terpstra, T. J. (1952). The asymptotic normality and consistency of Kendall's test against trend, when ties are present in one ranking. Indag. Math. 14, 327-333. Thompson, G. L. (1991). A unified approach to rank tests for multivariate and repeated measures designs. J. Amer Statist. Assoc. 86, 410-419. Wilcoxon, E and R. A. W~lcox (1964). Some Rapid Approximate Statistical Procedures, 2nd edn. American Cyanamid Co., Lederle Laboratories, Pear! River, NY. Wilkle, D. (1962). A method of analysis of mixed level factorial experiments. Appl. Statist. 11, 184-195. Wood, S. R. and D. E. Hartvigsen (1964). Statistical design and analysis of qualification test program for a small rocket engine, lndustr. Quality Contr. 20, 14-18.