On power and sample size computation for multiple testing procedures

On power and sample size computation for multiple testing procedures

Computational Statistics and Data Analysis 55 (2011) 110–122 Contents lists available at ScienceDirect Computational Statistics and Data Analysis jo...

647KB Sizes 2 Downloads 150 Views

Computational Statistics and Data Analysis 55 (2011) 110–122

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

On power and sample size computation for multiple testing procedures Jie Chen a,∗ , Jianfeng Luo b , Kenneth Liu c , Devan V. Mehrotra c a b c

Abbott Laboratories, P4ED, AP34-2, 100 Abbott Park Road, Abbott Park, IL 60064-3500, USA Department of Biostatistics, Fudan University, School of Public Health, Shanghai 200032, China Clinical Biostatistics, Merck Research Laboratories, PO Box 1000, North Wales, PA 19454, USA

article

info

Article history: Received 18 May 2009 Received in revised form 7 February 2010 Accepted 24 May 2010 Available online 11 June 2010 Keywords: Power Sample size Correlation Multiple tests Order statistics

abstract Power and sample size determination has been a challenging issue for multiple testing procedures, especially stepwise procedures, mainly because (1) there are several power definitions, (2) power calculation usually requires multivariate integration involving order statistics, and (3) expansion of these power expressions in terms of ordinary statistics, instead of order statistics, is generally a difficult task. Traditionally power and sample size calculations rely on either simulations or some recursive algorithm; neither is straightforward and computationally economic. In this paper we develop explicit formulas for minimal power and r-power of stepwise procedures as well as complete power of single-step procedures for exchangeable and non-exchangeable bivariate and trivariate test statistics. With the explicit power expressions, we were able to directly calculate the desired power, given sample size and correlation. Numerical examples are presented to illustrate the relationship among power, sample size and correlation. © 2010 Elsevier B.V. All rights reserved.

1. Introduction When multiple null hypotheses are tested simultaneously, multiplicity adjustment is generally performed to control some type of error, e.g., the familywise error rate (FWER), the probability of rejecting at least one true null hypothesis. Among the commonly used multiple testing methods, stepwise procedures are usually favored over single-step procedures due to power improvement and FWER control (Hochberg and Tamhane, 1987; Hsu, 1996; Tamhane, 1996). However, power and sample size determination in a stepwise multiple testing situation has been a challenging issue mainly because (1) there are several definitions of power, (2) power calculation usually requires multivariate integration involving order statistics, and (3) expansion of the probability expressions for power in terms of order statistics is generally a difficult task, especially for high-dimensional problems. With multiple null hypotheses, the power can be defined as minimal power (the probability of rejecting at least one false null hypothesis), complete power (the probability of rejecting all false null hypotheses), individual power (the probability of rejecting a particular false null hypothesis), and average power (the average proportion of false null hypotheses that are rejected); the reader is referred to Westfall et al. (1999) for more discussion on different power definitions. Following the same line of minimal power and complete power, one may be interested in the power of rejecting at least r out of m false null hypotheses, 2 ≤ r ≤ m − 1; here we simply call it r-power. Traditionally, power and sample size determination for stepwise multiple tests replies on either simulations such as the re-sampling method (Dunnett and Tamhane, 1993; Westfall et al., 1999; Bang et al., 2005; Xiong et al., 2005; Senn and Bretz, 2007), or some recursive algorithm for probability evaluation of order statistics (Dunnett and Tamhane, 1992;



Corresponding author. Tel.: +1 215 718 9337. E-mail addresses: [email protected], [email protected], [email protected] (J. Chen), [email protected] (J. Luo), [email protected] (K. Liu), [email protected] (D.V. Mehrotra). 0167-9473/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2010.05.024

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

111

Dunnett et al., 2001). The simulation approaches, however, are usually computationally intensive and may not be able to discover certain subtleties. For example, intuitively, given a sample size the minimal power and r-power of a multiple testing procedure are monotonically decreasing and the complete power is monotonically increasing as the correlation among test statistics increases. However, our illustration, both mathematically and graphically, shows that the minimal power and r-power of a step-up procedure are not monotonic over ρ ∈ [0, 1]. On the other hand, the recursive algorithm may not be applicable for direct evaluation of the impact of dependence on power, and can only be used for the equicorrelation case (Dunnett and Tamhane, 1992). We develop in this paper explicit formulas of minimal power and r-power primarily for stepwise multiple testing procedures with exchangeable and non-exchangeable bivariate and trivariate test statistics. For comparison purpose, formulas for complete power of single-step procedures are also presented. The explicit power expressions allow us not only to calculate in a straightforward manner the power given a sample size and correlation, but also to evaluate the relation among power, sample size and correlation, both mathematically and numerically. As a direct method, the computational time is reduced substantially, as compared to the simulation approaches. By computing the powers for a series of sample sizes and a given correlation structure, one simply chooses the sample size that is required to achieve the desired level of corresponding power. Even though we illustrate the power computation based on normally distributed data, the formulas are applicable to any multivariate distributions with known approximate correlation, as it can be seen in the power formulas that the test statistics are not required to follow any particular distribution. Finally, all the power formulas are expressed as probability functions in the forms of marginal or joint cumulative distribution that can be directly calculated in many r and R pmvt packages. software packages such as SAS The paper is organized as follows. Section 2 presents some preliminaries and notations on multiple testing procedures. Explicit formulas of minimal power and r-power for step-up and step-down procedures with exchangeable and nonexchangeable bivariate and trivariate test statistics are given respectively in Sections 3 and 4. Complete power formulas of single-step procedures are presented in Section 5. An investigation on the relation among power, correlation and sample size is given in Section 6. An example of a clinical trial with two endpoints is presented in Section 7 in which the relation of sample size, correlation coefficient, and treatment effects is further investigated. Finally, some concluding remarks are given in Section 8. 2. Preliminaries and notations Let Xij = {Xijk : k = 1, . . . , m}, i = 1, 2 and j = 1, . . . , ni , be independent and identically distributed random vectors from m-variate normal distribution with mean vector µi = {µi·m : k = 1, . . . , m}, a common variance σ 2 and a common correlation coefficient ρ . Let θk = µ1·k − µ2·k and θ = {θk : k = 1, . . . , m} be the vector of mean differences between the ¯ k : θk > ∆k with ∆k being a suitably pre-specified two groups. Suppose we are interested in testing Hk : θk ≤ ∆k against H effect margin for the k-th hypothesis, k = 1, . . . , m. This type of hypothesis is frequently encountered in clinical trials with multiple endpoints, such as non-inferiority trials when ∆k < δ or superiority trials when ∆k > δ 0 , where δ and δ 0 are some pre-defined, clinically meaningful margins. Then the usual test statistic Tk is given by

(X¯ 1·k − X¯ 2·k ) − ∆k , k = 1, . . . , m (1) √ σ 1/n1 + 1/n2 Pni 2 where X¯ i·k = n1 ˆ 2 in (1), each of the Tk marginally j=1 Xijk , i = 1, 2. Note that with σ being replaced by its estimate σ i follows a univariate t-distribution with d = m(n1 + n2 − 2) degrees of freedom, denoted by td , and T = (T1 , . . . , Tk ) follows an m-variate t-distribution with d degrees of freedom and a common correlation coefficient ρ . Suppose that there are m0 true null hypotheses and m1 false null hypotheses, m0 + m1 = m. Then the power of any Tk =

multiple testing procedure to reject at least r false null hypotheses is given by

ψr ,m1 ,m = Pr{reject ∩i∈Ir Hi ; Ir ⊆ m1 , |Ir | ≤ r }.

(2)

In clinical efficacy evaluation of a new therapy, it is usually assumed that treatment is superior to placebo control or noninferior to active control on all of the endpoints, i.e., m1 = m. Throughout this paper, we will assume that all the null hypotheses are false. Following the ideas in Dunnett and Tamhane (1992), the power of a single-step procedure to detect at least r false null hypotheses, denoted by ψrs,m,m , is simply

ψ

s r ,m,m

=

 n m−r  X m j =0

j

P

 o ∩jk=1 (Tk < cm ) ∩ ∩m k=j+1 (Tk ≥ cm )

(3)

where cm is the critical value of a single-step procedure and is determined to control the FWER at a desired α level under the configuration that all the null hypotheses are true. Note that j = 0 in (3) (and in subsequent Eqs. (4) and (5)) implies that all false null hypotheses are rejected. The power of a step-up procedure to reject at least r false null hypotheses, denoted by ψru,m,m , is given by

ψru,m,m =

 n m−r  X m j =0

j

P

 o ∩jk=1 (Tk:m < uk ) ∩ (Tj+1:m ≥ uj+1 )

(4)

112

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

and the power of a step-down procedure to reject at least r false null hypotheses, denoted by ψrd,m,m , is

ψrd,m,m =

 m−r  X   m P (Tj:m < dj ) ∩ ∩m , k=j+1 (Tk:m ≥ dk ) j =0

(5)

j

where T1:m < T2:m < · · · < Tm:m are the ordered components of the Tk , and u1 < u2 < · · · < um and d1 < d2 < · · · < dm are the critical values of step-up and step-down procedures, respectively. Note that the uk and dk may be determined according to the marginal distribution of the Tk for stepwise marginal p-value (or test statistics) based procedures. For example, for Hochberg step-up and Holm step-down procedures, we have uk = dk for k = 1, . . . , m. Throughout this paper we denote the random variables by upper-case letters, e.g., T , and their realized values by the corresponding lower-case letters, e.g., t. 3. Minimal power Minimal power is the right choice if one is interested in detecting at least one false null hypothesis. If all null hypotheses are true, then minimal power is equivalent to FWER. 3.1. Bivariate test statistics We present in this subsection explicit formulas of minimal power for bivariate test statistics under both exchangeability and non-exchangeability assumptions. 3.1.1. Exchangeable case Consider a simple single-step procedure, e.g., the Bonferroni procedure. It is easy to see that the minimal power of a single-step procedure is given by

ψ1s,,e2,2 = 1 − P {∩2k=1 (Tk < c2 )}.

(6)

With the assumption of exchangeability among T1 and T2 , i.e., P is symmetric in its arguments (critical values), then the minimal power of a step-up procedure to reject at least one false null hypotheses for exchangeable bivariate test statistics is

ψ1u,,2e,2 = 1 + 2P (T1 < u1 ) + P {∩2k=1 (Tk < u1 )} − 4P {∩2k=1 (Tk < uk )}.

(7)

Similarly, the minimal power of a step-down procedure to reject at least one false null hypotheses for exchangeable bivariate test statistics is given by

ψ1d,,2e,2 = 1 + 2P (T1 < d1 ) − P {∩2k=1 (Tk < d2 )} − 2P {∩2k=1 (Tk < dk )}. 

(8)

3.1.2. Non-exchangeable case Now consider that the bivariate test statistics are non-exchangeable, i.e., P is asymmetric in its arguments or critical values. An example of non-exchangeable bivariate test statistics includes cases with unequal standardized variates which may occur due to unequal variances or unequal pre-defined effect threshold values. Because of the non-exchangeability, each test statistic should have its own unique critical value at each step. Formulas in this subsection, as well as subsequent subsections for non-exchangeable bivariate and trivariate test statistics, are derived from the corresponding exchangeable cases using the exhaustive combinatorial enumeration principle (Goulden et al., 1983) for each of the probability terms in the power expressions for exchangeable cases. For the single-step Bonferroni procedure, the minimal power for non-changeable test statistics is simply

ψ1s,,n2,2 = 1 − P {∩2k=1 (Tk < ck2 )},

(9)

where ck2 is the critical value for the k-th test statistic and is determined analogously as c2 in (3). Let uk1 < uk2 be critical values of a step-up procedure for the k-th test statistic at steps 1 and 2, respectively. Then, according to (7), the minimal power of a step-up procedure for non-changeable bivariate test statistics can be written as

ψ1u,,2n,2 = 1 +

2 X k =1

where

P

i6=j

X i6=j

P {Tk < uk1 } + P {∩2k=1 (Tk < uk1 )} − 2

X

P {(Ti < ui1 ) ∩ (Tj < uj2 )},

(10)

i6=j

P {(Ti < ui1 ) ∩ (Tj < uj2 )} is the summation of probabilities of the two combinations for i, j = 1, 2 and i 6= j, i.e.,

P {(Ti < ui1 ) ∩ (Tj < uj2 )} = P {(T1 < u11 ) ∩ (T2 < u22 )} + P {(T1 < u12 ) ∩ (T2 < u21 )}.

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

113

Similarly, according to (8), the minimal power of a step-down procedure for non-changeable bivariate test statistics is given by

ψ1d,,2n,2 = 1 +

2 X

P {Tk < dk1 } − P {∩2k=1 (Tk < dk2 )} −

k=1

X

P {(Ti < di1 ) ∩ (Tj < dj2 )},

(11)

i6=j

where dk1 < dk2 are critical values of a step-down procedure and are defined analogously as uk1 < uk2 in (10).



3.2. Trivariate test statistics 3.2.1. Exchangeable case Again consider a simple single-step Bonferroni procedure, the minimal power of this procedure for exchangeable trivariate test statistics is simply

ψ1s,,e3,3 = 1 − P {∩3k=1 (Tk < c3 )}.

(12)

From (4), it can be shown that the minimal power of a step-up procedure for exchangeable trivariate test statistics is given by

  ψ1u,,3e,3 = 1 + 6P {T1 < u1 } − 6P {∩2k=1 (Tk < u1 )} − 18P {∩3k=1 (Tk < uk )} + 9P (T1 < u1 ) ∩ ∩3k=2 (Tk < u2 ) + 9P {∩2k=1 (Tk < u1 ) ∩ (T3 < u3 )} − P {∩3k=1 (Tk < u1 )};

(13)

see Appendix A.1 for a proof. Similarly, from (5) it can shown that the minimal power of a step-down procedure for exchangeable trivariate test statistics is

ψ1d,,3e,3 = 1 + 6P {T1 < d1 } − 9P {∩2k=1 (Tk < d1 )} + 6P {∩2k=1 (Tk < dk )} − 3P {∩2k=1 (Tk < d2 )} + 9P {∩2k=1 (Tk < d1 ) ∩ (T3 < d3 )}   − 6P {∩3k=1 (Tk < dk )} − 6P (T1 < d1 ) ∩ ∩3k=2 (Tk < d3 ) + 3P {∩2k=1 (Tk < d2 ) ∩ (T3 < d3 )} − P {∩3k=1 (Tk < d3 )}; see Appendix A.2 for a proof.

(14)



3.2.2. Non-exchangeable case Trivariate test statistics may be non-changeable if the standardized variates or their partial correlation coefficients are not equal. From (12) one can simply write the minimal power of a single-step procedure for non-exchangeable variates as

ψ1s,,n3,3 = 1 − P {∩3k=1 (Tk < ck3 )}

(15)

where ck3 is defined analogously to c3 in (3). Given the power expression (13) for the exchangeable case, the minimal power of a step-up procedure for non-changeable trivariate test statistics can be written as

ψ1u,,3n,3 = 1 + 2

3 X

P {Tk < uk1 } − 2

k=1

+3

X

P {(Ti < ui1 ) ∩ (Tj < uj1 )} − 3

i6=j

X

P {(Ti < ui1 ) ∩ (Tj < uj2 ) ∩ (Tk < uk3 )}

i6=j6=k

3 3 X X     P (Ti < ui1 ) ∩ ∩3k=1,6=i (Tk < uk2 ) + 3 P (Ti < ui3 ) ∩ ∩3k=1,6=i (Tk < uk1 ) i=1



3 k =1

−P ∩

i =1

(Tk < uk1 ) ,

(16)

where uk1 < uk2 < uk3 are critical values of a step-up procedure for the k-th test statistic at steps 1, 2 and 3, respectively, P P {( Ti < ui1 ) ∩ (Tj < uj1 )} is the summation of joint probabilities for all possible combinations of i, j = 1, 2, 3 but i 6= j, i6=j i.e.,

X

P {(Ti < ui1 ) ∩ (Tj < uj1 )} = P {(T1 < u11 ) ∩ (T2 < u21 )} + P {(T1 < u11 ) ∩ (T3 < u31 )}

i6=j

+ P {(T2 < u21 ) ∩ (T3 < u31 )}, and i6=j6=k P {(Ti < ui1 ) ∩ (Tj < uj2 ) ∩ (Tk < uk3 )} is the summation of joint probabilities for all possible combinations of i, j, k = 1, 2, 3 but i 6= j 6= k, i.e.,

P

X

P {(Ti < ui1 ) ∩ (Tj < uj2 ) ∩ (Tk < uk3 )} = P {∩3k=1 (Tk < ukk )} + P {(T1 < u11 ) ∩ (T2 < u23 ) ∩ (T3 < u32 )}

i6=j6=k

+ P {(T1 < u12 ) ∩ (T2 < u21 ) ∩ (T3 < u33 )} + P {(T1 < u12 ) ∩ (T2 < u23 ) ∩ (T3 < u31 )} + P {(T1 < u13 ) ∩ (T2 < u21 ) ∩ (T3 < u32 )} + P {(T1 < u13 ) ∩ (T2 < u22 ) ∩ (T3 < u31 )}.

114

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

Similarly, the minimal power of a step-down procedure for non-exchangeable trivariate test statistics can be expressed as

ψ1d,,3n,3 = 1 + 2

3 X

P {Tk < dk1 } − 3

k=1

+

X

X

P {(Ti < di1 ) ∩ (Tj < dj1 )}

i6=j

P {(Ti < di1 ) ∩ (Tj < dj2 )} + P {(Ti < di2 ) ∩ (Tj < dj1 )} −



X

i
+3

3 X 

P (Ti < di3 ) ∩ ∩3k=1,6=i (Tk < dk1 )





i=1

−2

P {(Ti < di2 ) ∩ (Tj < dj2 )}

i6=j

X

P {(Ti < di1 ) ∩ (Tj < dj2 ) ∩ (Tk < dk3 )}

i6=j6=k

3 X 

P (Ti < di1 ) ∩ ∩3k=1,6=i (Tk < dk3 )



+

i=1

3 X

P {(Ti < di3 ) ∩ ∩3k=1,6=i (Tk < dk2 ) }



i=1

 − P ∩3k=1 (Tk < dk3 )

(17)

where dk1 < dk2 < dk3 are critical values of a step-down procedure for the k-th test statistic at steps 1, 2 and 3, respectively.  4. r-power The r-power is useful when at least two or more rejections are required in order to declare a positive result. We present in this section r-power formulas for the trivariate case; there is no r-power defined for the bivariate case since there are only two null hypotheses in this situation. 4.1. Exchangeable case Consider a single-step Bonferroni procedure that rejects two out of three false null hypotheses, the r-power of this procedure for exchangeable test statistics can be written as

ψ2s,,e3,3 = 1 − 3P {∩2k=1 (Tk < c3 )} + 2P {∩3k=1 (Tk < c3 )}.

(18)

From (4) and (5), it can be shown that the r-power of a step-up procedure for exchangeable trivariate test statistics is given by

ψ2u,,3e,3 = 1 + 6P {T1 < u1 } + 3P {∩2k=1 (Tk < u1 )} − 18P {∩2k=1 (Tk < uk )}   − P {∩3k=1 (Tk < u1 )} + 9P (T1 < u1 ) ∩ ∩3k=2 (Tk < u2 )

(19)

and the r-power of a step-down procedure for exchangeable trivariate test statistics is

ψ2d,,3e,3 = 1 + 6P {T1 < d1 } − 12P {∩2k=1 (Tk < dk )} − 3P {∩2k=1 (Tk < d2 )}   + 12P {∩3k=1 (Tk < dk )} − 6P (T1 < d1 ) ∩ ∩3k=2 (Tk < d3 )   + 3P ∩2k=1 (Tk < d2 ) ∩ (T3 < d3 ) − P {∩3k=1 (Tk < d3 )}.

(20) u,e 2,3,3

Again proofs of (19) and (20) follow the same lines as the proofs of (13) and (14) because ψ u,e

d,e

of ψ1,3,3 (or ψ1,3,3 ).

d,e 2,3,3 )

(or ψ

is a component

4.2. Non-exchangeable case From (18), it is straightforward to see that the r-power of a single-step procedure for non-exchangeable trivariate test statistics is

ψ2s,,n3,3 = 1 −

X

P {(Ti < ci3 ) ∩ (Tj < cj3 )} + 2P {∩3k=1 (Tk < ck3 )}

(21)

i6=j

where ck3 is a critical value for the k-th test statistic as defined in (3). Analogously, from (19) the r-power of a step-up procedure for non-exchangeable trivariate test statistics is given by

ψ2u,,3n,3 = 1 + 2

3 X

P {Tk < uk1 } +

X

k=1

− P {∩3k=1 (Tk < uk1 )} + 3

P {(Ti < ui2 ) ∩ (Tj < uj2 )} − 3

i6=j

X

P {(Ti < ui1 ) ∩ (Tj < uj2 )}

i6=j

3 X 

P (Ti < ui1 ) ∩ ∩3k=1,6=i (Tk < uk2 )

i =1



,

(22)

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

115

and from (20) the r-power of a step-down procedure for non-exchangeable trivariate test statistics is

ψ2d,,3n,3 = 1 + 2

3 X

P {Tk < dk1 } − 2

k=1

X

+2

X

P {(Ti < di1 ) ∩ (Tj < dj2 )} −

X

P {(Ti < di2 ) ∩ (Tj < dj2 )}

i
i6=j

P {(Ti < di1 ) ∩ (Tj < dj2 ) ∩ (Tk < dk3 )} − 2

3 X   P (Ti < di1 ) ∩ ∩3k=1,6=i (Tk < dk3 ) i =1

i6=j6=k 3

+

X 

P (Ti < di3 ) ∩ ∩3k=1,6=i (Tk < dk2 )



− P {∩3k=1 (Tk < dk3 )},

(23)

i=1

where uk1 < uk2 < uk3 and dk1 < dk2 < dk3 are as defined in (16) and (17).



5. Complete power Complete power seems very attractive in a sense that one would undoubtedly like to reject all false null hypotheses, in which case there is no other possibility to select the most favorable configuration. Hence, multiplicity adjustment is not necessary and each individual hypothesis is simply tested at the same significant level as the overall Type I error rate. This method, usually referred to as the intersection–union test (Berger, 1982), is known not only to be very conservative (lower power to reject all false null hypotheses), but also to inflate Type II error rate, the probability of accepting at least one false null hypothesis. For instance, with three independent test statistics each at 0.90 individual power, the complete power could be as low as 0.903 = 0.729. As pointed out in the European Agency for the Evaluation of Medicinal Products guideline (EMEA, 2002), this inflation of Type II error rate or power loss must be taken into consideration for a proper estimation of the sample size when designing a clinical trial. With all false null hypotheses being rejected each at pre-defined overall significant level, all the tests are performed in a single step. Hence, the complete power for exchangeable bivariate test statistics is given by

ψ2s,,e2,2 = 1 − 2P {T1 < c1 } + P {∩2k=1 (Tk < c1 )}

(24)

and for trivariate exchangeable test statistics is given by

ψ3s,,e3,3 = 1 − 3P {T1 < c1 } + 3P {∩2k=1 (Tk < c1 )} − P {∩3k=1 (Tk < c1 )}.

(25)

If the test statistics are non-exchangeable, then the complete powers for bivariate and trivariate cases can respectively be written as

ψ2s,,n2,2 = 1 −

2 X

P {Tk < ck1 } + P {∩2k=1 (Tk < ck1 )}

(26)

k=1

and

ψ3s,,n3,3 = 1 −

3 X

P {Tk < ck1 } +

k=1

X

P {(Ti < ci1 ) ∩ (Tj < cj1 )} − P {∩3k=1 (Tk < ck1 )}. 

(27)

i6=j

6. Power, sample size and correlation We will show in this section the relation between power and correlation coefficient for exchangeable test statistics. The relationship for non-exchangeable test statistics can be demonstrated similarly, though with much more complexity. Sample sizes required to maintain some desired levels of powers for various values of correlation coefficient are presented for the cases with unknown common variance. 6.1. Bivariate test statistics Suppose that n1 = n2 = n, σ 2 is known, and (t1 , t2 ) follow the standard bivariate normal distribution with correlation coefficient ρ . Then the minimal power of a step-up procedure (7) can be expressed as

ψ1u,2,2 = 1 + 2Φ (u1 ) + Φ (u1 , u1 ) − 4Φ (u1 , u2 ), (28) where Φ (·) is the cumulative distribution function (cdf) of the standard normal and Φ (·, ·) is the cdf of the standard bivariate normal with correlation coefficient ρ . With an application of the reduction formula (Placket, 1954) to (28), one has hu (ρ) =

∂ψ1u,2,2

∂ 2 ψ1u,2,2

= φ(u1 , u1 ) − 4φ(u1 , u2 ) (29) ∂ t1 ∂ t2 which is not monotonic in ρ because it involves a quadratic form of ρ . Similar to Samuel-Cahn (1996), it can be shown that (29) is always negative in the neighborhood of ρ = 0 for appropriately chosen critical values (u1 , u2 ) and positive in the ∂ρ

=

116

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

Table 1 Sample size required to achieve the desired level of indicated minimal power for Bonferroni (Bonf), Hochberg (HC) and Holm (HM) procedures under various correlations of exchangeable (t1 , t2 ) for detecting treatment effect ∆/σˆ = 0.2 at FWER α = 0.05.

ρ

Power level 0.80

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.85

0.90

0.95

Bonf

HC

HM

Bonf

HC

HM

Bonf

HC

HM

Bonf

HC

HM

221 230 240 251 262 274 287 302 320 342

90 98 107 118 131 148 170 198 235 284

93 101 110 122 136 155 179 210 252 308

254 265 277 289 303 317 332 349 369 394

100 109 119 132 148 168 194 228 273 330

103 112 123 137 154 176 205 242 292 356

299 313 327 342 358 375 393 413 436 464

111 121 133 149 168 192 225 267 323 392

114 125 138 155 176 202 238 285 346 422

371 390 409 428 448 470 492 517 545 579

123 135 150 168 192 224 266 324 401 493

127 140 156 176 202 237 284 348 432 531

neighborhood of ρ = 1, i.e., (28) is decreasing for ρ ∈ [0, ρ1 ] and increasing for ρ ∈ [ρ1 , 1], where ρ1 is one of the two roots satisfying hu (ρ) = 0. Similarly, the minimal power of a step-down procedure is given by

ψ1d,2,2 = 1 + 2Φ (d1 ) − Φ (d2 , d2 ) − 2Φ (d1 , d2 ) which is obviously always a decreasing function of ρ as ∂ψ1d,2,2 ∂ 2 ψ1d,2,2 hd (ρ) = = = −φ(d2 , d2 ) − 2φ(d1 , d2 ) ≤ 0 ∂ρ ∂ t1 ∂ t2 for all values of ρ ∈ [−1, 1] and (d1 , d2 ).

(30)

Under the same distributional setup, the complete power for exchangeable bivariate test statistics (24) can be written as

ψ2u,,2e,2 = 1 − 2Φ (u1 ) + Φ (u1 , u1 ).

(31)

uk = dk = t1−α/k − (∆/σˆ ) n/2

(32)

With the same arguments as above for the minimal power, the complete power (31) clearly is an increasing function of ρ , i.e., given a sample size the complete power increases with correlation coefficient, as ∂ψ2ue,2,2 /(∂ t1 ∂ t2 ) is always positive for all ρ ∈ [0, 1]. When σ 2 is unknown as usually it is, then (t1 , t2 ) follow a bivariate t-distribution. We use Hochberg step-up and Holm step-down procedures to illustrate the direct calculation of the powers. The critical values uk and dk are given by

p

where td,1−α/k is the 100(1 − α/k)-th percentile of univariate t-distribution with d degrees of freedom. We also calculate the sample sizes for the single-step Bonferroni procedure for which c2 = u2 = d2 . Sample sizes that are required to achieve a desired level of minimal power for bivariate variables with FWER being controlled at α = 0.05 by Bonferroni, Hochberg and Holm procedures for detecting treatment effect ∆/σˆ = 0.2 are presented in Table 1. It can be seen that a sample size required to attain a desired minimal power increases with correlation coefficient ρ for all the three procedures, and hence that a study designed under the assumption of independence among endpoints may be underpowered when the endpoints actually show some positive dependence. It is observed from Table 1 that the difference in sample sizes between Bonferroni procedure and stepwise procedures becomes less substantial as the correlation coefficient increases and that the superiority of the Hochberg procedure over the other two is obvious. 6.2. Trivariate test statistics We assume that n1 = n2 = n3 = n and a known common σ 2 . Suppose that (t1 , t2 , t3 ) are exchangeable, then (t1 , t2 , t3 ) follow the distribution of the standard trivariate normal with common correlation coefficient ρ . The minimal power of a single-step procedure (12) can be written as

ψ1s,,e3,3 = 1 − Φ (c3 , c3 , c3 )

(33)

which is clearly a decreasing function of ρ , since, e.g., see Kotz et al. (2000, p. 293),

∂ψ1s,,e3,3

∂ Φ (c3 , c3 |t3 < c3 ) = −Φ (c3 )φ(c3 , c3 ; ρ|t3 < c3 ) ≤ 0 ∂ρij for all values of c3 and ρ ∈ [0, 1]. With similar arguments, it can be shown that the r-power of a single-step procedure is a decreasing function and the complete power is an increasing function of correlation coefficient ρ because ∂ψ2s,,e3,3 = −3φ(c3 , c3 ) + 2Φ (c3 )φ(c3 , c3 |t3 < c3 ) ≤ 0 ∂ρij ∂ρij

= −Φ (c3 )

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

(a) Minimal power of Hochberg procedure.

(c) r-power of Hochberg procedure.

117

(b) Minimal power of Holm procedure.

(d) r-power of Holm procedure.

Fig. 1. Minimal power and r-power as functions of correlation coefficient ρ and sample size n for standard trivariate normal distribution with FWER α = 0.05 and treatment effect ∆/σ = 0.2.

and

∂ψ3s,,e3,3 ∂ρij

= 3φ(c1 , c1 ) − Φ (c1 )φ(c1 , c1 |t1 < c1 ) ≥ 0.

However, it is difficult, if not impossible, to theoretically show how the minimal powers (13) and (14) and r powers (19) and (20) of step-up and step-down procedures vary with correlation coefficient ρ ∈ [0, 1]. Here we graphically illustrate the relationship of these powers with correlation coefficient ρ by directly calculating the powers using the Hochberg step-up and Holm step-down procedures. Given the known variance σ 2 , equal sample sizes n1 = n2 = n3 = n and correlation coefficient ρ , the critical values of these two procedures under the alternative hypotheses are the same as (32) with t1−α/k being replaced by z1−α/k , the 100(1 − α/k)-th percentile of the standard normal distribution for k = 1, 2, 3. Results of direct calculations show that, similar to the exchangeable bivariate case, the minimal power (13) of the Hochberg step-up procedure for exchangeable trivariate test statistics is a decreasing function of ρ ∈ [0, ρ2 ] and an increasing function of ρ ∈ [ρ2 , 1] where ρ2 is a solution for (13) to reach the minimum, and the minimal power (14) for the Holm step-down procedure is a monotonically decreasing function of ρ ∈ [0, 1] (Fig. 1). The impact of correlation coefficient on the r-powers is comparable to that on the minimal powers of Hochberg step-up and Holm step-down procedures (Fig. 1). As one can see ρ2 is very close to 1 for the given settings, which may be trivial in some practical scenario. However, the influence of ρ on power and sample size for non-exchangeable cases could be non-negligible, as will be seen in Section 7. Similar to the above subsection, when σ 2 is unknown, then (t1 , t2 , t3 ) follow a trivariate t-distribution. For Hochberg step-up and Holm step-down procedures, the critical values uk and dk will be the same as in (32) for k = 1, 2, 3. Sample sizes that are required to achieve a desired level of minimal power and r-power for exchangeable trivariate test statistics with

118

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

Table 2 Sample size required to achieve the desired level of indicated powers for Bonferroni (Bonf), Hochberg (HC) and Holm (HM) procedures under various correlations of (t1 , t2 , t3 ) for detecting treatment effect ∆/σˆ = 0.2 at FWER α = 0.05.

ρ

Power level 0.80 Bonf

0.85

0.90

0.95

HC

HM

Bonf

HC

HM

Bonf

HC

HM

Bonf

HC

HM

185 198 212 226 242 260 279 301 327 361

38 40 44 49 56 67 83 108 156 251

39 42 46 51 59 71 89 120 180 299

211 227 244 261 280 300 322 347 376 414

41 44 48 54 62 74 91 121 176 289

42 45 50 56 65 78 99 134 205 346

247 267 287 308 330 354 380 409 443 486

45 48 53 59 68 81 101 135 201 340

46 49 55 62 72 86 110 151 236 409

305 330 357 384 412 443 475 511 552 603

48 52 57 65 75 89 112 151 231 421

49 54 59 67 78 95 122 170 277 512

363 372 381 390 398 406 413 421 428 435

143 146 151 159 170 184 204 232 269 316

149 154 160 170 182 200 225 259 306 364

399 411 422 433 444 454 464 474 484 493

150 155 162 172 184 202 226 260 307 363

157 163 171 183 198 220 250 292 348 417

447 462 477 492 506 519 532 545 558 570

159 165 173 185 201 222 253 296 356 427

166 173 183 197 216 242 280 333 404 487

523 544 565 585 605 624 642 660 678 696

167 175 185 199 219 246 286 345 430 531

175 184 196 213 236 269 318 392 493 602

Minimal power 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 r-power 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

FWER being controlled at α = 0.05 for detecting treatment effect ∆/σˆ = 0.2 are presented in Table 2. It is not surprising that both Hochberg and Holm procedures require smaller sample sizes to achieve the same level of minimal power in the trivariate case than in the bivariate case; while the Bonferroni procedure requires a larger sample size for the bivariate case than for the trivariate case if the correlation coefficient is small, and smaller sample size for the bivariate case than for the trivariate case if the correlation coefficient is large (Tables 1 and 2). In addition to the trend of increasing sample size with correlation coefficient to maintain a desired minimal power for both bivariate and trivariate cases, the rate of sample size increase per unit increase in correlation coefficient is higher for the trivariate case than for the bivariate case (Tables 1 and 2). As expected, the sample size needed to achieve a desired r-power is larger than that to achieve the same level of minimal power; the difference is even substantial if the correlation coefficient is small (Table 2). 6.3. Comparisons of minimal power, r-power and complete power It is well known that powers of rejecting all false null hypotheses, at least r > 1 false null hypotheses and at least one false null hypothesis are in an increasing order. That is, given a sample size, the correlation structure, detectable treatment effect, and the Type I error rate, minimal power is always greater than r-power which in turn is greater than complete power. The question is how much power loss may be incurred due to the rejection of all or at least r false null hypotheses, as compared with the power for at least one rejection. We employ the derived formulas for further examination on this regard and present the results below. Throughout the investigation we assume exchangeable test statistics with a sample size n = 100, treatment effect on each endpoint ∆/σ = 0.2 and FWER α = 0.05. Undoubtedly, for the bivariate case the complete power is much lower than the minimal power of either a single-step Bonferroni procedure or stepwise Hochberg and Holm procedures, especially when the correlation coefficient is small; as correlation increases, the complete power is close to the minimal power of the Bonferroni procedure, but is still worse than those of the stepwise procedure (Table 3). For the trivariate case, the maximum level of power is achieved by minimal power, followed by r-power, of stepwise procedures; while complete power is the lowest for lower correlation in particular, although it is higher than the minimal power of the Bonferroni procedure with high correlation. It is understandable that the r-power of the Bonferroni procedure increases with correlation coefficient and so does the complete power, the r-power of stepwise procedures goes in the opposite direction (Table 3). 7. An example Two primary endpoints are commonly used in the assessment of treatment effect for Alzheimer’s disease; they are the Alzheimer’s Disease Assessment Scale — Cognitive Subscale (ADAS-Cog) and the Clinical Global Impression of Change (CGIC). The former measures the cognitive performance, while the latter appraises the global function of individual patients (Rogers et al., 1998). The ADAS-Cog score is typically measured as the change from baseline or annualized rate of change; CGIC is a

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

119

Table 3 Power comparisons of minimal power (Bonferroni — mBonf, Hochberg — mHoch and Holm — mHolm), r-power (Bonferroni — rBonf, Hochberg — rHoch and Holm — rHolm) and complete power (Complete) for exchangeable bivariate and trivariate test statistics with a sample size n = 100, treatment effect ∆/σ = 0.2 and α = 0.05.

ρ

mBonf

mHoch

mHolm

rBonf

rHoch

rHolm

Complete

0.49 0.48 0.47 0.46 0.44 0.43 0.41 0.40 0.38 0.35

0.85 0.81 0.77 0.73 0.69 0.65 0.60 0.55 0.49 0.43

0.84 0.80 0.76 0.72 0.68 0.63 0.58 0.53 0.47 0.40

–a – – – – – – – – –

– – – – – – – – – –

– – – – – – – – – –

0.16 0.18 0.20 0.21 0.23 0.24 0.26 0.28 0.31 0.34

0.50 0.48 0.47 0.46 0.45 0.43 0.42 0.40 0.38 0.35

1.00 1.00 1.00 1.00 1.00 1.00 0.90 0.77 0.62 0.46

1.00 1.00 1.00 1.00 1.00 0.98 0.86 0.72 0.57 0.40

0.14 0.15 0.17 0.18 0.19 0.20 0.21 0.21 0.22 0.23

0.52 0.52 0.52 0.52 0.51 0.49 0.47 0.44 0.41 0.37

0.49 0.49 0.49 0.48 0.47 0.45 0.43 0.40 0.36 0.31

0.07 0.09 0.11 0.13 0.15 0.17 0.20 0.23 0.26 0.30

Bivariate case 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Trivariate case 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 a

Not applicable.

seven-point rating scale (from ‘‘Very much worse’’ CGIC = 7 to ‘‘Very much improved’’ CGIC = 1) with mean rating at the final visit being the most useful clinical measurement. It is realized that ADAS-Cog and CGIC may not follow a bivariate normal distribution; however, for illustrative purpose, we assume that both measures are asymptotically normally distributed. Let T1 and T2 denote respectively the negative of the ADAS-Cog change from baseline and the negative of CGIC at the final visit. The distribution of (T1 , T2 ) depends on severity of Alzheimer’s disease, the length of study period, and other characteristics of study population. For patients at 50 years or older with mild to moderate uncomplicated Alzheimer’s disease, an estimated mean change of ADAS-Cog from baseline in a 24-week trial is approximately µ ˆ c1 = 1.82 with an estimated standard deviation of σˆ 1 = 6.06, and the mean rating of CGIC at the final visit in the same study period is estimated to be µ ˆ c2 = 4.51 with a standard deviation of σˆ 2 = 0.99 (Rogers et al., 1998). It is also known that the correlation coefficient between ADAS-Cog and CGIC is approximately ρˆ = 0.5. Based on the above information, Xiong et al. (2005) provided sample size estimates to achieve a complete power of at least 80% with the FWER being controlled at α = 0.05 for various combinations of treatment effects ∆1 = µt1 − µc1 on ADAS-Cog and ∆2 = µt2 − µc2 on CGIC under the alternative hypotheses. Note that when ∆1 6= ∆2 , then T1 and T2 are non-exchangeable. We use formulas (7), (8), and (26) to calculate the sample sizes for maintaining a desired level of minimal power and complete power for various combinations of treatment effects of ∆1 and ∆2 , with FWER controlled at α = 0.05 and c11 = td,α − (∆1 /σˆ 1 ) n/2,

p

c12 = td,α/2 − (∆1 //σˆ 1 ) n/2,

p

c21 = td,α − (∆2 /σˆ 2 ) n/2,

p

c22 = td,α/2 − (∆2 /σˆ 2 ) n/2.

p

Our results for achieving at least 80% complete power are very close to Xiong et al. (2005, Table 5) who employed a simulation method in their computation. Sample sizes that are necessary to reach 90% minimal power are reported in Table 4. Interestingly, given ∆1 (or ∆2 ) and the minimal power, the relation of sample size and ∆2 (or ∆1 ) is not monotonic, i.e., the sample size does not monotonically decrease as ∆2 (or ∆1 ) increases. For example, given ∆2 = 0.4 a sample of size 48 subjects per group is needed for the Hochberg procedure to achieve 90% minimal power if ∆1 = 0.5, however a sample of u,n size 55 subjects per group will be required if ∆1 = 2 given ∆2 = 0.4 and minimal power ψ1,2,2 = 90%. This counter-intuitive non-monotonicity relationship between sample size and treatment effects exists only for minimal power and r-power for both exchangeable and non-exchangeable test statistics with non-zero correlation. Based on the principle of ordinary differentiation equations, one can check from (10) and (11) that given non-zero correlation and one treatment effect, say ∆2 , the other treatment effect ∆1 is a nonlinear function of sample size n. Fig. 2 displays the estimated sample size to achieve at least 80% minimum power for the Hochberg step-up procedure to detect one treatment effect ∆1 given another treatment effect ∆2 = 1, 1.5 and 2 for correlation coefficient ρ = 0, 0.5, 0.8 and 0.9 and σ1 = σ2 = 3. Clearly, the estimated sample size decreases as ∆1 increases when ρ = 0; however, when ρ > 0 the required sample size

120

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

Table 4 Sample sizes required to achieve a minimal power of 90% for detecting the indicated treatment effects using Hochberg and Holm procedures for the Alzheimer’s disease example (ρ = 0.5 and α = 0.05). Method

∆1

∆2 0.2

0.4

0.6

0.8

1.0

1.2

1.4

Hochberg

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

197 213 141 79 49 34 25 19 15 13

48 50 54 55 48 37 27 21 17 14

22 22 23 24 25 25 24 21 17 14

13 13 14 14 14 15 15 15 15 13

9 9 9 9 10 10 10 10 10 10

7 7 7 7 7 7 7 7 8 8

6 6 6 6 6 6 6 6 6 6

Holm

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

199 222 147 81 50 34 25 19 16 13

48 51 55 57 50 38 28 22 17 14

22 23 24 25 26 26 25 22 18 15

13 13 14 14 15 15 16 16 15 14

9 9 9 9 10 10 10 11 11 11

7 7 7 7 7 7 7 8 8 8

6 6 6 6 6 6 6 6 6 6

(a) ρ = 0.

(b) ρ = 0.5.

(c) ρ = 0.8.

(d) ρ = 0.9.

Fig. 2. Sample size as a function of treatment effects ∆1 and ∆2 given correlation coefficient ρ = 0, 0.5, 0.8 and 0.9 for the Hochberg step-up procedure to achieve a minimal power of 80%.

increases and then decreases as ∆1 increases, especially if ρ is large, regardless of ∆2 . When ρ approaches 1, sample size reaches the maximum when ∆1 and ∆2 are approximately equal (Fig. 2).

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

121

8. Concluding remarks We have developed explicit formulas that can be employed to directly compute the minimal power and complete power for bivariate and trivariate cases, as well as r-power for trivariate case, for stepwise multiple testing procedures. The relationships between power, sample size, and correlation coefficient are investigated, both mathematically and graphically, using the presented power expressions. Numerical results for sample sizes are provided for various levels of minimal power and r-power of Hochberg step-up and Holm step-down procedures to detect a given treatment effect. The power formulas presented in Sections 3–5 are for both exchangeable and non-exchangeable random variables. Unlike the univariate case where sample size is inversely related to treatment effect size given the power, our example for bivariate endpoints with non-zero correlation shows that, given FWER and one fixed treatment effect size, increasing the second correlated treatment effect requires a larger sample size to maintain the same level of minimal power over certain regions. Given the complex relationship among power, correlation structure, sample size, and treatment effect in stepwise procedures, it is advised that in-depth exploration of the relationship should be performed in order to choose the right sample size to achieve a desired level of power for reasonably selected treatment effect and correlation. It should be pointed out that when θk deviates from ∆k , Tk follows a non-central t distribution and power formulas presented in this paper become approximate. As one can see, the explicit formulas for powers become more involved when the number of groups to be compared increases. The non-exchangeability in test statistics, as often seen in some clinical trials, would further complicate the problem. In contrast to simulation approaches, formulas presented in this paper offer a straightforward way for direct power computation and thorough investigation for up to three test statistics; however, simulation methods would be recommended for problems with higher dimension. Some future development would include testing of two-sided null hypotheses. It would also be useful to develop formulas for a mixture of distributions, e.g., a normal distribution for one endpoint and a binary distribution for another endpoint which may often be seen in clinical research. Acknowledgements We would like to thank Professor Sanat K. Sarkar for checking one of the proofs. We are also grateful to the two anonymous referees and the Associate Editor whose comments greatly helped improve the presentation of the paper. Appendix A.1. Proof of (13) According to (4), the minimal power of a step-up procedure can be written as

ψ1u,3,3 = P {T1:3 ≥ u1 } + 3P {(T1:3 < u1 ) ∩ (T2:3 ≥ u2 )} + 3P {∩2k=1 (Tk:3 < uk ) ∩ (T3:3 ≥ u3 )} = 1 + 2P {T1:3 < u1 } − 3P {∩3k=1 (Tk:3 < uk )}.

(34)

Note that P {T1:3 < u1 } = 3P {T1 < u1 } − 3P {∩2k=1 (Tk < u1 )} + P {∩3k=1 (Tk < u1 )} and P {∩3k=1 (Tk:3 < uk )} = 6P {∩3k=1 (Tk < uk )} − 3P (T1 < u1 ) ∩ ∩3k=2 (Tk < u2 )



−3





 ∩2k=1 (Tk < u1 ) ∩ (T3 < u3 ) + P {∩3k=1 (Tk < u1 )},

which proves (13). A.2. Proof of (14) According to (5), the minimal power of a step-up procedure can be written as

    ψ1d,3,3 = P {∩3i=1 (Tk:3 ≥ dk )} + 3P (T1:3 < d1 ) ∩ ∩3k=2 (Tk:3 ≥ dk ) + 3P ∩2k=1 (Tk:3 < dk ) ∩ (T3:3 ≥ d3 ) . Note that P {∩3k=1 (Tk:3 ≥ dk )} = 1 −

3 X

P {Tk:3 < dk } +

P {∩j=k1 ,k2 (Tj:3 < dj )} − P {∩3k=1 (Tk:3 < dk )},

1≤k1
k=1

P (T1:3 < d1 ) ∩ ∩3k=2 (Tk:3 ≥ dk )



X



= P {T1:3 < d1 } − P {∩2k=1 (Tk:3 < dk )} − P {∩k=1,3 (Tk:3 < dk )} + P {∩3k=1 (Tk:3 < dk )},

and P {∩2k=1 (Tk:3 < dk ) ∩ (T3:3 ≥ d3 )} = P {∩2k=1 (Tk:3 < dk )} − P {∩3k=1 (Tk:3 < dk )}.

(35)

122

J. Chen et al. / Computational Statistics and Data Analysis 55 (2011) 110–122

Hence,

ψ1d,3,3 = 1 + 2P {T1:3 < d1 } − P {T2:3 < d2 } − P {T3:3 < d3 } + P {∩2k=1 (Tk:3 < dk )} − 2P {∩k=1,3 (Tk:3 < dk )} + P {∩3k=2 (Tk:3 < dk )} − P {∩3k=1 (Tk:3 < dk )}.

(36)

Further note that, see, e.g. David (1981, p. 109) P {Tr :n < d} =

  n X m−1 n (−1)m−r P {∩m k=1 (Tk < d)}, r − 1 m m=r

then we have P {T1:3 < d1 } = 3P {T1 < d1 } − 3P {∩2k=1 (Tk < d1 )} + P {∩3k=1 (Ti < d1 )},

(37)

P {T2:3 < d2 } = 3P {∩

(38)

2 k =1

(Tk < d2 )} − 2P {∩

3 k=1

(Tk < d2 )}.

Also, according to Eq. (2.10) in Maurer and Margolin (1976), the joint cdf of any two order statistics is given by P {Tr :n < dr , Ts:n < ds } = (−1)

r +s

" n X b X (−1)a+b b=s a=r

n!



(n − b)!(b − a)!a!

a−1 a−r



b−a−1



b−s

# P a ,b ,

where Pa,b = P



  ∩ak=1 (Tk < dr ) ∩ ∩bk=a+1 (Tk < ds ) .

Then, for n = 3 we have the following: P {∩3k=2 (Tk:3 < dk )} = 3P {∩2k=1 (Tk < d2 ) ∩ (T3 < d3 )} − 2P {∩3k=1 (Tk < d2 )}, P {∩k=1,3 (Tk:3

(39)

  < dk )} = 3P (T1 < d1 ) ∩ ∩3k=2 (Tk < d3 ) − 3P {∩2k=1 (Tk < d1 ) ∩ (T3 < d3 )} + P {∩3k=1 (Tk < d1 )}

(40)

and P {∩2k=1 (Tk:3 < dk )} = 6P {∩2k=1 (Tk < dk )} − 3P {∩2k=1 (Tk < d1 )} − 3P (T1 < c1 ) ∩ ∩3k=2 (Tk < d2 )



+ P {∩3k=1 (Tk < d1 )}.

 (41)

Then (14) is proved by substituting (37)–(41) into (36). References Bang, H., Jung, S.-H., George, S., 2005. Sample size calculation for simulation-based multiple-testing procedures. Journal of Biopharmaceutical Statistics 15 (6), 957–967. Berger, R.L., 1982. Multiparameter hypothesis testing and acceptance sampling. Technometrics 24, 295–300. David, H.A., 1981. Order Statistics. John Wiley & Sons. Dunnett, C.W., Horn, M., Vollandt, R., 2001. Sample size determination in step-down and step-up multiple tests for comparing treatments with a control. Journal of Statistical Planning and Inference 97 (2), 367–384. Dunnett, C.W., Tamhane, A.C., 1992. A step-up multiple test procedure. Journal of the American Statistical Association 87, 162–170. Dunnett, C.W., Tamhane, A.C., 1993. Power comparisons of some step-up multiple test procedures. Statistics & Probability Letters 16, 55–58. EMEA, Points to consider on multiplicity issues in clinical trials, CPMP/EWP/908/99, 2002. Goulden, I., Jackson, D., Goulden, I., 1983. Combinatorial Enumeration. Wiley, New York. Hochberg, Y., Tamhane, A.C., 1987. Multiple Comparison Procedures. John Wiley & Sons. Hsu, J.C., 1996. Multiple Comparisons: Theory and Methods. Chapman & Hall/CRC. Kotz, S., Balakrishnan, N., Johnson, N.L., 2000. Continuous Multivariate Distributions. Volume I: Models and Applications. John Wiley & Sons. Maurer, W., Margolin, B.H., 1976. The multivariate inclusion-exclusion formula and order statistics from dependent variates. Annals of Statistics 4, 1190–1199. Placket, R.L., 1954. A reduction formula for normal multivariate integrals. Biometrika 41, 351–360. Rogers, S.L., Farlow, M.R., Doody, R.S., Mohs, R., Friedhoff, L.T., 1998. A 24-week, double-blind, placebo-controlled trial of donepezil in patients with Alzheimer’s disease. donepezil study group. Neurology 50, 136–145. Samuel-Cahn, E., 1996. Is the Simes improved Bonferroni procedure conservative? Biometrika 83, 928–933. Senn, S., Bretz, F., 2007. Power and sample size when multiple endpoints are considered. Pharmaceutical Statistics 6, 161–170. Tamhane, A.C., 1996. Multiple comparisons. Handbook of Statistics 13, 587–629. r Westfall, P.H., Tobias, R.D., Rom, D., Wolfinger, R.D., Hochberg, Y., 1999. Multiple Comparisons and Multiple Tests Using the SAS System. SAS Institute Inc., Cary, NC. Xiong, C., Yu, K., Gao, F., Yan, Y., Zhang, Z., 2005. Power and sample size for clinical trials when efficacy is required in multiple endpoints: application to an alzheimer’s treatment trial. Clinical Trials 2, 387–393.