Controlled Clinical Trials 24 (2003) 364–377
Matched-pair noninferiority trials using rate ratio: a comparison of current methods and sample size refinement Man-Lai Tang, Ph.D.* Channing Laboratory, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, USA Manuscript received March 25, 2002; manuscript accepted February 17, 2003
Abstract In this article, we consider the establishment of noninferiority between a reference test and a new test with respect to the ratio of sensitivity and/or specificity. We first review two (one-sided) noninferiority tests, namely the logarithmic transformation test and the Fieller-type test, and their associated sample size formulae proposed for matched-pair designs when the null hypothesis is of a specified nonunity rate ratio. Different methods for implementing these one-sided noninferiority tests are reviewed. They include (1) the sample-based method, (2) the constrained least-squares estimation method, and (3) the constrained maximum likelihood estimation method. We conduct a simple empirical study to evaluate the performance of various tests/methods. In summary, statistics based on constrained maximum likelihood estimation always control the actual type I error rate much better than other statistics. Moreover, the corresponding approximate sample size formulae are valid asymptotically in the sense that the exact powers associated with the approximate sample size formulae are generally close to the prespecified power level. Methods based on constrained maximum likelihood estimation are illustrated with a real example from a clinical laboratory study. 쑕 2003 Elsevier Inc. All rights reserved. Keywords: Constrained maximum likelihood estimation; Noninferiority; Rate ratio; Sample size determination; Sensitivity; Specificity
Introduction The establishment of equivalence and/or noninferiority between two treatments or two test procedures occurs in various types of controlled clinical trials. Determination of whether * Corresponding author: Man-Lai Tang, Ph.D., Channing Laboratory, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115. Tel.: ⫹1-617-525-0341; fax: ⫹1-617-731-1541. E-mail address:
[email protected] 0197-2456/03/$—see front matter 쑕 2003 Elsevier Inc. All rights reserved. doi:10.1016/S0197-2456(03)00025-4
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
365
a new test procedure is as effective as a standard procedure can sometimes be demonstrated by a one-sided equivalence hypothesis, that is, a hypothesis that the new test procedure is not inferior to the standard procedure. This so-called noninferiority trial is particularly common when the new procedure is considered to be as effective (perhaps not more effective), but it may be less toxic, easier to administer, or cheaper than the standard procedure. For instance, Lui and Cumberland described a study of a randomized clinical trail in congenital hearing impairment in newborns to demonstrate that the pure tone screening procedure, a less expensive and an easier to administer method, was not inferior to the auditory brain stem response procedure, a standard procedure, in terms of detecting hearing loss in early childhood without losing much precision in either sensitivity or specificity [1]. Dunnett and Gent provided another example in health care trials to evaluate whether nurse-practitioners could be substitute for physicians when treating mild illness without deteriorating the quality of patient care [2]. Hypothesis testing and sample size determination for noninferiority testing via rate difference in matched-pair designs were studied recently [3,4]. In this case, one usually specifies the maximal difference, ∆ ⬎ 0, between the sensitivity (or specificity) of the new procedure (denote it by pN) and the sensitivity (or specificity) of the standard procedure (denote it by pS) that is considered to be clinically acceptable. Then the new procedure is judged not to be noninferior to the standard procedure if the null hypothesis (i.e., H0: pN ⫺ pS ⭐ ⫺∆) is true and to be noninferior if the alternative hypothesis (i.e., H1: pN ⫺ pS ⬎ ⫺∆) holds. Although a criterion based on rate difference has been proposed as a standard criterion [5], rate ratio is considered as an alternative for evaluation of equivalence or noninferiority when the rates are close to 1 or the underlying rate of the standard test substantially varies between studies [1,6]. In the latter case, one usually specifies the maximal ratio, δ, between pN and pS that is considered to be clinically acceptable. The new procedure is now judged not to be noninferior to the standard procedure if the null hypothesis (i.e., H0: pN /pS ⭐ δ) is true and to be noninferior if the alternative hypothesis (i.e., pN /pS ⬎ δ) holds. Lachenbruch and Lynch proposed a so-called L-statistic for accessing equivalence of two HIV screening tests via rate ratio under a matched-pair design [7]. Tang et al. presented a score-type statistic for a similar task and showed that their score-type statistic controls type I error much better than existing statistics [8]. In addition, formulas based on this score-type statistic were proposed to provide a sample size estimate that guarantees a prespecified power of a hypothesis test at certain significance level, or controls the width of a confidence interval with certain confidence level [9]. Using logarithmic transformation and Fieller-type statistics based on sample-based and constrained least-squares estimations of nuisance parameters, Lui and Cumberland recently proposed sample size formulae for noninferiority testing of the rate ratio using paired-sample data [1]. Nam and Blackwelder studied the performance of a Wald-type statistic and the constrained maximum likelihood Fieller-type statistic, and the corresponding sample size formulae were derived [10]. They found that the constrained maximum likelihood Fieller-type possessed actual type I error rates close to the nominal level and its associated sample size formula generally provided satisfactory sample size to achieve the prespecified power level. However, evaluation of different combinations of statistics (i.e., logarithmic transformation and Fieller-type statistics) and nuisance parameter estimation methods (i.e., sample-based, constrained least-squares and constrained maximum
366
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
likelihood estimations) have not been reported yet. In a simple empirical study considered in the present manuscript, we observe that statistics based on both sample-based and constrained least-squares methods could possess inflated actual type I error rates. For instance, their actual significance levels could exceed the nominal level by more than 20%. Obviously, such a liberality in committing type I error may not be tolerable in practice. Moreover, the empirical powers based on their sample size formulae are empirically shown to be much less than the prechosen power level. In this article, we first briefly review the test statistics, along with the corresponding sample size formulae, for noninferiority testing of rate ratio based on the logarithmic transformation and Feiller-type statistics. We then consider different approaches, which differ only in the estimates of the variance of the test statistic under the null hypothesis, for implementing these statistics. We find that the Feiller-type statistic based on the constrained maximum likelihood estimation method reduces to the score-type test statistic. Extension of the proposed methods to simultaneously demonstrate the noninferiority for both the sensitivities and specificities is straightforward. Comparisons among various statistics and approximate sample size formulae will be investigated via a simple empirical study. In general, statistics based on constrained maximum likelihood estimation control type I error very well and the associated approximate sample size formulae produce accurate sample size estimates to achieve the prespecified power level.
Equivalence test procedures and sample size formulae Notation In this article, we assume that there is a gold standard that can be adopted to determine whether a given subject possesses a certain disease. A random sample of size ng subjects is drawn from the diseased (g ⫽ d) and nondiseased (g ⫽ d¯ ) populations, respectively. We assume further that each of these ng sampled subjects receives a reference test and a new screening test in random order. Following the definition of Lui and Cumberland, we define “consistent” as a positive test result on a diseased subject or as a negative test result on a nondiseased subject [1]. The probability and outcome structure in population g (g ⫽ d, d¯) are summarized in the following 2 × 2 table: New test Reference test
Consistent Inconsistent
Consistent
Inconsistent
Total
p11|g (n11|g) p01|g (n01|g) p+1|g (n+1|g)
p10|g (n10|g) p00|g (n00|g) p+0|g (n+0|g)
p1+|g (n1+|g) p0+|g (n0+|g) 1 (ng)
where 0 ⭐ pij|g ⭐ 1 denotes the response probability of cell (i, j), pi⫹|g ⫽ pi1|g ⫹ pi0|g, and p⫹j|g ⫽ p1j|g ⫹ p0j|g, i ⫽ 0,1, and j ⫽ 0,1. Let nij|g be the cell frequency of cell (i, j) with ni⫹|g ⫽ ni1|g ⫹ ni0|g, and n⫹j|g ⫽ n1j|g ⫹ n0j|g, i ⫽ 0,1, and j ⫽ 0,1. Hence, the sensitivities for the reference and the new tests are given by p1⫹|d and p⫹1|d, respectively. Similarly, the specificities for the reference and the new tests are given by p1⫹|d¯ and p⫹1|d¯, respectively.
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
367
Here, the observed (sample-based) values of pij|g, pi⫹|g, and p⫹j|g are pˆ ij|g ⫽ nij|g/ng, pˆ i⫹|g ⫽ ni⫹|g/ng, and pˆ ⫹j|g ⫽ n⫹j|g/ng for i, j ⫽ 0,1, and g ⫽ d,d¯. Tests and sample size formulae for noninferiority of sensitivity (or specificity) To test for noninferiority, one first specifies the maximum acceptable level of the relative difference, ∆0g (⬎0), between p1⫹|g and p⫹1|g that is considered to be clinically acceptable. If the new test is noninferior to the reference test under consideration, one would anticipate that (p1⫹|g⫺p⫹1|g)/p1⫹|g ⬍ ∆0g, which in turn is equivalent to p⫹1|g/p1⫹|g ⬎ δ0g, where δ0g ⫽ 1⫺∆0g. In other words, a noninferiority trial would use a one-sided test of the null hypothesis, H0g: p⫹1|g/p1⫹|g ⭐ δ0g (versus H1g: p⫹1|g/p1⫹|g ⬎ δ0g), at a prespecified nominal level (say, α) and a sensitivity (or specificity) ratio δ0g. In designing a noninferiority trial, one often chooses a sample size sufficiently large to yield a power of 0.8 or greater to demonstrate noninferiority if the sensitivities (or specificities) in the two test procedures are truly identical (i.e., p⫹1|g/p1⫹|g ⫽ 1). To accomplish the aforementioned tasks, Lui and Cumberland proposed two different tests (each with two different versions) for testing the null hypothesis H0g and their corresponding sample size formulae were also derived [1]. We review their approaches and discuss improved sample size formulae based on test statistics utilizing the constrained maximum likelihood method. Here, we first outline some background material common to all approaches presented in this article. We will later show that various approaches differ only in the estimates of the variance of the test statistic under the null hypothesis [11]. Tests for noninferiority To test the null hypothesis H0g, we may base our statistical inference on the quantity pˆ ⫹1|g/ pˆ 1⫹|g. Noting that the sampling distribution of such a quantity is generally skewed, we usually consider its logarithmic transformation version, that is, log(pˆ ⫹1|g/pˆ 1⫹|g), to improve the normal approximation [1,2]. In this case, the usual delta method [12,13] yields 2 , T1g⫽[log(pˆ ⫹1|g/pˆ 1⫹|g)⫺log(p⫹1|g/p1⫹|g)]/冪vˆ 1g
(1)
2 where vˆ 1g is the estimate of Var[log(pˆ ⫹1|g/pˆ 1⫹|g)] under the null hypothesis and is given by 2 vˆ 1g⫽[(1⫹δ0g)p˘ 1⫹|g⫺2p˘ 11|g]/(ngδ0gp˘ 21⫹|g). Here, p˘ 1⫹|g and p˘ 11|g are any estimates of p1⫹|g and p11|g under the null hypothesis. On the other hand, by employing the principle of Fieller’s Theorem [1], one can readily reach the following statistic for testing the null hypothesis H0g:
T2g⫽[pˆ ⫹1|g⫺δ0gpˆ 1⫹|g]/冪vˆ 22g,
(2)
2 is the null variance estimate of pˆ ⫹1|g⫺δ0gpˆ 1⫹|g and is given by vˆ 22g⫽δ0g[(1⫹ where vˆ 2g δ0g)p˘ 1⫹|g⫺2p˘ 11|g]/ng. For sufficiently large ng, both T1g and T2g are asymptotically normal distributed and the null hypothesis may then be tested by referring T1g or T2g to the standard normal distribution. It is noteworthy that the accuracy of the normal approximation however 2 depends heavily on the variance estimates vˆ 1g and vˆ 22g; and hence p˘ 1⫹|g and p˘ 11|g [11]. Possible choices for these estimates will be discussed shortly.
368
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
Sample size formulae for noninferiority Without loss of generality, we assume that δ1g ⫽ p⫹1|g/p1⫹|g ⬎ δ0g and that we seek a power 100(1⫺β)% of rejecting the null hypothesis H0g: p⫹1|g/p1⫹|g ⫽ δ0g at a nominal α-level when the alternative hypothesis H1g: p⫹1|g/p1⫹|g ⫽ δ1g (≠δ0g). Using T1g, one can readily derive the following formula:
(
ng ⫽ ceil
2 2 {zα冪[(1+δ0g)p¯ 1⫹|g⫺2p¯ 11|g]/(δ0gp¯ 1⫹|g )⫹zβ冪[(1⫹δ1g)p1⫹|g⫺2p11|g]/(δ1gp1⫹|g )}2 [log(δ1g) ⫺ log(δ0g)]2
)
(3) Here, ceil(x) denotes the smallest integer ⭓ x,zγ is the upper 100(γth) percentile of the standard normal distribution, and p¯ 1⫹|g and p¯ 11|g are the asymptotic limits of p˘ 1⫹|g and p˘ 11|g for sufficiently large ng given that a true ratio δ = δ1g = p⫹1|g/p1⫹|g. Based on T2g, one can obtain the following formula: mg ⫽ ceil
(
)
{zα冪δ0g[(1⫹δ0g)p¯ 1⫹|g⫺2p¯ 11|g]⫹zβ冪[(δ1g⫹δ20g)p1⫹|g⫺2δ0gp11|g⫺(δ1g⫺δ0g)2p21⫹|g]}2 [p1⫹|g(δ1g⫺δ0g)]2
(4) Note that the parameter p11|g must satisfy the constraint max{0, (1 ⫹ δ1g)p1⫹|g⫺1} ⭐ p11|g ⭐ δ0gp1⫹|g. Estimating p˘ 1⫹|g, p˘ 11|g, p¯1⫹|g, and p¯11|g We would like to reiterate that the accuracy of the normal distribution approximation of 2 both T1g and T2g relies heavily on their variance estimates (i.e., vˆ 1g and vˆ 22g) and hence p˘ 1⫹|g and p˘ 11|g appeared in the denominators [11]. Similarly, the accuracy of the sample size formulae relies on the choices of the asymptotic limits p¯ 1⫹|g and p¯ 11|g. In establishing noninferiority with respect to rate ratio of sensitivity and/or specificity in matched-pair design, there are three existing methods for estimating p˘ 1⫹|g and p˘ 11|g. We outline these methods and the determination of p¯ 1⫹|g and p¯ 11|g as follows: Sample-based method Based on this approach, p˘ 1⫹|g and p˘ 11|g are taken to be the observed (sample-based) values pˆ 1⫹|g ⫽ n1⫹|g/ng and pˆ 11|g ⫽ n11|g/ng. In particular, the statistics (1) and (2) correspond to the statistics Z(g1) and Z(g3) proposed by Lui and Cumberland [1]. Moreover, it is readily seen that the asymptotic limits p¯ 1⫹|g and p¯ 11|g are given by their true values p1⫹|g and p11|g. The resultant sample size formulae based on T1g and T2g are consistent with those (i.e., ng and mg) given by Lui and Cumberland. Also, the formula based on T2g reduces to the formula based on the Wald-type statistic [10]. Constrained least-squares method According to this method, p˘ 1⫹|g and p˘ 11|g are replaced by the estimates of p1⫹|g and p11|g under the null hypothesis restriction H0g: p⫹1|g/p1⫹|g ⫽ δ0g, using the least-squares estimation
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
369
[1]. They can be shown to be p˘ 1⫹|g=(δ0gpˆ ⫹1|g⫹pˆ 1⫹|g)/(1+δ20g) and p˘ 11|g⫽pˆ 11|g. In this case, the asymptotic limits p¯ 1⫹|g and p¯ 11|g are obtained by replacing pˆ ⫹1|g, pˆ 1⫹|g, and pˆ 11|g in p˘ 1⫹|g and 2 p˘ 11|g by their large sample limits p⫹1|g, p1⫹|g, and p11|g. That is, p¯ 1⫹|g ⫽ p1⫹|g(1+δ0gδ1g)/(1+δ0g ) and p¯11|g ⫽ p11|g. Before proceeding to the discussion of constrained maximum likelihood estimation method, we would like to briefly review here some serious drawbacks of applying the aforementioned methods in comparative binomial trials [11]. Farrington and Manning found that approximating the null variance by its value under the alternative hypothesis may, depending on the context, either underestimate or overestimate its true value, thus leading to incorrect approximate sample size formula. In fact, a similar observation was found in matched-pair trials [1]. For instance, ng and mg given in Eqs. (3) and (4) (based on either sample-based or constrained least-squares method) were empirically shown to underestimate the necessary sample sizes for the desired power level [1]. Moreover, both T1g and T2g discussed so far may have inflated actual type I errors. That is, the actual type I error rate could exceed the prespecified nominal level in a significantly big margin. In our empirical study discussed later, we find that their actual type I error rates could exceed the nominal level by more than 20%. Constrained maximum likelihood method Based on this method, p˘ 1⫹|g and p˘ 11|g are chosen to be the maximum likelihood estimates of p1⫹|g and p11|g under the null hypothesis restriction H0g: p⫹1|g/p1⫹|g ⫽ δ0g. Following Tang et al. [8], we can readily obtain p˘ 1⫹|g ⫽ [1 ⫺ pˆ 00|g ⫺ p˘ 10|g]/δ0g and p˘ 11|g ⫽ [1 ⫺ pˆ 00|g ⫺ (1⫹δ0g)p˘ 10|g]/δ0g, 2 ⫺ (pˆ 1⫹|g ⫹ 2pˆ 10|g), and where p˘ 10|g ⫽ [冪B2 ⫺ 4AC ⫺ B]/(2A) with A ⫽ 1 ⫹ δ0g, B ⫽ pˆ 1⫹|gδ0g C ⫽ pˆ 10|g(1 ⫺ δ0g)(pˆ ⫹1|g ⫹ pˆ 10|g) (see Appendix for an outline of proof). The asymptotic limits p¯ 1⫹|g and p¯ 11|g are obtained by replacing pˆ ⫹1|g, pˆ 1⫹|g, and pˆ 10|g in p˘ 1⫹|g and p˘ 11|g by their large sample limits p⫹1|g, p1⫹|g, and p10|g. That is, p¯ ⫹1|g ⫽ p¯ 11g ⫹ p¯ 10|g, and p¯ 11|g ⫽ [(δ1g p1⫹|g ⫹ p10|g) ⫺ (1 ⫹ δ0g)p¯ 10|g]/δ0g with p¯ 10|g ⫽ [冪B21 ⫺ 4A1C1 ⫺ B1]/(2A1) with 2 A1 ⫽ 1⫹δ0g, B1 ⫽ p1⫹|gδ0g ⫺ (δ1gp1⫹|g⫹2p10|g) and C1 ⫽ p10|g(1⫺δ0g)(δ1gp1⫹|g ⫹ p10|g). It is noteworthy that statistic T2g based on constrained maximum likelihood estimation is identical to the score-type statistic and the constrained maximum likelihood Fieller-type statistic [8,10]. Moreover, the corresponding sample size formula is identical to the formulas respectively discussed in Tang et al. and Nam and Blackwelder [9,10], which were all derived to guarantee a prespecified power of the noninferiority hypothesis test at a certain significance level. For the purpose of demonstrating a two-sided equivalence of two diagnostic test procedures, a sample size formula that controls the width of a confidence interval with a certain confidence level should be more appropriate [9]. In practice, when nij|g ⫽ 0 for some i and j, we employ the ad hoc procedure of adding 0.5 to each cell and then using (nij|g ⫹ 0.5)/ (ng ⫹ 2) as an estimate of pˆ ij|g to avoid the problem on the boundary of 0. For those constrained
370
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
estimates p˘ 1⫹|g and p˘ 11|g, we set p˘ 1⫹|g ⫽ 1 when p˘ 1⫹|g ⬎ 1; p˘ 11|g ⫽ max{0,(1 ⫹ δ0g)p˘ 1⫹|g⫺1} when p˘ 11|g ⬍ max{0,(1⫹δ0g)p˘ 1⫹|g ⫺ 1}; and p˘ 11|g ⫽ δ0gp˘ 1⫹|g when p˘ 11|g ⬎ δ0gp˘ 1⫹|g. Simultaneously testing noninferiority for both the sensitivity and the specificity In some applications, one may want to simultaneously assess the noninferiority of a new test to a reference test with respect to the sensitivity and the specificity especially when these two parameters of the new test are possibly lower than those of the reference test. In this case, we may establish noninferiority between the new and the reference tests only when we reject both hypotheses (i.e., H0d: p⫹1|d/p1⫹|d ⭐ δ0d and H0d¯: p⫹1|d¯/p1⫹|d¯ ⭐ δ0d¯) if Tjd ⬎ zαd and Tjd¯ ⬎ zαd¯ , where j ⫽ 1, 2, and 0 ⭐ αd, αd¯ ⭐ 1 are values such that αd × αd¯ ⫽ α, the prespecified overall nominal level. A common choice is α′ ⫽ αd ⫽ αd¯ ⫽ 冪α. Let β′ ⫽ 冪β. By replacing α by α′ and β by β′ in formulas (3) and (4), we can readily reach the required approximate sample sizes, denoted as Ng and Mg, for a desired power 100(1⫺β)% of simultaneously rejecting the two null hypotheses H0g: p⫹1|g/p1⫹|g ⫽ δ0g, for g ⫽ d, d¯, versus the alternative hypotheses H1g: p⫹1|g/p1⫹|g ⫽ δ1g at a nominal α-level when T1g and T2g (g ⫽ d, d¯) are adopted, respectively. Replacing p¯ 1⫹|g and p¯ 11|g by p1⫹|g and p11|g in Ng and Mg leads to the approximate sample size formulae proposed by Lui and Cumberland [1]. Here, noninferiority can be established only when we reject both of the null hypotheses rather than either one of the two hypotheses. For practical implementation issues of simultaneous equivalence/noninferiority testing, one could consult Lui and Cumberland.
Empirical studies In this section, we conduct a simple empirical study to evaluate (1) the robustness of various test statistics and (2) the accuracy of the sample size formulae based on the constrained maximum likelihood method. We first consider testing one-sided equivalence of the sensitivities (or the specificities) between a reference test and a new test. Following Cochran, we say a test is robust if the actual significant level does not exceed 20% of the nominal level (for example, less than or equal to 0.06 when the nominal significant level is 0.05) [14]. This is a less stringent definition than the one suggested by Heeren and D’Agostino, who suggested within 10% of the nominal level for robustness [15]. In this article, a statistical test with an actual significance level that exceeds the nominal level by more than 20% is called liberal, and a test with an actual level below the nominal level is called conservative. One should view a rejection of the null hypothesis for a liberal test with caution for the type I error exceeds the prechosen nominal error rate. Conservative tests are of less concern, for the type I error rate is controlled. In this respect, we share similar views to those suggested by Sullivan and D’Agostino [16]. Instead of computing the empirical significance levels of various test statistics via Monte Carlo simulation, we are going to obtain their actual significance levels. To do so, we notice that exact power of a particular α-level test Tkg (k ⫽ 1, 2) at a given sample size ng and response probability p ⫽ (p11|g, p10|g, p01|g) is given by
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377 ng
Π(Tkg; α, ng, p) ⫽
ng⫺n11|g ng⫺n11|g⫺n10|g
兺 兺
n11|g⫽0 n10|g⫽0
兺
n01|g⫽0
371
[
]
n g! n11|g!n10|g!n01|g!(ng⫺n11|g⫺n10|g⫺n01|g)!
× (p11|g)n11|g(p10|g)n10|g(p01|g)n01|g (1 ⫺ p11|g ⫺ p10|g ⫺ p01|g)ng⫺n11|g⫺n10|g⫺n01|g × I(Tkg; α) where I(Tkg; α) ⫽ 1 if Tkg indicates a rejection of the null hypothesis at the α-level; ⫽ 0 otherwise. In particular, if p ⫽ (p11|g, p10|g, p01|g) is chosen under the null hypothesis, Π(Tkg; α, ng, p) reduces to the actual significance level. Since a sample size ng greater than 2000 could lead to prohibitive computing time, we restrict our attention to those situations in which the approximate sample sizes are no more than 2000. To achieve this goal, we consider the situations in which δ0g ⫽ 0.90; δ1g ⫽ 0.95, 1.0; p11|g ranges from max{0, (1+δ1g)p1⫹|g⫺1} to δ0gp1⫹|g (with a step increase of 0.02); and p1⫹|g ⫽ 0.80, 0.90, 0.95 such that the combination of the above parameters leads to a valid set of response probabilities, for which pij|g ⭓ 0 for all i and j under both the null and alternative hypotheses. We first apply the approximate sample size formulae ng (3) and mg (4) under the constrained maximum likelihood method to obtain the desired sample sizes for a power of 80% at the 0.05-level (one-sided test) for each configuration determined by these parameters (see column 7 in Tables 1 and 2). Based Table 1. The approximate sample size for a desired power of 80% using the logarithmic transformation statistic with δ0 ⫽ 0.90 at α ⫽ 0.05 and the corresponding actual significance level (ASL) and exact power (EP)
p1⫹|g
δ1
p11|g
ASL (EP) Sample-based ng method
0.80
0.95
0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.76 0.78 0.80 0.80 0.85
1532 1411 1262 1127 985 838 702 567 430 350 313 280 246 210 173 139 684 565 464 139 376
1.00
0.90
0.95
0.95
1.00 0.95
a
5.08 5.10 5.12 5.16 5.22 5.33 5.52 5.97 6.94 5.25 5.32 5.47 5.63 5.99 6.79 8.00 5.46 5.74 6.37 7.65 6.75
(79.98) (80.34) (80.10) (80.22) (80.14) (79.82) (79.89) (80.02) (80.07)a (80.16) (79.92) (80.09) (80.27) (80.10) (79.68) (79.68) (81.55) (81.26) (82.07) (80.81) (80.25)
Constrained least-squares method
ASL (EP) Constrained maximum ng likelihood method
5.08 5.10 5.13 5.18 5.25 5.36 5.56 5.98 4.53 5.27 5.35 5.48 5.70 6.10 6.73 4.07 5.50 5.80 6.07 5.76 5.42
1557 1419 1281 1144 1008 873 739 607 478 360 326 293 259 226 194 162 683 578 475 156 425
(79.98) (80.35) (80.12) (80.25) (80.18) (79.87) (79.99) (80.14) (80.16) (82.45) (79.95) (80.18) (80.30) (80.19) (79.87) (80.09) (81.64) (81.33) (82.18) (80.81) (80.25)
4.94 4.93 4.92 4.90 4.87 4.84 4.77 4.64 4.32 4.84 4.81 4.77 4.71 4.58 4.31 3.73 4.79 4.69 4.49 4.08 4.42
Actual significance level is set in bold whenever it exceeds the true 0.05-level by 20%.
(80.23) (80.14) (80.14) (80.15) (80.20) (80.24) (80.32) (80.40) (80.56) (80.45) (80.36) (80.52) (80.53) (80.71) (81.04) (81.27) (80.30) (80.35) (80.46) (80.73) (80.04)
372
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
Table 2. The approximate sample size for a desired power of 80% using the Fieller-type statistic with δ0 ⫽ 0.90 at α ⫽ 0.05 and the corresponding actual significance level (ASL) and exact power (EP) ASL (EP)
ASL (EP)
p1⫹|g
δ1
p11/g
mg
Sample-based method
Constrained least-squares method
mg
Constrained maximum likelihood method
0.80
0.95
0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.76 0.78 0.80 0.80 0.85
1539 1406 1251 1123 980 843 707 566 430 347 313 280 243 209 174 138 683 567 465 135 376
5.08 5.10 5.13 5.17 5.23 5.33 5.52 5.97 6.94 5.25 5.32 5.47 5.65 6.00 6.78 7.98 5.46 5.74 6.36 7.72 6.75
5.11 (80.24) 5.13 (80.31) 5.16 (79.89) 5.21 (80.21) 5.27 (80.09) 5.39 (80.15) 5.59 (80.31) 6.00 (80.15) 4.62 (80.22) 5.32 (80.03) 5.40 (80.01) 5.51 (80.28) 5.75 (80.03) 6.18 (80.16) 6.74 (80.22) 4.06 (79.96) 5.53 (81.66) 5.82 (81.54) 6.13 (82.34) 5.72 (79.93) 5.44 (80.25)
1550 1412 1275 1138 1001 866 732 600 471 356 322 288 255 222 189 158 677 572 469 152 419
5.00 4.99 4.99 4.98 4.97 4.94 4.91 4.82 4.56 4.98 4.96 4.93 4.89 4.82 4.67 3.93 4.92 4.86 4.72 4.42 4.67
1.00
0.90
0.95
0.95
1.00 0.95
a
(80.15) (80.23) (79.80) (80.11) (79.97) (80.02) (80.14) (79.96) (80.07)a (80.00) (79.92) (80.09) (79.86) (79.96) (79.88) (79.45) (81.50) (81.38) (82.14) (78.99) (80.25)
(80.24) (80.35) (80.16) (80.18) (80.18) (80.24) (80.31) (80.41) (80.54) (80.37) (80.23) (80.30) (80.43) (80.53) (80.71) (81.21) (79.79) (80.37) (80.46) (80.82) (80.15)
Actual significance level is set in bold whenever it exceeds the true 0.05-level by 20%.
on these approximate sample sizes, we then obtain the corresponding (1) actual significance level (i.e., Π(Tkg; α, ng, p) under the null hypothesis (i.e., H0g: p⫹1|g/p1⫹|g ⫽ δ0g); and (2) exact power (i.e., Π(Tkg; α, ng, p) under the alternative hypothesis (i.e., H1g: p⫹1|g/p1⫹|g ⫽ δ1g) (see column 8 in Tables 1 and 2). For statistics based on sample-based and constrained leastsquares methods, we simply adopt the empirically adjusted sample sizes reported in Table 6 of Lui and Cumberland [1] (see column 4 in Tables 1 and 2). The corresponding actual significance levels and exact powers are computed accordingly (see columns 5 and 6 in Tables 1 and 2). From Tables 1 and 2, we see that both T1g and T2g based on the samplebased method and the constrained least-squares method should be considered liberal tests. The closer p11|g to δ0gp1⫹|g, the more liberal the tests, which in turn lead to inflated actual type I error rates. Hence, tests and sample size formulae based on the sample-based method and the constrained least-squares method are not recommended. On the other hand, tests based on the constrained maximum likelihood method can be regarded as conservative tests. Although their actual significance levels are generally smaller than the prespecified nominal level, the difference is pretty small except when p11|g is close to δ0gp1⫹|g. Again, conservative tests are of less concern, for their type I error rates are controlled. Based on the constrained maximum likelihood method, we observe that the logarithmic transformation
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
373
Table 3. The approximate sample size for a desired power of 80% of simultaneous noninferiority testing using the logarithmic transformation and Fieller-type statistics (based on constrained maximum likelihood method) with δ0 ⫽ 0.90 at α ⫽ 0.01 and the corresponding actual significance level (ASL) and exact power (EP) Log-transformation
Fieller’s theorem
p1⫹|g
δ1
p11|g
Ng
ASL (EP)
Mg
ASL (EP)
0.80
0.95
0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.76 0.78 0.80 0.80 0.85
1602 1459 1317 1175 1033 892 753 614 478 369 334 299 264 229 195 162 696 587 479 157 427
9.93 9.93 9.92 9.91 9.89 9.85 9.81 9.71 9.44 9.85 9.84 9.79 9.77 9.64 9.53 8.68 9.81 9.75 9.62 9.30 9.54
1599 1456 1313 1171 1030 889 749 611 476 366 331 296 261 226 192 159 693 584 476 154 424
9.99 9.99 9.99 10.00 9.98 9.97 9.95 9.90 9.72 9.99 10.00 9.99 9.95 9.90 9.81 9.14 9.96 9.91 9.81 9.67 9.77
1.00
0.90
0.95
0.95
1.00 0.95
(89.43) (89.40) (89.41) (89.42) (89.42) (89.43) (89.48) (89.48) (89.50) (89.64) (89.56) (89.64) (89.70) (89.75) (89.91) (90.14) (89.43) (89.51) (89.54) (90.30) (89.41)
(89.50) (89.42) (89.49) (89.50) (89.44) (89.46) (89.48) (89.52) (89.61) (89.49) (89.50) (89.57) (89.61) (89.64) (89.77) (90.00) (89.49) (89.53) (89.58) (89.92) (89.51)
statistic commits slightly fewer type I errors than the Fieller-type statistic while their sample size estimates are pretty close (less than or equal to 7 in total). Most importantly, the approximate sample sizes obtained via the constrained maximum likelihood method produce actual powers pretty close to the prespecified power level. In view of the above reasons, we can recommend tests based on the constrained maximum likelihood estimation. These results are consistent with those observations reported in comparative binomial trials [11]. If one could afford to recruit a few more subjects, it seems that the logarithmic transformation statistic is preferable as it commits fewer type I errors than the does Fieller-type statistic. Table 3 summarizes the approximate sample sizes (based on the constrained maximum likelihood method) required from the respective population g (g ⫽ d,d¯ ) for a desired power 80% of simultaneously rejecting the null hypotheses H0g: p⫹1|g/p1⫹|g ⫽ δ0g ⫽ 0.90, for g ⫽ d, d¯ , versus the alternative hypotheses H1g: p⫹1|g/p1⫹|g ⫽ δ1g=0.95, 1.0 at 0.01-level (one-sided test) in the situations in which p11|g ranges from max{0, (1+δ1g)p1⫹|g⫺1} to δ0gp1⫹|g (with a step increase of 0.02); and p1⫹|g ⫽ 0.80, 0.90, 0.95 such that the combination of the above parameters leads to a valid set of response probabilities, for which pij|g ⭓ 0 for all i and j under both the null and alternative hypotheses (see columns 4 and 6). Since equivalence/ noninferiority is established only when we reject both of the null hypotheses rather than either one of the two hypotheses, the actual significance level and exact power for each individual test should now be compared to the adjusted nominal level 0.1(⫽冪0.01) and
374
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
adjusted power level 0.894(⫽冪0.80), respectively. Consistent with our previous findings, tests based on the constrained maximum likelihood method can be regarded as conservative tests. Although their actual significance levels are generally smaller than the prespecified nominal level, the difference is pretty small except when p11|g is close to δ0gp1⫹|g. Moreover, the approximate sample sizes obtained via the constrained maximum likelihood method produce actual powers extremely close to the prespecified power level. Therefore, we can recommend tests based on the constrained maximum likelihood estimation. Example from a clinical laboratory study The quantitation of in vitro immunoglobulin E (IgE) antibodies to the benzylpenicilloyl determinant (BPO) is a useful tool for evaluating suspected penicillin allergic subjects. Garcia et al. studied the sensitivity and specificity of several radio allergo sorbent test (RAST) methods for quantitating specific IgE antibodies to the BPO [17]. Thirty positive control sera (serum samples from penicillin allergic subjects with a positive clinical history and a positive penicillin skin test) and 30 negative control sera (sera from subjects with no history of penicillin allergy and negative skin test) were tested for BPO-specific IgE antibodies by RAST using different conjugates coupled to the solid phase. Benzylpenicillin conjugated to human serum albumin (HSA) has been widely used and is considered here as the reference test. Benzylpenicillin conjugated to an aminospacer (SP) has only recently been reported and is considered here as the experimental test. The RAST cutoff points that offered the best compromise between sensitivity and specificity for the two assays were reported to be 0.9 for BPO-HSA and 0.5 for BPO-SP. The results are summarized in Table 4. In the subsequent presentation, all results will be based on the constrained maximum likelihood estimation method and we are interested in testing noninferiority for both the sensitivity and the specificity simultaneously. Suppose we want to simultaneously demonstrate that the sensitivity and specificity of the experimental test are at least a factor δ0d ⫽ δ0d¯ ⫽ 0.90 of those of the reference test. Hence, the null hypotheses of interest are: H0d:p⫹1|d/p1⫹|d ⭐ 0.90 or
H0d¯ : p1⫹|d¯ /p1⫹|d¯ ⭐ 0.90.
In this case, we have pˆ 1⫹|d ⫽ 19/30, pˆ ⫹1|d ⫽ 26/30, pˆ 1⫹|d¯ ⫽ 25/30, pˆ ⫹1|d¯ ⫽ 29/30 and the estimates of the rate ratios (i.e., δ0d and δ0d¯) are given by δˆ 0d ⫽ 26/19 ⫽ 1.37 and δˆ 0d¯ ⫽ 29/ 25 ⫽ 1.16. The constrained maximum likelihood estimates are p˘ 10|d ⫽ 0.2428, p˘ 1⫹|d ⫽ 0.7672, p˘ 10|d¯ ⫽ 0.1788, and p˘ 1⫹|d¯ ⫽ 0.9125. Hence, based on the constrained maximum likelihood method we have the logarithmic transformation statistic T1d ⫽ 2.16 and T1d¯ ⫽ 2.13, and the Fieller-type statistic T2d ⫽ 2.68 and T2d¯ ⫽ 2.42. Let the nominal level be α ⫽ 0.01. Hence, α′ ⫽ αd ⫽ αd¯ ⫽ 冪α ⫽ 0.1 with zα′ ⫽ 1.2816. As a result, both tests Table 4. Observed frequencies for BPO-HSA and BPO-SP for the positive and negative control groups Penicillin allergy: ⫹BPO-SP
Penicillin allergy: ⫺BPO-SP
BPO-HSA
Consistent
Inconsistent
Consistent
Inconsistent
Consistent Inconsistent
17 9
2 2
24 5
1 0
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
375
indicate that BPO-SP is noninferior to BPO-HSA in terms of both sensitivity and specificity at 0.01 level. Now suppose we are planning a trial with δ0d ⫽ δ0d¯ ⫽ 0.90, δ1d ⫽ δ1d¯ ⫽ 1.00, α ⫽ 0.01, p11|d ⫽ 0.57, p1⫹|d ⫽ 0.63, p11|d¯ ⫽ 0.80, and p1⫹|d¯ ⫽ 0.83. The asymptotic powers for the logarithmic transformation and the Fieller-type statistic based on these settings and nd ⫽ nd¯ ⫽ 30 are 0.2040 and 0.1865, respectively. These indicate that the study by Garcia et al. did not demonstrate reasonable power if the two conjugates coupled to the solid phase simultaneously had equal sensitivities and specificities [17]. In order to achieve a power of 80%, we need sample sizes (for simultaneously testing noninferiority for both the sensitivity and the specificity) Nd ⫽ 199 and Nd¯ ⫽ 77 (for the logarithmic transformation statistic), and Md ⫽ 196 and Md¯ ⫽ 74(for the Fieller-type statistic). It is noteworthy that the results between the two statistics do not differ substantially.
Discussion In practice, standard null hypotheses of no difference do not apply if the requirement is to establish equivalence or noninferiority between a reference test and a new test. We systematically review two statistics (one is derived from logarithmic transformation and the other is derived from the principle of Fieller’s theorem) based on three methods and their corresponding sample size formulae applicable to matched-pair trials when the null hypothesis is of a specified nonunity ratio of sensitivities (or specificities). Method 1, the sample-based method, and Method 2, the constrained least-squares method, may produce incorrect sample sizes. This is consistent with the findings in comparative binomial trials [11]. Moreover, tests based on these methods could lead to inflated type I error rates and are considered to be liberal. Therefore, they are not recommended. Method 3, the constrained maximum likelihood estimation, on the other hand, produces accurate sample size formulae and the associated statistics always control type I error well in moderate to large sample sizes. We therefore recommend them in practice. In particular, the logarithmic transformation statistic is preferable as its sample size estimate is generally close to that of the Fieller-type statistic but with less frequency of committing type I error. Our investigation is mainly confined to statistics based on logarithmic transformation and Fieller’s theorem, and methods based on sample-based, constrained least-squares, and constrained maximum likelihood. It is possible that there exist other statistics or estimation methods for which the nice asymptotic properties discussed in this article hold. However, we are unaware of any procedure for deriving such potential tests or methods and leave this question open for further research.
Acknowledgments The work described in this paper was fully supported by a grant from the Research Grant Council of the Hong Kong Special Administration Region (Project No. CUHK4261/00M). The author is grateful to the editor and two referees for their valuable suggestions that greatly enhanced the manuscript.
376
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
Appendix For any given δ0g, p1⫹|g, and p10|g, we have p11|g ⫽ p1⫹|g⫺p10|g, and p01|g ⫽ (δ0g⫺1)p1⫹|g ⫹ p10|g. Hence, the log-likelihood function can be written as lnL ⫽ n11|gln(p1⫹|g⫺p10|g)⫹n10|gln(p10|g)⫹n01|gln[(δ0g⫺1)p1⫹|g⫹p10|g] ⫹n00|gln[1 ⫺ δ0gp1⫹|g⫺p10|g], and the corresponding first order partial derivatives with respect to p1⫹|g, and p10|g are given by ∂lnL n11|g (δ0g ⫺ 1)n01|g δ0gn00|g ⫽ ⫹ ⫺ ∂p1⫹|g p1⫹|g ⫺ p10|g (δ0g ⫺ 1)p1⫹|g ⫹ p10|g 1 ⫺ δ0gp1⫹|g ⫺ p10|g and ∂lnL n11|g n n01|g n00|g ⫽⫺ ⫹ 10|g ⫹ ⫺ ∂p10|g p1⫹|g ⫺ p10|g p10|g (δ0g ⫺ 1)p1⫹|g ⫹ p10|g 1 ⫺ δ0gp1⫹|g ⫺ p10|g Here, δ0g is the parameter of interest and p1⫹|g and p10|g are considered to be the nuisance parameters. The constrained maximum likelihood estimators (under the null hypothesis H0g: p⫹1|g/p1⫹|g ⫽ δ0g) of p1⫹|g and p10|g (i.e., p˘ 1⫹|g and p˘ 10|g) for any δ0g can be obtained by solving the two equations ∂ln L/∂p1⫹|g ⫽ 0 and ∂ln L/∂p10|g ⫽ 0. Therefore, p˘ 1⫹|g(∂lnL/ ∂p1⫹|g ⫽ 0)+p˘ 10|g(∂lnL/∂p10|g ⫽ 0) yields ng ⫺
n00|g ⫽ 0. 1 ⫺ δ0gp˘ 1⫹|g ⫺ p˘ 10|g
Or, p˘ 1⫹|g ⫽ [1 ⫺ pˆ 00|g ⫺ p˘ 10|g]/δ0g.
(5)
After substituting Eq. (5) in ∂lnL/∂p1⫹|g ⫽ 0 and some simple algebra, it is easy to see that p˘ 10|g satisfies the following quadratic equation Ax2 ⫹ Bx ⫹ C ⫽ 0 With A ⫽ 1 ⫹ δ0g, B ⫽ pˆ 1⫹|gδ20g ⫺ (pˆ 1⫹|g ⫹ 2pˆ 10|g) and C ⫽ pˆ 10|g(1 ⫺ δ0g)(pˆ ⫹1|g ⫹ pˆ 10|g). Moreover, p˘ 10|g ⫽ [冪B2 ⫺ 4AC ⫺ B]/(2A) can be shown to be the desired constrained maximum likelihood estimate of p10|g. References [1] Lui KJ, Cumberland WG. Sample size determination for equivalence test using rate ratio of sensitivity and specificity in paired sample data. Control Clin Trials 2001;22:373–389. [2] Dunnett CW, Gent M. Significance testing to establish equivalence between treatments with special reference to data in the form of 2 × 2 tables. Biometrics 1977;33:593–602. [3] Nam J-M. Establishing equivalence of two treatments and sample size requirements in matched-pairs design. Biometrics 1997;53:1422–1430.
M.-L. Tang/Controlled Clinical Trials 24 (2003) 364–377
377
[4] Lu Y, Bean J. On the sample size for one-sided equivalence of sensitivities based upon McNemar’s test. Stat Med 1995;14:1831–1839. [5] Farrington CP, Manning G. Test statistics and sample size formulae for comparative binomial trials with null hypothesis of non-zero risk difference or non-unity relative risk. Stat Med 1990;9:1447–1454. [6] Liu JP, Hsueh HM, Hsieh E, Chen JJ. Tests for equivalence or non-inferiority for paired binary data. Stat Med 2002;21:231–245. [7] Lachenbruch PA, Lynch CJ. Assessing screening tests: extensions of McNemar’s test. Stat Med 1998;17:2007–2217. [8] Tang NS, Tang ML, Chan ISF. On tests of equivalence via non-unity relative risk for matched-pair design. Stat Med 2003;22:1217–1233. [9] Tang ML, Tang NS, Chan ISF, Chan BPS. Sample size determination for establishing equivalence/noninferiority via ratio of two proportions in matched-pair design. Biometrics 2002;58:957–963. [10] Nam J, Blackwelder WC. Analysis of the ratio of marginal probabilities in a matched-pair setting. Stat Med 2002;21:689–699. [11] Farrington CP, Manning G. Test statistics and sample size formulae for comparative binomial trials with null hypothesis of non-zero risk difference or non-unity relative risk. Stat Med 1990;9:1447–1454. [12] Katz D, Baptista J, Azen SP, Pike MC. Obtaining confidence intervals for the risk ratio in cohort studies. Biometrics 1978;34:469–474. [13] Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and practice. Cambridge: MIT, 1980. [14] Cochran WG. The χ2 test of goodness of fit. Ann Math Stat 1952;23:315–345. [15] Heeren T, D’Agostino RB. Robustness of the two independent sample t-test when applied to ordinal scaled data. Stat Med 1987;6:79–90. [16] Sullivan LM, D’Agostino RB. Robustness and power of analysis of covariance applied to data distorted from normality by floor effects: homogeneous regression slopes. Stat Med 1996;15:477–496. [17] Garcia JJ, Blanca M, Moreno F, et al. Determination of IgE antibodies to the benzylpenicilloyl determinant: a comparison of the sensitivity and specificity of three radio allegro sorbent test methods. J Clin Lab Analysis 1997;11:251–257.