Controlled Clinical Trials 25 (2004) 3 – 12 www.elsevier.com/locate/conclintrial
A simple alternative confidence interval for the difference between two proportions Guangyong Zou a,b,*, Allan Donner b b
a Robarts Clinical Trials, Robarts Research Institute, London, Ontario, Canada N6A 5K8 Department of Epidemiology and Biostatistics, University of Western Ontario, London, Ontario, Canada
Received 28 February 2003; accepted 25 August 2003
Abstract The difference between two proportions is often the focus of interest in prospective comparative studies such as randomized controlled trials that have a binary outcome. Consequently, interval estimation for this parameter has received considerable attention in the literature. A hybrid procedure resulting from combining two sets of confidence limits for a single proportion as proposed by Newcombe has been previously recommended for this purpose because of its superior properties and relative simplicity. In this paper, we propose a simple alternative approach based on Fisher’s z transformation. The results of an exact evaluation study show that this new procedure performs as well as Newcombe’s procedure in terms of percent coverage and expected confidence interval width. Several examples are presented. D 2004 Elsevier Inc. All rights reserved. Keywords: Confidence interval; Fisher’s z transformation; Number needed to treat; Risk difference; Score method
1. Introduction The difference between two proportions, or the risk difference, is often the primary focus of interest in controlled clinical trials and other prospective studies that have a binary outcome. Consequently, the problem of obtaining interval estimates for this parameter has gained considerable attention in the literature [1,2], with a very recent review provided by Agresti [3]. The application of interval estimation to the risk difference is of particular interest to clinicians, given its close relationship to the number needed to treat (NNT) [4]. Since the NNT is defined as the reciprocal of the risk difference, * Corresponding author. Robarts Clinical Trials, Robarts Research Institute, P.O. Box 5015, 100 Perth Drive, London, Ontario, Canada N6A 5K8. Tel.: +1-519-663-3400x34092; fax: +1-519-663-3807. E-mail address:
[email protected] (G. Zou). 0197-2456/$ - see front matter D 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.cct.2003.08.010
4
G. Zou, A. Donner / Controlled Clinical Trials 25 (2004) 3–12
corresponding confidence limits are obtained by taking the reciprocal of the upper and lower confidence limits for the latter parameter, with special care required when this interval contains zero [5]. So-called exact methods could also be applied to this problem. However, as noted by Agresti [3], by design these methods produce systematically conservative results. Thus our focus in this paper will be on approximate procedures. Following an extensive comparison of 11 methods ranging from simple asymptotic procedures to iterative profile likelihood based methods, Newcombe [6] recommended an approach that combines score confidence intervals as computed for the two separate proportions [7]—an approach that has also been suggested as appropriate for calculating confidence limits for the NNT [8]. It is largely because of its existence in closed form that makes the score method more attractive to practitioners than iterative procedures, such as those based on quasi-exact methods [9]. In this paper, we propose and evaluate an alternative closed-form procedure based on the familiar Fisher’s z transformation. The results obtained from an exact evaluation demonstrate that this simple method performs as well as Newcombe’s procedure [6] in terms of percent coverage and expected interval width. However, the proposed procedure can also be easily extended to more complex problems.
2. Methods Let p1 and p2 be the true probabilities (risks) of an outcome event in the control and the treatment groups, respectively, implying that the absolute risk reduction is given by D=p1p2, where 1VDV1. Let n1 and n2 be the number of subjects randomized to these two groups, respectively. Estimates of p1 and p2 are given by pˆ1=x1/n1 and pˆ2=x2/n2, where x1 and x2 denote the number of observed events in the two groups. A simple large-sample procedure for constructing a (1a) 100% confidence interval for D, commonly referred to as a Wald procedure, is given by ˆ a=2 r; ˆ DFz where Dˆ=pˆ1pˆ2, sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pˆ1 ð1 pˆ1 Þ pˆ2 ð1 pˆ2 Þ ; rˆ ¼ þ n1 n2 and za/2 is the upper a/2 quantile of the standard normal distribution. Aside from the possibility that this procedure may yield limits out of range, previously reported simulation results [2] show that it performs poorly in terms of percent coverage. However, to retain its inherent simplicity, many authors, including Hauck and Anderson [2] and Agresti and Caffo [10], have attempted to improve its performance by introducing various refinements, such as adding ‘‘pseudocounts.’’ Nevertheless, such adjustments fail to avoid the problem of providing out-of-range limits, a problem exacerbated by the highly skewed nature ˆ of the sampling distribution of D. Moreover, a strategy of using quantiles of the t-distribution [11] will also not be generally helpful, since ‘‘the poor performance of the Wald interval does not occur because it is too short’’ (Ref. [10], p. 284). A further complication of this approach is the need to calculate the degrees of freedom using the method of Satterthwaite [12]. As a consequence of these difficulties, we consider a new strategy described below.
G. Zou, A. Donner / Controlled Clinical Trials 25 (2004) 3–12
5
Transformations have often been employed for correcting the skewness of sampling distributions. Although it has been stated that ‘‘unlike the case of a single proportion, there is no convenient function that may be used to yield asymmetric confidence limits on the risk difference that are then bounded by (1,1)’’ (Ref. [13], p. 24), we proceed with the notion that the parameter D and a population correlation coefficient have the same domain. Moreover, it is well known that Fisher’s z transformation is very effective in removing the skewness of the sampling distribution of an estimated correlation coefficient [14]. Thus an alternative to the score-based hybrid method is to apply this transformation to the estimator ˆ ˆ Dˆ. Letting r denote the estimated standard error of D, the resulting confidence interval for D is given by
expð2lÞ 1 expð2uÞ 1 ; ; expð2lÞ þ 1 expð2uÞ þ 1
where l,u are given by za=2 rˆ 1 1 þ Dˆ log F : ˆ 2 1 D 1 Dˆ 2 Note that this transformation is defined only if 0
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi# l ð1 l Þ u ð1 u Þ u1 ð1 u1 Þ l2 ð1 l2 Þ 1 1 2 2 Dˆ za=2 ; Dˆ þ za=2 : þ þ n1 n2 n1 n2
The notion of combining two sets of limits in this manner has very general application. For example, it can also be applied to inference problems concerning a difference between paired proportions [16]. A justification for the approach has been provided by Donner and Zou [17], who applied it to the problem of constructing a confidence interval for the difference between two independent kappa statistics.
3. Exact coverage evaluation We evaluated the coverage probabilities associated with the procedures considered above by numerically computing all possible (n1+1)(n2+1) outcomes. The coverage probability for a given interval estimate (L,U) is easily shown to be given by n1 X n2 X x1 ¼0 x2 ¼0
0 1ðDa½L; U Þ@
n1 x1
1
0
Apx1 ð1 p1 Þn1 x1 @ 1
n2 x2
1 Apx2 ð1 p2 Þn2 x2 ; 2
6
G. Zou, A. Donner / Controlled Clinical Trials 25 (2004) 3–12
Fig. 1. Exact coverage probability of the interval estimator for risk difference with 0.95 nominal level, using Wald (dotted line), Score (solid line), and Fisher’s z (dash line) methods.
where 1(Da[L,U]) is 1 if [L,U] contains D, 0 otherwise. Similarly, the expected length of an interval can be obtained as n1 X n2 X
0 ðU LÞ@
x1 ¼0 x2 ¼0
n1 x1
1
0
Apx1 ð1 p1 Þn1 x1 @ 1
n2 x2
1 Apx2 ð1 p2 Þn2 x2 : 2
G. Zou, A. Donner / Controlled Clinical Trials 25 (2004) 3–12
7
We therefore compared the score method (S), the proposed method based on the Fisher z transformation (Z), and the classical Wald interval (W) in terms of coverage probability and expected interval width. Parameter values p2=0.05, 0.10, 0.30, and 0.50 were considered for this purpose, with parameter values for D ranging from 0 to 0.9, 0 to 0.8, 0.2 to 0.6, and 0.4 to 0.4 spaced at equal
Fig. 2. Exact coverage probability of the interval estimator for risk difference with 0.95 nominal level, using Wald (dotted line), Score (solid line), and Fisher’s z (dash line) methods.
8
G. Zou, A. Donner / Controlled Clinical Trials 25 (2004) 3–12
intervals of 0.10. The values of n1evaluated were equally spaced at intervals of 10 from 20 to 80, with values of n2 equally spaced at intervals of 20 from 20 to 80. To deal with extreme cases, pˆi was set to 1/ (2ni) if ni=0, and 11/(2ni) if xi=ni, for i=1,2. Coverage probabilities are shown in Figs. 1–4 for p2=0.05, 0.10, 0.30, 0.50, respectively. As expected, the traditional Wald method performs poorly in reasonably large sample sizes for p2V0.30
Fig. 3. Exact coverage probability of the interval estimator for risk difference with 0.95 nominal level, using Wald (dotted line), Score (solid line), and Fisher’s z (dash line) methods.
G. Zou, A. Donner / Controlled Clinical Trials 25 (2004) 3–12
9
Fig. 4. Exact coverage probability of the interval estimator for risk difference with 0.95 nominal level, using Wald (dotted line), Score (solid line), and Fisher’s z (dash line) methods.
(Fig. 3). We also note that the results from method S are very close to nominal for all parameter combinations considered, a finding that is in agreement with previously published evaluations [6,10]. Method Z shows performance virtually identical to that of method S, except when the event rates are small, i.e., 0.05 or 0.10, in the presence of small and imbalanced sample sizes, in which case the results tend to be erratic.
10
G. Zou, A. Donner / Controlled Clinical Trials 25 (2004) 3–12
Fig. 5. Expected interval width ratio of the score to Fisher’s z with sample size n2=20 (solid line), 40 (dotted line), 60 (dash line), and 80 (broken line).
Fig. 5 presents a comparison between methods S and Z of expected interval width. Method S tends to provide wider intervals than method Z in the presence of small event rates, with the relationship reversed for large event rates.
4. Illustrative examples For illustrative purposes, we consider the results obtained from applying the three methods considered to 11 example data sets (Table 1). The first four examples are taken from studies published in the journal Evidence-based Medicine and presented by Bender [8]. In agreement with the results of our evaluation, all three methods provide similar sets of confidence limits in sample sizes that are relatively large, but otherwise method Z provides intervals having the narrowest width.
G. Zou, A. Donner / Controlled Clinical Trials 25 (2004) 3–12
11
Table 1 The 95% confidence intervals using the Fisher z transformation (Z) as compared to the score-based hybrid method (S) and the traditional Wald method (W), as well as the interval width ratios (%) of S to Z (WS/WZ) x1/n1, x2/n2 Dˆ W S Z WS/WZ Real dataa 11/67, 1/63 148/5493, 192/5492 7/135, 1/130 47/643, 29/640 Artificial data 6/7, 1/7 5/56, 0/29 55/56, 0/29 5/5, 0/5 0/100, 0/100 0/1,000,000, 0/1,000,000 0/10, 0/100,000 a
0.1483 0.0080 0.0442 0.0278
0.7143 0.0893 0.9821 1 0 0 0
(0.0544, 0.2422) (0.0145, 0.0015) (0.0039, 0.0845) (0.0020, 0.0536)
(0.0503, 0.2555) (0.0145, 0.0015) (0.0006, 0.0959) (0.0018,0.0543)
(0.0533, 0.2406) (0.0145, 0.0015) (0.0038, 0.0844) (0.0020, 0.0535)
1.10 1.00 1.18 1.02
(0.3477, 1.0809) (0.0164, 0.1605) (0.9062, 1.0236) (0.4281, 1.1719) (0.0196, 0.0196) (F1.96106) (0.0851, 0.1851)
(0.1906, 0.8800) (0.0381, 0.1926) (0.8423, 0.9968) (0.3855, 1) (0.0370, 0.0370) (F3.84106) (0.00004, 0.2775)
(0.1463, 0.9281) (0.0167, 0.1597) (0.8214, 0.9935) (0.0655, 0.9722) (0.0195, 0.0195) (F1.96106) (0.0852, 0.1834)
0.88 1.31 0.90 0.68 1.89 1.96 1.03
See Bender [8] for references.
The remaining examples are artificial and presented only as a means of providing further insight into the properties of the competing procedures. Again, method S tends to provide narrower intervals when the risk difference is large, as can be seen in the first of these examples. The next two artificial examples deal with the case of a zero event rate in one of the two samples. Again, method S provides narrower intervals when the risk difference is large and wider intervals when the difference is small. The subsequent example illustrates a case in which the risk difference is estimated as 1.0, where method Z can produce a calculated interval that does not include the point estimate. We note that method S does not suffer from this drawback, however, which is rarely an issue in practice. The last three examples deal with the case in which the event rates in both samples are zero. The results for the first two of these examples show that method Z tends to provide narrower intervals than method S. When n1 and n2 are dramatically different, as reflected in the last example, the two methods tend to provide confidence intervals having similar width.
5. Discussion The problem of constructing a confidence interval for a difference between two independent proportions has received considerable attention in the literature. Closed-form methods are naturally attractive for this purpose because of their computational simplicity. In this paper, we have proposed and evaluated a simple alternative to the score method proposed by Newcombe [6], finding that it is very competitive both in terms of percent coverage and interval width. Both methods can be applied easily in the context of routine data analysis. A limitation of our method is that the point estimate of the risk difference D is not included in the ˆ resulting interval when D=F1.0. However, the probability of such a case arising in practice is given by p1n1(1-p2)n2+p2n2(1p1)n1, which is very small even in sample sizes as small as 20 subjects per group.
12
G. Zou, A. Donner / Controlled Clinical Trials 25 (2004) 3–12
When this situation does arise and a confidence interval is still considered desirable, we would recommend applying method S. An important advantage of the approach proposed here as compared to the score method is that it depends directly on the standard error of the estimated risk difference, a property which facilitates its extension to more general problems. For example, it may be readily applied to inference problems arising from cluster randomization trials, using the appropriate standard error for the estimated risk difference [18]. A further application of the method is to the binomial regression modeling of a risk difference [19]. Evaluation of these extensions is in progress. Acknowledgements We thank Ralf Bender and an anonymous referee for their valuable and helpful comments, which improved the paper. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada. References [1] Santner TJ, Snell MK. Small-sample confidence intervals for p1p2 and p1/p2 in 22 contingency tables. J Am Stat Assoc 1980;75:386 – 94. [2] Hauck WW, Anderson S. A comparison of large-sample confidence interval methods for the difference of two binomial probabilities. Am Stat 1986;40:318 – 22. [3] Agresti A. Dealing with discreteness: making ‘exact’ confidence intervals for proportions, differences of proportions and odds ratios more exact. Stat Methods Med Res 2003;12:3 – 21. [4] Laupacis A, Sackett DL, Roborts RS. An assessment of clinically useful measures of the consequences of treatment. N Engl J Med 1988;318:1728 – 33. [5] Altman DG. Confidence intervals for the number needed to treat. BMJ 1998;317:1309 – 12. [6] Newcombe RG. Interval estimation for the difference between independent proportions: comparison of eleven methods. Stat Med 1998;17:783 – 90. [7] Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc 1927;22:209 – 12. [8] Bender R. Calculating confidence intervals for the number needed to treat. Control Clin Trials 2001;22:102 – 10. [9] Chen X. A quasi-exact method for the confidence intervals of the difference of two independent binomial proportions in small sample cases. Stat Med 2002;21:943 – 56. [10] Agresti A, Caffo B. Simple and effective confidence intervals for proportions result from adding two success and two failures. Am Stat 2000;54:280 – 88. [11] Pan W. Approximate confidence intervals for one proportion and difference of two proportions. Comput Stat Data Anal 2002;40:143 – 57. [12] Satterthwaite FF. Synthesis of variance. Psychometrika 1941;6:309 – 16. [13] Lachin JM. Biostatistical Methods: The Assessment of Relative Risks. New York: Wiley; 2000. [14] Fisher RA. Frequency distributions of the values of the correlation coefficient in samples from an infinitely large populations. Biometrika 1915;10:507 – 21. [15] Simonoff JS, Hochberg Y, Reiser B. Response to Brownie’s reader reaction. Biometrics 1988;44:621. [16] Newcombe RG. Improved confidence intervals for the difference between binomial proportions based on paired data. Stat Med 1998;17:2635 – 50. [17] Donner A, Zou G. Interval estimation for a difference between intraclass kappa statistics. Biometrics 2002;58:209 – 15. [18] Donner A, Klar N. Confidence interval construction for effect measures arising from cluster randomization trials. J Clin Epidemiol 1993;46:123 – 31. [19] Wacholder S. Binomial regression in GLIM: estimating risk ratios and risk differences. Am J Epidemiol 1986; 123:174 – 84.