J Chron Di.s Vol. 40, No. 12, pp. 1141-1143,
0021-9681/87 $3.00 + 0.00 Pergamon Journals Ltd
1987
Printed in Great Britain
Letter to the Editors SAMPLE SIZE DETERMINATION
In a recent article, McKeown-Eyssen and Thomas [l] derived equations for sample size in case-control studies with a continuous exposure variable, using the t-test of means and two models of disease risk (linear and exponential). In a letter to the editor [2], Rao generalized the results to any arbitrary distribution of exposure f(x) when the risk function is from the exponential family, T(X) = exp(a + bx) which permits use of results on generating functions. However, two major problems remain; they are the ambiguous definition of r(x) as the “rate of disease among those exposed at level x” and confusion between the total population and the sampled cases and controls. Unfortunately these faults affect all of the results in the original paper. The results of the derivation by Rao are still applicable but in a different manner. In this note the general framework is presented and the example of a linear risk function is treated, then the formulas of Rao are applied to the general framework, and finally an alternative function for r(x) is proposed. Let us begin again. Suppose an exposure varible X has p.d.f. f(x) in the population, with mean p0 and variance ai. Let a variable Y define whether an individual becomes diseased (Y = 1 as case) or not (Y = 0 as control). Let r(x) denote the risk (or probability) of disease given exposure level x; then 1 - T(X) is the probability of not contracting the disease, i.e. r(x) = Pr( Y = 1 (X = x), 0 < r(x) < 1 Vx. Here r(0) is the probability of becoming diseased among non-exposed individuals. To derive expressions for the mean exposures for cases and controls, denote the proportion diseased in the population as A and the proportion nondiseased as 1 -A. A =
IN CASE-CONTROL
g(x) =f(x)r(x)/A and the mean and variance of exposure for cases are: PI =
m xf(x)r(x)
dx
A
s0
and m x2f(x)r(x)
a: =
dx - PL:.
A
s0
(This agrees with corrected A.2 and A.3.) Note that in general f(x) is discontinuous at 0 since f(0) represents the proportion unexposed. Similarly, the p.d.f. of exposure in controls is h(x) =f(x)[l - r(x)]/1 -A and the mean and variance of exposure for controls are thus:
s s
mxf(x)[l
P2 =
and Is*=
- r(x)] dx 1-A
0
mx’f(x)[l
- r(x)] dx 1-A
0
-LG.
In the original paper and the comment of Rao, the confusion arose in comparing p, and h which are mean exposures for cases and the whole population respectively; the appropriate comparison is rather between p, and c(~ the mean exposures of cases and controls respectively. The sample size equation for comparison of mean exposures is:
‘?f(x)r(x)dx. s0
The p.d.f. of exposure
STUDIES
in cases is therefore 1141
n =
(t, + qd2d (PI
-
P2j2
where ai is used since the null hypothesis of equal means usually implicitly assumes that 0: = s: = 0: as well, i.e. that cases and controls are drawn randomly from the same population. For the linear risk T(X) = a + bx we find A=a+bpoand
Letter to the Editors
1142
and
s
mxf(x)[ 1 - a - bx] dx
P2
=
I-A
0
=
Corrected sample size
Data source
/&)(I - a - b/do) - ba:, 1-A
Therefore, bo; -u-b/q,)
P’-P2=(u+b/i,)(l
Table I. Sample sizes based on parameters of linear model from international data&
and
Canada
Calgary & Toronto Men Women Guelph Men Toronto Men Women
108 77 98 111 61
Scandinavia
20, - fs)2Ka + b,d(l -a - b/d2
n=
b2040
which is very different from equation (A.9). Using the values of co, a, B, a, and b given for colon cancer in the paper, the corrected values of sample sizes for the linear model are as shown in Table 1; these are consistently larger (by 20 to 65%) than those presented in the original paper. To modify the development of Rao for this framework let r(x) = exp(a + bx) then A = exp(a)MX(b) where fa M,(t)
f(x)exp(tx)
=
dx
J0 and
Men from Them Parikkala Helsinki Copenhagen
314 348 206 335
Scotland
Edinburgh Men Women
666 890
“This corrects column 3 of the original table.
i.e. the mean exposure in the population is the weighted average of the mean exposures of cases and controls. Substituting the expressions for A, v, and p2, M:(O) is found. The quantity of interest for sample size is p, - pL2 (not p, - po). In this development, substituting and simplifying gives,
M:(b) - Mx@W:(O) PI= M:(b)lM,(b)
with M’ denoting the derivative. But now to extend this to controls also recall that h(z) =f(z)(l - r(z)/(l -A). Thus M,(t) = [l/(1 -A)]
=
m_f(~)(l - r(z))exp(tz)dz s0
[l/(1 -A)] x
= Ml
m_f(~)(l - exp(u + bz)) s0 exp(fz) dz -
exp(aPfAb)I
x M(t) - exp(aW,(t + b)l so ~2
=
WWfz(~)ll,=o
= P/C - ev(aW,@Nl x [WC(O) - exp(a)W@))l. A check is provided since K, = ACL,+ (1 - A 1~2
=
W(O),
” - ” = M,(b)( 1 - exp(u)M,(b)) With these modifications, the special cases for T(X) given by Rao, can be applied. However, within this framework there is a problem with both the linear and exponential models: the risk functions proposed are not necessarily bounded by 0 and 1. One could restrain the coefficients a and b so that exp(a + bx) < 1. If the maximum value of x is w this is accomplished by setting a < - bw. But this greatly restricts the flexibility of the model which was the original reason for introducing it. A better choice is the logistic function, T(X) = l/( 1 + exp(a + bx)) which is always in the interval (0, 1). This puts the matter in the more appropriate context of logistic regression analysis in case-control studies where the theoretical work is done [3]. Though exact expressions for sample size are not available for this model since iterative solutions are Whitemore [4] has presented necessary, approximations which can be applied.
Letter to the Editors REFERENCES 1. McKeown-Eyssen GE, Thomas T. Sample size determination in case control studies: the influence of the distribution of exposure. J Chron Dis 1985; 38: 559-568. 2. Rao RR. Sample size determination in case control studies: the influence of the distribution of exposure. Letter to the Editor. J Chron Dis 1986; 39: 941-943. 3. Prentice R. Pvke R. Loeistic disease incidence models and case cd&o1 studies.-Biometrika 1979; 66: 403-411. 4. Whitemore AS. Sample size for logistic regression with small response probability. J Am Stat Ass 1981; 76(373): 27-32.
1143
STANBECKER Division of Reproductive Health Center for Health Promotion and Education Centers for Disease Control Atlanta GA 30333 U.S.A.