Computer Methods and Programs in Biomedicine (2004) 74, 261—265
A computer program to estimate power and relative efficiency to assess multiplicative interactions in flexibly matched case-control studies T. Stürmer a,*, O. Gefeller b , H. Brenner a a
Department of Epidemiology, German Centre for Research on Ageing at the University of Heidelberg, Bergheimer Strasse 20, Heidelberg 69115, Germany b Department of Medical Informatics, Biometry, and Epidemiology, University of Erlangen–Nuremberg, Waldstr. 6, Erlangen 91054, Germany Received 31 January 2003 ; received in revised form 29 July 2003; accepted 6 August 2003
KEYWORDS Case-control studies; Efficiency; Sample size; Epidemiologic methods; Research design
Summary Matching on one of two possibly interacting exposures can increase power and efficiency to estimate multiplicative interactions in case-control studies. We recently introduced the concept of flexible matching strategies with varying proportions of a dichotomous matching factor among controls to further increase power and efficiency. In order to facilitate the application of this concept, we developed a computer program which provides estimates of power and efficiency varying the proportion of the matching factor in controls over all possible values from 1 to 99%. The program allows one to estimate the effect of frequency matching on power and efficiency to study multiplicative interactions and to assess the optimal prevalence of the matching factor in selected controls for a given scenario which often differs from the prevalence in cases (aimed at in traditional ‘fixed’ frequency matching). Our program will strongly facilitate assessing the benefits of matching and flexible matching strategies in case-control studies addressing multiplicative interactions, including gene–environment interactions. © 2003 Elsevier Ireland Ltd. All rights reserved.
1. Introduction Frequency matching is a popular design option to improve the power of case-control studies [1]. The benefits of matching may be particularly pronounced in studies assessing multiplicative inter*Corresponding author. Present address: Brigham and Women’s Hospital, Division of Pharmacoepidemiology and Pharmacoeconomics, 1620 Tremont Street, Suite 3030, Boston MA 02120, USA. Tel.: +1-617-278-0627; fax: +1-617-232-8602. E-mail address:
[email protected] (T. Stürmer).
actions, including gene–environment interactions [2] which often lack adequate power. However, unmatched and frequency matched studies are two but distinct possibilities of control selection according to the prevalence of a matching factor. We recently introduced the concept of flexible matching strategies which allow for varying the proportion of the matching factor among controls over a wide range between and beyond the proportion among cases (matched design) and among the population (unmatched design). We found that traditional ‘fixed’ frequency matching is often suboptimal
0169-2607/$ — see front matter © 2003 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.cmpb.2003.08.003
262
T. Stürmer et al.
with respect to gain in power and efficiency and that a larger gain can often be achieved with a different prevalence of the matching factor among controls [3,4]. To our knowledge, however, so far no software is available to calculate power and relative efficiency of frequency matched case-control studies assessing multiplicative interactions, neither for ‘fixed’ matching nor for flexible matching. The aim of this paper is to present a convenient software for the epidemiologic user to evaluate prevalences of a dichotomous and possibly interacting exposure, i.e. the matching factor, in selected controls ranging from 1 to 99% with respect to their influence on power and efficiency to estimate multiplicative interactions in case-control studies. These evaluations allow one to find the optimum prevalence of the environmental exposure in selected controls and to assess expected cutbacks with specific non-optimal prevalences.
2. Computational methods and theory The program provides the power and relative efficiency of case-control studies assessing the multiplicative interaction between two dichotomous exposures with respect to risk for disease. To avoid confusion about which of the two exposures is the matching factor, we will use the example of a gene–environment interaction with an environmental exposure (the matching factor) and a marker of genetic susceptibility (which is unknown at the time of recruitment) from now on.
3. Program description The program is written as a SAS-macro to allow easy implementation and widespread use. The source code of the SAS-macro called ‘pe0int.sas’
is open code under the conditions of the GNU-GPL license [5]. Consequently, the source code can be easily adapted by future users to their needs. The macro can be downloaded free of charge from the statistical archive network maintained by the Department of Medical Informatics, Biometry and Epidemiology at the University of Erlangen– Nuremberg (http://www.imbe.med.uni-erlangen. de/issan/issan.htm) or obtained on diskette by written request to the first author. Extensive validation efforts have been made to guarantee the correctness of the macro. The following parameters need to be specified when the macro is invoked (see Table 1): the prevalence of the genetic susceptibility (PG ) and of the environmental exposure (the matching factor, PE ) in the population, the odds ratio for the association between the genetic susceptibility and the environmental exposure in the population (ORGE ), the odds ratio of the exposure–disease association in non-susceptibles (ORED|g ), the multiplicative gene–environment interaction (INT), the odds ratio for the genetic-susceptibility-disease association in unexposed (ORGD|e ), the number of cases (NCASES ), and the control-to-case ratio (CCRATIO ). Using these parameters, expected cell counts for cases and controls are calculated for situations in which the marginal prevalence of the matching factor in selected controls is varied from 1 to 99% by steps of 1% according to the methodology previously presented [4]. For all situations, the expected variance of the multiplicative gene–environment interaction is then calculated assuming independence of the odds ratios for the environmental exposure–disease association in both strata of genetic susceptibility. The relative efficiency of estimation in percent, compared to the unmatched design, is obtained by dividing the variance in the unmatched study by the variance in the study with the corresponding prevalence of the matching factor (x100). The
Table 1 Parameters to be specified when the macro is invoked Notation
Meaning
PG PE ORGE ORED|g INT
Prevalence of the genetic susceptibility in the population Prevalence of the environmental exposure (the matching factor) in the population Odds ratio of genetic susceptibility–exposure association in the population Odds ratio of exposure–disease association in non-susceptibles Multiplicative interaction (odds ratio of exposure–disease association in susceptibles (ORED|G )/odds ratio of exposure–disease association in non-susceptibles (ORED|g )) Odds ratio of genetic susceptibility–disease association in unexposed Number of cases Control-to-case ratio
ORGD|e NCASES CCRATIO
A computer program to estimate power and relative efficiency
263
Table 2 Parameters included in the program output Notation
Meaning
PE0 PE1 Nijk
Marginal prevalence of the matching factor in selected controls Marginal prevalence of the matching factor in cases Expected cell counts, where i = 1 (i = 0) for cases (controls), j = 1 (j = 0) for exposed (unexposed), and k = 1 (k = 0) for matching-factor present (absent) Expected power Expected power gain in percent of the maximum possible gain by flexible matching Expected variance of the logarithm of the interaction Expected relative efficiency compared to the unmatched design (100%) Expected efficiency gain in percent of the maximum possible gain by flexible matching
Power Powerp V ln INT RE Rep
power is determined as follows: the standard error of the interaction under the null hypothesis is derived from the test statistic evaluating homogeneity of the odds ratios for the environmental exposure–disease association in individuals with and without the genetic susceptibility according to the formula developed by Woolf [6]. This test-based standard error is multiplied by 1.96 to find the cutpoint under the null hypothesis of no multiplicative interaction with a two-sided significance level of 0.05. The distance of this cutpoint to the logarithm of the expected interaction is then divided by the square root of the expected variance of this interaction as mentioned above yielding a new statistic that follows asymptotically a standard normal distribution. Finally, the power corresponds to one minus the area under the standard normal curve below the actual value of this new statistic. The following information is then provided in the output (see Table 2): marginal prevalence of the environmental exposure in controls (PE0 ) and cases (PE1 ), expected cell counts Nijk where i = 1 (i = 0) for cases (controls), j = 1 (j = 0) for susceptibles (unsusceptibles), and k = 1 (k = 0) for exposed (unexposed), power (Power), power gain in percent of the maximum possible gain by flexible matching (Powerp), variance of the logarithm of interaction (V ln INT), relative efficiency (RE), and efficiency gain in percent of the maximum possible gain by flexible matching (REp).
4. Sample of program run To illustrate the application of the program, we use a scenario with a prevalence of the genetic susceptibility and of the environmental exposure of 20 and 10%, respectively, independence between the genetic susceptibility and the environmental exposure in the population (ORGE = 1.0), an odds ratio of the exposure–disease association in non-susceptibles (ORED|g ) of 2.0, a multiplicative
gene–environment interaction (INT) of 3.0 (equivalent to an odds ratio of the exposure–disease association in susceptibles (ORED|G ) of 6.0), no effect of the genetic susceptiblity per se on disease risk (corresponding to an odds ratio of the genetic susceptibility–disease association in the absence of exposure (ORGD|e ) of 1.0), 400 cases, and a control-to-case ratio (CCRATIO ) of 1. In the scenario presented, the unmatched study corresponds to a study, in which the prevalence of the environmental exposure in selected controls (PE0 ) equals the prevalence in the population (PE ), i.e. 10% (see Table 3). The power of this unmatched study design is 62%. The relative efficiency (RE) of this design compared to the unmatched design is by definition 100% and the percent gain in power (Powerp) and efficiency (REp) compared to the maximum possible gain are both zero. In a ‘fixed’ frequency matched study, the prevalence of the environmental exposure in selected controls (PE0 ) equals 24%, i.e. the prevalence of the environmental exposure in cases (PE1 ). The power increases to 81% and the relative efficiency is 159%. Maximum power (87%) and relative efficiency (188) are, however, obtained with a prevalence of the matching factor in selected controls between approximately 40 and 60%.
5. Discussion Our previous observations regarding possible gains in power and efficiency by using frequency matching [2] and flexible matching strategies [4] have demonstrated that these approaches offer a large potential to increase power and efficiency to assess multiplicative gene–environment interactions in case-control studies. We now present a well documented and easy-to-use computer software that allows one to conveniently assess the advantages of matching and flexible matching strategies in case-control studies. These advantages are likely
264
T. Stürmer et al.
Table 3 Abbreviated program output for parameter values of example PE0
PE1
N000
N010
N001
N011
Power
Powerp
V ln INT
RE
0.01 0.10 0.24 0.36 0.39 0.40 0.45 0.50 0.55 0.60 0.61 0.70 0.80 0.90 0.99
0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24
317 288 243 205 195 192 176 160 144 128 125 96 64 32 3
79 72 61 51 49 48 44 40 36 32 31 24 16 8 1
3 32 77 115 125 128 144 160 176 192 195 224 256 288 317
1 8 19 29 31 32 36 40 44 48 49 56 64 72 79
14 62 81 86 86 87 87 87 87 87 86 84 78 62 14
−188 0 76 95 97 97 99 100 99 97 97 88 65 −0 −188
1.6418 0.2371 0.1492 0.1313 0.1292 0.1286 0.1266 0.1260 0.1266 0.1286 0.1292 0.1379 0.1612 0.2371 1.6418
14 100 159 181 184 184 187 188 187 184 184 172 147 100 14
REp −97 0 67 91 95 96 99 100 99 96 95 82 53 −0 −97
Information in the output: marginal prevalence of the environmental exposure (the matching factor) in controls (PE0 ) and cases (PE1 ); expected cell counts: Nijk where i = 1 (i = 0) for cases (controls), j = 1 (j = 0) for susceptibles (unsusceptibles), and k = 1 (k = 0) for exposed (unexposed); expected cell frequencies of cases are omitted from the table since they are independent of PE0 : N100 = 244, N110 = 61, N101 = 54, N111 = 41; power (Power), percent power gain (Powerp); variance of the logarithm of the interaction (V ln INT); relative efficiency (RE); and percent efficiency gain (REp). Parameters: PG = 0.2; PE = 0.1; ORGE = 1; ORED|g = 2; INT = 3; ORGD|e = 1; NCASES = 400; CCRATIO = 1.
to be most relevant for gene–environment interactions, but extend to the assessment of multiplicative interactions in general. The main disadvantages of matching are the loss of the ability (i) to assess the association between the matching factor and the disease [7], (ii) to use other than the multiplicative scales to determine interaction [8], (iii) the additional complexity and cost of control sampling [9], and (iv) the necessity to control all analyses for the matching variable [10]. Unless the prevalence of the matching factor in selected controls equals the one in the population (corresponding to an unmatched study), these disadvantages are likely to be quite independent of the prevalence of the environmental exposure in selected controls over a wide range. Therefore, optimum power and efficiency should be used as the main criteria for matching rather than the prevalence of the environmental exposure observed in cases. As we could demonstrate previously [3,4], the use of the prevalence observed in cases to sample matched controls, although routinely applied in ‘fixed’ frequency matching, is not ideal with respect to power and efficiency in many realistic situations. It furthermore does not seem to be justified by theoretical or mathematical concepts. While the computer program provided in this paper offers, for the first time, convenient assessment of power and efficiency of flexibly matched
case-control studies assessing gene–environment interactions, the following limitations should be kept in mind. The calculations of power and relative efficiency provided in the program are based on large sample approximations. To assess the validity of these approximations, we compared the results with estimates obtained from 10,000 simulations for each scenario. We found that for the range of parameters covered in our simulation study [4], values of power and relative efficiency obtained from calculations were very close to those obtained from simulations. So far, the program is applicable for a dichotomous exposure and dichotomous classification of the genetic susceptibility. Future developments should include extensions to more complex situations, including multi-level or continuous exposures, or more complex definitions of genetic susceptibility, such as separate categorization of homozygotes and heterozygotes. The program presented here can be easily adapted and modified by potential users, and may therefore be a useful starting point for such extensions. In conclusion, we offer access to computer software which should help to optimize frequency matching for an environmental exposure in case-control studies assessing multiplicative gene–environment interactions. Our program will strongly facilitate assessing the benefits of flexible matching strategies in the design phase of
A computer program to estimate power and relative efficiency case-control studies as well as during the recruitment of controls.
References [1] O. Gefeller, A. Pfahlberg, H. Brenner, J. Windeler, An empirical investigation on matching in published case-control studies, Eur. J. Epidemiol. 14 (1998) 321—325. [2] T. Stürmer, H. Brenner, Potential gain in efficiency and power to detect gene–environment interactions by matching in case-control studies, Genet. Epidemiol. 18 (2000) 63—80. [3] T. Stürmer, H. Brenner, Degree of matching and gain in power and efficiency in case-control studies, Epidemiology 12 (2001) 101—108. [4] T. Stürmer, H. Brenner, Flexible matching strategies to increase power to detect and efficiency to estimate
[5] [6] [7] [8] [9] [10]
265
gene–environment interactions in case-control studies, Am. J. Epidemiol. 155 (2002) 593—602. Free Software Foundation. The GNU General Public License, Accessed 12 December 2002. (http://www.gnu.org/ licenses/licenses.html). N.E. Breslow, N.E. Day, Statistical Methods in Cancer Research, The Analysis of Case-Control Data, IARC, Lyon, vol. I, 1980. K.J. Rothman, S. Greenland (Eds.), Modern Epidemiology, second ed., Raven, Philadelphia, 1998. S. Greenland, Tests for interaction in epidemiologic studies: a review and a study of power, Stat. Med. 2 (1983) 243—251. W.D. Thompson, J.L. Kelsey, S.D. Walter, Cost and efficiency in the choice of matched and unmatched case-control study designs, Am. J. Epidemiol. 116 (1982) 840—851. O.S. Miettinen, Matching and design efficiency in retrospective studies, Am. J. Epidemiol. 91 (1970) 111— 118.