Addictive Behaviors 32 (2007) 187 – 193
Short Communication
The use of GEE for analyzing longitudinal binomial data: A primer using data from a tobacco intervention Ji-Hyun Lee *,1, Thaddeus A. Herzog, Cathy D. Meade, Monica S. Webb, Thomas H. Brandon H. Lee Moffitt Cancer Center and Research Institute, The University of South Florida, Tampa, Florida, USA
Abstract Longitudinal study designs in addictive behaviors research are common as researchers have focused increasingly on how various explanatory variables affect responses over time. In particular, such designs are used in intervention studies that have multiple follow-up points. These designs typically involve repeated measurement of participants’ responses, and thus correlation within each participant is expected. Correct inferences can only be obtained by taking into account this within-participant correlation between repeated measurements, which can complicate the analysis of longitudinal data. In recent years, generalized estimating equations (GEE) has become a standard method for analyzing non-normal longitudinal data, yet it often is not utilized by addiction researchers. The goal of this article is to provide an overview of the GEE approach for analyzing correlated binary data for behavioral researchers, using data from an intervention study on the prevention of relapse to tobacco smoking. D 2006 Elsevier Ltd. All rights reserved. Keywords: Statistical analysis; Binary longitudinal data; GEE; Tobacco relapse; Relapse-prevention-intervention
1. Introduction It is common in addictive behaviors research to have longitudinal data that is in the form of repeated measurements on the same participants over time. For instance, consider the data from a * Corresponding author. Biostatistics Core, H. Lee Moffitt Cancer Center and Research Institute, University of South Florida, 12902 Magnolia Dr., Tampa, FL 33612, U.S.A. E-mail address:
[email protected] (J.-H. Lee). 1 The more detailed manuscript of this article is available upon request to the first author. 0306-4603/$ - see front matter D 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.addbeh.2006.03.030
188
J.-H. Lee et al. / Addictive Behaviors 32 (2007) 187–193
study of smoking relapse prevention (Brandon, Meade, Herzog, Chrisikos, Webb, & Cantor, 2004), which will be further analyzed in the current article. In the study, the response variable of interest was point-prevalence smoking abstinence, and participants were assessed on whether or not they smoked any cigarettes over the 7 days preceding follow-up points of 12, 18, and 24 months postintervention. Such longitudinal studies typically have been analyzed using a series of univariate analysis at specific time points such as the end of treatment and various follow-up points. These classical analyses, although simple to implement and interpret, do not model the pattern of the response variable across time. In other words, when the study involves not only the usual scientific questions such as how the mean response differs across treatment conditions, but also how the mean response changes over time and how its change over time differs by condition, the classical approaches are insufficient. Beginning with Liang and Zeger (1986), the method of the generalized estimating equations (GEE) has quickly become a standard approach for analyzing correlated data obtained in longitudinal studies. Although there have been many papers explaining the theoretical basis for the GEE approach, there are few practical introductions to GEE application for behavioral researchers. This paper provides an overview of the GEE method for longitudinal data and describes the application of this method to a randomized trial of a smoking relapse-prevention-intervention (Brandon et al., 2004), wherein the repeatedly measured response variable was binary.
2. A review of the GEE method 2.1. Background When we use a classical regression model, a single observation of the response variable is obtained for each observational unit (i.e., human participants). In this case, one of the fundamental assumptions of standard statistical modeling is independence between observations. When a study has a longitudinal study design, however, this independence assumption is no longer met, because the data are characterized by repeated observations over time on the same set of units, and data obtained from the same unit can be closely correlated. If the correlation of data within a unit is ignored, it may produce incorrect standard errors, resulting in invalid hypothesis tests and confidence intervals. In addition, an important consideration in the statistical modeling of correlated data concerns the type of outcome variable (i.e., continuous or categorical). When longitudinal data are normally distributed, it is feasible to have a multivariate normal probability distribution, which will be used as the basis for statistical inference. As a result, methods for correlated continuous data are undeniably the most developed (e.g., the repeated measures ANOVA and the random effects models). For non-normally distributed categorical outcomes, however, the resulting probability models can be so complex that fitting them to real data may be a practical challenge. In 1986, Liang and Zeger proposed the GEE approach to deal with this impractical probability distribution in handling correlated responses. The approach does not require specification of a full probability model, but rather requires only the mean and covariance structure of the responses. They introduced the concept of bworking correlationQ structure to account for the correlation within each unit, which must be specified for each analysis.
J.-H. Lee et al. / Addictive Behaviors 32 (2007) 187–193
189
2.2. Longitudinal logistic regression model with GEE The GEE model for the binary response (having the value 0 or 1, depending on whether or not an event of interest occurs) is an extension of the standard logistic regression model from the generalized linear model approach (McCullagh & Nelder, 1989). The extension is presented in the subscript j time, which indicates that the same participants can be measured repeatedly over time. Let y ij denote the response from participant i at time j, for i = 1, . . ., N and j = 1, . . ., t. Further, define Pr( Yij = 1) = l ij , as the marginal mean of Yij . It is the probability of seeing the event of interest Yij = 1 for participant i at time j. The first step of the GEE approach is to relate the response to a linear combination of the explanatory variables (i.e., set up a regression model). In general, the objective of a regression model is to determine whether any of the explanatory variables are associated with the response variable. However, because the mean of a binary outcome is a probability, it must be between 0 and 1. Therefore, one needs a link function into which all possible values of the linear model map into the range for l ij . In this case, the log of an odds, called, logit, is the link function, and is written by ! lij ¼ b0 ; þ b1 Xij1; ;...; þ bp Xijp ; ð1Þ log 1 lij where X ij1, . . ., X ijp are the explanatory variables for participant i at time j, and b 0, b 1, . . ., b p are the intercept and the regression coefficient parameters of the explanatory variables that should be estimated, respectively. The second step of the GEE approach is to define the variance of Yij . For a binary response variable the variance is completely determined by the mean of Yij : var( Yij ) = l ij (1 l ij ). The last step is to choose the working correlation structure between observations on the same participant. This correlation structure may depend on a vector of unknown parameters, which is assumed to be equal for all participants. Most commonly used working correlation structures are exchangeable or compound, auto-regressive (AR1), and unstructured. A thorough discussion concerning the properties of various working correlations can be found in Liang and Zeger (1986). Once the mean, variance of the responses, and a working correlation structure are defined as described above, the GEE estimates of the unknown parameters in the model can be obtained using the solution of the GEE score equation. The GEE generally produces consistent estimators of the true variance of the estimated parameters, even when the working correlation has been misspecified. This is called the robust or empirical covariance estimator. On the other hand, if the working correlation is correctly specified, then a consistent estimator for the covariance matrix of the estimated parameters reduces to a simpler form of the empirical covariance estimator, and is called the model-based or naive covariance estimator.
3. Application The following illustration uses the smoking relapse prevention data reported previously in detail by Brandon et al. (2004). Brandon et al. tested the efficacy of self-help relapse-prevention booklets for smokers who had already achieved initial abstinence at baseline. The study employed a 2 2 factorial design, in which one factor was the amount of content (1 booklet vs. 8 booklets), and the other factor was the frequency of contact (1 mailing vs. 8 mailings). Although the study was longitudinal, with
190
J.-H. Lee et al. / Addictive Behaviors 32 (2007) 187–193
Table 1 Number of relapsed participants (%) with missing observations at each of the follow-up measurements Content group Low High
Smoking at 12 months
Smoking at 18 months
Smoking at 24 months
Yes
No
Missing
Yes
No
Missing
Yes
No
Missing
40 (21.4) 27 (13.8)
117 (62.6) 136 (69.4)
30 (16.0) 33 (16.8)
54 (28.9) 38 (19.4)
118 (63.1) 146 (74.5)
15 (8.0) 12 (6.1)
58 (31.0) 40 (20.4)
122 (65.2) 146 (74.5)
7 (3.7) 10 (5.1)
multiple follow-up points, the authors used the classical approach of performing separate logistic regressions for each of 3 follow-up assessments. Four hundred thirty one participants who were abstinent at baseline were randomly assigned into groups receiving the high or low contact and the high or low content interventions. The main outcome variable, smoking status (smoking/abstinent) was determined at each of 3 follow-up points (12, 18, and 24 months), and was coded as a binary variable. Some participants did not complete every follow-up assessment, resulting in missing observations. In the current analyses, as an illustration of the GEE approach, we focus on only the dcontentT factor. Table 1 summarizes the number of participants with missing observations and the number of smoking relapsed participants at each of the follow-up measurements. It indicates that at 12 months, the high content intervention group had a lower proportion of participants who relapsed into smoking compared to low content participants. Both groups showed a subsequent increase in the number of participants who relapsed, but for the low intervention group this increase was more pronounced. Our goal is to evaluate whether the effect of the high intervention compared to the low level intervention is significant over a period of 24 months. Fig. 1 shows the prevalence of smoking in the follow-up time period, revealing a constant increase from the baseline in both intervention groups, with a maximum at the last time point. We first consider a model that represents the mean as a straight line over time and includes the explanatory variables of group (content low = 0, high = 1) and time (in months). Let l ij denote the mean,
Fig. 1. Sample means of smoking relapse in percent by content intervention groups across times.
J.-H. Lee et al. / Addictive Behaviors 32 (2007) 187–193
191
Table 2 Analysis of the GEE parameter estimates based on empirical standard error estimates, using the auto-regressive working correlation (AR1) structurea, with smoking relapse as outcome variable, and content group and time as explanatory variables Parameter
Estimate
Standard error
Z
p-value
Intercept Group Time
2.900 0.565 0.102
0.134 0.220 0.005
21.70 2.56 22.54
b .0001 0.0100 b .0001
Baseline
12 months
18 months
24 months
Baseline 12 months 18 months 24 months
1.00
0.57 1.00
0.33 0.57 1.00
0.19 0.33 0.57 1.00
a
The estimated AR1 working correlation coefficients.
or equivalently, the probability of smoking relapse for i = 1, . . ., 431 participants and j = 1 (baseline), 12, 18, 24 months. Using the logit link function for the binary responses, ð2Þ logit lij ¼ b0 þ b1 Group þ b2 Time is fitted. Here b 0, b 1, b 2 are the regression coefficient parameters for intercept, group, and time. After inspecting different working correlation structures and based on our previous experience, we fit the model (2) with an auto-regressive (AR1) working correlation. Table 2 presents the results of the GEE Table 3 Analysis of the GEE parameter estimates based on empirical standard error estimates, using various working correlation structures, with smoking relapse as outcome variable, and content intervention group, time, and the interaction between group and time as explanatory variables Correlation
Parameter
Estimate
Standard error
p-value
Exchangeable
Intercept Group Time Group time Intercept Group Time Group time Intercept Group Time Group time Intercept Group Time Group time
2.778 0.426 0.102 0.008 2.963 0.406 0.106 0.008 1.106 0.511 0.016 0.001 2.946 0.415 0.106 0.008
0.142 0.229 0.006 0.009 0.131 0.213 0.006 0.009 0.247 0.383 0.011 0.017 0.154 0.246 0.007 0.010
b .0001 0.063 b .0001 0.413 b .0001 0.057 b .0001 0.359 b .0001 0.182 0.146 0.938 b .0001 0.091 b .0001 0.467
AR1
Unstructureda
Independent
a
Odds ratio
95% CI Low
Upper
0.65 1.11 0.99
0.42 1.09 0.97
1.02 1.12 1.01
0.67 1.11 0.99
0.44 1.10 0.97
1.01 1.12 1.00
0.60 1.02 1.00
0.28 0.99 0.97
1.27 1.04 1.03
0.66 1.11 0.99
0.41 1.10 0.97
1.07 1.13 1.01
The baseline measurements were excluded from the data for this working correlation structure, due to the errors occurred in the GEE estimation routine.
192
J.-H. Lee et al. / Addictive Behaviors 32 (2007) 187–193
analysis assuming an AR1. The p-value is obtained based on the empirical standard errors. From the results, the high content intervention was significantly more successful in preventing relapse of smoking over the total follow-up period ( p = 0.01). The result also indicates that there is a significant linear increase over time for smoking relapse (b 2 = 0.10, p b 0.001). In the present example, we are also interested whether and how the change of the content intervention effect differs over time. To investigate this, we added an additional term, (Group Time) interaction, to the previous model (2); ð3Þ logit lij ¼ b0 þ b1 Group þ b2 Time þ b3 ðGroup TimeÞ: The model (3) is fitted using several different working correlations: exchangeable, AR1, unstructured, and independent correlation structures. Table 3 displays the results of the analysis for model (3). We found, in the current study, that the results were not dependent on the working correlation assumptions based on the similar quantities in the estimated parameters. We tested the null hypothesis that the slopes are the same. That is, equivalently, the coefficient of interaction, b 3 for (Group Time), in model (3) equals zero. The negative sign of the regression coefficient of the interaction term (across all the different working correlation structures) indicates that the effect of the intervention is stronger at the beginning of the follow-up period. However, the p-values imply that there is no significant interaction effect between the intervention and time.
4. Summary and discussion This paper addressed the issue of binary longitudinal data analysis, using the GEE approach both as a tutorial and a practical guide. An application of the GEE to data on smoking relapse behavior was used as an example. The target readers are researchers who have been aware of the need for longitudinal designs, yet have not fully understood this relatively new technique for the analysis of longitudinal data. We understand that most behavioral researchers are attracted to the classical approaches when analyzing longitudinal data, possibly because they are more familiar with the classical approaches rather than the more complex advanced approaches. Despite their wide use, the classical approaches have limitations for analyzing longitudinal data. The most severe limitation may be that the data should have no missing data at any time points and equal observations for all units across time points are necessary. In addition, classical approaches cannot acknowledge explicitly time or meaningful patterns, which is one of the goals of a longitudinal study design. As a practical guide, we want to point out that the choice of working correlation structure depends on what one believes is most realistic for the particular data. However, the influence of correlation among repeated observations within a unit may have a substantial effect on the estimated variances of the regression coefficients that would be used for testing the significance of the coefficients. Thus, one must be cautious when choosing a working correlation structure for data involving multiple data points per participant (see Fitzmaurice, 1995). We recommend the use of the GEE only if the number of participants is at least 30, and if 3 to 5 data points per participant are assessed. This recommendation is derived from our experience and the fact that the GEE model is based on the large sample theory, or asymptotic properties of regression parameter estimators.
J.-H. Lee et al. / Addictive Behaviors 32 (2007) 187–193
193
Several statistical packages (e.g., SAS, S-PLUS, or Stata) have built-in support for analyzing longitudinal data using user-written programs that require minimal programming effort. The goal of this article was to introduce readers to the basic rationale of the GEE, and to emphasize the key role of correlations among repeated observations within a unit. Understanding this issue is essential prior to employing statistical software in the use of GEE.
Acknowledgments The collection of the data reported in this article was supported by the National Cancer Institute grant R01 CA80706.
References Brandon, T. H., Meade, C. D., Herzog, T. A., Chrisikos, T. N., Webb, M. S., & Cantor, A. B. (2004). Efficacy and costeffectiveness of a minimal intervention to prevent smoking relapse: Dismantling the effects of amount of content versus contact. Journal of Consulting and Clinical Psychology, 75, 797 – 808. Fitzmaurice, G. M. (1995). A caveat concerning independence estimating equations with multivariate binary data. Biometrics, 51, 309 – 317. Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13 – 22. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). New York7 Chapman and Hall.