Journal of Clinical Epidemiology 55 (2002) 329–337
Attrition in longitudinal studies: How to deal with missing data Jos Twisk*, Wieke de Vente Institute for Research in Extramural Medicine, Vrije Universiteit, Vd Boechorststraat 7, 1081 BT Amsterdam, The Netherlands Received 15 October 2000; received in revised form 20 July 2001; accepted 24 July 2001
Abstract The purpose of this paper was to illustrate the influence of missing data on the results of longitudinal statistical analyses [i.e., MANOVA for repeated measurements and Generalised Estimating Equations (GEE)] and to illustrate the influence of using different imputation methods to replace missing data. Besides a complete dataset, four incomplete datasets were considered: two datasets with 10% missing data and two datasets with 25% missing data. In both situations missingness was considered independent and dependent on observed data. Imputation methods were divided into cross-sectional methods (i.e., mean of series, hot deck, and cross-sectional regression) and longitudinal methods (i.e., last value carried forward, longitudinal interpolation, and longitudinal regression). Besides these, also the multiple imputation method was applied and discussed. The analyses were performed on a particular (observational) longitudinal dataset, with particular missing data patterns and imputation methods. The results of this illustration shows that when MANOVA for repeated measurements is used, imputation methods are highly recommendable (because MANOVA as implemented in the software used, uses listwise deletion of cases with a missing value). Applying GEE analysis, imputation methods were not necessary. When imputation methods were used, longitudinal imputation methods were often preferable ab9ove cross-sectional imputation methods, in a way that the point estimates and standard errors were closer to the estimates derived from the complete dataset. Furthermore, this study showed that the theoretically more valid multiple imputation method did not lead to different point estimates than the more simple (longitudinal) imputation methods. However, the estimated standard errors appeared to be theoretically more adequate, because they reflect the uncertainty in estimation caused by missing values. © 2002 Elsevier Science Inc. All rights reserved. Keywords: Attrition; Longitudinal studies; Missing data
1. Introduction One of the main methodological problems in longitudinal studies is attrition. Attrition, missing data, dropouts; all terms are used for the situation that not all N subjects have data on all T repeated measurements [1–3]. Attrition is generally seen at the end of a longitudinal study, although it is also possible that subjects miss a particular measurement, and then return in the study at the next follow-up. A few decades ago, few methods were available to analyse longitudinal data. The available methods (e.g., MANOVA for repeated measurements) had a major drawback, namely that if one of the repeated measurements was missing, all other available data of that subject were excluded from the analysis as well. To overcome this problem imputation methods for missing data have been developed [3]. With today’s sophisticated methods to analyse longitudinal data, such as
* Corresponding author. Tel.: 31 20 4448405; fax: 31 20 4448181. E-mail address:
[email protected] (J. Twisk)
Generalised Estimating Equations (GEE), subjects with incomplete data are not excluded from the analyses. If a particular subject is missing one or more out of T repeated measurements, the remaining available data from the other measurements for that particular subject are used in the analyses. In other words, when more sophisticated methods for the analysis of longitudinal data are used, it is probably less urgent to estimate the missing data. In general, a distinction is made between three types of attrition or missingness: (1) missing completely at random (MCAR, attrition is independent of both unobserved and observed data), (2) missing at random (MAR, attrition depends on observed data, but not on unobserved data), and (3) missing not at random (MNAR, attrition depends on unobserved data). Information on the type of attrition and the possible determinants of attrition is important for a proper interpretation of the results of longitudinal data analysis. However, that is not the focus of this paper. The purpose of this paper is to give an introduction on the possibilities how to deal with missing data in longitudinal studies; particularly aiming at researchers who are less experienced in this field. Therefore, several of the available
0895-4356/02/$ – see front matter © 2002 Elsevier Science Inc. All rights reserved. PII: S0895-4356(01)00 4 7 6 - 0
330
J. Twisk, W. de Vente / Journal of Clinical Epidemiology 55 (2002) 329–337
imputation methods to replace missing data will be discussed. In an example, the influence of missing data on the results of statistical analyses and the influence of different imputation methods on these results will be illustrated. Because the topic of this supplement concerns longitudinal observational studies, the illustration will be focussed on a specific observational longitudinal dataset. The results of this illustration should, therefore, be interpreted within the limitations of this choice. 2. Methods 2.1. Dataset The longitudinal dataset consists of a continuous outcome variable Y, which has been measured six times, and four predictor variables: X1, a continuous time-independent predictor variable, X2, a continuous time-dependent predictor variable, X3, a dichotomous time-dependent predictor variable and X4, a dichotomous time-independent predictor variable. All timedependent predictor variables were measured at the same six occasions as the outcome variable Y. The number of subjects in the complete dataset was 147. The dataset used in this paper is a “real” dataset taken from the Amsterdam Growth and Health Study, an observational longitudinal study investigating the longitudinal relation between lifestyle and health in adolescence and young adulthood [4]. The abstract notation of the different variables (Y, X1 to X4) is used because for the topic of this paper, it is unimportant what these variables actually are. In this particular dataset the outcome variable Y was total serum cholesterol expressed in mmol/l. X1 was fitness level at baseline (measured as maximal oxygen uptake on a treadmill), X2 was body fatness (estimated by the sum of the thickness of four skinfolds), X3 was smoking behavior (dichotomized as smoking vs. nonsmoking) and X4 was gender. 2.2. Imputation methods Imputation methods can be divided into cross-sectional and longitudinal imputation methods; both can be used to replace missing data in longitudinal studies. The cross-sectional methods described in this paper are: mean of series method, the hot-deck method, and the cross-sectional linear regression method. Longitudinal imputation methods that are discussed are: last value carried forward method, linear interpolation method, and two longitudinal linear regression methods. Furthermore, the multiple imputation method is considered [5–7]. 2.3. Cross-sectional imputation methods In all variants of the mean of series imputation method, the average value of the available data for a particular variable at a particular time point is calculated. This average value is imputed for the missing values. A somewhat different approach is called the hot deck imputation method. With this approach, the average value of, or a random draw from
a subset of comparable cases (e.g., cases with the same gender, age, etc.) is imputed for the missing cases. With crosssectional regression methods, a linear regression of the complete cases is used, with all available predictor variables at the time at which the outcome variable Y was missing. This means for instance that for Y at t 2, a cross-sectional regression analysis with X1, X2 at t 2, X3 at t 2 and X4 was performed. This regression analysis provided the predicted values for the outcome variable Y at that particular time point, and the predicted value is used for the imputation of the missing cases. It should be noted that this approach is only suitable in situations when (for a particular subject) only the outcome variable is missing and not the (possible) predictor variables. 2.4. Longitudinal imputation methods The simplest longitudinal imputation method is called last value carried forward (LVCF). In this approach the value of a particular variable at t t is imputed for a missing value for that particular subject at t t 1, assuming that the variable is more or less constant over time. Another longitudinal imputation method is the linear interpolation imputation method. With this method a missing value at t t is imputed by the average value of the value at t t1 and t t 1, assuming a linear development in time of the variables with missing data. Comparable, but somewhat more sophisticated is the individual longitudinal regression imputation method. This method assesses the linear regression between the outcome variable Y and time for each case with a missing value. The predicted value belonging to the time point of the missing value is imputed for that missing value. A fourth longitudinal imputation method, the population longitudinal regression imputation method, is based on a longitudinal regression of Y on the previous measurement of Y, on the four predictor variables X1 to X4 and on time. For the time-dependent predictor variables X2 and X3 the values measured at the time at which the outcome variable Y was missing were used in this regression. Again, the predicted value belonging to the time point of the missing value is imputed for that missing value. 2.5. Multiple imputation method With the multiple imputation method [5–7], various (say M) imputation values are calculated for every missing value. With the M imputations, M complete data sets can be made. On each data set created in this way, statistical analyses can be performed. The M sets of imputations are repeated draws under one model for missingness (e.g., crosssectional regression, hot deck, longitudinal regression, etc.). The M complete data set summary statistics (e.g., regression coefficients) can be combined to form one summary statistic. The point estimate of the summary statistic is calculated as the average of the M imputations. The variance of the summary statistic is calculated from two components. One component reflects the within-imputation variance (the average of the variances of the summary statistics of the M im-
J. Twisk, W. de Vente / Journal of Clinical Epidemiology 55 (2002) 329–337
331
putations) and the other component reflects the betweenimputation variance (the difference between the summary statistic of each imputation and the average of the summary statistics of the M imputations) [5–7]. The major advantage of this method is that the combined variance accounts for the uncertainty introduced by estimating the missing values. In principle, the M imputations of the missing values are M repetitions of the posterior predictive distribution of the missing values. In this paper the multiple imputation method will be illustrated with two models for missingness: the cross-sectional hot deck method and the individual longitudinal regression method.
tion of the missing values were performed to create five complete datasets. For the multiple imputation hot-deck approach, a random draw from the subset based on the value of X4 (i.e., the first hot-deck approach) was taken for imputation. For the longitudinal regression multiple imputation approach five independent drawings were taken from the posterior predictive distribution created by an individual linear regression analysis between the outcome variable Y and time. It should be noted that for each dataset with missing values, at least three observations were present for each subject. All imputation methods were done either by hand or by SPSS-PC [8].
2.6. Generating datasets
2.7. Statistical analysis
Besides a complete dataset, a few incomplete datasets were considered. All incomplete datasets were derived from the complete dataset and at each time point only the outcome variable Y was assumed to be missing. In the first dataset with missing values, at each repeated measurement (including the initial measurement at t 1) 10% of the data was missing in a random manner (missing completely at random, MCAR). In the second incomplete dataset, 10% of the data was missing at each of the repeated measurements dependent on the observed data (missing at random, MAR); i.e., the subjects with the highest values of the outcome variable Y at t t were assumed to be missing at t t 1. In this MAR dataset, a full dataset at t 1 was considered. In the third incomplete dataset, 10% of the data was missing at each of the repeated measurements dependent on the unobserved data (missing not at random, MNAR); i.e., the subjects with the highest values of the outcome variable Y at t t were assumed to be missing at t t. Besides datasets with 10% missing data, also the same incomplete datasets with 25% of the data missing were analyzed. All statistical analyses were carried out on: (1) the complete dataset, (2) the incomplete datasets, and (3) datasets in which the missing data were replaced using the imputation methods mentioned before. Regarding the hot-deck method, two different approaches were followed: (1) the average values for the two different categories for the time-independent variable X4 were imputed. So when X4 equaled 1 for a subject with missing data, the average of the subjects of whom X4 also equaled 1 was imputed and for a subject of whom X4 equaled 2, the average of the subjects of whom also X4 equaled 2 was imputed (i.e., missing data for males were imputed with the average value of all other males and missing data for females were imputed with the average value of all other females). (2) Besides X4, also the values of X1, X2 and X3 at the time of measurement at which Y was missing were used to define the subjects most similar to the subject with a missing value for Y. The average value of Y for these subjects was used for imputing the missing value. In this example the four most similar subjects were chosen. For both multiple imputations used in this example, five independent drawings from the posterior predictive distribu-
First of all, a MANOVA for repeated measurements was carried out. The univariate approach was used and when the assumption of sphericity was not met, the Greenhouse-Geisser adjustment was applied. MANOVA for repeated measurements was used to investigate (1) whether there was an overall development in time for outcome variable Y, (2) whether there was an overall difference between the different categories of the time-independent dichotomous variable X4 (i.e., males and females), and (3) whether there was a difference in development in time for the different categories of X4 (i.e., males and females) [9]. Furthermore, in an additional analysis, the relationship between the baseline values of the other predictor variables (X1, X2, and X3) with the outcome variable Y was assessed simultaneously. Second, a Generalized Estimating Equations (GEE) analysis [10–13] was carried out to investigate the longitudinal relationship between the outcome variable Y and the four predictor variables (X1 to X4) and time. Because the relationship between the outcome variable Y and time is quadratic rather than linear, also time2 was added to the GEE model. Besides this, the interaction between time and X4, was analyzed. With GEE, the following model was analyzed: Y it = β 0 + β 1 X1 i + β 2 X2 it + β 3 X3 it + β 4 X4 i + 2 β 5 time + β 6 time + β 7 ( X4 i × time ) + ε it where Yit observation of subject i at time t; 0 intercept; 1 to 7 regression coefficients; X1 and X4 timeindependent predictor variables; X2 and X3 time-dependent predictor variables; X4 time interaction between X4 and time and εit measurement error of subject i at time t. Because the primary purpose of this study was to compare different imputation methods within one statistical technique, the model analysed with GEE slightly differs from the model analyzed with MANOVA. For both statistical techniques relatively simple models were chosen, which are frequently used in research involving longitudinal observational data. It should be noted that the results and conclusions of the illustrative analyses only hold for these “simple” models. MANOVA for repeated measurements were carried out by SPSS-PC [8] and the GEE analyses were carried out by SPIDA [14].
332
J. Twisk, W. de Vente / Journal of Clinical Epidemiology 55 (2002) 329–337
Table 1 Descriptive information of outcome variable Y and the predictor variables (X1 to X4) measured at six repeated occasionsa X41 X42 X41 X42 X41 X42 X41 X42 X41 X42
Y (total serum cholesterol) X1 (fitness at baseline) X2 (sum of skinfolds) X3 (smoking behaviour) X4 (gender)
males females
t1
t2
t3
t4
t5
t6
4.5 (0.6) 4.4 (0.7) 2.1 (0.2) 1.9 (0.2) 2.7 (1.0) 3.7 (1.3) 1.4% 3.8% N69 N78
4.3 (0.7) 4.4 (0.7)
4.1 (0.7) 4.4 (0.7)
4.0 (0.6) 4.4 (0.7)
4.4 (0.8) 4.9 (0.7)
5.0 (0.9) 5.3 (0.9)
2.7 (1.0) 4.0 (1.3) 7.2% 7.7%
2.7 (1.0) 4.3 (1.4) 13.0% 17.9%
2.8 (1.0) 5.2 (1.6) 13.0% 24.4%
3.4 (1.2) 5.2 (1.6) 33.3% 32.1%
3.6 (1.4) 4.6 (1.7) 33.3% 21.8%
a For outcome variable Y and the continuous predictor variables (X1 and X2) mean and standard deviation are given, for the dichotomous predictor variables X3 the percentage of smokers (X31) is given and for the dichotomous predictor variable X4, the actual numbers in each category are given.
3. Results Table 1 shows descriptive information of outcome variable Y (i.e., total serum cholesterol) and the four predictor variables (fitness level, the sum of skinfolds, smoking behaviour and gender). Table 2 shows the interperiod correlation coefficients for outcome variable Y. Because the interperiod correlation coefficients were quite high, the MNAR and the MAR dataset were comparable. We decided only to report the results of the statistical analysis for the MAR dataset. The reason for choosing this dataset instead of the MNAR dataset is that it is impossible to investigate whether the missing data depend on the unobserved data, while it is possible to investigate whether the missing data depend on observed data. So in research, one never knows whether one is dealing with a MNAR dataset or not. 3.1. MANOVA for repeated measurements Tables 3 and 4 show the results of MANOVA for repeated measurements for all generated datasets. With MANOVA for repeated measurements, the overall difference between two groups (indicated by X4; i.e., males and females), the overall development in time, and the difference in development in time between the two groups (indicated by the time*X4 interaction) were analyzed. In an additional analysis the relationships with the baseline values of the other three determinants were investigated. For the within-subject effects (i.e., the effects involving time) explained variances were calculated. Explained variance was defined as the proportion of the total variability in the outcome variable that is accounted
Table 2 Interperiod correlation matrix for outcome variable Y Y1 Y2 Y3 Y4 Y5 Y6
Y1
Y2
Y3
Y4
Y5
Y6
—
0.76 —
0.70 0.77 —
0.67 0.78 0.85 —
0.64 0.67 0.71 0.74 —
0.59 0.59 0.63 0.65 0.69 —
for by variation in the particular predictor variable; i.e., it is the ratio of the sum of squares of the particular effect to the total sum of squares. For the between-subject effect, regression coefficients and standard errors were calculated and the averages of the six regression coefficients and standard errors were reported. The results of the within-subject and between-subject effects are given in Table 3. In general, the influence of missing data and the influence of different imputation methods on the results of the MANOVA for repeated-measurements was marginal. The final conclusions of the MANOVA for repeated measurements, if based on P-values, were similar for all datasets. This is because all effects were highly statistically significant. The exception was the datasets with missing values, in which the effect of X4 was not significant. This is caused by the reduced power of the analyses. With MANOVA for repeated measurements only the “complete” cases (without missing data) can be analyzed. The explained variances, however, indicate that the observed associations were comparable with the results from the other datasets. For all datasets, imputation of missing data led to an underestimation of the overall time effect (i.e., lower explained variance). An additional interesting finding is that both the cross-sectional and the longitudinal multiple imputation methods underestimate the overall time effect (compared to the complete dataset and the “single” imputation methods). This phenomenon is stronger when the percentage of missing data is higher. The underestimation of the overall time effect by multiple imputation methods can be seen as an indicator for the uncertainty in the estimation of parameters, due to missing data. Compared to the cross-sectional imputation methods, especially in the MAR datasets, the longitudinal imputation methods behave better. This is due to the fact that in these datasets the missingness depends on the outcome variable Y at t t1, which is used in the longitudinal imputation methods. In an additional analysis (including X4 as a between-subject factor), the relationship with the baseline values of the other three determinants was investigated. These results are presented in Table 4. Like for the between-subject effect X4,
J. Twisk, W. de Vente / Journal of Clinical Epidemiology 55 (2002) 329–337
333
Table 3 Results of MANOVA for repeated measurements analyzing an overall group effect [the between-subject effect for X4a, expressed in a regression coefficient () and standard error (SE)]b, the overall development in time (the within-subject effect for time) of outcome variable Y c, and a possible difference in development between the two groups (the within-subject effect: time*X4 interactiond; both within-subject effects are expressed as explained variancese and P-values) (Complete dataset, two datasets with 10% missing data, two datasets with 25% missing data, and datasets in which the missing data were replaced using different imputation methods)f
Complete dataset 10% missing (MCAR) Cross-sectional imputation Multiple imputationg Longitudinal imputation Multiple imputationh 10% missing (MAR) Cross-sectional imputation Multiple imputationg Longitudinal imputation Multiple imputationh 25% missing (MCAR) Cross-sectional imputation Multiple imputationg Longitudinal imputation Multiple imputationh 25% missing (MAR) Cross-sectional imputation Multiple imputationg Longitudinal imputation Multiple imputationh
Between-subjects effect
Within-subjects effects
X4
SE
Time Explained variancee
P-value
Time*X4 interactiond Explained variancee
P-value
0.26 0.19 0.23/0.26 0.27 0.27/0.28 0.26 0.12 0.21/0.23 0.22 0.26/0.27 0.27 0.36 0.19/0.25 0.26 0.22/0.25 0.22 0.09 0.19/0.25 0.25 0.23/0.27 0.25
0.12 0.16 0.12 0.12 0.12 0.12 0.11 0.11 0.11 0.12 0.12 0.30 0.10/0.11 0.12 0.11/0.12 0.12 0.11 0.10/0.11 0.11 0.12 0.12
0.42 0.43 0.39/0.40 0.33 0.37/0.40 0.36 0.46 0.31/0.39 0.34 0.41/0.44 0.40 0.40 0.36/0.38 0.27 0.33/0.40 0.24 0.51 0.35/0.38 0.30 0.37/0.43 0.35
.01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01
0.05 0.05 0.05 0.05 0.05 0.05 0.07 0.03/0.04 0.03 0.05/0.06 0.05 0.10 0.03/0.06 0.05 0.05/0.06 0.05 0.10 0.03/0.04 0.03 0.05/0.07 0.05
.01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01 .01
a
Gender (males vs. females). The regression coefficients and standard errors are averaged over the six time points. c Total serum cholesterol. d Different development for males and females. e Explained variance is defined as the proportion of the total variability in the outcome variable Y that is accounted for by variation in the particular predictor variable. It is the ratio of the sum of squares of the particular effect to the total sum of squares. f For the cross-sectional and longitudinal imputation methods, the range in regression coefficients (or explained variances) and standard errors is reported. g Five independent drawings from the subset based on X4. h Five independent drawings from the predictive distribution created by an individual linear regression analysis between the outcome variable Y and time. b
for the predictor variables X1, X2, and X3, the average of the six regression coefficients is reported. It is clear that all analyses on datasets with missing data without imputation led to different results than the analysis on the complete dataset. For X1, in the datasets with 10% missing data the different imputation techniques did not differ much regarding the magnitude of the regression coefficients, while in the datasets with 25% missing data, especially the cross-sectional imputation techniques led to an underestimation of the regression coefficients. For X2, the point estimates were mostly underestimated when cross-sectional imputation techniques were used; however, the differences with the results of the complete dataset were only marginal. For X3, in the MAR datasets every imputation led to an overestimation of the regression coefficients, while for the MCAR datasets this was not observed. The most striking difference in the results was, however, found for the standard errors of the different regression coefficients. All cross-sectional imputation techniques led to an underestimation of these standard errors. This underesti-
mation was more pronounced in the datasets with 25% missing data and was more pronounced for X1 and X3 than for X2. Applying multiple imputation methods to the missing datasets led to an “overestimation” of the standard errors of the regression coefficients. This reflects the increase of uncertainty in the estimation of the regression coefficients caused by the missing data, which is part of the theoretical background of multiple imputations. 3.2. GEE analysis In Table 5, the results of the GEE analyses are shown. With this analysis the longitudinal relationship between outcome variable Y and the four predictor variables X1 to X4, time, time2, and the time X4 interaction was analyzed. Unlike the MANOVA, in the GEE analysis X2 and X3 were treated as time dependent. For each predictor variable, the regression coefficients and standard errors were given. Although the overall conclusions (if based on P-values) for the different predictor variables did not differ between the several datasets, the point estimates and standard errors of (es-
334
J. Twisk, W. de Vente / Journal of Clinical Epidemiology 55 (2002) 329–337
Table 4 Results of MANOVA for repeated measurements analyzing the relationship between outcome variable Y a and (baseline values of) predictor variables X1, X2, and X3b, expressed in a regression coefficient and a standard errors ( ))c (Complete dataset, two datasets with 10% missing data, two datasets with 25% missing data, and datasets in which the missing data were replaced using different imputation methods)d Complete dataset 10% missing (MCAR) Cross-sectional imputation Multiple imputatione Longitudinal imputation Multiple imputatationf 10% missing (MAR) Cross-sectional imputation Multiple imputatione Longitudinal imputation Multiple imputatationf 25% missing (MCAR) Cross-sectional imputation Multiple imputatione Longitudinal imputation Multiple imputatationf 25% missing (MAR) Cross-sectional imputation Multiple imputatione Longitudinal imputation Multiple imputatationf
X1
X2
X3
0.47 (0.28) 0.50 (0.38) 0.46/0.43 (0.27) 0.46 (0.33) 0.47/0.45 (0.27–0.28) 0.46 (0.29) 0.36 (0.23) 0.47/0.43 (0.25/0.26) 0.43 (0.29) 0.51/0.48 (0.28) 0.48 (0.31) 0.74 (0.65) 0.43/0.33 (0.24/0.25) 0.39 (0.31) 0.47/0.40 (0.26/0.28) 0.48 (0.31) 0.22 (0.26) 0.47/0.34 (0.24/0.25) 0.39 (0.35) 0.51/0.41 (0.27/0.28) 0.42 (0.30)
0.18 (0.05) 0.20 (0.06) 0.17/0.18 (0.04/0.05) 0.18 (0.06) 0.18/0.19 (0.05) 0.19 (0.06) 0.13 (0.05) 0.15/0.17 (0.04) 0.15 (0.06) 0.18/0.19 (0.05) 0.18 (0.06) 0.30 (0.13) 0.17/0.18 (0.04/0.05) 0.15 (0.06) 0.17/0.18 (0.04/0.05) 0.17 (0.06) 0.10 (0.06) 0.14/0.18 (0.04) 0.14 (0.05) 0.17/0.20 (0.05) 0.18 (0.06)
0.16 (0.38) 0.15 (0.38) 0.16 (0.36/0.37) 0.15 (0.40) 0.17/0.15 (0.37/0.38) 0.17 (0.40) 0.17 (0.39) 0.25/0.19 (0.35) 0.22 (0.38) 0.19/0.18 (0.38/0.39) 0.19 (0.45) 0.04 (0.73) 0.16 (0.36/0.37) 0.07 (0.41) 0.27/0.14 (0.35/0.38) 0.10 (0.42) 0.38 (0.48) 0.32/0.25 (0.32/0.33) 0.28 (0.40) 0.34/0.27 (0.37) 0.26 (0.41)
a
Total serum cholesterol. The regression coefficients and standard errors are averaged over the six time points. c X1 fitness at baseline, X2 sum of four skinfolds, X3 smoking behaviour (yes/no). d For the cross-sectional and longitudinal imputation methods, the range in regression coefficients and standard errors is reported. e Five independent drawings from the subset based on X4 (i.e., gender). f Five independent drawings from the predictive distribution created by an individual linear regression analysis between the outcome variable Y and time. b
pecially) the time-independent predictor variables X1 and X4 differ remarkably between the different imputation methods. For X1, the point estimate of the regression coefficient was very unstable. This was caused by the large standard error of this coefficient in the complete dataset. Especially in the datasets with 25% missing data, there is a large difference in regression coefficients between the different imputation methods. Standard errors of the regression coefficients derived with use of cross-sectional imputation methods were underestimated, leading to smaller confidence intervals and lower P-values. This could lead to wrong conclusions, which is especially relevant in situations of borderline significance. A less clear, but comparable situation is found for the timeindependent predictor variable X4. The fact that GEE underestimates the standard errors of the regression coefficients of the time-independent predictors when cross-sectional imputation methods are applied, has also been found in analyses in other datasets analyzed with GEE (unpublished results) and was also found in the results of MANOVA for repeated measurements. This has to do with the decreased variance in outcome variable Y due to the cross-sectional imputation techniques and with the estimation procedures of GEE regarding time-independent predictors [14]. In the 10% MAR dataset, the decrease in variance in outcome variable Y was small, and therefore, the underestimation of the standard errors was less pronounced. For the time-dependent predictor variable X2, for time,
for time2 and for the time X4 interaction, the results of all datasets were more or less the same. Only marginal differences in both the regression coefficients and the standard errors of the regression coefficients were found. An exception was found for the datasets with 25% missing data, in which the cross-sectional imputation techniques seem to overestimate the linear time effect, while the longitudinal imputation techniques seem to underestimate this effect. For X3, in the MAR datasets, missingness led to an underestimation of the regression coefficients. This indicates that the relationship between Y and X3 is especially observed in the upper part of the distribution. With the crosssectional imputation methods, this problem is not solved, while with the longitudinal imputation methods, the underestimation of the regression coefficients was less pronounced. This is caused by the fact that the longitudinal imputation methods use information of the individual development of each subject, while the cross-sectional imputation methods only use information of the cross-sectional data of all other subjects. As has also been seen in the MANOVA for repeated measurements, in GEE analysis, the multiple imputation led to an “overestimation” of the standard errors of the estimated regression coefficients. This is caused by the between-imputation variance, which is added as an extra component to the overall variance. The larger standard errors led to broader confidence intervals and therefore to higher P-values, which
J. Twisk, W. de Vente / Journal of Clinical Epidemiology 55 (2002) 329–337
335
Table 5 Regression coefficients and standard errors ( ) derived from GEE analysis investigating the longitudinal relationships between outcome variable Y a and several predictor variables (X1 to X4)b, time, the interaction between time and X4, and timeb (Complete dataset, two datasets with 10% missing data, two datasets with 25% missing data, and datasets in which the missing data were replaced using different imputation methods)c X1 Complete dataset 10% missing (MCAR) Cross-sectional imputation Multiple imputationd Longitudinal imputation Multiple imputatatione 10% missing (MAR) Cross-sectional imputation Multiple imputationd Longitudinal imputation Multiple imputatatione 25% missing (MCAR) Cross-sectional imputation Multiple imputationd Longitudinal imputation Multiple imputatatione 25% missing (MAR) Cross-sectional imputation Multiple imputationd Longitudinal imputation Multiple imputatatione
Complete dataset 10% missing (MCAR) Cross-sectional imputation Multiple imputationd Longitudinal imputation Multiple imputatatione 10% missing (MAR) Cross-sectional imputation Multiple imputationd Longitudinal imputation Multiple imputatatione 25% missing (MCAR) Cross-sectional imputation Multiple imputationd Longitudinal imputation Multiple imputatatione 25% missing (MAR) Cross-sectional imputation Multiple imputationd Longitudinal imputation Multiple imputatation5
0.03 (0.28) 0.01 (0.28) 0.04/0.07 (0.23/0.27) 0.00 (0.32) 0.02/0.00 (0.28) 0.02 (0.30) 0.02 (0.27) 0.06/0.05 (0.23/0.24) 0.05 (0.30) 0.08/0.03 (0.24/0.28) 0.03 (0.31) 0.06 (0.28) 0.04/0.07 (0.22/0.23) 0.08 (0.33) 0.11/0.02 (0.24/0.30) 0.14 (0.31) 0.06 (0.29) 0.01/0.02 (0.21) 0.07 (0.27) 0.06/0.04 (0.28/0.30) 0.02 (0.33)
X2
X3
0.10 (0.02) 0.10 (0.02) 0.09/0.13 (0.02) 0.10 (0.03) 0.10 (0.02) 0.10 (0.03) 0.11 (0.02) 0.09/0.12 (0.02) 0.10 (0.03) 0.09/0.11 (0.02) 0.10 (0.03) 0.11 (0.02) 0.10/0.13 (0.02) 0.11 (0.02) 0.09/0.11 (0.02) 0.09 (0.03) 0.11 (0.02) 0.09/0.13 (0.02) 0.09 (0.03) 0.08/0.11 (0.02) 0.09 (0.02)
0.07 (0.06) 0.07 (0.06) 0.05/0.01 (0.05/0.06) 0.04 (0.09) 0.08/0.07 (0.05/0.06) 0.07 (0.07) 0.01 (0.05) 0.01/0.04 (0.05/0.06) 0.01 (0.09) 0.07/0.01 (0.05) 0.04 (0.07) 0.06 (0.06) 0.04/0.01 (0.06) 0.03 (0.11) 0.06/0.03 (0.05/0.06) 0.03 (0.11) 0.02 (0.06) 0.00/0.04 (0.05) 0.01 (0.09) 0.08/0.02 (0.05) 0.02 (0.07) X4 time
X4
time
0.14 (0.14) 0.13 (0.14) 0.14/0.13 (0.13) 0.13 (0.17) 0.16/0.13 (0.13/0.14) 0.16 (0.15) 0.17 (0.14) 0.18/0.14 (0.12) 0.14 (0.15) 0.17/0.14 (0.14/0.15) 0.14 (0.15) 0.17 (0.15) 0.25/0.16 (0.11/0.12) 0.21 (0.18) 0.20/0.11 (0.12/0.15) 0.26 (0.20) 0.17 (0.14) 0.19/0.11 (0.11/0.12) 0.13 (0.15) 0.18/0.11 (0.14/0.15) 0.11 (0.17)
0.63 (0.05) 0.64 (0.06) 0.66/0.65 (0.05 /0.06) 0.66 (0.08) 0.64/0.55 (0.05) 0.56 (0.07) 0.66 (0.05) 0.70/0.64 (0.05) 0.68 (0.07) 0.64/0.61 (0.05) 0.61 (0.07) 0.66 (0.06) 0.62/0.66 (0.05/0.06) 0.67 (0.09) 0.58/0.51 (0.05/0.06) 0.43 (0.09) 0.64 (0.05) 0.69/0.67 (0.04/0.05) 0.68 (0.07) 0.58/0.51 (0.05) 0.50 (0.06)
0.07 (0.02) 0.07 (0.02) 0.07 (0.02) 0.07 (0.03) 0.07/0.08 (0.02) 0.08 (0.03) 0.08 (0.02) 0.06/0.07 (0.02) 0.06 (0.02) 0.08 (0.02) 0.08 (0.03) 0.08 (0.02) 0.06/0.09 (0.02) 0.10 (0.04) 0.07/0.09 (0.02) 0.09 (0.04) 0.08 (0.02) 0.06/0.08 (0.02) 0.07 (0.03) 0.07/0.08 (0.02) 0.07 (0.03)
timeb 0.09 (0.01) 0.09 (0.01) 0.09/0.10 (0.01) 0.07 (0.01) 0.08/0.09 (0.01) 0.06 (0.01) 0.09 (0.01) 0.09/0.10 (0.01) 0.07 (0.01) 0.08/0.09 (0.01) 0.06 (0.01) 0.09 (0.01) 0.09 (0.01) 0.08 (0.01) 0.07/0.08 (0.01) 0.06 (0.01) 0.09 (0.01) 0.09/0.10 (0.01) 0.07 (0.01) 0.07/0.08 (0.01) 0.06 (0.01)
a
Total serum cholesterol. X1 fitness at baseline, X2 sum of four skinfolds, X3 smoking behavior (yes/no), X4 gender (males /females). c For the cross-sectional and longitudinal imputation methods, the range in regression coefficients and standard errors is reported. d Five independent drawings from the subset based on X4. e Five independent drawings from the predictive distribution created by an individual linear regression analysis between the outcome variable Y and time. b
is the reason for more conservative conclusions about the relationships analyzed. However, it may be considered theoretically justified that the imputation uncertainty is reflected in the final results of the statistical analyses. 4. Discussion In this paper we have examined the consequences of missing data in longitudinal studies for the results of statistical analyses. This was done by comparing datasets without
missing data, datasets with missing data, and datasets in which the missing data were imputed by different imputation methods. Furthermore, we considered two statistical methods; i.e., MANOVA for repeated measurements, in which listwise deletion of cases with missing data occurred, and GEE analysis in which it is assumed that attrition depends on covariates only [15]. To summarize the results, for the MANOVA for repeated measurements it was found that with cross-sectional imputation techniques the overall time effects were underestimated
336
J. Twisk, W. de Vente / Journal of Clinical Epidemiology 55 (2002) 329–337
and in the MAR datasets the interaction between time and the between subject factor (i.e., gender) was underestimated. Both underestimations were more pronounced in the datasets with 25% missing data. The cross-sectional imputation techniques also underestimated the standard errors of the effects of the between-subject factor and the baseline values of the other predictor variables. The underestimation was strongest in the dataset with 25% missing data. For the results of the cross-sectional imputation techniques, no difference was found between the MCAR and the MAR datasets. Analyses on the incomplete datasets showed distorted results; i.e., unstable regression coefficients and large standard errors. In addition to X1, in MANOVA, X2, and X3 were also treated as time independent. Because of this, it was a bit surprising that the effects of missingness and imputations were different for X1 compared to X2 and X3. We do not have a clear explanation for this, although a comparison of all analyses shows that the regression coefficients for X1 are highly unstable, and that the standard errors are large. For the GEE analysis it was found that with cross-sectional imputation techniques, the standard errors of the regression coefficients for the time-independent predictors were underestimated. This effect was strongest in the dataset with 25% missing data and in the MAR dataset. The results of the analyses on incomplete datasets were comparable to the results of the analysis on datasets with any of the imputation techniques. With the multiple imputation technique all standard errors were “overestimated.” The differences in point estimates between multiple imputations and the other imputation methods were not consistent, which was probably due to the choice for the (relatively simple) probability distributions for the multiple imputations. When interpreting these results, it should be taken into account that to prevent complexity, in this example only four missing data scenarios were illustrated, while in a real life situation infinite patterns of missing data can occur. In two of the missing data scenarios, the attrition rate was not very high (10%). Furthermore, dependency on the observed data in the MAR dataset was large; i.e., the highest 10 or 25% of the distribution at a particular measurement were assumed to drop out at the subsequent measurement. Another limitation of the present example is that only the Y-values were set missing, while it is just as likely to have missing data in the predictor variables as well. Furthermore, many more models for imputation can be chosen. The multiple imputation methods were for instance limited to two (relatively) simple models, and they were limited to only five replications of the imputed values. Another important issue is that the choice for the example dataset, partly determines the performance of different imputation techniques. For instance, the stronger the relationship between the predictor variables and the outcome variable Y, the better the imputation methods using these predictor variables will behave. When the outcome variable Y does not change much over time, the performance of longitudinal imputation methods using the information of Y at earlier time points will be bet-
ter. In general, one should be aware of the choices made, when evaluating the results of the different imputation methods. It has been argued that using imputation methods for missing data results in a decrease in variability of that particular variable. In theory, this is quite obvious for the mean of series and the hot-deck approach, but a decrease in variability is also a problem in the regression approaches. In the regression approaches all imputed values lie exactly on the estimated regression line. To overcome this problem, a value randomly chosen from a range of values, for instance, covering the 95% confidence interval around the mean or predicted value, can be imputed. Another solution to overcome the problem of reduced variability is to use the multiple imputation method [5–7]. In theory, the multiple imputation method is the most elegant solution for the imputation of missing data. The performance of the multiple imputation method, however, highly depends on the chosen model for missingness. In the presented example two relatively simple models were used for multiple imputation; the results derived from these multiply imputed datasets did not show different regression coefficients than the ones from the datasets with single imputation methods. The difference was found in more adequate (i.e., higher) standard errors of the regression coefficients in both the MANOVA for repeated measurements and the GEE analyses. It is possible that if more sophisticated models for missingness were used (e.g., using SOLAS software [16]), the performance of the “multiple imputation” method would lead to point estimates, which would be closer to the point estimates derived from the complete dataset. In this paper, for each multiple imputation method one model for missingness was used. In the cross-sectional imputation method, the hot-deck “model” was used and in the longitudinal imputation method an individual regression “model” was used. By using more than one model for missingness within one multiple imputation method, it has been argued that the multiple imputation method can also account for the uncertainty of the model for missingness [3]. Although it was not the purpose of this paper to compare the results of MANOVA for repeated measurements with the results of the GEE-analysis, it was striking that the regression coefficient for X4 derived from MANOVA (0.24) was totally different from the coefficient derived from the GEE analysis. For the GEE analysis, combining the X4 regression coefficient and the time X4 interaction, an overall regression coefficient of 0.11 was found for X4 in the complete dataset. This difference was caused by the fact that in the MANOVA analysis, the effect of the dichotomous predictor variable X4 was assessed without correcting for the other three predictor variables. When the effect of X4 was derived from an MANOVA analysis correcting for the baselines values of X1, X2, and X3, the regression coefficients were comparable to the regression coefficients for X4 derived from a GEE analysis. In fact, when the GEE analysis was performed on a complete dataset, correcting for the
J. Twisk, W. de Vente / Journal of Clinical Epidemiology 55 (2002) 329–337
baseline values of X1, X2 and X3, the regression coefficients for X4 were exactly the same (i.e., 0.13) for the two statistical methods. In the example a continuous outcome variable was used. The background of this choice is that there are more imputation methods available for continuous outcome variables than for dichotomous and/or categorical outcome variables. For these noncontinuous types of missing data, a cross-sectional imputation method is the imputation of missing values with the value of the category with the highest frequency. This can be either based on the total population or on a particular subset (“hot-deck” approach). The most frequently used longitudinal imputation method available for dichotomous and categorical missing data is the LVCF method. Linear interpolation can be used, but the average value of the output variable at the two surrounding time points has to be rounded off. For dichotomous outcome variables, both cross-sectional and longitudinal logistic regression can be used to predict missing data. In this situation the predicted values also have to be rounded off. In this paper several imputation methods to replace missing data have been discussed. Replacing missing data is, however, not the only solution to the attrition problem. Many alternative approaches are especially suitable for experimental studies. An example is the alternative approach of Shih and Quan [17], who suggest to combine the results of two analyses: (1) the comparison of the outcome variable between the groups analyzed in the complete cases, and (2) the comparison of the percentage of outcome related dropouts between the groups analyzed. The P-values of these two analyses can be combined to give the P-value for the “real” difference between the groups. For further information about the modeling of the dropout mechanism in longitudinal studies, an excellent (mathematical) overview is given by Little [15]. 5. Conclusions The present example with its mentioned limitations (i.e., specific observational longitudinal dataset, four missing data scenarios, limited number of imputation techniques, missingness dependent on the outcome variable, two statistical methods, less advanced multiple imputation estimation procedures) shows that when MANOVA for repeated measurements is used to analyze a longitudinal dataset with missing data, imputation methods to replace these missing data are
337
highly recommendable (because MANOVA as implemented in the software used, uses listwise deletion of cases with a missing value). When GEE is used to analyze a longitudinal dataset with missing data, not imputing at all may be better than any of the imputation methods applied. If one chooses to impute missing values, longitudinal methods are generally preferred above cross-sectional methods. Using the more refined multiple imputation method to impute missing values did not lead to different point estimates than the single imputation techniques. The estimated standard errors were higher than the ones obtained from the complete dataset, which seems to be theoretically justified, because they reflect uncertainty in estimation caused by missing values. References [1] Magnusson D, Bergman LR, editors. Data quality in longitudinal research. Cambridge: Cambridge University Press, 1990. [2] Diggle PJ. Testing for random dropouts in repeated measurement data. Biometrics 1989;45:1255–8. [3] Little RJA, Rubin DB. Statistical analysis with missing data. New York: John Wiley, 1987. [4] Kemper HCG, editor. The Amsterdam Growth Study: a longitudinal analysis of health, fitness and lifestyle. HK sports science monograph series, Vol 6. Champaign, IL: Human Kinetics Publishers Inc., 1995. [5] Rubin DB. Multiple imputation for nonresponse in surveys. New York: John Wiley & Sons, 1987. [6] Rubin DB. Multiple imputation after 18 years. J Am Stat Assoc 1996;91:473–89. [7] Schafer JL. Analysis of incomplete multivariate data. New York: Chapman & Hall, 1997. [8] SPSS-X user’s guide. 3rd ed. Chicago: SPSS Inc., 1988. [9] Kleinbaum DG, Kupper LL, Muller KE. Applied regression analysis and other multivariable methods. Boston: PWS-KENT Publishing Company, 1988. [10] Zeger SL, Liang K-Y. Longitudinal data analysis for discrete and continuous outcomes. Biometrics 1986;42:121–30. [11] Zeger SL, Liang K-Y. An overview of methods for the analysis of longitudinal data. Stat Med 1992;11:1825–39. [12] Liang K-Y, Zeger SL. Regression analysis for correlated data. Annu Rev Publ Health 199314:43–68. [13] Twisk JWR. Different statistical models to analyze epidemiological observational longitudinal data: an example from the Amsterdam Growth and Health Study. Int J Sports Med 1997;18(Suppl3):S216–24. [14] Gebski V, Leung O, McNeil D, Lunn D, editors. SPIDA user manual. version 6. NSW, Australia: Macquarie Univ., 1992. [15] Little RJA. Modelling the drop-out mechanism repeated measures studies. JASA 1995;90:1112–21. [16] SOLAS for missing data analysis. Statistical solutions. Saugus, MA: Stonehill Corporate Center, 1997. [17] Shih WJ, Quan H. Testing for treatment differences with dropouts present in clinical trials—a composite approach. Stat Med 1997;16:1225–39.