Journal of Clinical Epidemiology 72 (2016) 75e83
ORIGINAL ARTICLES
Different methods to analyze stepped wedge trial designs revealed different aspects of intervention effects J.W.R. Twiska,b,*, E.O. Hoogendijkb,c,d, S.A. Zwijsenb,c, M.R. de Boerb,e a Department of Epidemiology and Biostatistics, VU University Medical Centre, Amsterdam, The Netherlands EMGOþ Institute for Health and Care Research, VU University Medical Center, Amsterdam, The Netherlands c Department of General Practice & Elderly Care Medicine, VU University Medical Center, Amsterdam, The Netherlands d Gerontop^ole, Toulouse University Hospital, Toulouse, France e Section Methodology and Applied Biostatistics, Department of Health Sciences, VU University, Amsterdam, The Netherlands b
Accepted 3 November 2015; Published online 14 November 2015
Abstract Objectives: Within epidemiology, a stepped wedge trial design (i.e., a one-way crossover trial in which several arms start the intervention at different time points) is increasingly popular as an alternative to a classical cluster randomized controlled trial. Despite this increasing popularity, there is a huge variation in the methods used to analyze data from a stepped wedge trial design. Study Design and Setting: Four linear mixed models were used to analyze data from a stepped wedge trial design on two example data sets. The four methods were chosen because they have been (frequently) used in practice. Method 1 compares all the intervention measurements with the control measurements. Method 2 treats the intervention variable as a time-independent categorical variable comparing the different arms with each other. In method 3, the intervention variable is a time-dependent categorical variable comparing groups with different number of intervention measurements, whereas in method 4, the changes in the outcome variable between subsequent measurements are analyzed. Results: Regarding the results in the first example data set, methods 1 and 3 showed a strong positive intervention effect, which disappeared after adjusting for time. Method 2 showed an inverse intervention effect, whereas method 4 did not show a significant effect at all. In the second example data set, the results were the opposite. Both methods 2 and 4 showed significant intervention effects, whereas the other two methods did not. For method 4, the intervention effect attenuated after adjustment for time. Conclusion: Different methods to analyze data from a stepped wedge trial design reveal different aspects of a possible intervention effect. The choice of a method partly depends on the type of the intervention and the possible time-dependent effect of the intervention. Furthermore, it is advised to combine the results of the different methods to obtain an interpretable overall result. Ó 2016 Elsevier Inc. All rights reserved. Keywords: Baseline adjustment; Intervention; Longitudinal studies; Stepped wedge trial design; Statistical methods; Time adjustment
1. Introduction The stepped wedge trial design is a one-way crossover trial in which several arms start with the intervention at different time points (see Fig. 1). The starting point of the intervention is randomized, and although this randomization theoretically can be on an individual level, it is mostly on a cluster level, such as hospitals, departments, or neighborhoods. In the literature, there is some debate about the usefulness of a stepped wedge trial design. This
* Corresponding author. Tel.: 020 4444495. E-mail address:
[email protected] (J.W.R. Twisk). http://dx.doi.org/10.1016/j.jclinepi.2015.11.004 0895-4356/Ó 2016 Elsevier Inc. All rights reserved.
is not only related to practical and logistical reasons, but also to statistical power issues [1e6]. However, within epidemiology, the stepped wedge trial design is increasingly popular as an alternative to the classical cluster randomized controlled trial (RCT). Besides the discussion about the usefulness of a stepped wedge trial design (a discussion which will not be covered in this article), there is also much confusion about the way data from a stepped wedged trial design should be analyzed. In a systematic review, Brown and Lilford [7] mentioned that ‘‘no two studies use the same method in analyzing data,’’ whereas Mdege et al. [8] concluded that there was a huge variation in statistical methods used, varying from simple cross-sectional statistical methods, such as
76
J.W.R. Twisk et al. / Journal of Clinical Epidemiology 72 (2016) 75e83 Time
What is new? Key findings There is huge variation in methods used to analyse data from a stepped wedge trial. The necessity of the adjustment of time and the necessity of the adjustment for the baseline value of the outcome depends on the method used. Different methods reveal different aspects of the intervention effect. The choice of a method partly depends on the type of intervention and the possible time-dependent effect of the intervention. What this adds to what was known? The paper provides an overview of different methods that can be used to analyse data from a stepped wedge trial design. The paper evaluates different methods that can be used to analyse data from a stepped wedge trial design. The paper provides a table with pros and cons of the different methods that can be used to analyse data from a stepped wedge trial design. What is the implication and what should change now? Researchers can use this paper as a reference to choose a suitable method to analyse data from a stepped wedge trial design. Researchers should combine the results of the different methods to obtain an interpretable overall result.
t-tests or ManneWhitney U tests to more complicated methods, such as mixed models. Most stepped wedge trial designs are longitudinal in nature. This means that the same group of subjects is followed over time and the different clusters receive the intervention at different points in time. There are also stepped wedge trial designs that are cross-sectional regarding the subjects. In those stepped wedge trial designs at each interval, new subjects are included and depending on the timing and the cluster in which they are randomized they receive either the intervention or the control condition. It is also possible that the stepped wedge trial design is a combination of both. The focus of the present article is on stepped wedge trial designs that are (partly) longitudinal in nature. The most important issue to be considered in the analysis of data from a longitudinal stepped wedge trial design is the
Arm(s)
1
2
3
4
5
6
1
0
X
X
X
X
X
2
0
0
X
X
X
X
3
0
0
0
X
X
X
4
0
0
0
0
X
X
5
0
0
0
0
0
X
0 = control condition; X = intervention condition
Fig. 1. Schematic illustration of a cluster-stepped wedge design with five arms and six repeated measurements.
one-way crossover nature of the design. Because of that, the effect of the intervention can be measured partly within the subject (each subject moves at a certain point in time from the control to the intervention condition) and partly between the subjects (at a certain point in time, the intervention group can be compared with the control group). Ideally, these two aspects of the effect should be combined in the analysis. To do that it is necessary that data from a stepped wedge trial design are analyzed with a method that is capable to combine these effects; that is, a mixed model analysis [9]. Therefore, in the present article, only variations of mixed models will be considered as appropriate ways to analyze data from stepped wedge trial designs. Besides the combination of the within and betweensubject effects, in the analysis of data from a stepped wedge trial design, also the time variable can play an important role. In a classical RCT, the time variable is of no interest because the control and the intervention group are measured at the same time points; that is, the intervention variable is time independent [10], and therefore, adjustment for time cannot influence the estimated intervention effect. In a stepped wedge trial design, this is different because the intervention variable is time dependent and can influence the estimated intervention effect. Finally, it should be evaluated whether an adjustment for baseline differences in the outcome variable should be made. Although in classical RCTs, there is a debate going on whether an adjustment for the baseline value of the outcome variable is necessary to get a valid estimate of the intervention effect [11e13], most researchers argue that this adjustment should always be done [14]. In the present article, we will also take this issue into consideration regarding a stepped wedge trial design. To contribute to the debate regarding the analyses of stepped wedge trial data, the purpose of this article is to compare several statistical methods that can be used to analyze data from a stepped wedge trial design on two example data sets.
2. Methods 2.1. Data sets 2.1.1. The ACT trial The frail older adults: care in transition (ACT) trial was based on a geriatric care model and ran over a 24-month
J.W.R. Twisk et al. / Journal of Clinical Epidemiology 72 (2016) 75e83
period [15]. Primary care practices in the intervention group delivered care according to the geriatric care model, whereas practices in the control group offered usual care. The stepped wedge trial was conducted among 1,147 frail older adults in 35 primary care practices in the Netherlands, which were randomized over four arms which started the intervention at different time points. Outcome measurements were administered at baseline and at 6, 12, 18, and 24 months. The data collection took place between May 2010 and March 2013. The primary outcome of the study was quality of life as measured by the 12-item Short Form questionnaire (SF-12). The SF-12 measures quality of life in two domains: a mental health component score (MCS) and a physical health component score. In the present study, only the MCS score was used. 2.1.2. The Grip trial One of the aims of the Grip on Challenging Behaviour trial was to evaluate the effectiveness of a multidisciplinary care program for managing challenging behavior of residents with dementia residing in nursing homes [16]. In this clustered stepped wedge trial, 659 residents participated. They came from 17 participating units, which were randomly divided over five arms. Six measurements took place: one baseline measurement and a measurement every 4 months during a period of 20 months (February 2010eOctober 2011). After each measurement cycle, except the last one, a new arm started the intervention. Challenging behavior, measured with the CohenMansfield Agitation Inventory (CMAI), was used as the main effect measure in the present study. The CMAI consists of 29 questions regarding agitated behavior. The behavior can either be scored as absent (score 5 1) or occurring up to several times per hour (score 5 7), resulting in a range of 29e203. 2.2. Statistical methods Four different mixed model analyses were used to analyze the data from the two example data sets. The four methods were chosen because they have been (frequently) used in practice and they all are potentially useful for the analyses of data from a stepped wedge trial design [7,8,17]. Method 1 compares all the intervention measurements with the control measurements. With this approach, the intervention variable is a time-dependent dichotomous variable. The estimated effect of the intervention reveals the difference between all the measurements after an intervention period and all the measurements after a control period. Because the intervention effect is reflected in one number, this method does not provide an answer to the question whether a long-term exposure to the intervention is different from a short-term exposure. Methods 2 and 3 are methods in which such a distinction is made. The difference between methods 2 and 3 is that in method 2, the intervention variable is a time-independent categorical variable
77
comparing the different arms with each other. Each arm is a different combination of intervention and control measurements. In method 3, the intervention variable is a timedependent categorical variable comparing groups with different number of intervention measurements with all the control measurements. The different number of intervention measurements is reflected in the amount of time a particular subject receives the intervention. Method 3 is basically an extension of method 1, in which the intervention group from method 1 is divided into subgroups defined according to the length of the intervention. The figure in Appendix A at www.jclinepi.com illustrates one of the example data sets, and the way the different intervention variables were defined. The last method (method 4) is slightly different in such a way that instead of the observed values of the outcome variables at the different measurement occasions, the changes in the outcome variable between subsequent measurements are analyzed. These transitions are then compared between three transition groups, (1) subjects moving from control condition to control condition, (2) subjects moving from control condition to intervention condition, and (3) subjects moving from intervention condition to intervention condition. Because changes between subsequent measurements are used as outcome in this method, the within-subject correlation reduces to almost zero [10]. In Appendix B at www. jclinepi.com, the statistical models behind the four methods are written down. For all methods, four analyses were performed. First, a crude analysis was performed, that is, an analysis without any adjustments besides the relevant adjustments for the within-patient and the within-cluster correlation. Second, an adjustment was made for follow-up time (i.e., the time that a particular subject is in the study). In these analyses, follow-up time was treated as a categorical variable represented by dummy variables. In the third analysis, an adjustment was made for the baseline value of the outcome variable, and in the fourth analysis, both an adjustment for follow-up time and the baseline value of the outcome variable was performed. The adjustment for the baseline value was achieved by discarding the baseline observations from the outcome and use them as a time-independent covariate in the analyses. Because in method 4, the changes between subsequent measurements were used as outcome, for this method, the baseline value was not part of the outcome in the crude analyses, so for this method, the baseline value was just added as a time-independent covariate to the model. All analyses were performed with a mixed model analysis to adjust for both the dependency of the repeated measures within the patient and to adjust for the dependency of the measurements of a particular patient within the primary care practice (for the ACT trial) or the participating nursing homes (for the GRIP trial). MLwiN (version 2.27, Bristol, UK) was used for all mixed model analyses [18].
78
J.W.R. Twisk et al. / Journal of Clinical Epidemiology 72 (2016) 75e83
Table 1. Mean (standard deviation) of the MCS score for the different groups at each time point for the different methods (ACT trial) T1 Method 1 Intervention Control Method 2 Arm 1 Arm 2 Arm 3 Arm 4 Method 3 0 Months 6 Months 12 Months 18 Months 24 Months Method 4 CeC CeI IeI ICC patient level ICC cluster level
T2
T3
T4
T5
50.7 (10.9) 50.4 (10.6)
51.4 (10.0) 53.0 (9.6)
52.0 (10.3) 53.8 (10.2)
53.2 (9.9)
49.9 (10.6) 49.1 50.2 50.3 50.8
50.7 49.7 49.5 51.9
50.6 52.9 52.5 53.7
(10.4) (9.1) (9.2) (10.0)
52.1 52.4 51.4 53.8
(10.4) (9.4) (10.7) (10.2)
52.8 52.9 53.1 54.5
(10.2) (9.8) (10.0) (9.0)
50.4 (10.6) 50.7 (10.9)
53.0 (9.6) 52.9 (9.1) 50.6 (10.4)
53.8 51.4 52.4 52.1
(10.2) (10.7) (9.4) (10.4)
54.5 53.1 52.9 52.8
(9.0) (10.0) (9.8) (10.2)
0.66 (10.2) 1.06 (10.8)
1.97 (10.4) 2.24 (10.6) 0.13 (9.5)
(11.5) (9.4) (9.7) (10.6)
49.9 (10.6)
(10.9) (11.2) (10.6) (9.8)
0.39 (8.1) 0.78 (10.3) 0.69 (9.1)
0.46 (7.9) 0.68 (9.3)
0.511 0.059
Abbreviations: MCS, mental health component score; CeC, controlecontrol transition; CeI, controleintervention transition; IeI, interventioneintervention transition; ICC, intraclass correlation coefficient.
3. Results 3.1. ACT trial Table 1 shows the development over time of the observed MCS score for different groups as they were defined for the different methods and the intraclass correlation coefficients for both the individual and the cluster level. The table shows a general increase over time of the MCS score independent of the intervention. Table 2 shows the results of the different analyses. When looking at the crude results, the estimated intervention effects were different for the four methods. Methods 1 and 3 showed a strong positive intervention
effect, whereas method 2 showed an inverse intervention effect, which is significant for the comparison between arm 1 (the arm that starts the intervention first) and arm 4 (the arm that starts the intervention last). Method 4 (comparing the changes between subsequent time points between different transition states) did not show a significant effect at all. Adjusting for time influenced the results of especially methods 1 and 3. The significant intervention effect observed in the unadjusted analyses, (totally) disappeared due to the adjustment for time. The adjustment for the baseline value of the outcome led, in general, to a slight decrease in the estimated intervention effects.
Table 2. Regression coefficients and standard errors of the effect estimates of the ACT trial derived from four different methods Crudea Method 1b Intervention Method 2c Arm 1 Arm 2 Arm 3 Method 3b 6 Months 12 Months 18 Months 24 Months Method 4d Controleintervention Interventioneintervention a b c d
Adjusted for time
Adjusted for baseline
Adjusted for time and baseline
1.76 (0.25)
0.05 (0.40)
1.37 (0.34)
0.09 (0.41)
1.84 (0.77) 1.04 (0.91) 1.53 (0.90)
1.81 (0.77) 1.07 (0.90) 1.50 (0.89)
0.80 (0.67) 0.96 (0.78) 1.29 (0.78)
0.78 (0.67) 0.97 (0.77) 1.26 (0.77)
1.41 1.55 2.51 3.08
(0.31) (0.36) (0.42) (0.52)
0.48 (0.47) 0.16 (0.42)
0.22 0.60 0.03 0.22
(0.42) (0.55) (0.71) (0.95)
0.46 (0.48) 0.20 (0.53)
1.05 1.35 2.28 3.03
(0.37) (0.42) (0.49) (0.59)
0.31 (0.47) 0.08 (0.42)
0.06 0.65 0.15 0.25
(0.44) (0.58) (0.76) (1.01)
0.25 (0.48) 0.46 (0.53)
An analysis without any adjustments besides the relevant adjustments for the within-patient and the within-cluster correlation. Compared with the control condition. Compared with arm 4 (the arm with the most control measurements). Compared with controlecontrol transition.
J.W.R. Twisk et al. / Journal of Clinical Epidemiology 72 (2016) 75e83
79
Table 3. Mean (standard deviation) of the CMAI_log score for the different groups at each time point for the different methods (GRIP trial) T1 Method 1 Intervention Control Method 2 Arm 1 Arm 2 Arm 3 Arm 4 Arm 5 Method 3 0 Months 4 Months 8 Months 12 Months 16 Months 20 Months Method 4a CeC CeI IeI ICC patient level ICC cluster level
T2
T3
T4
T5
T6
3.78 (0.35) 3.94 (0.34)
3.89 (0.34) 3.91 (0.35)
3.87 (0.33) 3.91 (0.35)
3.85 (0.32) 3.94 (0.41)
3.87 (0.35)
3.87 (0.34) 3.84 3.84 3.87 3.89 3.90
3.78 3.89 3.93 3.90 4.07
3.87 3.92 3.89 3.83 4.00
(0.36) (0.32) (0.33) (0.32) (0.38)
3.87 3.88 3.87 3.90 3.92
(0.35) (0.33) (0.31) (0.34) (0.36)
3.82 3.84 3.83 3.98 3.94
(0.32) (0.34) (0.29) (0.34) (0.41)
3.81 3.84 3.79 3.85 4.10
(0.31) (0.35) (0.30) (0.30) (0.39)
3.94 (0.34) 3.78 (0.35)
3.91 (0.35) 3.92 (0.32) 3.87 (0.36)
3.91 3.87 3.88 3.87
(0.35) (0.31) (0.33) (0.35)
3.94 3.98 3.83 3.84 3.82
(0.41) (0.34) (0.29) (0.34) (0.32)
4,10 3.85 3.79 3.84 3.81
(0.39) (0.30) (0.30) (0.35) (0.31)
3.33 (15.3) 2.97 (15.5)
2.08 (16.6) 2.44 (14.1) 5.58 (12.4)
(0.34) (0.29) (0.32) (0.37) (0.39)
3,87 (0.34)
(0.35) (0.30) (0.30) (0.32) (0.44)
0.46 (21.2) 0.96 (16.3) 0.49 (15.0)
2.79 (22.8) 3.00 (14.7) 2.43 (14.4)
5.37 (17.2) 1.47 (15.5)
0.600 0.093
Abbreviations: CeC, controlecontrol transition; CeI, controleintervention transition; IeI, interventioneintervention transition. a For method, the changes between subsequent measurements are calculated for the observed values instead of the log-transformed values.
3.2. Grip trial Table 3 shows the development over time for observed behavioral problems for different groups as they were defined for the different methods and the two intraclass correlation coefficients. Because of a skewed distribution of the CMAI score, a natural log transformation was used in all analyses except for method 4. From Table 3, it can be seen that the CMAI_log score stays almost equal over time. Table 4 shows the results of the different analyses.
As in the ACT trial, the estimated intervention effects of the Grip trial are different between the four methods. In the crude analyses, both method 2 (comparing the randomization arms) and method 4 (comparing the differences between subsequent measurements) show significant intervention effects, whereas the other two methods did not. Adjusting for time attenuated the significant intervention effect obtained in method 4 for the interventioneintervention condition, which is, after adjustment,
Table 4. Regression coefficients and standard errors of the effect estimates of the GRIP trial derived from four different methods Crudea Method 1b Intervention Method 2c Arm 1 Arm 2 Arm 3 Arm 4 Method 3b 4 Months 8 Months 12 Months 16 Months 20 Months Method 4d Controleintervention Interventioneintervention a b c d
0.007 (0.012)
Adjusted for time 0.016 (0.017)
Adjusted for baseline 0.002 (0.016)
Adjusted for time and baseline 0.032 (0.021)
0.208 0.174 0.186 0.119
(0.068) (0.063) (0.063) (0.068)
0.209 0.174 0.187 0.120
(0.067) (0.062) (0.063) (0.067)
0.153 0.099 0.105 0.120
(0.061) (0.057) (0.057) (0.062)
0.152 0.100 0.105 0.125
(0.062) (0.057) (0.057) (0.063)
0.028 0.003 0.022 0.021 0.033
(0.014) (0.016) (0.017) (0.022) (0.033)
0.009 0.018 0.048 0.054 0.079
(0.019) (0.024) (0.031) (0.039) (0.053)
0.011 0.002 0.028 0.025 0.070
(0.018) (0.021) (0.023) (0.029) (0.043)
0.025 0.029 0.011 0.031 0.025
(0.024) (0.031) (0.040) (0.052) (0.071)
0.061 (1.138) 2.135 (0.890)
0.352 (1.193) 1.274 (1.192)
0.170 (1.250) 1.920 (1.028)
0.160 (1.300) 0.590 (1.358)
An analysis without any adjustments besides the relevant adjustments for the within-patient and the within-cluster correlation. Compared with the control condition. Compared with arm 5 (the arm with the most control measurements). Compared with controlecontrol transition.
80
J.W.R. Twisk et al. / Journal of Clinical Epidemiology 72 (2016) 75e83
not significant any more. For methods 1 and 3, the adjusted for time also influenced the regression coefficients. As expected, for method 2, the adjustment for time did not lead to a change in the magnitude of the regression coefficient. The adjustment for the baseline value of the outcome attenuated the intervention effect in most analyses. Surprisingly, the estimated crude intervention effect from the analysis with the dichotomous intervention variables (method 1) does not seem to correspond with the observed difference between the intervention and control measurements as been shown in Table 3. Table 3 shows that the intervention measurements were lower compared with the control measurements over the whole follow-up period, whereas the crude intervention effect estimated with method 1 reveals a slightly higher value for the intervention measurements. This difference has to do with the fact that in the figure no adjustment is made for the dependency of the repeated measurements within the patient. When an analysis was performed without this adjustment, a strong significant intervention effect was observed. The corresponding regression coefficient was 0.042 with a standard error of 0.014 and a P-value !0.01. However, when an adjustment was made for the dependency of the repeated measurements within the patient (i.e., when a random intercept on the patient level was added to the model), the difference between the intervention and the control condition almost totally disappeared (regression coefficient was 0.004 with a standard error of 0.011). A further adjustment for the clustering of the patient observations within the nursing home led to the reported regression coefficient of 0.007.
4. Discussion In the present article, four methods to estimate the effect of an intervention in a stepped wedge trial design were compared with each other. The key issue is that the different methods analyze different aspects of the intervention effect, and therefore, the results obtained from the four methods are different. Besides that, it was shown that an adjustment for time and/or an adjustment for baseline values influenced the results of these analyses to a different extent for the four methods. As mentioned previously, different methods analyze different aspects of the intervention effect. With method 1, in which all intervention measurements were compared with all control measurements, information about the length of the intervention is not taken into account. Methods 2 and 3 on the other hand try to estimate the effect of the length of the intervention. Although the purpose of the two methods is comparable, the results were quite different. The results obtained from method 2 can be interpreted as the average difference between the arms with different intervention durations. In fact, these differences reflect the between-subject part of the intervention effect.
The results obtained from method 3 reveal a more direct effect of the different intervention durations and reflect both the between and within-subject part of the intervention effect. Although this is an advantage, method 3 has also a disadvantage, that is, the reduction of the number of subjects with a longer duration of the intervention. This reduces the power of the analysis and makes the method more vulnerable for random fluctuations. Method 4 in which the changes between subsequent measurements were analyzed captures mostly the within-subject part of the intervention effect [10]. Power issues can play a role in stepped wedge trial designs. As mentioned previously, in method 3, when the number of patients (i.e., clusters) in one of the comparison groups becomes small, it reduces the power of the comparison. Although this is recognized by others as well [3,5], it is also argued that a stepped wedge design requires a lower sample size compared with a classical cluster RCT [6]. Because different methods analyze different aspects of the intervention effect, in a full analysis the different results should be combined into a meaningful result. For instance, regarding the ACT trial, we can say that the outcome variable increases over time irrespective of the intervention and that there is neither a short-term nor a long-term intervention effect. Furthermore, the observed differences between the stepped wedge arms on average over time are due to the differences at baseline. 4.1. Should we adjust for time in a stepped wedge trial design? One of the reasons why a stepped wedge trial design is chosen is the possibility to detect a potential timedependent intervention effect for instance due to a learning effect within the system that provides the intervention. However, it must be realized that stepped wedge trial design is also used in situations where such a timedependent time effect is not expected, for example, stepped wedge trial designs involving infectious diseases [19,20]. Regarding an adjustment for time, Hussey and Hughes [21] claim that effect estimates derived from a stepped wedge trial design are biased when time is not included in the model. This makes sense when there is either an increase or a decrease over time in the outcome variable independent of the intervention. This is different from a classical RCT in which an adjustment for time is not necessary. Because in a classical RCT, the intervention and control groups are measured at the same time points, time is not related to the group variable. Based on the definition of confounding (i.e., a possible confounder must be related to both the outcome variable and the independent variable), time cannot be a confounder in a classical RCT. In a stepped wedge trial design, the situation is different because the intervention variable is related to time. When time increases, the number of patients receiving the intervention increases. So when time is also associated with the outcome
J.W.R. Twisk et al. / Journal of Clinical Epidemiology 72 (2016) 75e83
variable (i.e., when there is either a decrease or increase over time), time can be a confounder. The influence of time on the estimation of the effect of the intervention was nicely illustrated in the ACT trial. Within this trial, an increase over time was observed in the whole population, irrespective of the fact whether the patient receives the intervention or not. Because at the end of the follow-up period, there are more patients in the intervention group (due to the stepped wedge trial design), this increase over time is wrongly allocated to the intervention when time is not taken into account. Adjusting for time leads to a huge decrease in the estimated intervention effect. However, when the randomization arms are compared with each other (method 2), an adjustment for time is not really necessary. This is because belonging to a particular arm is not related to time and therefore time cannot be a confounder. The small differences observed in the example data sets between the analysis of the arms with or without an adjustment for time are due to an increase in missing values over time and the selectiveness of the missing data. In a full data set (i.e., without any missing data), the results of method 2, that is, the analysis comparing the different arms with each other would be the same with or without an adjustment for time. However, not including time can lead to inflated variance estimates. In addition, in the transition method (method 4), an adjustment for time makes sense because in this method, the independent variable (i.e., the transition group) is also related to time. Although the adjustment for time makes sense in the methods in which the group variable is related to time, one should be careful with the interpretation of the results. With method 1, for instance, at the first measurement, all subjects receive the control condition, whereas at the last measurement, all subjects receive the intervention. When an adjustment for time is performed in the analysis, this will lead to an inefficient estimation of the intervention effect. Basically, the same holds for method 3 in which subjects with different lengths of the intervention are compared with each other. In this method, especially, the estimation of the intervention effect for the group with the longest length of the intervention is unreliable. This is because this group is only measured at the last follow-up measurement, whereas the control condition is not measured at the last measurement. In the analysis adjusted for time, this leads to a sort of ‘‘empty cell’’ problem, which leads to an unreliable result. One of the problems with the analysis of data from a stepped wedge trial design is that the usefulness of a method depends on the time dependency of the effect of the intervention. When the effect of the intervention is time independent, all methods provide valid results; only the interpretation of the estimated intervention effect is different, and for methods 1, 3, and 4, an adjustment for time is necessary when there is an increase or decrease in the outcome variable over time independent of the intervention. However, when for instance the effect of starting the intervention is different at different time points, the estimation of the effect of the intervention
81
in methods 3 and 4 will be biased (i.e., either underestimated or overestimated). To evaluate the time dependency of the effect of starting the intervention, an interaction with time can be added to the model. 4.2. Should we adjust for the baseline value of the outcome variable in a stepped wedge trial design? In a classical RCT, an adjustment for baseline differences is performed to adjust for possible regression to the mean. It is argued that this is necessary because the two groups to be compared are taken from the same source population and the differences observed at baseline are due to random fluctuations and measurement error [10,11]. When the differences are not taken into account, it can lead to either an overestimation or underestimation of the intervention effect. Within a stepped wedge trial design, basically the same arguments can be used. Therefore, it should be considered whether the intervention variable is analyzed as a timedependent or time-independent variable. In method 2, in which the arms were compared with each other, the intervention variable (i.e., the arm) is analyzed as time independent. Then, basically the same rules can be applied as for a classical RCT. In fact, the arms are randomized and the differences between the arms at baseline are due to random fluctuations and measurement error. Therefore, in method 2, where the arms are analyzed directly, an adjustment for the baseline value of the outcome variable must be applied. In the other three methods, the intervention variable is analyzed as a time-dependent variable. Then, the question whether an adjustment for baseline differences should be made has to do with the question whether the patients of the groups that are compared with each other are different from the patients in the control condition. In method 3, for instance, the patients are defined according to the length of the intervention. Suppose that the patients who start the intervention first have lower values of the outcome variable at baseline compared with the other patients. The patients who start the intervention first are the only patients with the highest length of the intervention, so these baseline differences should be taken into account to get a valid estimate of the intervention effect. The same holds for method 4 in which the groups are defined according to the transition status. In this case, the patients who start the intervention first do not have a controlecontrol transition, whereas the patients who start the intervention last do not have an interventioneintervention transition. For method 1 (comparing all the intervention measurements with all the control measurements), the situation is slightly more complicated. All patients who receive the intervention have a control measurement at baseline. However, because the number of intervention measurements differ between the patients, it is also possible that baseline differences influence the estimation of the intervention effect. It should be realized that in the present study, an adjustment was made for the baseline value on the subject level and not on the cluster level. Besides that, in the present
82
J.W.R. Twisk et al. / Journal of Clinical Epidemiology 72 (2016) 75e83
Table 5. Considerations for the different methods used to analyze data from a stepped wedge trial design Pros and cons Between-subjects and/or within-subject effects Is it necessary to adjust for time? Possibility to analyze influence of length of intervention Is it necessary to adjust for the baseline value? Assuming time-independent intervention effect of starting the intervention Possibility to detect delay in treatment effect Different number of intervention effects for different design layouts
Method 1
Method 2
Method 3
Method 4
Both Yes No
Only between No Yes
Both Yes Yes
Mostly within Yes No
As in classical RCT Yes
As in classical RCT No
As in classical RCT Yes
As in classical RCT Yes
No No
Yes Yes
Yes Yes
Yes (partly) No
Abbreviation: RCT, randomized controlled trial.
study, no adjustments were made for other covariates besides the baseline value of the outcome variable and/or time. However, the way this adjustment is done is not different for the methods discussed in the present study. It should further be realized that in the first three methods, the results obtained from the analysis adjusted for the baseline value of the outcome variable cannot be compared directly to the results obtained from the analyses without that adjustment. This is because without an adjustment for the baseline value of the outcome variable, the first measurement is treated as part of the outcome. In the transition method (method 4), however, the data sets used in the analyses with and without an adjustment for the baseline value of the outcome variable were equal, so for method 4, the results of the different analyses can be directly compared with each other. It is sometimes argued that model fit indicators should decide which of the methods should be used. This can be done for the comparison between nested models, that is, the choice between methods 1 and 3. Based on the likelihood ratio test, one can decide which of the methods is the most appropriate. Furthermore, the likelihood ratio test can be used for other purposes as well. For instance, within method 3, the likelihood ratio test can be used to give an answer to the question whether the time variable could be treated as continuous, assuming a linear development over time, instead of categorical. Although this argument makes sense and it is indeed true that the likelihood ratio test (or in general fit indicators) can decide which of the methods or models is statistically the most appropriate, the relevance of the interpretation of the intervention effect should better decide which of the methods should be used.
5. Conclusion It is important to realize that the analyses of data from a stepped wedge trial design are not as straightforward as for a classical (cluster) RCT. Different methods reveal different aspects of a possible intervention effect, and the choice of a method partly depends on the type of intervention and the
possible time-dependent effect of the intervention. Table 5 summarizes the pros and cons of the different methods discussed in the present article. When the effect of the intervention is time independent, the results of method 3 probably give the best estimation of the intervention effect because this method combines the betweensubjects and within-subject effect and it takes the length of the intervention into account. However, when the effect of starting the intervention is time dependent, method 3 does not give a valid estimation of the effect of the intervention. The most stable results are obtained from method 2, especially because the results are not influenced by time. The disadvantage, however, is that with method 2, only the between-subject part of the intervention effect is estimated. The possible adjustment for baseline differences in the outcome variable is not different from the way this is done in a classical (cluster) RCT. Furthermore, it is advised to combine the results of the different methods to obtain an interpretable overall result.
Supplementary data Supplementary data related to this article can be found at http://dx.doi.org/10.1016/j.jclinepi.2015.11.004.
References [1] Kotz D, Spigt M, Arts ICW, Crutzen R, Viechtbauer W. Researchers should convince policy makers to perform a classic cluster randomised controlled trial instead of a stepped wedge design when an intervention is rolled out. J Clin Epidemiol 2012;65:1255e6. [2] Kotz D, Spigt M, Arts ICW, Crutzen R, Viechtbauer W. Use of the stepped wedge design cannot be recommended: a critical appraisal and comparison with the classic cluster randomised controlled trial design. J Clin Epidemiol 2012;65:1249e52. [3] Hemming K, Girling A. The efficiency of stepped wedge vs. cluster randomised trials: stepped wedge studies do not always require a smaller sample size. J Clin Epidemiol 2013;66:1427e9. [4] Hemming K, Girling A, Martin J, Bond SJ. Stepped wedge cluster randomised trials are efficient and provide a method of evaluation without which some interventions would not be evaluated. J Clin Epidemiol 2013;66:1058e9.
J.W.R. Twisk et al. / Journal of Clinical Epidemiology 72 (2016) 75e83 [5] Kotz D, Spigt M, Arts IC, Crutzen R, Viechtbauer W. The stepped wedge design does not inherently have more power than a cluster randomized controlled trial. J Clin Epidemiol 2013;66:1059e60. [6] Woertman W, de Hoop E, Moerbeek M, Zuidema SU, Gerritsen DL, Teerenstra S. Stepped wedge designs could reduce the required sample size in cluster randomised trials. J Clin Epidemiol 2013;66: 752e8. [7] Brown CA, Lilford RJ. The stepped wedge trial design: a systematic review. BMC Med Res Methodol 2006;6:54. [8] Mdege ND, Man M-S, Taylor CA, Torgerson DJ. Systematic review of stepped wedge cluster randomized trials shows that design is particularly used to evaluate interventions during routine implementation. J Clin Epidemiol 2011;64:936e48. [9] Hemming K, Haines TP, Chilton PJ, Girling AJ, Lilford RJ. The stepped wedge cluster randomized trial: rationale, design, analysis and reporting. BMJ 2015;350:h391. [10] Twisk JWR. Applied longitudinal data analysis for epidemiology. A practical guide. 2nd ed. Cambridge, UK: Cambridge University Press; 2013. [11] Austin PC, Manca A, Zwarenstein M, Juurlink DN, Stanbrook MB. A substantial and confusing variation exists in handling of baseline covariates in randomized controlled trials: a review of trials published in leading medical journals. J Clin Epidemiol 2010;63:142e53. [12] Saquib N, Saquib J, Ioannidis JP. Practices and impact of primary outcome adjustment in randomized controlled trials: metaepidemiologic study. BMJ 2013;347. [13] Senn S. Seven myths of randomisation in clinical trials. Stat Med 2013;32:1439e50.
83
[14] Twisk J, Proper K. Evaluation of the results of a randomized controlled trial: how to define changes between baseline and follow-up. J Clin Epidemiol 2004;57:223e8. [15] Muntinga ME, Hoogendijk EO, van Leeuwen KM, van Hout HPJ, Twisk JWR, van der Horst HE, et al. Implementing the chronic care model for frail older adults in the Netherlands: study protocol of ACT (frail older adults: care in transition). BMC Geriatr 2012;12:19. [16] Zwijsen SA, Smalbrugge M, Zuidema SU, Koopmans RTCM, Bosmans JE, van Tulder MW, et al. Grip on challenging behaviour: a multidisciplinary care programme for managing behavioural problems in nursing home residents with dementia. Study protocol. BMC Health Serv Res 2011;11:41. [17] Zwijssen SA, Smalbrugge M, Eefsting JA, Twisk JW, Gerritsen DL, Pot AM, et al. Coming to grips with challenging behavior: a cluster randomised controlled trial on the effects of a multidisciplinary care program for challenging behavior in dementia. J Am Med Dir Assoc 2014;15:e1e10. [18] Goldstein H, Rasbash J, Plewis I, Draper D, Browne W, Yang M, et al. A user’s guide to MLwiN. London: Institute of Education; 1998. [19] Piszczek J. Partlow EStepped-wedge trial design to evaluate Ebola treatments. Lancet Infect Dis 2015;15:762e3. [20] Shimakawa Y, Lemoine M, Mendy M, Njai HF, D’Alessandro U, Hall A, et al. Population-based interventions to reduce the public health burden related with hepatitis B virus infection in the Gambia, West Africa. Trop Med Health 2014;42:59e64. [21] Hussey MA, Hughes JP. Design and analysis of stepped wedge cluster randomised trials. Contemp Clin Trials 2007;28:182e91.