Cluster-randomized controlled trials evaluating complex interventions in general practices are mostly ineffective: a systematic review

Cluster-randomized controlled trials evaluating complex interventions in general practices are mostly ineffective: a systematic review

Journal of Clinical Epidemiology 94 (2018) 85e96 Cluster-randomized controlled trials evaluating complex interventions in general practices are mostl...

819KB Sizes 0 Downloads 48 Views

Journal of Clinical Epidemiology 94 (2018) 85e96

Cluster-randomized controlled trials evaluating complex interventions in general practices are mostly ineffective: a systematic review Andrea Siebenhofera,b,*, Michael A. Paulitscha, Gudrun Pregartnerc, Andrea Bergholdc, Klaus Jeitlerb,c, Christiane Mutha, Jennifer Englera b

a Institute of General Practice, Goethe University Frankfurt am Main, Theodor-Stern-Kai 7, Frankfurt am Main 60590, Germany Institute of General Practice and Evidence-based Health Services Research, Medical University of Graz, Auenbruggerplatz 2/9/IV, Graz 8036, Austria c Institute for Medical Informatics, Statistics and Documentation, Medical University Graz, Auenbruggerplatz 2, Graz 8036, Austria

Accepted 17 October 2017; Published online 31 October 2017

Abstract Objectives: The aim of this study was to evaluate how frequently complex interventions are shown to be superior to routine care in general practice-based cluster-randomized controlled studies (c-RCTs) and to explore whether potential differences explain results that come out in favor of a complex intervention. Study Design and Setting: We performed an unrestricted search in the Central Register of Controlled Trials, MEDLINE, and EMBASE. Included were all c-RCTs that included a patient-relevant primary outcome in a general practice setting with at least 1-year follow-up. We extracted effect sizes, P-values, intracluster correlation coefficients (ICCs), and 22 quality aspects. Results: We identified 29 trials with 99 patient-relevant primary outcomes. After adjustment for multiple testing on a trial level, four outcomes (4%) in four studies (14%) remained statistically significant. Of the 11 studies that reported ICCs, in 8, the ICC was equal to or smaller than the assumed ICC. In 16 of the 17 studies with available sample size calculation, effect sizes were smaller than anticipated. Conclusion: More than 85% of the c-RCTs failed to demonstrate a beneficial effect on a predefined primary endpoint. All but one study were overly optimistic with regard to the expected treatment effect. This highlights the importance of weighing up the potential merit of new treatments and planning prospectively, when designing clinical studies in a general practice setting. Ó 2017 Elsevier Inc. All rights reserved. Keywords: Cluster-randomized controlled trial; General practice; Effectiveness; Complex intervention; Shortcomings; Systematic review

1. Introduction Cluster-randomized controlled trials (c-RCTs) are considered to be a suitable study design for examining patient-relevant clinical questions in primary care. A Registration: PROSPERO 2014: CRD42014009234. Funding: This systematic and methodological review was funded by the German Federal Ministry of Education and Research (FKZ: 01KG1504), Germany. Ethical approval: No ethical approval was sought. Data sharing: For each study, all extracted items relating to the primary outcomes can be found in detail in the published SRDR. This link will be activated after publication of the manuscript. Conflict of interest: All authors have completed the ICMJE uniform disclosure form at http://www.icmje.org/coi_disclosure.pdf and declare that no support from any organization for the submitted work; no financial relationships with any organizations that might have an interest in the submitted work in the previous 3 years; and no other relationships or activities that could appear to have influenced the submitted work. * Corresponding author. Tel.: 43 316 385-73558; fax: 43 316 385-79654. E-mail address: [email protected] (A. Siebenhofer). https://doi.org/10.1016/j.jclinepi.2017.10.010 0895-4356/Ó 2017 Elsevier Inc. All rights reserved.

c-RCT is characterized by the random group assignment of people from, for example, communities, families, or medical practices [1,2]. However, methodological shortcomings are common [3]. Publications such as the extended Consolidated Standards of Reporting Trials (CONSORT) [1,2] and the Ottawa statements [4] were made to overcome this problem. As most interventions in general practice are multifaceted and often have partly interacting components, they are considered to be complex, meaning that the study team has to consider many different aspects when designing the study [5,6]. Furthermore, study authors can be expected to be more or less convinced of the superiority of the new intervention when planning such an elaborate enterprise. However, in our feasibility project, which was restricted to journals relevant to general practice, only 33% of included studies showed statistically significant effects on a patient-relevant primary endpoint [7]. This was even less than in a recently published systematic review on

86

A. Siebenhofer et al. / Journal of Clinical Epidemiology 94 (2018) 85e96

What is new? Key findings  A very low number of studies evaluating a complex intervention in a general practice setting showed superiority compared to routine care.  Underestimates of intracluster similarities were unlikely to have been the principal reason behind the large number of negative findings.  Anticipated effect sizes were mostly higher than those actually obtained in studies examining complex interventions vs. routine care. What this adds to what was known?  In a recently published systematic review of randomized controlled trials, only 50e60% demonstrated the superiority of new interventions when compared with standard treatment. In our review, more than 85% of cluster-randomized controlled studies (c-RCTs) failed to even demonstrate a beneficial effect on a predefined primary endpoint. What is the implication and what should change now?  Our findings highlight the importance of weighing up the potential merit of new treatments, considering the appropriateness of c-RCTs, or any other study design, and planning prospectively, when designing clinical studies in a general practice setting.

randomized controlled trials (RCTs) which reported that 50e60% had demonstrated the superiority of new interventions compared to standard treatment [8]. At the Institute of General Practice at Goethe University Frankfurt, two recent c-RCTs [9,10] involving complex interventions were unable to demonstrate superiority, even though great effort was put into designing these trials. As a result, the current methodological project was initiated. The primary objective of our systematic and methodological review was to evaluate how frequently and to what extent complex interventions are shown to be superior to routine care in general practice-based c-RCTs. A further aim was to explore whether potential differences in methodological and other factors could explain results that come out in favor of a complex intervention.

2. Methods The protocol and the results of the feasibility project for this systematic and methodological review were recently

published in BMJ Open [7]. The protocol was registered at PROSPERO [11]. 2.1. Eligibility criteria We conducted a systematic and methodological review of all published and available c-RCTs, independent of patient age, and with the general practice setting as the level of randomization. In a superiority trial, the intervention group had to have investigated a complex intervention in accordance with the recommendations of the latest Medical Research Council guidance [12], and the control group had to have received routine care. To prevent additional heterogeneity between studies arising from active comparators, the control group had to have continued to receive treatment as usual (routine care). For inclusion in our review, trialists either had to have explicitly defined the patient-relevant primary outcome(s) that they used as primary or main outcome(s) in a power and sample size calculation or to have listed it (these) as the main outcome(s) in their trial’s objectives [13]. The primary outcome(s) had to be patient relevant, and detailed criteria for the assessment of the patient-relevant endpoints had to have been determined in accordance with the Institute for Quality and Efficiency in Health Care (IQWiG) methods version 4.2, which gives a concise and literature-based definition of what is meant by patient relevance [14]. In this connection, ‘‘patient relevant’’ is considered to refer to how a patient feels, functions, or survives, that is, whether indicators of mortality, morbidity, health-related quality of life, hospitalization, and/or treatment satisfaction are provided. Furthermore, data had to be provided on a patient level, studies had to be of at least 12 months’ duration, and patient-relevant primary outcome(s) had to have been measured after a period of at least 12 months had elapsed. Language was not a criterion for exclusion. In addition to those used in the feasibility project, two further criteria were added: psychotherapeutic interventions (e.g., cognitive behavioral therapy) were excluded because they are not typically performed by general practicioners, as were interventions that did not take place in a general practice setting and were not conducted by one or several members of a general practice team. For further details of the eligibility criteria, see at PROSPERO [11] and the published protocol in BMJ Open 2016 [7]. 2.2. Search strategy As described in the protocol publication [7], we followed the validated strategy recommended by Taljaard et al. [15] and Bland [16], and performed an unrestricted search in the Central Register of Controlled Trials (CCTR, August 2015), MEDLINE (from 1946), and EMBASE (from 1988) until September 14, 2015. A combination of subject headings and text words relating to ‘‘general

A. Siebenhofer et al. / Journal of Clinical Epidemiology 94 (2018) 85e96

practice’’ and ‘‘cluster randomized’’ were used in the search, and articles in any language were considered. To ensure literature saturation, we also scanned references in methodological and relevant secondary literature that were found in the three electronic databases, as well as the reference lists of included studies published after 01/2010, and personal files. The full search strategy is presented in Supplementary Table 1 at www.jclinepi.com. 2.3. Outcome measures The aim of the study was to assess outcomes and methodological bias when c-RCTs are used to examine novel complex interventions vs. routine care in a general practice setting, irrespective of the disease by: 1. Summarizing the evidence from c-RCTs and describing the distribution of estimates of treatment effects with respect to direction (in favor of the complex intervention or routine care), magnitude (size of the effect), and statistical significance (or confidence interval). 2. Evaluating how frequently complex interventions in c-RCTs are shown to be statistically superior to routine care. 3. Exploring the extent to which methodological [e.g., power calculations and intraclass correlation coefficients (ICCs)] and other factors (e.g., ethical approval, funding) explain whether the results of cRCTs are statistically significant or not. For each study, all extracted items relating to the primary outcomes can be found in detail in the published Systematic Review Data Repository (SRDR) [17]. 2.4. Study selection Titles and abstracts of potentially relevant publications were independently screened for inclusion by two reviewers (G.P., M.A.P., and J.E.). Final eligibility was determined through examination of full-text articles. Disagreements were resolved by discussion and the inclusion of a third reviewer. If data on one of our eligibility criteria were missing, we e-mailed the authors. 2.5. Data extraction and treatment of missing data To allow authors at different sites to work together on one database, we used the SRDR [18]. The SRDR is a publicly available online tool for the extraction and management of data for systematic review or meta-analysis that is provided by the Agency for Healthcare Research and Quality of the US Department of Health and Human Services. Data from each study were assessed independently by two authors (G.P., J.E., and M.A.P.). Disagreements included a third reviewer and were resolved through discussion. Missing data required for risk of bias

87

assessments and ICCs were requested by e-mail from the authors of the original study. 2.6. Quality assessment The criteria used for the assessment of methodological quality were developed during the feasibility phase and based on the CONSORT statementdextension to cluster-randomized trials [1], the extraction sheets for RCTs used by IQWiG [14], the Cochrane Handbook for Systematic Reviews of Interventions [19], and the systematic review of Froud et al. [20]. As we were only interested in patient-relevant primary outcomes as defined in our protocol paper [7,11], we assessed risk of bias with special regard to these outcomes. In addition, if an outcome was measured by an instrument, we determined whether a validated or a nonvalidated instrument had been used (instruments were assumed to have been validated when an appropriate article had been published). We assessed the risk of bias in recruitment and identification of study participants using an item specifically developed for c-RCTs by Eldridge et al. [21] that differentiates between four bias categories (possible, not possible, unlikely, and unclear). Quality assessment evaluated the risk of bias as high, low, or unclear for random sequence generation, allocation concealment, blinding of participants and outcome assessors, group and cluster similarity at baseline, involvement of clustering in analysis, incomplete outcome data and selective outcome reporting, intentionto-treat analysis (ITT), timing of outcome assessments, compliance, and any additional bias. The criteria were independently assessed by two authors (G.P., J.E., K.J., and M.A.P.). Disagreements were resolved by discussion and the inclusion of a third reviewer. We used full texts and authors’ responses to obtain any missing information. This study is reported in accordance with the preferred reporting items for systematic reviews and meta-analyses statement [22] (see Supplementary File at www.jclinepi. com for PRISMA checklist). 2.7. Data analysis and presentation of results Treatment effects and their direction were evaluated by means of risk ratios and differences in mean values for patient-relevant primary outcomes. Statistical significance was recorded when it had been demonstrated using a P-value (0.05) or confidence interval. However, since many studies did not consider the issue of multiple testing, we also performed a Bonferroni correction for all studies that showed a significant effect but did not specifically address this issue; the significance level demonstrated in a study was divided by the reported number of patient-relevant primary outcome measures. We determined how many studies reported sample size calculations and how many accounted for clustering and assessed whether studies had accounted for clustering in

88

A. Siebenhofer et al. / Journal of Clinical Epidemiology 94 (2018) 85e96

their analysis. Furthermore, values for assumed and obtained ICCs are presented. Finally, we looked at studies that reported a sample size calculation for at least one patient-relevant primary outcome and had accounted for the clustered structure of the data in both the sample size calculation and the analysis (higher quality ‘‘reliable’’ studies) and assessed how well they had met the assumptions made for sample size calculations. We then calculated the differences between the assumed and obtained effect sizes presented by the authors. When an observed effect is smaller than the minimum effect considered to be important a priori, the value is !1, regardless of measurement scale [23]. The results are presented as absolute and relative frequencies and in the form of tables and graphs. Due to variability in outcomes and patient populations, we were not able to conduct a meta-analysis. 2.8. Patient involvement No patients were involved in deciding on the research question or the outcome measures nor were they involved in developing plans for the design or implementation of the study. No patients were asked to advise on interpreting or writing up the results. There are no plans to disseminate the results of the research to study participants or the relevant patient community. However, patients’ participation in research studies is always voluntary, and by providing researchers with new methodological knowledge on the potential weaknesses of c-RCTs in a general practice setting, we hope to be able to protect our patients from wasting their time and patience participating in studies that are poorly designed.

3. Results 3.1. Study selection The search in electronic databases identified 9,148 references, of which 4,765 were duplicates and removed. The title and abstract of 4,383 publications were assessed, of which 4,205 were excluded on the basis of our exclusion criteria. The remaining 178 publications underwent full-text examination. After scanning the reference lists of methodological papers and included full-text articles, no further publications were added. Twenty-nine studies were ultimately included [24e52] (see Fig. 1). 3.2. Response to authors’ requests We requested missing data from all authors except one [48]; 18 authors responded [25,26,28,30e32, 34e36,40,42,43,45,47,49e52], and all requested data were provided by nine authors [26,30,31,34,36,42,43,45,49]. Their responses were marked in the SRDR, where further details can be found [17].

Fig. 1. Flowchart search for and selection of relevant trials.

3.3. Description of included studies The 29 studies included 34 populations and subpopulations in their research: 26 studies investigated one population, one study examined two populations (analyzed together) [25], one three populations (analyzed together) [36], and one studied three separately analyzed subpopulations [39]. Seven trials included patients with lung diseases (e.g., asthma in adults or children), seven included patients with psychiatric diagnoses (e.g., depression or dementia), and four included patients with diabetes. Two studies looked at populations of people with back pain or arthritis and one at patients with cardiovascular diseases. Three studies investigated multiple patient populations, whereby the conditions under examination included diabetes, cardiovascular diseases, Chronic Obstructive Pulmonary Disease, and irritable bowel syndrome. Four studies described the investigated populations as ‘‘elderly,’’ and one study investigated ‘‘children’’ irrespective of disease. Data for 99 patient-relevant primary outcomes were collected from the 29 studies. Of these outcomes, 54 measured different aspects of quality of life and functioning (e.g., general aspects such as physical functioning or specific aspects such as asthma-related issues), 38 outcomes were related to morbidity (e.g., severity of depression or injuries), five outcomes measured patients’ assessments of their care, and one outcome was mortality.

A. Siebenhofer et al. / Journal of Clinical Epidemiology 94 (2018) 85e96

89

Since diverse outcomes in the studies made metaanalysis impossible, the magnitude of the effect sizes was not further evaluated.

Fig. 2. Chronological distribution of total number of patient-relevant primary endpoints per study.

One additional outcome involved mortality and aspects of morbidity (e.g., nonlethal strokes). The 99 outcomes in the studies comprised 71 different endpoints. Fifty-four measurement methods (e.g., questionnaires, number of cases and physical examination) were used, which included 19 instruments (35%) we considered to be validated and 2 that had been adapted from previously validated instruments. Twelve of the included studies assessed three or more patient-relevant primary outcomes. Six of them assessed five or more patient-relevant primary outcomes (see Supplementary Table 2 at www.jclinepi.com). As we considered only patient-relevant primary outcomes, the total number of primary outcomes per study may have been even higher. As shown in Fig. 2, the total number of patient-relevant primary outcomes per study declined following the publication of the CONSORT extension for c-RCTs in 2004 [1]. Further details on the included studies, and especially on the type of complex intervention, can be found using the publicly available SRDR online tool [17]. 3.4. Outcome measures 3.4.1. Distribution of estimates of treatment effects All in all, the 29 studies reported 99 patient-relevant primary outcomes (1e24 outcomes per study). Of these 99 outcomes, effect sizes were reported for 72 (73%). For a further 24 (24%) outcomes, only summary statistics per group were presented. Three outcomes were left out of the reports completely. Forty-four of the 72 effect sizes (61%) indicated that the outcome favored the complex intervention (of which 15 showed a significant effect before Bonferroni correction), whereas in 20 cases (28%), the outcomes tended to favor the control group (1 of which showed a significant effect before correction), and 1 showed virtually no difference. An assessment of the direction was not possible in six cases (8%) because the study used an adapted questionnaire without clarifying whether high or low values indicated a favorable patient outcome. Finally, one intervention arm in a three-arm study fared better than the control group, whereas the others fared worse.

3.4.2. Frequency of statistical superiority of complex interventions in c-RCTs All in all, eight studies reported the statistical superiority of a complex intervention at the 5% level for at least one patient-relevant primary outcome [26,28,30,37,39,44,51,52]. However, after adjusting for multiple testing (Bonferroni correction), only four studies (14%) were still able to show the superiority of a complex intervention for one patient-relevant primary outcome [28,30,37,52]. For further details, see Supplementary Table 2 at www.jclinepi.com. The distribution of reported P-values is shown in Fig. 3. 3.4.3. Key components of included studies 3.4.3.1. General information. The criteria ‘‘patient consent’’ (69%) and ‘‘ethical approval’’ (90%) were reported in the majority of trials. Details on publication of study protocol/feasibility trial (31%), trial registration (31%), funding by pharmaceutical industry (21%), and conflict of interest (14%) were provided less frequently. In comparison, the four studies that demonstrated a significant intervention effect after the Bonferroni correction all mentioned ethical approval, and a conflict of interests was addressed in 50% of them (see Table 1). 3.4.3.2. Sample size calculation and ICCs. Twenty-one (72%) of the overall 29 studies reported that they had performed a sample size calculation for at least one patient-relevant primary endpoint, 18 of these (86%) accounted for clustering and described the assumed ICC or design effect. All of the studies that showed a significant effect after Bonferroni correction reported that they had conducted a sample size calculation for the primary endpoint, considered clustering in the sample size calculation, and provided ICCs (see Table 1). Values for assumed and obtained ICCs for the same patient-relevant primary endpoint were available for 13 outcomes and in 11 studies [24,28,30,31,35,36,42e44,47,48]. For eight of these outcomes, the obtained ICC was equal to or smaller than the assumed ICC [24,28,30,31,36,42, 47,48] (see Supplementary Fig. 1 at www.jclinepi.com). For two of the four studies that demonstrated a significant effect on the patient-relevant primary endpoint, both assumed and obtained ICCs were available, with the obtained ICCs being smaller than assumed ICCs [28,30]. The two other studies assumed ICCs for sample size calculations but did not provide obtained ICCs [37,52] (see Supplementary Fig. 1 at www.jclinepi.com). Of the 18 studies that reported an adequate sample size calculation for at least one patient-relevant outcome, 12 (67%) had successfully recruited the required number of clusters and 11 (61%) the anticipated number of patients.

90

A. Siebenhofer et al. / Journal of Clinical Epidemiology 94 (2018) 85e96

Fig. 3. Distribution of reported P-values.

3.4.3.3. Assessment of methodological bias. In 11 of the 29 studies (38%), bias in recruitment and identification according to Eldridge 2008 [21] was either unlikely or impossible. Adequate randomization and allocation concealment were performed in 19 (66%) and 11 (38%) cases, respectively. In 17 studies (59%), no risk of performance bias resulted from nonblinding of patients. In 14 studies (48%), a risk of detection bias resulted from nonblinding of outcome assessors. In 16 studies (55%), group similarity at baseline meant there was no risk of selection bias. Apart from two, all studies (93%) accounted for the clustered structure of the data in their analyses. Cluster size was identical in 7 (24%) cases, 8 (28%) dealt with patient dropout, and all 29 studies stated that they had performed an ITT analysis. Reporting was adequate in 23 (79%) of them, and no additional bias could be detected in 26 (90%). The assessed risk of bias was lower in studies for which the superiority of the intervention continued to exist after the Bonferroni correction (see Tables 1 and 2). 3.4.4. Discrepancies between assumed and obtained effect sizes under certain reliable conditions Eighteen studies (62%) provided information on sample size calculations for at least one patient-relevant primary

outcome (21 outcomes in total) and accounted for the clustered structure of their data in both the sample size calculation and analysis (see Supplementary Fig. 2 at www.jclinepi.com). For 20 outcomes from 17 studies, an expected treatment effect, or minimal clinically relevant difference, was reported; obtained effect sizes were also provided either directly or were calculable from other data. All but one study [30] were overly optimistic with regard to the expected treatment effect. The median relative reduction for the outcomes for which a comparison was possible was 68% (interquartile range: 45e89%).

4. Discussion This systematic and methodological review identified 29 long-term c-RCTs that compared a complex intervention to routine care in a primary care setting. Overall, 99 patient-relevant primary endpoints were identified, but only four outcomes (4%) in four studies (14%) showed superiority after Bonferroni correction. This means that 86% of the total number of analyzed studies, and 96% of the total number of patient-relevant primary endpoints showed no intervention effectdwhich is a striking number, especially in view of the large numbers of patients, substantial effort, and monetary resources that studies on

A. Siebenhofer et al. / Journal of Clinical Epidemiology 94 (2018) 85e96

91

Table 1. Reported key methodological components of included studies and differences between studies with and without an intervention effect on the primary outcome Studies reporting an intervention effect

Methodological components General information Consent (patients) Consent (clusters) Ethical approval Publication of study protocol/ feasibility trial Trial registration number Funding by the pharmaceutical industry Conflict of interest according to the author(s) c-RCT evident from title Sample size calculation Sample size calculation for patient-relevant primary endpointa Involvement of the cluster in the sample size calculationa Randomization and blinding process Recruiting/identification not biased Adequate randomization method Adequate allocation concealment Blinding of patients Blinding of outcomes assessors Analysis Group similarity at baseline Identical cluster size at baseline Involvement of clustering in analysisb Dealing with dropout (patients) ITT Adequate reporting (item ‘‘selective reporting’’) No additional bias

Overall studies, n [ 29 (% in brackets) 20 6 26 9

(69) (21) (90) (31)

Studies without intervention effect, n [ 21 (% in brackets) 14 3 19 7

(67) (14) (91) (33)

Not keeping intervention effect after Bonferroni correction, n [ 4 (% in brackets) 4 1 3 1

(100) (25) (75) (25)

Keeping intervention effect after Bonferroni correction, n [ 4 (% in brackets) 2 2 4 1

(50) (50) (100) (25)

9 (31) 6 (21)

7 (33) 5 (24)

1 (25) 0 (0)

1 (25) 1 (25)

4 (14)

2 (10)

0 (0)

2 (50)

16 (55)

11 (52)

2 (50)

3 (75)

21 (72)

16 (76)

1 (25)

4 (100)

18 (62)

13 (62)

1 (25)

4 (100)

11 (38)

8 (38)

1 (25)

2 (50)

19 (66) 11 (38)

12 (57) 8 (38)

3 (75) 1 (25)

4 (100) 2 (50)

17 (59) 14 (48)

13 (62) 11 (52)

1 (25) 1 (25)

3 (75) 2 (50)

16 (55) 7 (24) 27 (93)

11 (52) 5 (24) 19 (91)

2 (50) 1 (25) 4 (100)

3 (75) 1 (25) 4 (100)

8 (28) 29 (100) 23 (79)

5 (24) 21 (100) 17 (81)

0 (0) 4 (100) 3 (75)

3 (75) 4 (100) 3 (75)

26 (90)

19 (91)

4 (100)

3 (75)

Abbreviation: c-RCT, cluster-randomized controlled study. a Condition is assumed to be fulfilled, if at least one patient-relevant primary endpoint in a study met the criteria. b Condition is assumed to be fulfilled, if all patient-relevant primary endpoints in a study were adjusted.

complex interventions generally require. In six studies, five or more patient-relevant primary outcomes were assessed, even though the total number of patient-relevant outcomes per study declined following the publication of the CONSORT extension for c-RCTs in 2004 [1]. In order to describe the differences between studies with and without an intervention effect, the authors analyzed 22 quality aspects exploratively. An analysis of all 29 studies revealed clear potential for overall improvement in the way they were conducted. More information was reported for over half the 22 quality criteria in the four studies that, after Bonferroni correction, were able to show the statistically significant superiority of an intervention on at

least one patient-relevant primary outcome. These included information on ethical approval, funding by a pharmaceutical company, a stated conflict of interest, details on sample size calculations, randomization and blinding process, as well as group similarity at baseline, cluster size at baseline and information on how dropouts were dealt with. The CONSORT statement extension to cluster-randomized trials [1,2] requires that both the assumed and obtained ICCs are reported. When this information was not available, the authors were contacted directly. We were able to identify 11 studies that quoted ICC values, of which eight had an ICC that was equal to or smaller than the assumed ICC. This indicates that

92

A. Siebenhofer et al. / Journal of Clinical Epidemiology 94 (2018) 85e96

Table 2. Assessment of methodological bias by study Bias in Recruitment / Identification according to Eldridge 2008 Bennewith 2002 Black 2013 Byng 2004 Coleman 1999 Elley 2003** Gask 2004 Gensichen 2009* Griffin 2011 Holton 2010 Jellema 2005 Juul 2014 Kendrick 1999 Kennedy 2013 Kerse 1999** Kruis 2014 Lobo 2004 Lord 1999 McGeoch 2006 Menn 2012 Metzelthin 2013 Murphy 2009 Olivarius 2001 Rea 2004 Rosendal 2007 Smith 2012 Thoonen 2003 van Marwijk 2008 Wagner 2001 Zwarenstein 2007*

non.poss. unlikely possible unlikely unclear possible possible possible unlikely possible non.poss. non.poss. possible unlikely unclear possible possible unlikely possible possible unlikely possible unclear possible unlikely possible possible unclear unlikely

Random sequence generation (selection bias)

Allocation concealment (selection bias)

Group similarity at baseline (selection bias)

Blinding of participants (performance bias)

þ ? þ ? þ þ þ þ þ þ ? ? þ þ þ þ ? ? ? þ þ þ þ þ þ ? ? ? þ

þ þ  þ ?    þ  þ þ  þ ?   þ   þ  ?  þ   ? þ

þ þ  ? þ þ þ þ þ ? ? ? þ þ þ þ   ? ? þ þ ? þ þ þ ? ? ?

þ þ þ  þ þ  þ þ þ þ þ  þ  ?   þ   þ þ þ þ    þ

Bias in recruitment/identification according to Eldridge 2008: not possible 5 non.poss. unlikely possible

Compliance (performance bias)

Blinding of outcome assessor (detection bias)

Timing of outcome assessments (detection bias)

Incomplete outcome data (attrition bias)

Intentionto-treatanalysis (attrition bias)

Adequate reporting (reporting bias)

Additional Bias

 ? þ  þ þ ? ? ? ? þ   ? ? ? þ ? ?  ? þ ? ? þ ? ?  þ

? þ þ  ?   þ þ þ þ   þ  ?   þ   þ þ þ þ  þ  þ

þ þ ? þ þ þ þ ? þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ

þ ?   þ  þ þ ? ? þ ? ? ? ?  þ þ þ þ þ  þ ? þ þ þ ? þ

þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ

þ þ   þ + þ þ þ  þ þ þ  þ þ  þ þ þ þ þ þ þ þ  þ þ þ

þ + þ  þ þ þ þ þ þ þ  þ  þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ

unclear

Risk of bias: þ low

 high

? unclear

Bold text: studies with one patient-relevant primary outcome significant after Bonferroni adjustment of alpha-levels. * with a single patient-relevant primary outcome; ** with several patient-relevant primary outcomes.

underestimates of intracluster similarities were unlikely to have been a principal reason for the large number of negative findings. Finally, effect sizes were smaller than anticipated in all but one [30], of the 17 higher quality studies whose results appeared to be reliable. In our systematic and methodological review, we primarily aimed to summarize the evidence from c-RCTs in order to describe the distribution of estimates of treatment effects and evaluate how frequently complex interventions in a primary care setting are shown to be statistically superior to routine care. This is a novel research idea, and we are not aware that any previous study has been performed on the topic. However, the overall assessment of reported methodological and other factors shows that with only a few exceptions, the amount of reported methodological information was comparable to that provided in other previously performed reviews [3,20,23,53]. Unlike Rutterford et al. [23], we did not assess all the six elements which are necessary for a full sample size calculation. Nevertheless, in comparison to the review by Rutterford,

which included 300 studies published between 2000 and 2008, a greater percentage of studies in our study pool, which includes studies up to 2015, accounted for clustering in their sample size calculations (66% vs. 34%). Another recently published systematic review by Ivers et al. [3] that dealt with the same sample of 300 cluster-randomized trials as Rutterford clearly showed that despite the publication of the CONSORT statement on the reporting and methodological quality of c-RCTs in 2004 [1], very few recommendations had been adhered to, and further effort was necessary to improve methodological quality. However, our study sample showed there had been some improvement over the years, with, compared to the review of Ivers et al. [3], more studies reporting on the blinding of outcome assessors (48% vs. 40%), more including clustering in their sample size calculation (66% vs. 33%), and more including it in their analysis (93% vs. 84%). However, these findings should not be overinterpreted because our study sample included several studies published after 2008 and was restricted to studies in general practices that included patient-relevant primary outcomes.

A. Siebenhofer et al. / Journal of Clinical Epidemiology 94 (2018) 85e96

While in the study by Rutterford et al. [23], only 18 of 300 trials presented information on both assumed and obtained ICCs, this information was available for 11 of 29 studies in our review because authors’ responses had provided us with information on seven further pairs. As in the review by Rutterford et al. [23], the actual ICC was smaller or equal to anticipated levels in 70% of trials. In addition, a comparison between obtained and assumed treatment effects in a sample of ‘‘reliable’’ studies (ICC accounted for in analysis as well as in sample size calculation for a patient-relevant primary outcome) revealed that in all but one trial, actual effect sizes were smaller than those targeted for the power calculation. This result is very similar to the findings by Rutterford et al. [23]. She argued that this may be because interventions are often ineffective or because investigators assume minimum important target effects that are unachievable using the type of intervention under evaluation. At this point, it is important to reflect on whether the reliability of study results that are inconclusive is due to the absence of power calculations and clustering adjustments as well as other potential flaws. Two recent publications by Pocock et al. [54,55] dealt with the problem that labeling randomized trials as positive or negative on the basis of a P-value for the primary outcome of less than 0.05 is not sufficient. All available evidence must be interpreted to determine the credibility of results. Questions should be asked about the appropriateness of patients, endpoints and treatment regimens, the adequacy of sample size calculations, the possible existence of flaws in the trial design and execution, the magnitude of the treatment effect, and its clinical importance. For the four studies that, with respect to mean differences, showed statistical significance after the Bonferroni correction, a standardized effect size (Cohen’s d) was calculated for only one patient-relevant primary outcome in one study [30]; however; at d 5 0.26, it is rather small in comparison with available literature [56]. Two other outcomes that were statistically significant were from studies with eight [28] and five [37] patient-relevant primary outcomes. It can therefore not be ruled out that the statistically significant outcomes are the result of a cumulative error in the alpha value. Furthermore, the differences in two of the outcomes would appear to be rather small: SF-36-subscales range from 0 to 100, but the resulting difference in the study of Elley et al. [28] was 4.51. The outcome in the study by Kerse [37] showed a difference of 0.30 on a five-point rating scale. In Zwarenstein et al. [52], only one patient-relevant primary outcome, ranging from 0 to 9, was used, and the adjusted difference was 0.85. These results have encouraged us to look more closely at the practical relevance of the ‘‘positive’’ findings. Our systematic and methodological review was based on a highly qualified search strategy and trial assessment. In addition, data were made publicly available in the SRDR

93

[17]. In contrast to previously published reviews [3,23], we used a new pool of studies and developed our own search strategy. Furthermore, highly qualified screening and study quality assessments were conducted by two reviewers independently of each other, and all authors were contacted by us. We are convinced that we have developed a new and valid database that can be used as the basis for answering future research questions. The high proportion of c-RCTs in a general practicioner setting that showed no beneficial effect may be attributable to several previously mentioned problems such as common methodological shortcomings, the high quality of existing standard care, or because the new (complex) treatment options usually investigated in c-RCTs are simply not superior to existing routine care. In addition, contamination effects can never be ruled out in c-RCTs [57]. However, if the results obtained in c-RCTs are not compared with the results obtained using an alternative study design, such assumptions are purely hypothetical. Specific findings obtained using different study designs (e.g., observational studies with routinely collected health data vs. RCTs) performed on the same topic have shown that studies using routinely collected health data often substantially overestimate treatment effects [58]. Assuming that c-RCTs possibly often investigate (complex) interventions that lack superiority, a follow-up project that aims to identify studies that have used alternative designs to deal with the same research questions as in our c-RCTs may help to provide an answer. However, when results from c-RCTs and other trials such as observational trials disagree, the situation may still be difficult to interpret because results from reliable RCTs, whether they use clusters or not, are generally more trustworthy than results obtained from observational data, as these have their own specific limitations [59]. Our assessment of study quality was based on the content of journal publications and authors’ responses, so we were unable to check assumptions made by the authors. A further limitation is that the feasibility project [7] gave us to believe we would identify 150 studies for inclusion, whereas only 29 studies were actually identified. The size of the study pool was thus vastly overestimated. There are several reasons for this. First, the restricted search in ‘‘general practice-related journals’’ in our feasibility trial had already identified most of the trials, and second, the addition of two new exclusion criteria increased the number of excluded studies. As we are aware that we have a valuable new study pool of c-RCTs with patient-relevant outcomes in a general practice setting at our disposal, we are planning to conduct further studies. For example, we plan to analyze the complex interventions themselves in accordance with the recommendations of M€ohler et al. [5,6]. As negative results are very disappointing for authors, we are also very interested in authors’ own interpretations and explanations as to why their studies did not show a positive effect. These will therefore be extracted from the discussion sections of included studies.

94

A. Siebenhofer et al. / Journal of Clinical Epidemiology 94 (2018) 85e96

5. Conclusions and implications for research and practice This systematic and methodological review of all c-RCTs evaluating complex interventions in a general practice setting resulted in the identification of a very low number of studies able to show superiority in the intervention group. This is alarming because in the opinion of the authors of the present study, such trials can be expected to be laborious, complex, and expensive, and to be planned and performed to the best ability of the investigators. This article addresses methodological issues and should therefore be seen as a discussion paper reminding researchers to invest the appropriate amount of effort into the development of rigorous study designs that prevent patients, health care staff, and researchers from wasting their time. Our review shows that when studies are poorly designed and implemented, it is often clear from the outset that they are not going to yield useful information. As we know from previously published methodological reviews, several problems remain to be dealt with, such as the adequate treatment of missing data [60], and the infrequent use of appropriate statistical methods in cluster-randomized crossover trials [61]. It is therefore to be welcomed that a number of initiatives have recently been set up to improve this process [62,63], as is the design of graphical tools to identify risk of bias in cluster-randomized trials [63].

Acknowledgments Paul Glasziou, Sarah Thorning, and Elaine Beller extensively discussed the project idea during the research fellowship of Andrea Siebenhofer at the Center for Research in Evidence-Based Practice, Bond University, Australia, in 2013. The authors thank Peter Sawicki and Paul Glasziou for their helpful comments on this manuscript and Jeanet W. Blom who translated one article for us.

Supplementary data Supplementary data related to this article can be found at https://doi.org/10.1016/j.jclinepi.2017.10.010. References [1] Campbell MK, Elbourne DR, Altman DG. CONSORT statement: extension to cluster randomised trials. BMJ 2004;328:702e8. [2] Campbell MK, Piaggio G, Elbourne DR, Altman DG. Consort 2010 statement: extension to cluster randomised trials. BMJ 2012;345:e5661. [3] Ivers NM, Taljaard M, Dixon S, Bennett C, McRae A, Taleban J, et al. Impact of CONSORT extension for cluster randomised trials on quality of reporting and study methodology: review of random sample of 300 trials, 2000-8. BMJ 2011;343:d5886. [4] Taljaard M, Weijer C, Grimshaw JM, Eccles MP. The Ottawa Statement on the ethical design and conduct of cluster randomised trials: precis for researchers and research ethics committees. BMJ 2013;346:f2838.

[5] M€ohler R, Bartoszek G, Kopke S, Meyer G. Proposed criteria for reporting the development and evaluation of complex interventions in healthcare (CReDECI): guideline development. Int J Nurs Stud 2012;49(1):40e6. [6] M€ohler R, Kopke S, Meyer G. Criteria for reporting the development and evaluation of complex interventions in healthcare: revised guideline (CReDECI 2). Trials 2015;16:204. [7] Siebenhofer A, Erckenbrecht S, Pregartner G, Berghold A, Muth C. How often are interventions in cluster-randomised controlled trials of complex interventions in general practices effective and reasons for potential shortcomings? Protocol and results of a feasibility project for a systematic review. BMJ Open 2016;6(2):e009414. [8] Djulbegovic B, Kumar A, Glasziou P, Miladinovic B, Chalmers I. Medical research: trial unpredictability yields predictable therapy gains. Nature 2013;500:395e6. [9] Erler A, Beyer M, Petersen JJ, Saal K, Rath T, Rochon J, et al. How to improve drug dosing for patients with renal impairment in primary care - a cluster-randomized controlled trial. BMJ Fam Pract 2012;13(1):91. [10] Hoffmann B, M€uller V, Rochon J, M€uller B, Albay Z, Weppler K, et al. Effects of a team-based assessment and intervention on patient safety culture in general practice: an open randomised controlled trial. BMJ Qual Saf 2014;23(1):35e46. [11] Siebenhofer A, Muth C, Fullerton B, Glasziou P, Beller E, Thorning S et al. The shortcomings of cluster-randomized controlled trials for the assessment of complex interventions in general practices: a systematic review. Centre Rev Dissemination - CRD. PROSPERO 2014:CRD42014009234. Available at. http://www.crd.york.ac.uk/ PROSPERO/display_record.asp?ID=CRD42014009234. Accessed December 7, 2017. [12] Craig P, Dieppe P, Macintyre S, Michie S, Nazareth I, Petticrew M. Developing and evaluating complex interventions: the new Medical Research Council guidance. BMJ 2008;337:a1655. [13] Djulbegovic B, Kumar A, Glasziou PP, Perera R, Reljic T, Dent L, et al. New treatments compared to established treatments in randomized trials. Cochrane Database Syst Rev 2012;10:MR000024. [14] Institute for Quality and Efficiency in Health Care. General Methods: Version 4.2 of 22 April 2015. Available at https://www. iqwig.de/download/IQWiG_General_Methods_Version_%204-2_no_ longer_valid.pdf. Accessed December 7, 2017. [15] Taljaard M, McGowan J, Grimshaw JM, Brehaut JC, McRae A, Eccles MP, et al. Electronic search strategies to identify reports of cluster randomized trials in MEDLINE: low precision will improve with adherence to reporting standards. BMC Med Res Methodol 2010;10:15. [16] Bland JM. Cluster randomised trials in the medical literature: two bibliometric surveys. BMC Med Res Methodol 2004;4:21. [17] Siebenhofer A, Muth C, Fullerton B, Glasziou P, Beller E, Thorning S et al. Our specific SRDR project database to be published after acceptance of our paper. [18] Systematic Review Data Repository [Software]. Available at http:// srdr.ahrq.gov/. Accessed December 7, 2017. [19] Higgins JPT, Green S. Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated March 2011] 2016. Available at www.handbook.cochrane.org. Accessed December 7, 2017. [20] Froud R, Eldridge S, Diaz Ordaz K, Marinho VCC, Donner A. Quality of cluster randomized controlled trials in oral health: a systematic review of reports published between 2005 and 2009. Community Dent Oral Epidemiol 2012;40(Suppl 1):3e14. [21] Eldridge S, Ashby D, Bennett C, Wakelin M, Feder G. Internal and external validity of cluster randomised trials: systematic review of recent trials. BMJ 2008;336:876e80. [22] Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med 2009;6(7):e1000097. [23] Rutterford C, Taljaard M, Dixon S, Copas A, Eldridge S. Reporting and methodological quality of sample size calculations in cluster

A. Siebenhofer et al. / Journal of Clinical Epidemiology 94 (2018) 85e96

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

randomized trials could be improved: a review. J Clin Epidemiol 2015;68:716e23. Bennewith O, Stocks N, Gunnell D, Peters TJ, Evans MO, Sharp DJ. General practice based intervention to prevent repeat episodes of deliberate self harm: cluster randomised controlled trial. BMJ 2002;324:1254e7. Black DA, Taggart J, Jayasinghe UW, Proudfoot J, Crookes P, Beilby J, et al. The Teamwork Study: enhancing the role of non-GP staff in chronic disease management in general practice. Aust J Prim Health 2013;19(3):184e9. Byng R, Jones R, Leese M, Hamilton B, McCrone P, Craig T. Exploratory cluster randomised controlled trial of shared care development for long-term mental illness. Br J Gen Pract 2004; 54(501):259e66. Coleman EA, Grothaus LC, Sandhu N, Wagner EH. Chronic care clinics: a randomized controlled trial of a new model of primary care for frail older adults. J Am Geriatr Soc 1999;47(7):775e83. Elley CR, Kerse N, Arroll B, Robinson E. Effectiveness of counselling patients on physical activity in general practice: cluster randomised controlled trial. BMJ 2003;326:793. Gask L, Dowrick C, Dixon C, Sutton C, Perry R, Torgerson D, et al. A pragmatic cluster randomized controlled trial of an educational intervention for GPs in the assessment and management of depression. Psychol Med 2004;34:63e72. Gensichen J, von Korff M, Peitz M, Muth C, Beyer M, Guthlin C, et al. Case management for depression by health care assistants in small primary care practices: a cluster randomized trial. Ann Intern Med 2009;151:369e78. Griffin SJ, Borch-Johnsen K, Davies MJ, Khunti K, Rutten GE, Sandbæk A, et al. Effect of early intensive multifactorial therapy on 5-year cardiovascular outcomes in individuals with type 2 diabetes detected by screening (ADDITION-Europe): a clusterrandomised trial. Lancet 2011;378:156e67. Holton CH, Beilby JJ, Harris MF, Harper CE, Proudfoot JG, Ramsay EN, et al. Systematic care for asthma in Australian general practice: a randomised controlled trial. Med J Aust 2010;193: 332e7. Jellema P, van der Windt DA, van der Horst HE, Twisk JW, Stalman WA, Bouter LM. Should treatment of (sub)acute low back pain be aimed at psychosocial prognostic factors? Cluster randomised clinical trial in general practice. BMJ 2005;331:84. Juul L, Maindal HT, Zoffmann V, Frydenberg M, Sandbaek A. Effectiveness of a training course for general practice nurses in motivation support in type 2 diabetes care: a cluster-randomised trial. PLoS One 2014;9:e96683. Kendrick D, Marsh P, Fielding K, Miller P. Preventing injuries in children: cluster randomised controlled trial in primary care. BMJ 1999;318:980e3. Kennedy A, Bower P, Reeves D, Blakeman T, Bowen R, ChewGraham C, et al. Implementation of self management support for long term conditions in routine primary care settings: cluster randomised controlled trial. BMJ 2013;346:f2882. Kerse NM, Flicker L, Jolley D, Arroll B, Young D. Improving the health behaviours of elderly people: randomised controlled trial of a general practice education programme. BMJ 1999;319:683e7. Kruis AL, Boland MRS, Assendelft WJJ, Gussekloo J, Tsiachristas A, Stijnen T, et al. Effectiveness of integrated disease management for primary care chronic obstructive pulmonary disease patients: results of cluster randomised trial. BMJ 2014; 349:g5392. Lobo CM, Frijling BD, Hulscher MEJL, Bernsen RMD, Grol RPTM, Prins A, et al. Effect of a comprehensive intervention program targeting general practice staff on quality of life in patients at high cardiovascular risk: a randomized controlled trial. Qual Life Res 2004;13:73e80. Lord J, Victor C, Littlejohns P, Ross FM, Axford JS. Economic evaluation of a primary care-based education programme for

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54] [55] [56] [57]

[58]

95

patients with osteoarthritis of the knee. Health Technol Assess 1999;3:1e55. McGeoch GRB, Willsman KJ, Dowson CA, Town GI, Frampton CM, McCartin FJ, et al. Self-management plans in the primary care of patients with chronic obstructive pulmonary disease. Respirology 2006;11(5):611e8. Menn P, Holle R, Kunz S, Donath C, Lauterberg J, Leidl R, et al. Dementia care in the general practice setting: a cluster randomized trial on the effectiveness and cost impact of three management strategies. Value Health 2012;15(6):851e9. Metzelthin SF, van Rossum E, de Witte LP, Ambergen AW, Hobma SO, Sipers W, et al. Effectiveness of interdisciplinary primary care approach to reduce disability in community dwelling frail older people: cluster randomised controlled trial. BMJ 2013; 347:f5264. Murphy AW, Cupples ME, Smith SM, Byrne M, Byrne MC, Newell J. Effect of tailored practice and patient care plans on secondary prevention of heart disease in general practice: cluster randomised controlled trial. BMJ 2009;339:b4220. Olivarius NF, Beck-Nielsen H, Andreasen AH, Horder M, Pedersen PA. Randomised controlled trial of structured personal care of type 2 diabetes mellitus. BMJ 2001;323:970. Rea H, McAuley S, Stewart A, Lamont C, Roseman P, Didsbury P. A chronic disease management programme can reduce days in hospital for patients with chronic obstructive pulmonary disease. Intern Med J 2004;34(11):608e14. Rosendal M, Olesen F, Fink P, Toft T, Sokolowski I, Bro F. A randomized controlled trial of brief training in the assessment and treatment of somatization in primary care: effects on patient outcome. Gen Hosp Psychiatry 2007;29(4):364e73. Smith JR, Noble MJ, Musgrave S, Murdoch J, Price GM, Barton GR, et al. The at-risk registers in severe asthma (ARRISA) study: a cluster-randomised controlled trial examining effectiveness and costs in primary care. Thorax 2012;67(12):1052e60. Thoonen BPA, Schermer TRJ, van Den Boom G, Molema J, Folgering H, Akkermans RP, et al. Self-management of asthma in general practice, asthma control and quality of life: a randomised controlled trial. Thorax 2003;58(1):30e6. van Marwijk HW, Ader H, de Haan M, Beekman A. Primary care management of major depression in patients aged or 555 years: outcome of a randomised clinical trial. Br J Gen Pract 2008; 58(555):680e6. IeII; discussion 687. Wagner EH, Grothaus LC, Sandhu N, Galvin MS, McGregor M, Artz K, et al. Chronic care clinics for diabetes in primary care: a system-wide randomized trial. Diabetes Care 2001;24: 695e700. Zwarenstein M, Bheekie A, Lombard C, Swingler G, Ehrlich R, Eccles M, et al. Educational outreach to general practitioners reduces children’s asthma symptoms: a cluster randomised controlled trial. Implement Sci 2007;2:30. Diaz-Ordaz K, Slowther AM, Potter R, Eldridge S. Consent processes in cluster-randomised trials in residential facilities for older adults: a systematic review of reporting practices and proposed guidelines. BMJ Open 2013;3(7):e003057. Pocock SJ, Stone GW. The primary outcome is positive - is that good enough? N Engl J Med 2016;375:971e9. Pocock SJ, Stone GW. The primary outcome fails - what next? N Engl J Med 2016;375:861e70. Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. Hillsdale, NJ: Erlbaum; 1988. Puffer S, Torgerson D, Watson J. Evidence for risk of bias in cluster randomised trials: review of recent trials published in three general medical journals. BMJ 2003;327:785e9. Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JPA. Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: meta-epidemiological survey. BMJ 2016;352:i493.

96

A. Siebenhofer et al. / Journal of Clinical Epidemiology 94 (2018) 85e96

[59] Hemkens LG, Contopoulos-Ioannidis DG, Ioannidis JPA. Current use of routinely collected health data to complement randomized controlled trials: a meta-epidemiological survey. CMAJ Open 2016;4(2):E132e40. [60] Diaz-Ordaz K, Kenward MG, Cohen A, Coleman CL, Eldridge S. Are missing data adequately handled in cluster randomised trials? A systematic review and guidelines. Clin Trials 2014;11:590e600. [61] Arnup SJ, Forbes AB, Kahan BC, Morgan KE, McKenzie JE. Appropriate statistical methods were infrequently used in

cluster-randomized crossover trials. J Clin Epidemiol 2016;74: 40e50. [62] Brown AW, Li P, Bohan Brown MM, Kaiser KA, Keith SW, Oakes JM, et al. Best (but oft-forgotten) practices: designing, analyzing, and reporting cluster randomized controlled trials. Am J Clin Nutr 2015;102:241e8. [63] Caille A, Kerry S, Tavernier E, Leyrat C, Eldridge S, Giraudeau B. Timeline cluster: a graphical tool to identify risk of bias in cluster randomised trials. BMJ 2016;354:i4291.