Study Designs for Effectiveness and Translation Research

Study Designs for Effectiveness and Translation Research

Study Designs for Effectiveness and Translation Research Identifying Trade-offs Shawna L. Mercer, MSc, PhD, Barbara J. DeVinney, PhD, Lawrence J. Fine...

210KB Sizes 12 Downloads 47 Views

Study Designs for Effectiveness and Translation Research Identifying Trade-offs Shawna L. Mercer, MSc, PhD, Barbara J. DeVinney, PhD, Lawrence J. Fine, MD, DrPH, Lawrence W. Green, DrPH, Denise Dougherty, PhD Background: Practitioners and policymakers need credible evidence of effectiveness to justify allocating resources to complex, expensive health programs. Investigators, however, face challenges in designing sound effectiveness and translation research with relevance for “real-world” settings. Methods:

Research experts and federal and foundation funders (n⫽⬃120) prepared for and participated in a symposium, held May 4 –5, 2004, to weigh the strengths, limitations, and trade-offs between alternate designs for studying the effectiveness and translation of complex, multilevel health interventions.

Results:

Symposium attendees acknowledged that research phases (hypothesis generating, efficacy, effectiveness, translation) are iterative and cyclical, not linear, since research in advanced phases may reveal unanswered questions in earlier phases. Research questions thus always need to drive the choice of study design. When randomization and experimental control are feasible, participants noted that the randomized controlled trial with individual random assignment remains the gold standard for safeguarding internal validity. Attendees highlighted trade-offs of randomized controlled trial variants, quasi-experimental designs, and natural experiments for use when randomization or experimental control or both are impossible or inadequately address external validity. Participants discussed enhancements to all designs to increase confidence in causal inference while accommodating greater external validity. Since no single study can establish causality, participants encouraged replication of studies and triangulation using different study designs. Participants also recommended participatory research approaches for building population relevance, acceptability, and usefulness.

Conclusions: Consideration of the study design choices, trade-offs, and enhancements discussed here can guide the design, funding, completion, and publication of appropriate policy- and practice-oriented effectiveness and translational research for complex, multilevel health interventions. (Am J Prev Med 2007;33(2):139 –154) © 2007 American Journal of Preventive Medicine

Introduction

T

he demand for public health and healthcare practice and policy to be based on scientific evidence continues to grow, affecting programs, services, and research.1– 6 Attempts to describe characteristics of valid, high-quality research and evaluation that policymakers and practitioners should value are

From the Guide to Community Preventive Services, National Center for Health Marketing, Centers for Disease Control and Prevention (Mercer), Atlanta, Georgia; (DeVinney, Independent Contractor) Christiansburg, Virginia, and Child Health and Quality Improvement, Office of Extramural Research, Education, and Priority Populations, Agency for Healthcare Research and Quality, (Dougherty), Rockville, Maryland; Agency for Healthcare Research and Quality (DeVinney), Rockville, Maryland; Clinical Prevention and Translation, National Heart, Lung, and Blood Institute, National Institutes of Health (Fine), Bethesda,

multiplying.7–14 The United States Preventive Services Task Force,15 Task Force on Community Preventive Services,16,17 Cochrane Collaboration,18 Campbell Collaboration,19 and the United Kingdom’s National Institute for Health and Clinical Excellence (NICE)20 put a premium on rigorous design as they assess the literature and develop recommendations for practice and policy. The U.S. Institute of Medicine and the National Quality Forum have recommended focusing on the Maryland; and Department of Epidemiology and Biostatistics, School of Medicine and Comprehensive Cancer Center (Green), University of California at San Francisco, San Francisco, California Address correspondence and reprint requests to: Shawna L. Mercer, MSc, PhD, Director, Guide to Community Preventive Services, National Center for Health Marketing, Centers for Disease Control and Prevention, 1600 Clifton Road, NE, Mailbox E-69, Atlanta GA 30333. E-mail: [email protected].

Am J Prev Med 2007;33(2) © 2007 American Journal of Preventive Medicine • Published by Elsevier Inc.

0749-3797/07/$–see front matter doi:10.1016/j.amepre.2007.04.005

139

scientific soundness of quality improvement activities.13,21 U.S. federal agencies are subjecting applications for intervention research funding to more rigorous external peer review than in previous years, and the U.S. Department of Education is placing higher priority on evaluation projects that use rigorous research methods to assess intervention effectiveness.22 Many quality-of-evidence ratings emphasize internal validity (Does this intervention work under the conditions set forth in the study?) without also giving consideration to external validity (Will it work in other settings and contexts?).23 Assessing internal validity is of paramount importance, for without it, one cannot be sure whether the intervention works at all. Yet growing recognition of gaps between research and practice has led researchers, policymakers, practitioners, and government officials to call for more research that has relevance for practice and policy across a wide range of real-world settings and situations.17,23–30 Some practitioners and policymakers question the effectiveness for their particular situations of interventions deemed efficacious in studies using populations or circumstances different from their own. Others wonder whether interventions whose effectiveness has been established within some practices or communities can be generalized or transferred to a broad range of settings. Studies that consider external as well as internal validity are important for informing real-world decision making in such situations.5,23,27,31 While efficacy research assesses whether an intervention works under ideal conditions, effectiveness research examines whether it works under real-world conditions.32 Translation research, while defined differently within and across disciplines, involves exploring how to translate (or transfer) scientific discoveries into practical applications to improve health.25,27,30 Challenges arise when researchers attempt to design effectiveness and translation research to evaluate complex, multilevel health interventions in real-world settings. It may be challenging, for example, to devise an appropriate control group in studies assessing multicomponent interventions when enough is known about the individual components to raise ethical objections if any of them were to be withheld, and when receiving nothing or a placebo would be unacceptable. Or it may be difficult to determine how to evaluate long-term follow-up when significant subject attrition is expected over time. Other design challenges may arise when the intervention can take many forms or requires program- or population-specific adaptations, when investigators cannot control how the intervention is implemented by different practitioners, when the quality with which the intervention is delivered can vary widely, or when the individuals who would volunteer or agree to participate are different from the target population as a whole. Financial or logistic complications may occur 140

when trying to secure adequate sample size for studies in which large organizational units, entire communities, or nations are the unit of analysis. Finally, the intervention of interest may not be able to be randomly assigned to individuals or groups because they will not agree to be randomized or because all potential participants are exposed to the intervention (e.g., a law). These design challenges suggest that it might be worthwhile to consider what valuable information can be gained from employing a variety of study designs.33 One can also ask whether the weight of evidence from nonrandomized study designs can offset the strength of evidence lost when randomization is impossible or inappropriate. The purpose of this project was therefore to explore the strengths, limitations, and trade-offs among a variety of designs applicable to effectiveness and translation research. Of particular interest were designs that would provide evidence not only for whether these interventions would work in the setting in which they were first studied, but also whether they could produce findings generalizable to other settings and contexts. The intent was to identify directions that could be taken to strengthen the evidence base for real-world decision making.

Methods A series of symposia was initiated to bring together methodological and subject matter experts to examine trade-offs among study designs for effectiveness and translation research. The second symposium forms the basis of this manuscript. The first symposium is briefly mentioned here to provide context, and two later symposia on related topics are described in the Discussion section. The first symposium was designed by National Institutes of Health (NIH) and Centers for Disease Control and Prevention (CDC) staff to initiate broad-based discussion of design issues in translational research surrounding diabetes and obesity. The symposium consisted of one session held during a NIH/CDC-sponsored meeting entitled From Clinical Trials to Community: The Science of Translating Diabetes and Obesity Research. Proceedings from this meeting, held January 12–13, 2004, are available from NIH and online.34 The second symposium consisted of a two-day meeting convened May 4 –5, 2004, and was entitled Research Designs for Complex, Multi-Level Health Interventions and Programs. Sponsored by NIH and CDC, with participation from the Agency for Healthcare Research and Quality (AHRQ) and the Robert Wood Johnson Foundation (RWJF), the symposium’s objectives were to: (1) understand how opportunities and challenges in effectiveness and translational research lead to consideration of a variety of research designs, (2) recognize key trade-offs among alternative research designs, and (3) identify one or more useful research designs for effectiveness and translational studies. This paper presents lessons learned through this second symposium and a subsequent year of interagency discussion, with the aim of stimulating further action on the funding, conduct, and publication of promising research design options.

American Journal of Preventive Medicine, Volume 33, Number 2

www.ajpm-online.net

Initial preparations for the May 4 –5 symposium involved holding a series of intra- and inter-agency planning meetings with NIH and CDC personnel (the list of federal advisory group members is available online at http://obssr.od.nih. gov/Conf_Wkshp/Complex%20Interventions.htm), and engaging AHRQ and RWJF staff in discussing design challenges, identifying important programmatic and policy questions, and suggesting methodologic and subject matter experts. Six topics were chosen to illustrate alternative research designs: prevention of type 2 diabetes, prevention of childhood obesity, promotion of physical activity, tobacco control and cessation among adolescents, improving the management of asthma in high-risk populations, and reducing underage drinking. A working group was established for each scenario, consisting of the authors of this article, leading nongovernmental researchers with methodologic and substantive expertise, and interested federal staff (working group members are identified at http://obssr.od.nih.gov/Conf_ Wkshp/Complex%20Interventions.htm). Groups worked intensively to delineate current research needs and practice constraints surrounding complex, multilevel health interventions, identify a hypothetical but realistic effectiveness or translational research question, and devise at least two study designs. “Complex” was defined as multicomponent, and “multilevel” as intervening on two or more levels of determinants of health (e.g., individual, familial, organizational, political, social, economic, environmental). At least one of the two designs had to be a quasi-experiment (without randomization or experimental control) or a natural experiment (without both) to allow the greatest possible comparisons. The six scenarios were presented to the (⬃120) assembled experts at the May 2004 symposium to stimulate discussion of strengths, weaknesses, and trade-offs between the designs— initially by scenario, and then evolving toward broad tradeoffs and lessons learned.

Results Since the authors, scenario group members, and symposium participants came from diverse disciplines, challenges arose from their reliance on different study designs, use of different terminology to describe similar aspects of study design, and different understandings of the same terms. How the terms “comparison group” and “control group” are used in different traditions provides a good example: (1) in some, they are considered synonymous; (2) in others, “control group” is only used when subjects are randomized into intervention and control groups, while “comparison group” is used when groups are not randomized; and (3) in others, a “comparison group” may receive either no intervention, a placebo, or an alternate intervention while a “control group” may receive a placebo or no intervention or be wait-listed for intervention, but may not receive an alternate intervention (as that would be a comparison group)—irrespective of whether groups are randomized.32,35 Given the symposium’s focus on studies of complex interventions with behavioral components, the research design terminology and study August 2007

design schematics outlined in Shadish et al.32 provided helpful common ground. These terms and descriptions may be more familiar to those within behavioral science, health behavior, health promotion, health education, and evaluation traditions. In Shadish et al.,32 use of “control group” and “comparison group” is consistent with the third case above. Two of the scenarios, and comparison of the tradeoffs between the two designs proposed for each, can be found in Table 1. Table 2 summarizes key strengths and weaknesses of the designs, along with suggested design enhancements. The footnotes to Table 2 list potential threats to internal and external validity. Highlighted in the text of this paper are the strengths, weaknesses, and trade-offs that scenario developers and other symposium participants felt were most worthy of consideration, along with generic observations and recommendations. Visual schematics of the various designs, taken from and building on Shadish et al.32 can be found in Appendix A online at www.ajpm-online.net. (Full details of the scenarios and designs, as well as copies of all symposium presentations and videocasts of the full symposium can be accessed at http://obssr.od.nih. gov/Conf_Wkshp/Complex%20Interventions.htm.) Since some individuals presented twice, the presentation being referred to is specified.

General Observations and Recommendations Symposium participants acknowledged the philosophical debate about whether and how causality can truly be determined,32 and the varied opinions across disciplines on appropriate methods for assessing causality. Most agreed that no one study establishes causality, and that policy, program, and practice decisions must often be taken in the absence of certainty about causes.42 Different researchers and traditions have attempted to delineate the various phases of research. Since symposium participants highlighted the need for understanding and collaboration across disciplines, the authors of this manuscript attempt to show their relative equivalence in Figure 1. Discussions at the symposium revealed that studying what interventions work is an iterative and cyclical rather than a linear process. As symposium participants considered the effectiveness of interventions deemed efficacious, and translation to new situations of interventions considered effective, in-depth exploration often revealed unanswered questions about basic underlying theory; clinical, behavioral, or organizational factors and their relationships; and even whether the interventions were entirely efficacious.5,23,42 Moreover, the lines between effectiveness and translational research remained fuzzy, because in almost all cases where interventions had been deemed effective, the interventions had not demonstrated effectiveness in a variety of real-world conditions, let alone in Am J Prev Med 2007;33(2)

141

142 American Journal of Preventive Medicine, Volume 33, Number 2

Table 1. Design issues considered in developing hypothetical scenarios presented at May 4 –5, 2004 symposium: two example scenarios Scenario number and research goal

Desired outcomes

Two research designs compared for scenario

Key trade-offs between the two designs

African-American and Latino adult church attendees with impaired fasting glucose 100–125 mg/dl and body mass index ⱖ30, in Los Angeles and Chicago

Weight loss

RET with random assignment of individuals within churches (random assignment at church level also considered) to encouragement to adopt aspects of Diabetes Prevention Program or to control group SET

6 urban and suburban settings

Increased time spent walking

ITS with 6 sites receiving intervention and serving as their own comparison groups pre-intervention PP with 3 intervention and 3 comparison communities (can be analyzed as 6 independent samples or 3 paired comparisons)

RET controls for secular trends but SET does not RET may allow recruitment of more representative sample because participants are given choice in selecting intervention components SET may increase subject retention relative to RET because all subjects eventually participate in intervention arm PP with comparison group accounts for secular trends ITS permits detection of trends and careful measurement of effect size and maintenance over time Potential for analytic refinements to ITS Size of the detectable effect is smaller (i.e., fewer minutes of increase in physical activity) for ITS than either an independent samples ttest or paired t-test in the PP design; thus, the ITS has less risk of a false negative result

Proposed study intervention

Setting(s) and population(s)

#1: Reduce diabetes risk and prevalence

Provide menu of options for increased physical activity (e.g., buddy system for walking) and improved diet (modeled on Diabetes Prevention Program36)

#3: Increase “utilitarian” physical activity

Neighborhood light rail as alternative to automobile use in mass transportation

www.ajpm-online.net

Notes: Design issues faced by all six hypothetical scenarios can be found at http://obssr.od.nih.gov/Conf_Wkshp/Complex%20Interventions.htm. Content for these hypothetical scenarios was developed for the symposium by the members of the scenario work groups, identified on the same website. ITS, interrupted time series design; PP, pre–post design; RET, randomized encouragement trial; SET, staggered enrollment trial.

Table 2. Strengths and limitations of, and enhancements to, alternative designs discussed at May 4 –5, 2004 Symposium for Studying Complex, Multi-Level Health Interventions Enhancements that could strengthen design

Design

Key strengthsa,b

Key limitations

True experiments: randomized controlled designs32

Gold standard for establishing causation because randomization creates probabilistically equivalent treatment and control groups leading to high internal validityc Protects against most threats to internal validityc: ambiguous temporal precedence, selection, history, maturation, testing, instrumentation, regression to the mean

May have low external validityd

Could increase external validity and understanding of process by assessing implementation and sustainability in natural settings

Could have differential attrition between intervention and control groups May have low external validityd

Consider using practical clinical trials5 including attention to (1) selection of clinically relevant alternative interventions for comparison, (2) including diverse study participants, (3) recruiting participants from heterogeneous settings, and (4) collecting data on a broad range of health outcomes Consider relevance of grounded theory by Shadish et al.32 and apply their principles for achieving generalized causal inference If encouragement strategies are developed collaboratively between researcher and participants, can promote an even more equitable relationship between researcher and participant Can reduce cost if use of an intermediate variable as study endpoint rather than a disease endpoint is defensible

Traditional RCT with individual as unit of RA32

RET37,38 (see also Chin’sc presentation on diabetes scenario)

SET (see Chin’s presentation on SETc)

Stronger external validity than traditional RCT with individuals as the unit of RA and stronger internal validity than observational and quasi-experimental studies RA to encouragement (persuasive communication) to have the intervention or to select from a menu of options more closely mimics the delivery of preventive services in real-world settings Can reveal participants’ decision-making process (i.e., models real-world behavior of treatment choices) Researchers and community are partners in the research; community and individual preferences are considered May provide a more equitable relationship between researcher and participant than mandated treatment assignment Subjects can serve as own controls when those originally in the control arm receive the intervention

Internal validity may be lower than a traditional RCT with individuals as the unit of RA Need to collect extensive quantitative and qualitative data to measure intensity of and fidelity to implementation of intervention Because it is less controlled than an RCT, RETs tend to have smaller effect sizes and greater within-group variance. therefore requiring larger sample sizes Cost may be very high due to data collection requirements and smaller effect sizes and greater within-group variance than RCTs

No controls for longer-term secular trends May have contamination and extended learning effects by controls who were exposed to general ideas of trial

Add a nonequivalent dependent variable

(Continued on next page)

August 2007

Am J Prev Med 2007;33(2)

143

Table 2. Strengths and limitations of, and enhancements to, alternative designs discussed at May 4 –5, 2004 Symposium for Studying Complex, Multi-Level Health Interventions Design

GRT39,40 (see also Camargo’s presentation on asthma scenario and Murray’s presentation on GRTsc)

Key strengthsa,b

Key limitations

May have greater enrollment and subject retention among controls than a traditional RCT with individual RA because they know they will receive intervention at a definable future point Staggered enrollment can allow some examination of secular trends through having subjects initiate intervention at different times With proper randomization and enough groups, bias is similar across study conditions Can use a GRT design with a small number of groups for (1) feasibility study or preliminary evidence of effectiveness, and (2) estimating effect or intraclass correlation coefficient without needing causal inference

May have autocorrelation (correlation of consecutive observations over time) in the analyses among individuals who begin as controls and cross over to treatment group

Extra variation attributable to groups increases standard error of measurement Degrees of freedom are limited with small numbers of groups, reducing the benefits of randomization Complicated logistics Large-scale GRTs can be very expensive

Quasi-experimental designs: nonrandomized designs with or without controls PP32 PP without control or comparison May be useful for testing group: has many threats to feasibility of an intervention internal validity: selection, (Nonrandomized) PP with history, maturation, testing, and control or comparison instrumentation group can account for Limited external validity for other secular trends units, settings, variations in treatment, outcome measures ITS32,41 (see also No accounting for concurrent Repeated measures enable historical trends without control Feldman’s examination of trends group presentation on before, during, and after Instrumentation changes can lead ITSb) intervention to identification of spurious Boosts power to detect effect change by providing a Selection biases if composition of precise picture of pre- and sample changes at intervention post-intervention through taking advantage of order and patterns— both observed and expected— over time Pre-intervention series of data points allows for examination of historical trends, threats to internal validity

144

American Journal of Preventive Medicine, Volume 33, Number 2

Enhancements that could strengthen design

Can decrease variation attributable to groups through adjustment for covariates (reducing the intraclass correlation coefficient) and modeling time Employ more and smaller groups rather than fewer and larger groups Match or stratify groups a priori Include independent evaluation personnel who are blind to conditions Pay particular attention to recruiting representative groups and members Add control or comparison group Add nonequivalent dependent variable

Add control group Qualitatively or quantitatively assess whether other events or changes in composition of sample might have caused effect or whether data collection methods changed Add nonequivalent dependent variables Remove treatment at known time Use switching replications design Use multiple jurisdictions with varying degrees and timing of interventions and similar surveillance data (Continued on next page)

www.ajpm-online.net

Table 2. (continued) Design

Multiple baseline (see Brown’s presentation on the tobacco control scenario and SansonFisher’s presentation on multiple baseline designsb)

RD32 (see also Shadish’s presentation on RDb)

NEs32 (see also Murray’s presentation on obesity scenario and Gortmaker’s presentation on NEsb)

Key strengthsa,b Can closely assess effect size, speed, and maintenance over time Each unit acts as its own control All settings can get the intervention if ongoing analyses suggest that it is beneficial Can use individuals and small and large groups as units of analysis Appropriate and accepted statistical analyses exist If an intervention strategy appears, through ongoing analyses, not to be beneficial, that strategy can be modified or replaced by another strategy before the intervention is placed in another jurisdiction/site Can study various components of an intervention individually Design is consistent with decision-making process used by a wide range of influential groups, such as policymakers, police, educators, and health officials When properly implemented and analyzed, RD yields an unbiased estimate of treatment effect Allows communities to be assigned to treatment based on their need for treatment, which is consistent with how many policies are implemented Incorporates characteristics of multiple designs, including multiple baseline and switching replication

Provide the potential to study more innovative, large-scale, expensive, or hard-toimplement programs and policies than typically can be studied in project funded through regular mechanisms available to funders

Enhancements that could strengthen design

Key limitations

Having fewer study units may limit generalizability Interventions can be affected by chance in some units Measures must be suited for repeated use Must determine how to define a stable baseline The design depends on temporal relationship between intervention and measures that is either abrupt or must be able to predict time lag following intervention Must determine how far apart interventions should be staggered

Increase number of study units Research costs are reduced if data are routinely collected surveillance data Can incorporate switching replication Can randomize within sets of communities to determine order of entry into study

Complex variable specification and statistical analysis Statistical power is considerably less than randomized experiment of same size due to collinearity between assignment and treatment variables Effects are unbiased only if functional form of relationship between assignment variable and outcome variable is correctly specified, including nonlinear relationships and interactions

Correctly model functional form of relationship between assignment and outcome variables prior to treatment. This can be done with surveillance data Power can be enhanced by combining RD with randomized experiment. Rather than using cutoff score for assignment to treatment and control, use cutoff interval. Cases above interval are assigned to treatment and those below are controls. Those within cutoff interval are randomly assigned to treatment or control Can increase internal validity with more data points in the pre- and post-intervention periods, using multiple baseline or time series methods

Selection biases May have limited generalizability and this is difficult to examine because (1) there is no RA to conditions, (2) matching with comparison groups may be based on limited number of variables; (3) experimenter does not control intervention; and (4) lower internal validity than designs with RA

(Continued on next page)

August 2007

Am J Prev Med 2007;33(2)

145

Table 2. Strengths and limitations of, and enhancements to, alternative designs discussed at May 4 –5, 2004 Symposium for Studying Complex, Multi-Level Health Interventions Key strengthsa,b

Design

Enhancements that could strengthen design

Key limitations

Provides opportunity to study interventions for which typical funding mechanisms would be too slow to capture such opportunities prospectively Policymakers and laypeople understand NEs Can reduce costs if extant data can be used a

Key strengths, limitations, and enhancements were generated through presentations and discussions at the symposium. All presentations can be accessed at http://obssr.od.nih.gov/Conf_Wkshp/Complex%20Interventions.htm. c Threats to internal validity30: ambiguous temporal precedence—lack of clarity of cause and effect may result from being unsure of which variable occurred first; selection—participants in intervention and control groups may differ in an important way; history— events outside the study might affect results but not be related to the intervention; maturation—subjects may change over a study due to the passage of time only; testing—prior measurement of the dependent variable may affect subsequent measurements; instrumentation—reliability of instrument that assesses the dependent variable or controls the independent variable may change over the study; regression to the mean—those with extreme scores tend to have scores closer to the mean on a second measurement; mortality/attrition— differential attrition from study between intervention and control groups. d Threats to external validity30: interaction of the causal relationship with the units—the extent to which the study results can be generalized from the specific sample that was studied to various defined populations; interaction of the causal relationship over intervention variations—the extent to which an effect found with one variation of an intervention can be generalized to other variations of the intervention; interaction of the causal relationship with outcomes—the extent to which an effect found on one kind of outcome variation would hold if other outcome variations were used; interactions of the causal relationship with settings—the extent to which the study results can be generalized from the study’s set of conditions to other settings; context-dependent mediation—the extent to which an explanatory mediator in one context mediates in another context. GRT, group randomized trial; ITS, interrupted time series design; NE, natural experiment; PP, pre–post design; RA, random assignment; RCT, traditional randomized controlled trial with individuals as the unit of RA; RD, regression discontinuity design; RET, randomized encouragement trial; SET, staggered enrollment trial. b

Basic descriptions of research phases used in social science , epidemiology, health care, public health, and other health-related fields Formative/descriptive/hypothesisgenerating research

Analytic/hypothesis-testing research

Basic research

Efficacy research

Effectiveness research

Translational research

The “Levy arrow”: Phases originally developed to illustrate the continuum of research at the National Institutes of Health 43 I. Basic research

II. Applied research and development

III. Clinical investigations

IV. Clinical trials

Flay’s “eight phases of research”, for the development of health promotion programs

I. Basic research

II. Hypothesis development

III. Pilot-applied research

IV. Prototype evaluation studies

Framework for design and evaluation of complex interventions to improve health

Preclinical theory

Phase I Modeling

Phase II Exploratory trial

44

VI. Treatment effectiveness trials

V. Efficacy trials

V. Demonstration and education research

VII. Implementation effectiveness trials

VIII. Demonstration evaluations

45

Phase III Definitive randomized controlled trial

Phase IV Long-term implementation

Figure 1. Phases of research described by various traditions, and estimation of their relative equivalence across schemas. Note: Although the phases are portrayed in a linear fashion to facilitate comparability, they need to be viewed as iterative and cyclical (see discussion in text).

146

American Journal of Preventive Medicine, Volume 33, Number 2

www.ajpm-online.net

the seemingly infinite number of specific population– setting– circumstance interactions.23,27 For example, some of the community-level tobacco and underage drinking interventions had been tested with randomized designs, but there was uncertainty about whether they could be applied without modification or adaptation to different communities or in other countries. Rather than proceeding with a large-scale effectiveness test in a new country, Pechacek, Brown, and other members of the tobacco scenario working group noted that a more appropriate research question would be to first adapt the “proven” intervention to the new settings and determine its impact on the important outcomes. Additionally, studying the transferability to other real-world communities of the tobacco intervention revealed that assumptions had been made about certain aspects of its implementation and functionality. Pechacek, Brown, and colleagues suggested that those assumptions might need additional pilot testing or efficacy research prior to proceeding with studies of translatability. Key conclusions of the symposium, therefore, included the need to allow cycling back to earlier phases of research as new questions arise,33 and to enable the research question to drive the choice of study design rather than allowing a preference for one design or a linear view of research phases to alter the essential question and the circumstances and context in which it needs to be answered. Within each study design, there are considerable differences across published studies in both quality and the degree to which they seek to address the needs of practitioners and policymakers. Decision makers consider both study quality and utility when determining whether to filter information out or to pull it into their stock of knowledge—which they can then call into action when required.46,47 Enhancements that can strengthen study quality, utility, or both are therefore suggested for each study design in the following section and in Table 2. In terms of quality, for example, control or comparison groups and non-equivalent variables can be added to interrupted time series and pre–post designs. In terms of utility, designs should seek to address threats to external validity that may result from interactions between intervention characteristics (such as intervention intensity and the skill of personnel implementing the intervention) and contextual factors in settings to which one might wish to generalize. This involves considering the extent to which study results can be generalized from the specific conditions in the study to various defined populations, other variations of the intervention, other outcome variations, other settings, and other contexts (see discussion of external validity in the footnote to Table 2).32 For example, rather than conducting highly restricted randomized controlled trials with strict protocols and narrowly defined participant groups, researchers can design August 2007

practical clinical trials that aim to answer decision makers’ questions and that therefore choose relevant interventions, include diverse participants from heterogeneous settings, and measure outcomes of relevance to decision makers.5 Symposium participants highlighted the importance of ensuring that designers of all effectiveness and translation studies actively seek to increase study quality and consider practice needs from the earliest stages of study design, with the aim of providing practice-based evidence.27,48 Symposium attendees further identified that one of the most valuable ways to build population relevance, acceptability, and usefulness is to use participatory research approaches.49 –52 These approaches require researchers to engage those who are expected to be the users, beneficiaries, and stakeholders of the research not just as subjects of the research but as active participants in the research process itself—including them in identifying research needs, honing research questions, designing and conducting the study, and interpreting and applying the study findings. As an approach to research rather than a specific research design, participatory research can be used with all of the designs discussed here, including randomized designs. Since no single study can establish causality, and given the various trade-offs among study designs, symposium discussants underscored the importance of using different study designs to address the same research question (triangulation) and encouraging more replication of studies.32,42,53 Such replication and triangulation can further increase confidence in causal inference, in the likelihood that findings represent stable effects, and in the results being generalizable; and it can help to offset strength of evidence lost when randomization is impossible. Replication also facilitates systematic reviews and meta-analyses.53 Novel approaches to design can also be used to take advantage of scarce research dollars. A relatively rare approach, discussed at the symposium by Lanier, is to design a study including two different interventions addressing non-overlapping health conditions (e.g., smoking cessation and injury control), where each intervention can serve as the control or comparison group for the other. This provides information about the intervention and the health topic area, while controlling for Hawthorne (that knowledge of being studied can influence one’s behavior35) and other reactive effects.

Strengths and Limitations of Various Designs and Trade-offs Among Them Research designs proposed by the working groups included randomized controlled designs (often called true experimental designs) and nonrandomized designs with or without controls (often known as quasiexperimental designs). Also proposed were natural Am J Prev Med 2007;33(2)

147

experiments—a category of research and evaluation within which various design options can be employed, rather than a type of design.32 Randomized controlled designs: true experimental options. Randomized controlled designs were acknowledged throughout the symposium as the typically preferred options. Those discussed in the symposium included the traditional randomized controlled trial with individuals as the unit of random assignment (RA), randomized encouragement trial, staggered enrollment trial, and group randomized trial. Randomized controlled trial with individuals as the unit of random assignment. The traditional RCT with RA at the level of the individual is considered by many to be the “gold standard” for clinical and other intervention research because it protects against threats to internal validity due to history, maturation, selection, testing, and instrumentation biases, ambiguous temporal precedence, and the tendency for measurements to regress to the mean (see Table 2).32 For these reasons, this design is particularly ideal for early clinical efficacy research.32 Although designing practical clinical trials can increase the external validity of RCTs with individual RA,5 almost none of the scenarios presented at this symposium (Table 1) could be studied efficiently using this design because individual RA was not possible for interventions already underway or planned for the community level, differential attrition and/or contamination were likely, potential subjects would not agree to be randomized with a chance of never receiving the intervention, functioning of the setting would make it challenging to adhere to individual RA, and/or sample sizes would have been prohibitively large (reasons were identified in the various scenario presentations and discussed in Cook’s and Shadish’s commentaries—all available online).8,10,54,55 Randomized encouragement trial. Some scenario groups chose a randomized encouragement trial (RET) as one of their study design options to receive the benefits of randomization while simultaneously mimicking the delivery of many preventive services in real-world settings. An RET encourages subjects in the intervention group to participate in the intervention or to choose among a menu of specifically defined intervention options (as in the diabetes scenario; see Table 1), while subjects in the control group are neither offered nor encouraged to participate in the intervention. Randomization in an RET can be at the individual level or higher. An RET may allow recruitment of a more representative sample since participants are active partners in selecting their treatment. Support from community leaders may be greater because participants are given choice, and encouragement strategies can be developed collaboratively with the community (see Chin’s 148

presentation for diabetes).37,38 An RET’s internal validity may be lower than a traditional RCT with individual RA, if assessing individual components, but higher than an observational or quasi-experimental study.37,38 A well-done RET may have stronger external validity than a traditional RCT with RA, and it provides an indication of the uptake or participation rate among participants. Because there may be substantial variability in each participant’s intervention, an RET requires extensive measurement of intervention intensity and fidelity (Table 2). Self-tailoring could lead some participants to select few intervention components, resulting in smaller effect sizes and therefore requiring larger sample sizes. Yet, participants could also select more components than usual resulting in a larger effect size than would be expected with a RCT. Additionally, if appropriate data are collected, an RET can provide considerable insight into participants’ decision-making processes. Mangione and colleagues in the diabetes scenario estimated that their proposed RET could have required four times the sample size of the Diabetes Prevention Program (DPP) RCT.36 However, since the DPP already demonstrated a causal connection among weight loss, diet and physical activity, and the prevention or delay in onset of diabetes,36 their scenario studying transferability to real-world settings could use weight loss as the outcome variable rather than diabetes onset, thus reducing the number of biological and related measurements. This, and delivering their intervention at the group level, would have lowered the cost for their proposed RET below that for a traditional RCT with individual RA. A similar cost savings might be possible in other effectiveness or translation studies where an intermediate variable exists that earlier efficacy or effectiveness studies have clearly demonstrated is in the causal pathway. Nevertheless, RETs may still be as or possibly even more expensive than traditional RCTs with individual RA if substantial observation is needed to study individuals’ choices. Staggered enrollment trial. There are multiple designs within the staggered enrollment trial (SET), which begins by randomizing subjects into the intervention or control arm for a defined period of time. During this period, the trial design is the same as that of a traditional RCT. Then at the end of this first follow-up period, the initial control subjects are either started on the intervention (similar to wait-list controls) or randomized a second time to intervention or control, with all subjects eventually participating in the intervention (see Chin’s presentation on SET). In the former case, the comparison for the control subjects now in the intervention is the time when they were in the control group. In the latter case, the comparison for the intervention subjects is the subjects who remain in the control arm (Table 2).

American Journal of Preventive Medicine, Volume 33, Number 2

www.ajpm-online.net

As discussed in the context of the diabetes scenario (Table 1), SETs are likely to have greater subject enrollment and retention than RETs and traditional RCTs because patients who are randomized to the control group know that they will receive the intervention at some definable future point. At the same time, this design does not provide any controls for longerterm outcomes or secular trends unless enrollment is staggered over a long time frame. As with traditional RCTs, there may also be contamination and extended learning effects in the control group by participants having been exposed to the general ideas of the trial. Another caution noted at the symposium is that there may be autocorrelation (correlation of consecutive observations over time32) in the analyses among individuals who begin as controls and cross over to the intervention group. Group randomized trial. Three scenario groups (obesity, tobacco use, and asthma) selected a group randomized trial (GRT)—where groups, rather than individuals, are randomized— because while they believed that RA would guard against threats to internal validity, their settings (schools, communities, and emergency departments, respectively) were too complex to enable conduct of multicomponent interventions with individual RA. Group randomization is also beneficial for other complex settings such as worksites and clinical practices—the latter because much health care today takes place in clinical microsystems rather than between a single provider and patient. The main strength of the GRT is that with proper randomization and enough groups, potential sources of bias are equally distributed across intervention and control groups (Table 2) and, assuming a valid analysis, inferences can be as strong as those obtained from a traditional RCT with individual RA.39 Well-done GRTs require intervention and control groups to be matched on several stable independent correlates of the outcome such as age or problem severity, or to be similar on such correlates if the number of groups is large enough.40 A primary disadvantage of GRTs is the need for large numbers of groups.39 Extra variation or noncomparability within groups can also threaten internal validity, and intragroup correlation threatens power. In his presentation on GRTs, Murray advised meeting participants to reserve full-scale GRTs for situations in which (1) experimental evidence is needed for causal inference, (2) individual randomization is not desirable, (3) there is preliminary evidence for feasibility and effectiveness or translatability, and (4) there is sufficient information available to size the study. He also noted that smaller GRTs (e.g., eight groups or fewer) are useful for studying the feasibility of a fullscale GRT through providing an effect estimate. August 2007

Nonrandomized Designs With or Without Controls: Quasi-Experimental Designs Nonrandomized designs considered by one or more scenario groups included pre–post, interrupted time series, multiple baseline, and regression discontinuity designs. Pre–post (PP). Traditional PP designs without randomization measure variables of interest at a single point before and a single point after an intervention. PP designs without a control or comparison group have numerous limitations affecting internal and external validity (Table 2).32,53 Adding control or comparison groups, as was done in the physical activity scenario (Table 1), can help account for secular trends. Nonrandomized PP designs may be useful for testing the feasibility of an intervention and are better than nonrandomized post-test only designs, unless the pretest creates a strong interaction with the intervention and biases the results. Generally a nonrandomized PP design, even with a control or comparison group, should not be the sole source for causal inference.53 Interrupted time series. In an interrupted time series (ITS) design, a string of consecutive observations is interrupted by the imposition of an intervention to see if the slope or level of the series changes following the intervention (see Feldman’s presentation on ITS in the “Exploring the Tradeoffs” session).32 Each site acts as its own comparison prior to implementation. This design is appropriate when one knows the specific point at which a policy, service, or other intervention will occur in prospective studies, or when it occurred for retrospective studies and, ideally, when most people were exposed to it. It is a strong alternative when randomization is not feasible due to inability to control who receives an intervention.41 As discussed for the physical activity scenario, an ITS has an advantage over a traditional PP design because it allows detection of trends before, during, and after intervention implementation (Table 1). The pre-treatment series of data points allows examination of potential threats to internal validity and the post-treatment series allows description of the speed of the change and persistence of the effect (Table 2). Limitations of ITS designs (Table 2) can be reduced by adding one or more nonrandomized control or comparison groups, quantitatively or qualitatively assessing whether other events might have caused effects, removing the intervention at a known time, or using a switching replications design in which nonrandomized groups receive the intervention at different times and serve as controls for each other.32 Use of comparison or control groups can be further enhanced if there are multiple jurisdictions with varying degrees and timing of interventions. An additional enhancement includes measuring non-equivalent dependent variables— Am J Prev Med 2007;33(2)

149

variables that are not expected to change because of the intervention but that are expected to respond to some or all of the contextually important threats to internal validity in a similar fashion as the dependent variable.32 In the physical activity scenario, tennis playing functioned as a non-equivalent type of exercise because walking was expected to increase with light rail transit implementation, but tennis was not. Finally, Feldman’s presentation on ITS noted a number of analytic options for time series including approaches that characterize and compare trends. Multiple baseline. The multiple baseline (MB) design is a form of ITS design that is used most often when components of interventions are being developed or combinations of components within effective interventions are being tested.41 Sanson-Fisher noted in his presentation on MB designs that they can take a “mission-oriented” approach in which numerous components are included at the outset with the aim of causing change in the outcome of interest early on, and then engaging in component analysis through selective removal of components to determine which are the most effective. Alternatively, a “component-oriented” approach involves consecutively adding components to the intervention until the desired effect is achieved. If a given component does not work, a different one can be substituted or the current component can be modified before testing it in another community. Another MB approach is to study similar interventions simultaneously in different settings, as was suggested for the tobacco control scenario (see Brown’s presentation online). Key disadvantages of the MB design (Table 2) relate to requiring measures that are suitable for repeated measurements and needing to know how to define a stable baseline and how far apart to stagger interventions. The ability to individually study different components of an intervention provides an advantage over designs that implement a whole package of interventions. The most important advantage of the MB design is that it is consistent with the decision-making processes of policymakers, police, educators, and health officials when they periodically examine administrative records and surveillance data, since resources for interventions may be allocated differentially over time (Table 2). Regression discontinuity design. In a regression discontinuity (RD) design, the researcher assigns participants (individuals or groups) to intervention and comparison or control conditions (or two or more intervention conditions) based on their exceeding or falling below a cut-off on an assignment variable, rather than randomly.32 The assignment variable can be any measure taken before the intervention—such as scores on a pre-test, a measure of illness severity, or arrests for drunk driving, as was considered in the alcohol sce150

nario. When an intervention effect is seen, the regression line for the intervention group is discontinuous from the regression line for the comparison or control group (see Shadish’s presentation on RD designs). The major strength of RD (Table 2) is that, when properly implemented and analyzed, RD yields an unbiased estimate of the intervention effect.32,56 An additional advantage for community-based interventions is that it allows communities to be assigned to intervention based on their greater need, which is consistent with how many policies are implemented (see Shadish’s presentation on RD designs). Yet, because of collinearity between the assignment and intervention variables, statistical power is considerably less than in a RCT of the same size. RD requires over two times as many subjects as a randomized experiment to reach 0.80 power.57 Furthermore, effects are unbiased only if the functional form of the relationship between the assignment and outcome variables is correctly specified, including nonlinear relationships and interactions.32 Once the relationship is correctly modeled, any threats to internal validity would cause a sudden discontinuity in the regression line at the cut-off point of the assignment variable and this is typically considered implausible. It is possible to combine the RD design and randomization by defining a cut-off interval, assigning all participants above the interval to one condition, all participants below to another, and randomly assigning participants within the interval to the various conditions. If feasible, this allows randomization of middle participants when it is not clear where the cut-off should be set, allows estimation of regression lines for both intervention and control participants within the randomization interval, and increases power over the RD design alone.32

Natural Experiments A natural experiment (NE) involves investigating an existing, newly developing, or anticipated naturally occurring situation in which an intervention cannot usually be manipulated by the researcher.32 Non-experimental (not discussed at the symposium), quasi-experimental, and, very rarely, randomized designs can be used to study NEs. Natural experiments often enable study of innovative, large-scale, expensive, hard to implement, rapid, and/or jurisdiction-wide programs and policies that would be difficult to fund through regular funding mechanisms, or when getting funding mechanisms in place would be too slow to capture opportunities prospectively. Murray stated that NEs often have limited generalizability because there is typically no RA to conditions, and if matching of intervention and comparison groups is based on a limited number of variables (see Murray’s obesity scenario presentation). NE’s internal validity can be in-

American Journal of Preventive Medicine, Volume 33, Number 2

www.ajpm-online.net

creased by adding data points pre- and post-intervention, by applying MB, ITS, or RD methods, and by having a large number of comparison groups—as in comparing data from one or two states that underwent a change in policy or program to data from the remaining states (see Table 2). Efforts should also be made to ensure that comparison groups are well matched. Under such conditions, Gortmaker noted that NEs are valuable and underutilized for studying complex interventions (see his presentation on NEs).

Discussion Given increasing practice and policy demands for answers about what works in real-world environments, the symposium discussed in this paper explored strengths and weaknesses of, and trade-offs among, designs for conducting complex, multilevel effectiveness and translation research. Symposium planners, presenters, and expert invitees highlighted a number of lessons and recommendations. As Shadish reminded symposium attendees in his presentation on trade-offs, and as discussed in the literature,30,58 there is “no free lunch” when assessing causality. Some reasons given for not employing randomized designs are questionable. Some argue that traditional RCTs with individual RA are not appropriate when the intervention will be locally adapted. However, randomized experiments do not require interventions that are consistent across intervention sites.59 Shadish noted that variability in the intervention is relevant to its construct validity, requiring collection of sufficient information to describe variation across sites. Other arguments relate to cost. Quasi-experiments and NEs can be as expensive as traditional RCTs, however, particularly if appropriate modeling for selection bias is included. Other factors being constant, sample size requirements can be substantial for GRTs and quasiexperiments. Overall, the expense, time, and participant commitment required to secure equivalent quality in data collection and measurement may be similar across designs and higher for designs in which the intervention varies across sites. When randomization and/or experimental control truly is/are impossible, impractical, or will not enable sufficient examination of external validity, careful selection among the nonrandomized designs discussed here, using the suggested enhancements to those designs, and replication and triangulation of research can all increase confidence in causal inference and offset the strength of evidence lost through randomization.32,42,53 Given the regular initiation of jurisdiction-wide health initiatives and policies with no lead time and no chance of experimental control or randomization— symposium attendees highlighted the importance of building quick response expertise, capacity, and support to enable capitalizing on emergent NE opportuniAugust 2007

ties and collecting adequate baseline information. Gortmaker suggested that groups similar to CDC’s Epidemic Intelligence Service60 could be established and ready on short notice to investigate policy changes and other NEs. Since the cost of NEs is substantially reduced when using archival or routinely collected data, symposium participants noted the benefits of strengthening and extending existing surveillance systems and developing new systems, so that required data are in place before change occurs. While not explicitly discussed at the symposium, Shadish et al.32 recently provided a grounded theory and a set of five principles for assessing generalized causal inference— exploring the extent to which the causal relationship is generalizable over variations in interventions, outcomes, persons (or other units), and settings. The principles are (1) surface similarity— generalizing by judging apparent similarities between things studied and targets to which one wishes to generalize; (2) ruling out irrelevancies— generalizing by identifying the attributes of persons, settings, interventions, and outcome measures that are irrelevant because they do not change a generalization; (3) making discriminations— generalizing by making discriminations that limit generalization; (4) interpolation and extrapolation— generalizing by interpolating to unsampled values within the range of the sampled persons, settings, interventions, and outcomes, and extrapolating beyond the sampled range; and (5) causal explanation— generalizing by developing and testing explanatory theories about the target of generalization.32 All five principles must be met to adequately address generalized causal influence, but they differ in how practical they are for use within an individual study. Although this symposium concentrated on study designs for assessing causality, Shadish noted that RCTs—and designs structured to be their equivalent— are often not the best designs for noncausal questions. Different designs may be required to examine related descriptive and process questions whose answers are also essential for guiding the translation of effective interventions into practice across a range of real-world populations, settings, and conditions.61 For example, process and implementation studies can be used to explore which practitioners will adopt and sustain effective practices, needs assessments can assess which patients need to adopt and sustain interventions, and cost-effectiveness studies need to be used to determine the direct and indirect costs of programs. In addition, relatively inexpensive “early-phase” research can be used to determine the feasibility of an intervention. Some of the designs discussed here can be useful in these types of studies. Small GRTs may be useful for estimating effect sizes and/or intraclass correlation and for determining whether an intervention is worth pursuing. MB designs are useful for examining feasibility Am J Prev Med 2007;33(2)

151

and choosing design elements or settings for implementation. Although beyond the scope of the symposium and the current paper, further attention to choosing among alternative designs for such noncausal questions is essential. Special care also needs to be given to delineating design, methodologic, and analytic components that can be addressed within, or added to effectiveness and translation studies that ask causal questions, to enable simultaneously study of intervention implementation and fidelity. If results of causal effectiveness and translation studies are null or effects were smaller than expected, it is essential to tease out whether the intervention was ineffective, whether implementation of the intervention was incomplete, whether certain components were counterproductive, or whether other factors are responsible. Designs such as the RET, ITS, and MB are particularly amenable to process and implementation evaluation. Because all the scenarios involved substantial behavioral components, study designs considered at the symposium are more familiar to those engaged in health behavior, health promotion, and evaluation. With the health field’s increasing need for interdisciplinary and transdisciplinary research, symposium participants highlighted the importance of reviewing study designs across all of the disciplines and traditions that contribute to causal inference within health including epidemiology, economics, and medical anthropology, among others. Finally, the symposium did not deal fully with questions that should be considered alongside design choice such as analysis issues or modeling selection bias in quasi-experiments. These issues are also beyond the scope of the symposium and the current paper, but they deserve in-depth attention.

Next Steps and Conclusions The enthusiasm generated by the May 4 –5, 2004 symposium, along with ongoing consideration by the planning group of lessons learned, spawned several other symposia on related issues. One, cosponsored by AHRQ, CDC, NIH, RWJF, and the Department of Veteran’s Affairs, held September 13–15, 2005, focused on the needs of heath care and public health quality improvement (QI) with the aim of reviewing a range of QI interventions and their relevant research and evaluation questions, considering designs and methods for answering QI questions, and suggesting changes in funding, review, training, and publication to accelerate reliable QI research methods and grow the field. Materials from this symposium are currently available on the Internet (at www.hsrd.research.va.gov/quality2005/) and manuscripts are in preparation. One other symposium built on related work by one of the current authors (LWG) and Glasgow who proposed a set of criteria for assessing external validity27 that can be used alongside existing 152

guidelines and rating scales for internal validity such as CONSORT,62,63 TREND,64,65 and the Jadad scale66 that are employed by the Cochrane Collaboration,67 AHRQ Evidence-Based Practice Centers,68 U.S. Preventive Services Task Force,69 and Task Force on Community Preventive Services.16,17 The symposium, sponsored by RWJF, NIH, CDC, and AHRQ, brought together editors of several influential public health journals to receive their feedback on the value and operationalization of incorporating external validity criteria into manuscript review. The process of selecting the optimal combination of specific design elements in effectiveness and translation research is not simple. The choice of study design is shaped by the specific research question; the level of understanding and certainty about the underlying theory, mechanisms, and efficacy of an intervention; the possibility of randomizing individuals or groups; the availability of natural experiments; the level of available resources; the extent of generalization required; and the views of intended users of the research and study subjects. Nevertheless, well-designed studies of complex, multilevel interventions provide exciting opportunities to increase knowledge about what works when and where, and how to make future improvements. We would like to express our sincere appreciation to all those who brought their expertise, experience, and enthusiasm to the planning and execution of the symposium that is discussed in this article—the Hill Group who coordinated logistics for the symposium, the Centers for Disease Control and Prevention and National Institutes of Health Symposium Advisory Teams (members listed online at http://obssr.od. nih.gov/Conf_Wkshp/Complex%20Interventions.htm), the Scenario Working Groups (members listed online at http:// obssr.od.nih.gov/Conf_Wkshp/Complex%20Interventions. htm), those who presented on behalf of the Scenario Working Groups (Anthony Biglan, PhD; K. Stephen Brown, PhD, Ross C. Brownson, PhD, Carlos A. Camargo, Jr., MD, DrPH, Marshall Chin, MD, Deborah A. Cohen, MD, MPH, Henry A. Feldman, PhD, Brian R. Flay, DPhil, Steve L. Gortmaker, PhD, Ralph W. Hingson, ScD, MPH, Harold Holder, PhD, Robert W. Jeffrey, PhD, Carol M. Mangione, MD, MSPH, David M. Murray, PhD, William R. Shadish, PhD, and Sandra R. Wilson, PhD), and the additional presenters and commentators (Marshall Chin, MD, Thomas D. Cook, PhD, Henry A. Feldman, PhD, Steve L. Gortmaker, PhD, David M. Murray, PhD, Mary E. Northridge, PhD, MPH, Rob Sanson-Fisher, PhD, and William R. Shadish, PhD). Further thanks are due to Rob Sanson-Fisher for helping us originate the idea for this initiative, and to William R. Shadish for providing methodologic insights throughout the symposium planning process. We also thank David Lanier, MD, and Terry F. Pechacek, PhD, for helpful comments made at the symposium that are included in this manuscript. LWG was employed by CDC from 1999 to 2004, and has received various honoraria, reimbursements for chairing panels, consulting, speaking, since 2004. He served as a member of Board of Scientific Counselors for the National Human

American Journal of Preventive Medicine, Volume 33, Number 2

www.ajpm-online.net

Genome Research Institute; and was a speaker, expert panel member, and consultant for other NIH, SAMHSA, and AHRQ units and contractors. All of these agencies have some stake in the allocation of resources to the various types of research and evaluation discussed and criticized in the three papers on which I am a co-author and the introduction to them. Some of my university colleagues at UCSF could gain, others lose, resources for their research if the allocation of resources to specific types of research is influenced by this set of papers. No other authors reported financial disclosures. This work was undertaken when SLM and LWG were affiliated with the Office of Science and Extramural Research, Public Health Practice Program Office, CDC, LJF was affiliated with the Office of Behavioral and Social Science Research (OBSSR), NIH, and Barbara J. DeVinney was a contractor with OBSSR. The findings and conclusions in this report are those of the authors and do not necessarily represent the views of the CDC, NIH, or the AHRQ.

References 1. Eddy DM, Billings J. The quality of medical evidence: implications for quality of care. Health Aff (Millwood) 1988;7:19 –32. 2. Garber AM. Evidence-based coverage policy. Health Aff (Millwood) 2001;20:62– 82. 3. International Union for Health Promotion and Education. The evidence of health promotion effectiveness: a report for the European Commission by the International Union for Health Promotion and Education. Brussels and Luxembourg: ECSC-EC-EAEC, 1999. 4. Tang KC, Ehsani JP, McQueen DV. Evidence based health promotion: recollections, reflections, and reconsiderations. J Epidemiol Community Health 2003;57:841–3. 5. Tunis S, Stryer D, Clancy C. Practical clinical trials: increasing the value of clinical research for decision making in clinical and health policy. JAMA 2003;290:1624 –32. 6. Woolf SH, George JN. Evidence-based medicine: interpreting studies and setting policy. Hematol Oncol Clin North Amer 2000;14:761– 84. 7. Brownson RC, Baker EA, Leet TL, Gillespie KN. Evidence-based public health. Oxford: Oxford University Press, 2003. 8. Cook TD. Causal generalization: how Campbell and Cronbach influenced my theoretical thinking on this topic. In: Alkin M, ed. Evaluation roots: tracing theorists’ views and influences. Thousand Oaks CA: Sage, 2004:88 –113. 9. Des Jarlais D, Lyles C, Crepaz N, The TREND Group. Improving the reporting quality of nonrandomized evaluations: the TREND statement. Am J Public Health 2004;94:361– 6. 10. Eccles M, Grimshaw J, Campbell M, Ramsay C. Research designs for studies evaluating the effectiveness of change and improvement strategies. Qual Saf Health Care 2003;12:47–52. 11. Grades of Recommendation A, Development, and Evaluation (GRADE) Working Group. Grading quality of evidence and strength of recommendations. BMJ 2004;328:1490. 12. Grol R, Grimshaw J. From best evidence to best practice: effective implementation of change in patients’ care. Lancet 2003;362:1225–30. 13. Institute of Medicine. Crossing the quality chasm: a new health system for the 21st century. Washington DC: National Academy Press, 2001. 14. Institute of Medicine Board on Health Care Services. The 1st annual crossing the quality chasm summit: a focus on communities. Washington DC: National Academies Press, 2004. 15. Harris R, Helfand M, Woolf S, et al. Current methods of the U.S. Preventive Services Task Force: a review of the process. Am J Prev Med 2001;20(suppl 3):21–35. 16. Briss P, Zaza S, Pappaioanou M, et al. Developing an evidence-based guide to community preventive services—methods. Am J Prev Med 2000; 18:35– 43. 17. Task Force on Community Preventive Services. The guide to community preventive services: what works to promote health? Zaza S, Briss PA, Harris KW, managing eds. New York: Oxford Press, 2005.

August 2007

18. Cochrane Collaboration. Methods groups (MGs). Available at: www.cochrane. org/contact/entities.htm#MGLIST. 19. Shadish W, Myers D. Campbell Collaboration research design policy brief. November 11, 2004. Available at: www.campbellcollaboration.org/MG/ ResDesPolicyBrief.pdf. 20. National Institute for Health and Clinical Excellence. The guidelines manual, April 2006. Available at: www.nice.org.uk/page.aspx?o⫽ phmethods. 21. National Quality Forum. A national framework for healthcare quality measurement and reporting: a consensus report. Washington DC: National Forum for HealthCare Quality Meausurement and Reporting, 2002. 22. U.S. Department of Education. Scientifically based evaluation methods. Available at: www.eval.org/doe.fedreg.htm. 23. Green LW. From research to “best practices” in other settings and populations. Am J Health Behav 2001;25:165–78. 24. Bero LA, Montini T, Bryan-Jones K, Mangurian C. Science in regulatory policy making: case studies in the development of workplace smoking restrictions. Tob Control 2001;10:329 –36. 25. Clancy CM, Slutsky JR, Patton LT. Evidence-based health care 2004: AHRQ moves research to translation and implementation. Health Serv Res 2004;39:xv–xxiii. 26. Gerberding JL. Protecting health—the new research imperative. JAMA 2005;294:1403– 6. 27. Green LW, Glasgow RE. Evaluating the relevance, generalization, and applicability or research: issues in translation methodology. Eval Health Prof 2006;29:1–28. 28. Hanney S, Gonzalez-Block M, Buxton M, Kogan M. The utilization of health research in policy-making: concepts, examples and methods of assessment. Health Res Policy Syst 2003;1:2. 29. Stryer D, Tunis S, Hubbard H, Clancy C. The outcomes of outcomes and effectiveness research: impacts and lessons from the first decade. Health Serv Res 2005;35:977–93. 30. Zerhouni E. Policy forum: medicine. The NIH Roadmap. Science 2003;302:63–72. 31. Mittman BS. Creating the evidence base for quality improvement collaboratives. Ann Intern Med 2004;140:897–901. 32. Shadish W, Cook T, Campbell D. Experimental and quasi-experimental designs. Boston: Houghton-Mifflin, 2002. 33. Campbell M, Fitzpatrick R, Haines A, Kinmonth AL, Sandercock P, Spiegelhalter D, Tyrer P. Framework for design and evaluation of complex interventions to improve health. BMJ 2000;321:694 – 6. 34. National Institutes of Health, Centers for Disease Control and Prevention. Hiss R, Green LW, Garfield S, et al., eds. From clinical trials to community: the science of translating diabetes and obesity research. Bethesda: National Institutes of Health, 2004. Available at: www.niddk.nih.gov/fund/other/ Diabetes-Translation/conf-publication.pdf. 35. Last JM, ed. A dictionary of epidemiology. 2nd ed. New York: Oxford University Press, 1988. 36. Diabetes Prevention Program Research Group. Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin. N Engl J Med 2002;346:393– 403. 37. Braslow JT, Daun N, Weisz JR, Wells KB, Starks SL. Randomized encouragement trial: A pragmatic paradigm for clinical research. Health Services Research and Development 2004 National Meeting, Washington DC, March 9 –11, 2004 (abstract). 38. Doan N, Braslow J, Weisz J, Wells K. Randomized encouragement trial (RET): a design paradigm for public health evaluation. Society for Psychotherapy Research International Conference 2002, Santa Barbara CA, June 23–27, 2002 (abstract). 39. Murray DM. Design and analysis of group-randomized trials. New York: Oxford University Press, 1998. 40. Murray D, Varnell S, Blitstein J. Design and analysis of group-randomized trials: a review of recent methodological developments. Am J Public Health 2004;94:423–32. 41. Biglan A, Ary D, Wagenaar AC. The value of interrupted time-series experiments for community intervention research. Prev Sci 2000;1:31– 49. 42. Shadish WR Jr, Cook TD, Leviton LC. Foundations of program evaluation: theories of practice. Newbury Park CA: Sage. 43. Blackburn H. Research and demonstration projects in community cardiovascular disease prevention. J Public Health Policy 1983;4:398 – 421. 44. Flay BR. Efficacy and effectivness trials (and other phases of research) in the development of health promotion programs. Prev Med 1986;15:451–74.

Am J Prev Med 2007;33(2)

153

45. U.K. Medical Research Council. A framework for development and evaluation of RCTs for complex interventions to improve health. Medical Research Council, April 2000. Available at: www.mrc.ac.uk/pdf-mrc_cpr.pdf. 46. Weiss CH, Bucuvalas MJ. Social science research and decision-making. New York: Columbia University Press, 1980. 47. Weiss CH, Bucuvalas MJ. Truth tests and utility tests: decision-makers’ frame of reference for social science research. In: Freeman HE, Solomon MA, eds. Evaluation studies review annual. Beverly Hills CA: Sage 1981;6:695–706. 48. Green LW, Kreuter MW. Health program planning: an educational and ecological approach. 4th ed. New York: McGraw-Hill, 2005. 49. Green LW, Mercer SL. Can public health researchers and agencies reconcile the push from funding bodies and the pull from communities? Am J Public Health 2001;91:1926 –9. 50. Israel BA, Eng E, Schulz AJ, Parker EA, Methods in community-based participatory research for health. San Francisco: Jossey-Bass Publishers, 2005. 51. Minkler M, Wallerstein N, Community-based participatory research for health. San Francisco: Jossey-Bass Publishers, 2003. 52. Van De Ven A, Johnson P. Knowledge for theory and practice. Acad Manag Rev 2006;31. 53. Wilson DB, Lipsey MW. The role of method in treatment effectiveness research: evidence from meta-analysis. Psychol Bull 2001;6:413–29. 54. Begg C, Cho M, Eastwood S, et al. Improving the quality of reporting of randomized controlled trials. The CONSORT statement. JAMA 1996; 276:637–9. 55. Kirkwood B. Making public health interventions more evidence based. BMJ 2004;328:965– 6. 56. Rubin DB. Assigning to treatment group on the basis of a covariate. J Educ Stat 1977;2:1–26. 57. Cappelleri JC, Darlington RB, Trochim WMK. Power analysis of cutoffbased randomized clinical trials. Eval Rev 1994;18:141–52. 58. Rosen L, Manor O, Engelhard D, Zucker D. In defence of the randomized controlled trial for health promotion research. Am J Public Health 2006;96:18 –24.

154

59. Angrist J, Imbens G, Rubin D. Identification of causal effects using instrumental variables, with discussion. J Am Stat Assoc 1996;91:444 – 72. 60. Centers for Disease Control and Prevention. Epidemic intelligence service. Available at: www.cdc.gov/eis. 61. Tucker JA, Roth DL. Extending the evidence hierarchy to enhance evidencebased practice for substance use disorders. Addiction 2006;101:918 –32. 62. Gross CP, Mallory R, Heiat A, Krumholz HM. Reporting the recruitment process in clinical trials: who are these patients and how did they get there? Ann Intern Med 2002;137:10 – 6. 63. Mohrer D, Schulz KF, Altman DG, Lepage L. The CONSORT statement: revised recommendations for improving the quality of reports. JAMA 2001;285:1987–91. 64. Des Jarlais DC, Lyles C, Crepaz N, TREND Group. Improving the reporting quality of nonrandomized evaluations of behavioral and public health interventions: the TREND statement. Am J Public Health 2004;94:361– 6. 65. Dzewaltowski DA, Estabrooks PA, Klesges LM, Glasgow RE. TREND: an important step, but not enough. Am J Public Health 2004;94:1474. 66. Jadad AR, Moore RA, Carroll D, et al. Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials 1996;17:1–12. 67. Jackson N, Waters E. Guidelines for Systematic Reviews of Health Promotion and Public Health Interventions Task Force. The challenges of systematically reviewing public health interventions. J Public Health 2004;26:303–7. 68. Agency for Healthcare Research and Quality. Evidence-base practice centers: synthesizing scientific evidence to improve quality and effectiveness in health care. Available at: www.ahrq.gov/clinic/epc. 69. U.S. Preventive Services Task Force. The guide to clinical preventive services 2005. Rockville MD: Agency for Healthcare Research and Quality, 2005.

American Journal of Preventive Medicine, Volume 33, Number 2

www.ajpm-online.net

Appendix A. Schematic Diagrams of Research Study Designs Discussed in This Article Notation and schematic diagrams are reproduced from Shadish et al.1 and extended for designs not covered there. Schematic diagrams are for the basic designs; schematics of design enhancements can be found in Shadish et al.1

Key to Notation C ⫽ units are assigned to conditions on the basis of a cutoff score NR ⫽ nonrandom assignment to intervention and control/comparison groups; NR is placed at the front of each schematic diagram; however, R sometimes occurs before and sometimes after the pre-test OA ⫽ preassignment measure of the assignment variable ON ⫽ pre-test or post-test measures/observations RE ⫽ random assignment at the individual or group level to (1) encouragement to undertake the intervention or to choose among a menu of intervention options, or (2) to a control/comparison condition that is neither offered nor encouraged to participate in the intervention (they receive no intervention or usual services); RE is placed at the front of each schematic diagram but sometimes occurs before and sometimes after the pre-test RG ⫽ random assignment at the group level to intervention and control/comparison conditions; RG is placed at the front of each schematic diagram but sometimes occurs before and sometimes after the pre-test RI ⫽ random assignment at the level of the individual to intervention and control/comparison groups; RI is placed at the front of each schematic diagram but sometimes occurs before and sometimes after the pre-test X ⫽ intervention XC ⫽ an intervention with one or more components XC⫹1 ⫽ adding an intervention component to the existing intervention components XT ⫽ the entire multicomponent intervention XT⫺1 ⫽ the entire multicomponent intervention ⫺ 1 component - - - - ⫽ Horizontal dashed line between groups indicates that they were not randomly formed

Randomized Controlled Designs: True Experimental Options Traditional randomized controlled trial with individuals as the unit of R RI O X O RI O O Randomized encouragement trial O X O RE RE O O Staggered enrollment trial O X RI or G RI or G O OR RI or G O X RI or G O Group randomized trial O X RG RG O

O O O O

X

O

} RI or G } RI or G

O O

X

O O

X

O

O O

Nonrandomized Designs With or Without Control/Comparison Groups: Quasi-Experimental Designs Pre–post design Intervention group only O1 X O2 With a nonrandomized control/comparison group NR O1 X O2 NR O1 O2 Interrupted time series design Intervention group only O1 O2 O3 O4

O5

X

O6

O7

O8

O9

O10

With a nonrandomized control/comparison group O1 O2 O3 O4 O5 X O6 O7 O8 O9 O10 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - O1 O2 O3 O4 O5 O6 O7 O8 O9 O10

Am J Prev Med 2007;33(2)

154.e1

Multiple baseline design O1 O2 O3 OR O1 O2 O3

O4

O5

XT

O6

O7

O8

O9

O10

XT⫺1

O11

O12

etc.

O4

O5

XC

O6

O7

O8

O9

O10

XC⫹1

O11

O12

etc.

Regression discontinuity design OA C X O2 OA C O2

References 1. Shadish W, Cook T, Campbell D. Experimental and quasi-experimental design. Boston: Houghton-Mifflin, 2002.

154.e2

American Journal of Preventive Medicine, Volume 33, Number 2