Clinical Opinion
www. AJOG.org
EDUCATION
Are we making a mountain out of a mole hill? A call to appropriate interpretation of clinical trials and population-based studies Suchitra Chandrasekaran, MD; Mary D. Sammel, ScD; Sindhu K. Srinivas, MD, MSCE
The volume of scientific articles being published continues to grow. Simultaneously, the ease with which both patients and providers can access scholarly articles is also increasing. One potential benefit of easy accessibility is broader awareness of new evidence. However, with a large volume of literature, it is often difficult to thoroughly review each article. Therefore, busy clinicians and inquisitive patients may often read the conclusion as their take-away message. In this article, we provide an overview of a few challenging study designs that are at increased risk for over- or understating conclusions, potentially leading to changes in clinical practice. Key words: comparative effectiveness research, noninferiority trials, study design
A
vailability of high-quality medical literature is critical to providing excellent health care. The number of studies being published and types of study designs being utilized have significantly increased in the past 10 years.1-3 As with any situation, the benefits of accessing numerous articles with the click of a mouse is balanced by the challenges of adeptly reviewing the data from each study presented. Every study design has specific advantages and limitations that have an impact on its conclusions. Therefore, when stating study conclusions, it is vital that authors accurately report the study findings without over- or understating its implications.4
From the Department of Maternal and Child Health Research Program, Obstetrics and Gynecology, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA. Received Sept. 12, 2012; revised Nov. 14, 2012; accepted Nov. 20, 2012. The authors report no conflict of interest. Reprints: Suchitra Chandrasekaran, MD, The Hospital of the University of Pennsylvania, 3400 Spruce St., 2000 Courtyard, Philadelphia, PA 19102.
[email protected]. 0002-9378/$36.00 © 2013 Mosby, Inc. All rights reserved. http://dx.doi.org/10.1016/j.ajog.2012.11.029
438
For the practicing clinician, it can be challenging to thoroughly analyze multiple studies, determine applicability, and ultimately decide whether there is sufficient evidence to warrant a change in clinical practice. This prompts physicians to rely heavily on the author’s stated conclusions to assist in the interpretation of study findings. In addition, because patients can also access articles through the Internet, they may also take conclusions made at face value. This challenges practicing physicians to have a thorough understanding of the literature to appropriately address patient questions and concerns. Therefore, the onus and responsibility of accurately reporting the study design, statistical analyses, and conclusions rest heavily on the investigators, peer reviewers, and editorial boards of journals to ensure that study conclusions do not go beyond what accurately reflects the methods and results of a study. Studies generally fall into 2 broad categories: analytic studies and descriptive studies. Study subtypes exist within these categories. Analytical studies include clinical trials, retrospective and prospective cohort studies, and case control studies. Descriptive studies include analysis of secular trends (eg, ecological studies), case series, and case reports. Certain
American Journal of Obstetrics & Gynecology JUNE 2013
study designs are more challenging to power, analyze, and interpret. In this clinical opinion, we will discuss some designs and their respective analytical methods. Hopefully this will increase awareness of the potential limitations and implications pertaining to these study designs.
Changing times influences study designs Historically most studies were based on the concept of superiority or establishing a difference between 2 or more exposure or treatment groups. In a superiority trial design, the null hypothesis states that the 2 exposures being studied are equal, and the alternative hypothesis states that the 2 exposures being studied are different.5 For example, when comparing intervention A to intervention B, the null hypothesis would be there is no difference in the outcome between the 2 interventions, and the alternative to that is noticing 1 intervention has a superior or better outcome than the other. In this scenario, type I (alpha) error occurs when the study incorrectly concludes the interventions are different (rejects the null). A type II (beta) error occurs when the study incorrectly concludes that the outcomes are similar between the interventions or fails to conclude that the outcomes are different (rejects the alternate hypothesis). In current practice, multiple interventions may already exist for various clinical scenarios. In such situations, it may be unethical to withhold treatment or use a placebo for comparison. This has led to a dramatic increase in the number of comparative effectiveness research studies (studies assessing a direct comparison of existing interventions). Instead of demonstrating superiority, these trials seek to conclude whether a new treatment is equivalent to or at least no
Education
www.AJOG.org
Clinical Opinion
TABLE
Superiority vs noninferiority trials
Study
H0 (null hypothesis)
HA (alternate hypothesis)
Superiority
Intervention A ⫽ B
Intervention A ⫽ B
Type 1 error (erroneous rejection of null hypothesis)
Type II error (erroneous rejection of alternate hypothesis)
Erroneously reject the truth A ⫽ B
Erroneously reject truth A ⫽B
Erroneously reject truth B less than A
Erroneously reject truth B ⫽ A or greater
................................................................................................................................................................................................................................................................................................................................................................................
Noninferiority (assuming A is standard of care)
Intervention B is inferior to A (B less than A)
Intervention B is 1. Not inferior to A or 2. Superior to A (B ⫽ A or greater)
................................................................................................................................................................................................................................................................................................................................................................................
Chandrasekaran. A call to appropriate interpretation of the clinical trials and population-based studies. Am J Obstet Gynecol 2013.
worse than a second or reference treatment within a preset margin. These studies are referred to as equivalence or noninferiority designs.6 Interestingly, compared with superiority trials, noninferiority trials are significantly more complex to design. In noninferiority trials, the null hypothesis is that the new treatment is inferior to the standard of care, and as such the null and alternative hypotheses are reversed in this type of trial compared with a superiority design.5 In other words, when comparing interventions A and B, and presuming intervention A is the standard of care, the null hypothesis would be that B is an inferior intervention compared with A. The alternate hypothesis then would note that B is not inferior or is superior or equal to A. This is completely the opposite of the superiority trial. Hence, the type 1 (alpha) error in this trial represents the erroneous rejection of the null hypothesis, B is worse than A, and the type II (beta) error is the erroneous rejection of the alternate hypothesis, that B is not inferior to or is superior or equal to A (Table). The practical implication of this reversal of type I and type II errors is generally different standards for acceptable errors, which results in the requirement of a much larger sample size to demonstrate that treatments or exposures are similar or that one is not inferior to the other. The other challenging component involved in calculating sample size for a noninferiority trial depends on the choice of the clinically acceptable difference between the 2 groups being compared, otherwise known as the margin of noninferiority. This margin should be
smaller than the clinically relevant effect with the aim to prove similarity. Furthermore, some would argue that the true goal of a noninferiority trial is not only to demonstrate no worse effects than the reference treatment but also to indirectly infer superiority to placebo through a comparison with an intervention already proven to be superior to placebo.6 Therefore, if outcomes with intervention A are better than placebo by 10%, then a new intervention B might need to be at least 5% better than placebo and thus no more than 5% worse than intervention A.5 In situations in which effective trials have been performed comparing placebo with current accepted clinical regimens, this information can add to the process of setting the margin of noninferiority. However, it is not unusual for current treatments or interventions to be considered standard of care because of years of clinical efficacy without accurate trials comparing placebo vs current treatment. This increases the difficulty in accurately setting a margin of noninferiority. In these situations, combining available objective data and expert opinion is necessary in setting a clinically acceptable noninferiority margin.5 Whereas equivalence trials share similar methodologies with noninferiority trials, the aim of an equivalence trial is to determine whether one intervention is truly equivalent to another for a particular outcome. Equivalence is often incorrectly inferred when the results are within a preset margin. To state equivalence, a much smaller margin or observed difference between the 2 groups is required, which results in larger sample size requirements than are necessary for
a noninferiority trial.5,6 Therefore, noninferiority does not imply equivalence and the 2 terms are not interchangeable.6
An obstetrics example of comparative effectiveness research: conclude with caution One recent study by Khandelwal et al,7 attempted to establish noninferiority between 12 and 24 hour dosing intervals of antenatal steroids. Although this study proposed to determine noninferiority, it is unclear whether the appropriate type I and type II errors were utilized to adequately power this study. For example, to accurately demonstrate noninferiority, a higher threshold for power is typically assumed over and above what is used for a standard superiority study, most commonly 90% or higher. However, this example study assumed a power of 80%. Furthermore, a margin of 20% was set to observe noninferiority for the primary outcome of respiratory distress syndrome (RDS) between the 2 strategies being compared. Based on guidelines, that margin is a bit too generous for a noninferiority trial.5 Whenever possible, data from published meta-analyses or use of clinical judgment should be used to set a noninferiority margin that could be designated a minimally important difference. Therefore, given the present margin, achieved power, and sample size, it is difficult to conclude noninferiority with such a large allowable difference in the outcome between the 2 groups. By reducing the noninferiority margin to 10% and raising the power to 90%, a sample size of 225 patients should have been randomized to each group. This is about 3 times larger than the study sample size 161 and 67
JUNE 2013 American Journal of Obstetrics & Gynecology
439
Clinical Opinion
Education
patients in the 2 groups. Despite these issues, the conclusions stated that 12 hour dosing “is equivalent to 24-hour dosing.” As mentioned earlier, a noninferiority trial cannot conclude equivalence, and a noninferiority margin of 20% would allow neither the conclusion of noninferiority or equivalence. In addition, this study was also unable to definitively demonstrate no harm with this dosing interval. Inarguably, studying the dosing interval of antenatal corticosteroids inherently lends itself to a noninferiority study design. However, this study by Khandelwal et al7 illustrates the difficulties of accurately powering a study to demonstrate noninferiority and the potential clinical pitfalls in concluding no difference or equality in 2 treatments. Another component of understanding noninferiority or equivalence involves reviewing the necessary statistical analysis. Superiority trials are analyzed based on the intention-to-treat principle. In this process, patients are grouped according to their initial randomization, regardless of whether they received the intervention or treatment. This analysis errs on the side of biasing findings toward no difference (the null hypothesis). However, in a noninferiority trial, an intent-to-treat analysis alone would not be clinically acceptable because this is the hypothesis being tested is trying to prove similarity or no difference. Whereas, generally, an as-treated analysis may bias the results toward the null hypothesis (ie, inferiority), it is possible that bias toward noninferiority can occur when deviation from treatment is related to treatment efficacy. It is therefore important to consider the impact of both approaches: intent-to-treat and as-treated analyses.8 A study analyzing induction vs expectant management of intrauterine growth restriction from the Disproportionate Intrauterine Growth Intervention Trial at Term (DIGITAT) data attempted to establish equivalence between these 2 clinical modalities.9 This study appropriately switched the alpha and beta error in establishing power. However, the analysis was performed based on only intention-to-treat principles. Therefore, 440
their conclusion stating no difference was noted between the induction of labor group and expectant monitoring group could be untrue because the intention-to-treat analysis in this study would bias the results toward an erroneous conclusion of no difference. This further demonstrates the complexity involved in accurately designing and executing noninferiority and equivalence trials to draw appropriate conclusions that are clinically meaningful and useful.
Population-based studies: what can they conclude? In addition to an increase in the number of papers addressing comparative effectiveness, another growing study design is the retrospective or prospective cohort utilizing population-based data. Over time, the creation of population-based datasets has significantly increased. These datasets are created at the international, national, and local levels. They contain years of data collection with a multitude of data points from different centers in a region ascertained through various means including discharge coding, national registries, and primary data collection. Population-based studies also sometimes rely on data collected for other purposes because the data are readily available for all or most members of the population being studied. When data have been collected for other purposes, it should be validated to determine accuracy. Longitudinal population data offers the ability to study disease processes that have a low prevalence and/or incidence rate and long-term clinical outcomes. Additionally, large populationbased data allow for the evaluation of quality metrics. However, because of methods of data ascertainment, the potential for misclassification, and incomplete data, there are potential issues with this type of data that must be acknowledged when reporting results and drawing conclusions. Population-based cohort studies often follow a selectively defined population for longitudinal assessment of the exposure-outcome relationship. The main justification for conducting a population-based study is to study less common diseases and increase the external validity
American Journal of Obstetrics & Gynecology JUNE 2013
www.AJOG.org of potential conclusions.10 For example, the Young Finns Study was a large population-based study designed as a collaborative effort between 5 universities and other institutions in Finland.11,12 The study included 3596 children recruited in 1980 undergoing follow-up at 27 years after recruitment. One paper from this study looked at preconception cardiovascular risk factors and pregnancy outcomes such as hypertensive disorders, low birthweight, and gestational diabetes. Although there are definite strengths in having a large number of subjects to study a potential association, International Classification of Diseases, ninth revision (ICD-9) codes from birth registry data were used to identify patients with the complications of interest. In addition, hypertensive disorders were studied as a composite diagnosis, and preeclampsia was not divided into specific phenotype or severity. Although the authors discussed the validity of their population-based data and acknowledged that the diagnoses of preeclampsia and glucose tolerance were underreported in their registry, they did not mention the possible impact of these limitations. Hence, the final conclusion that high lipid levels before pregnancy predict an increased risk of preeclampsia and gestational diabetes could be true, but given the potential for errors and misclassification that occurs in using birth registries, ICD-9 codes, and underreporting of data, this may represent a type 1 error in which the alternate hypothesis was incorrectly accepted. Another type of population-based study is the ecological analysis. These studies are useful in evaluating coincident trends and may serve to generate hypotheses. In this type of study, trends are assessed using population data from potentially different datasets or at different times. These datasets are often comprised of birth certificate data or hospital discharge coding data at the state or national level. Therefore, although conclusions about trends can be inferred, causality often cannot be assessed. An interesting ecological study by Zhang et al13 assessed the association between the rise in singleton preterm births and the impact of labor induction using
www.AJOG.org recorded birth data. In this study, rates of preterm birth and induction were analyzed among birth registry data during 2 separate time frames and both were found to be increased. Zhang et al13 explained that in a previous study, they used an ecological design to decrease confounding by clinical indication that occurs when analyzing individual-level associations between obstetric intervention and birth outcome. They noted a strong association between US state level changes in rates of labor induction and a decline in gestational age and birthweight. They then assumed that the differences in labor induction or cesarean delivery rates across states are likely driven by differences in regional or local medical practice style rather than statelevel differences. Based on this assumption, Zhang et al13 considered it unlikely that there was confounding by indication for labor induction or cesarean delivery at the state level and hence chose the ecological study design. However, they did not provide data to support this assumption. Therefore, this study could have produced biased results because individual data were not obtained and this needs to be acknowledged. Furthermore, in this type of study design, it cannot be automatically concluded that labor induction caused preterm birth. This concept of ecological fallacy (true, true, and unrelated) needs to be acknowledged and conclusions cannot suggest causality. However, it must be noted that ecologic studies can produce unbiased results if the within-group variability in the exposure of interest is low or absent and the between-group variability in the exposure is high.14 Finally, the case-control study design can also be used in collecting populationbased data. In these studies, the outcome of interest has already occurred and subjects are recruited on the basis of their outcome to be deemed a “case: with outcome” or “control: without outcome.” In this study design, the potential for recall bias or misclassification bias exists. Because the outcome of the study has already occurred, there is always the possibility of inaccurately recalling an exposure.15 A population-based case-control study from Hungary attempted to study the as-
Education sociation between preeclampsia with or without superimposed chronic hypertension and the risk of congenital anomalies in their offspring.16 In this study, preeclampsia was diagnosed prospectively because it was documented that retrospective diagnoses were not reliable. The dataset included 22,843 cases with 38,151 agematched controls. Additionally, as is noted in the study, the diagnosis of hypertension prior to the pregnancy is not documented. Therefore, the possibility of misclassification of chronic hypertension and preeclampsia also exists. Although the conclusion in this study noted that preeclampsia was not associated with a higher risk of congenital abnormalities in their offspring and mentioned that these findings need confirmation in other studies, the conclusion should have clarified the potential for misclassification bias. This then allows readers to consider the potential limitations of such a study design and does not overstate the findings.
Call to action To provide the highest-quality medical care, a strong and accurate evidence base is critical. It is well understood that all study designs have their individual strengths and weaknesses. Even with the most impeccably designed and executed study, limitations will exist. Therefore, the ever increasing responsibility of investigators and reviewers is not only to perform well-designed studies but to be cautious about the conclusions being cited to avoid over- or understating potential findings and accurately interpreting study results. This involves carefully choosing terminology such as equivalence vs noninferiority, as demonstrated by comparative effectiveness studies. In large population-based studies, this includes modifying statements describing limitations when documenting positive or negative associations in the conclusion. Finally, this also involves understanding that the level of evidence a study may be assigned based on its design does not necessarily reflect on the accuracy of a study’s conclusions. Hence, we challenge all investigators, reviewers, and journal readers to carefully assess all conclusion statements and further examine potential design issues before using new data
Clinical Opinion
to influence clinical management and guidelines. f REFERENCES 1. Rahman M, Saito M, Fukui T. Articles with high-grade evidence: trend in the last decade. Contemp Clin Trials 2005;26:510-1. 2. Druss BG, Marcus SC. Growth and decentralization of the medical literature: implications for evidence-based medicine. J Med Libr Assoc 2005;93:499-501. 3. Lacroix EM, Mehnert R. The US National Library of Medicine in the 21st century: expanding collections, nontraditional formats, new audiences. Health Info Libr J 2002;19:126-32. 4. Guyatt GH, Rennie D. Users’ guides to the medical literature: a manual of evidence-based clinical practice. Guyatt GH, Rennie D, eds. Chicago, IL: AMA Press; 2002. 5. Piaggio G, Elbourne DR, Altman DG, Pocock SJ, Evans SJ; CONSORT Group. Reporting of noninferiority and equivalence randomized trials: an extension of the CONSORT statement. JAMA 2006;295:1152-60. 6. Julious SA. The ABC of non-inferiority margin setting: an investigation of approaches. Pharm Stat 2011;10:448-53. 7. Khandelwal M, Chang E, Hansen C, Hunter K, Milcarek B. Betamethasone dosing interval: 12 or 24 hours apart? A randomized, noninferiority open trial. Am J Obstet Gynecol 2012; 206:201.e1-11. 8. Gomberg-Maitland, Frison L, Halperin JL. Active control clinical trials to establish equivalence or non-inferiority: methodological and statistical concepts linked to quality. Am Heart J 2003;146:398-403. 9. Boers KE, Vijgen SM, Bijlenga D, et al. Induction versus expectant monitoring for intrauterine growth restriction at term: randomised equivalence trial (DIGITAT). BMJ 2010;341:c7087. 10. Szklo M. Population-based cohort studies. Epidemiologic reviews. Vol. 20, No. 1. 1998. 11. Raitakari OT, Juonala M, Rönnemaa T, et al. Cohort Profile: The cardiovascular risk in Young Finns Study. Int J Epidemiol. 2008;37:1220-6. 12. Harville EW, Viikari JS, Raitakari OT. Preconception cardiovascular risk factors and pregnancy outcome. Epidemiology 2011;22:724-30. 13. Zhang X, Kramer M. The rise in singleton preterm births in the USA: the impact of labour induction. BJOG 2012;119:1309-15. 14. Webster TF. Bias magnification in ecologic studies: a methodogical investigation. Environ Health 2007;6:17. 15. Grimes DA, Schulz KF, False alarms and pseudo-epidemics: the limitations of observational epidemiology, Obstet Gynecol 2012;120:920-7. 16. Bánhidy F, Szilasi M, Czeizel AE. Association of pre-eclampsia with or without superimposed chronic hypertension in pregnant women with the risk of congenital abnormalities in their offspring: a population-based case-control study. Eur J Obstet Gynecol Reprod Biol 2012;163:17-21.
JUNE 2013 American Journal of Obstetrics & Gynecology
441