Availability of Large-Scale Evidence on Specific Harms from Systematic Reviews of Randomized Trials Panagiotis N. Papanikolaou, MD, John P. A. Ioannidis, MD PURPOSE: To assess how frequently systematic reviews of randomized controlled trials convey large-scale evidence on specific, well-defined adverse events. METHODS: We searched the Cochrane Database of Systematic Reviews for reviews containing quantitative data on specific, well-defined harms for at least 4000 randomized subjects, the minimum sample required for adequate power to detect an adverse event due to an intervention in 1% of subjects. Main outcome measures included the number of reviews with eligible large-scale data on adverse events, the number of ineligible reviews, and the magnitude of recorded harms (absolute risk, relative risk) based on large-scale evidence. RESULTS: Of 1727 reviews, 138 included evidence on ⱖ4000 subjects. Only 25 (18%) had eligible data on adverse events, while 77 had no harms data, and 36 had data on harms that were
nonspecific or pertained to ⬍4000 subjects. Of 66 specific adverse events for which there were adequate data in the 25 eligible reviews, 25 showed statistically significant differences between comparison arms; most pertained to serious or severe adverse events and absolute risk differences ⬍4%. In 29% (9/31) of a sample of large trials in reviews with poor reporting of harms, specific harms were presented adequately in the trial reports but were not included in the systematic reviews. CONCLUSION: Systematic reviews can convey useful largescale information on adverse events. Acknowledging the importance and difficulties of studying harms, reporting of adverse effects must be improved in both randomized trials and systematic reviews. Am J Med. 2004;117:582–589. ©2004 by Elsevier Inc.
R
risks of interventions. Here, we sought to assess how frequently evidence on specific, well-defined adverse events is available in a quantitative fashion from systematic reviews of large randomized trials, and whether common and uncommon harms may be evaluated with this approach.
andomized controlled trials are a potentially important source of information on adverse events associated with medical interventions (1). However, single randomized trials usually lack the power to study major but uncommon harms, such as those occurring in approximately 1% of treated patients but are improbable otherwise in the absence of treatment. Moreover, data from randomized trials tend to be poorly collected, analyzed, reported, and utilized for this purpose (2). To date, most information on adverse events has come from large observational studies (3). Evidence from randomized trials might become a more useful source of information on adverse events if meta-analyses of many trials or single large trials could be performed and adequate emphasis were provided for standardized collection and reporting of harms (4). Empirical evidence is needed on whether it is feasible to generate large-scale evidence on specific, well-defined harms using systematic reviews of randomized trials. Such meta-analyses might generate accurate information on not only common, but also relatively uncommon,
From the Clinical Trials and Evidence-Based Medicine Unit (PNP, JPAI), Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece; and the Department of Medicine (JPAI), Tufts–New England Medical Center, Tufts University School of Medicine, Boston, Massachusetts. Requests for reprints should be addressed to John P. A. Ioannidis, MD, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina 45110, Greece, or
[email protected]. Manuscript submitted January 20, 2004, and accepted in revised form April 15, 2004. 582
© 2004 by Elsevier Inc. All rights reserved.
METHODS We selected systematic reviews of randomized controlled trials for which there were quantitative data on the occurrence of at least one very specific adverse event in each study and per study arm for ⱖ4000 subjects, and for which a formal meta-analysis of these data had been performed. The cutoff of 4000 subjects was decided a priori. With 4000 subjects assigned randomly and approximately equally to two intervention arms, there is about 80% power to detect an adverse event due to an intervention that occurs in about 1% of subjects but that is rare (0.25% or less) otherwise (␣ ⫽ 0.05). Eligible adverse events included all adverse events that had been clearly defined based on clinical or laboratory criteria and with explicit severity or seriousness grading that was consistent enough across studies to allow the inclusion of these data in a formal meta-analysis and the clinical use of the results. Thus, the following types of meta-analyses were excluded: analyses where adverse events represented composite counts of several adverse events from the same organ system or several organ systems, or according to some other merging approach, be0002-9343/04/$–see front matter doi:10.1016/j.amjmed.2004.04.026
Large-Scale Evidence on Specific Harms/Papanikolaou and Ioannidis
Table 1. Systematic Reviews with Sample Size of at Least 4000 for at Least One Quantitative Comparison Characteristic Type of disease or condition Infectious diseases* Perinatal conditions and pregnancy Cancer Cardiovascular diseases Cerebrovascular diseases Smoking Bone and joint diseases Mental health Other Interventions Vaccines or passive immunization Anti-infective agents Antiplatelet agents or anticoagulants Other drugs Surgical or other invasive procedures Noninvasive technology Behavioral/psychotherapy/counseling Combinations of the above Other
Number (%) 34 (24.6) 29 (21) 16 (11.6) 11 (8) 10 (7.2) 8 (5.8) 5 (3.6) 4 (2.9) 21 (15.2) 10 (7.2) 20 (14.5) 10 (7.2) 33 (23.9) 8 (5.8) 13 (9.4) 14 (10.1) 6 (4.4) 24 (17.4)
* Some of the infectious diseases-related topics may also pertain to other types of diseases or conditions of those listed.
cause it is not possible to know the risk of each adverse event and the composite adverse events may have very different clinical connotations; analyses of withdrawals due to adverse events and toxicity, unless these pertained only to a single type of adverse event and withdrawal was an explicit measure of severity or seriousness; analyses in which counts of otherwise specific adverse events had been recorded per arm across trials, but there was large diversity in the severity or seriousness threshold for counting an event across studies (e.g., one study considering only severe cases, while another considering any case, including mild ones); and analyses of nonrandomized studies. Whenever both randomized and observational data were included in the systematic reviews, we used only data from randomized trials. Quasi-randomized data (e.g., alternate allocation studies) were used. Two independent investigators screened all the reviews in the Cochrane Database of Systematic Reviews (issue 3, 2003), except for the reviews classified as withdrawn. Pilot screening on 200 reviews was performed first. Issues and discrepancies arising during this phase were discussed to enhance clarity of definitions. In case of disagreements, consensus was reached through discussion. Reviews were categorized into three categories: reviews with ⬍4000 subjects for any quantitative comparison; those with ⱖ4000 subjects for at least one quantitative comparison that provided some eligible safety data; and those with ⱖ4000 subjects for at least one quantitative comparison that did not provide any eligible safety data.
For the last category of review, we recorded the reasons for not meeting the eligibility criteria as follows: no separate quantitative data provided on any adverse events (specific or not); quantitative data provided only on adverse events that were not specific; quantitative data provided on specific adverse events but on fewer than 4000 subjects, despite the meta-analysis including at least 4000 subjects for efficacy outcomes; and quantitative data provided on nonspecific adverse events and on fewer than 4000 subjects, despite the meta-analysis including at least 4000 subjects for efficacy outcomes. When systematic reviews did not contain quantitative data on harms, this may have been due to lack of interest or proper attention on the part of the systematic reviewers, or lack of reporting of such data in the randomized trials. Thus, for a sample of the systematic reviews that contained no eligible data on any harms, we examined whether the largest randomized trial in the review contained any quantitative data on any specific harms. We sampled all instances where the largest trial had at least 4000 subjects on its own (to include trials that would qualify for large-scale evidence even when considered alone) and had been published in a peer-reviewed journal with an impact factor of ⬎1 per the Institute for Scientific Information, Journal Citation Reports (2002 rankings), so as to exclude trials in reports or journals that were difficult to find.
Data Analysis For each meta-analysis that provided eligible information on adverse events, we recorded the interventions compared, the types of harms for which there was eligible information, the random-effects risk difference and relative risk (binary outcomes) or weighted mean difference (continuous outcomes) for the meta-analysis for each eligible harm (including 95% confidence intervals), as well as whether there was any statistically significant betweenTable 2. Systematic Reviews With Data on Efficacy Outcomes for at Least 4000 Subjects but with No Eligible Data on Harms Reason for Lack of Eligibility
Number (%)
No separate quantitative data on any adverse event Quantitative data on ⱖ4000 subjects only on nonspecific adverse events Composite counts of several different adverse events Lack of grading Reporting only aggregate withdrawals due to toxicity Combination of reasons above Quantitative data on specific adverse events on ⬍4000 subjects Quantitative data on nonspecific adverse events on ⬍4000 subjects
77 (68.1)
October 15, 2004
THE AMERICAN JOURNAL OF MEDICINE威
17 (15.0) 8 (7.1) 2 (1.8) 4 (3.5) 3 (2.6) 10 (8.9) 9 (8.0)
Volume 117 583
Large-Scale Evidence on Specific Harms/Papanikolaou and Ioannidis
study heterogeneity based on the Q statistic (significant for P ⬍0.10). Calculations were performed in RevMan (Cochrane Collaboration, Oxford, United Kingdom) and Meta-Analyst (Joseph Lau, Boston, Massachusetts). P values are two-tailed.
RESULTS Of 1754 systematic reviews, 27 had been withdrawn and 1589 did not have at least 4000 randomized subjects. There were 138 systematic reviews with a sample size of ⱖ4000 for at least one quantitative comparison (Table 1). Of those, only 25 (18%) provided eligible data on specific harms. Among reviews with at least 4000 subjects but no eligible data on harms, most did not provide separate quantitative data on any adverse events (specific or not), but about a third provided some information on harms (Table 2). We screened the 113 systematic reviews that did not present specific large-scale evidence on harms for the largest randomized trial that was included in each of them, and identified 31 trials with a sample of at least 4000 subjects that had been published in a journal with an impact factor ⬎1. Among these 31 trials, nine (29%; 95% confidence interval: 14% to 48%) presented detailed enough data on specific harms that would qualify for our definition of large-scale evidence on specific harms, but they had nonetheless not been included in the systematic reviews. The systematic reviews included either no data on these harms (five trials) or nonspecific information (four trials) even though the original reports provided data on harms, including the level of severity or seriousness. Data from the randomized trial were not conveyed to the systematic review for topics on specific harms, such as heptavalent pneumococcal vaccine, digoxin toxicity, toxicity of calcium and aspirin when given for the prevention of preeclampsia, and adverse reactions to antihypertensive agents. Information that was transmitted lost its specific, well-defined quality in topics such as necrotizing enterocolitis (separating proven vs. suspected cases) in women given antibiotics for preterm labor with or without rupture of membranes, adverse effects of the cholera vaccine, and the use of salmeterol versus salbutamol in asthma patients. Of the 25 eligible systematic reviews with large-scale evidence on harms, 18 pertained to drugs, one to vaccines, three to surgical procedures, and three to other nondrug interventions. Eligible large-scale evidence was available on a total of 66 specific adverse events (Table 3). The most common specific harms addressed were bleeding complications (n ⫽ 20), which included major extracranial hemorrhage, symptomatic intracranial hemorrhage, severe gastrointestinal hemorrhage, and major blood loss. 584
October 15, 2004
THE AMERICAN JOURNAL OF MEDICINE威
Statistically significant differences (P ⬍0.05) were observed for 25 of the 66 specific adverse events (24 binary outcomes and 1 continuous outcome) for 21 comparisons of various interventions (Table 4). In six of the 25 examples, all the evidence was from a single trial. Among binary harm outcomes, 22 had statistically significant risk differences and relative risks, one was significant only in risk difference, and another was significant only in relative risk. Statistically significant between-study heterogeneity (P ⬍0.10) was seen for five specific harms in both metrics, five others in risk difference only, and one other in relative risk only. With the exception of a few harms, such as pyrexia, blood pressure changes, dry mouth, and weight change, all adverse effects with statistically significant differences between the comparison arms were severe or serious (Table 4), and all observed risk differences for these serious or severe harms were ⱕ4%. For the majority of harms that were statistically significant (17/25), events were rare in control groups with crude total event rates of ⱕ1%. In five of the eight cases where events occurred in ⬎1% of the control group, the absolute risk difference was ⱖ2%, while in the other three, the absolute excess in risk was ⬍2%.
DISCUSSION Systematic reviews of randomized controlled trials rarely provide large-scale evidence on specific, well-defined adverse events associated with the tested interventions. Only 25 systematic reviews in the entire Cochrane Database of Systematic Reviews contained such evidence on at least one type of harm. More than three quarters of systematic reviews with at least 4000 subjects in randomized trials lacked such information. The Cochrane Library is known for the high quality of its reviews (5,6) and comprehensive approach to outcomes, so it seems unlikely that other systematic reviews would fare better. Since adverse events are poorly reported in randomized trials (2,7–11), systematic reviews and meta-analyses may provide inadequate information concerning safety data, or they may even abandon all effort to present or synthesize information (12) if the data are of very poor quality or highly heterogeneous. Indeed, when appropriate information on harms was missing from the systematic reviews, this information often was also missing or not presented appropriately in the reports of the largest randomized trials that were included in these reviews. We also identified several instances where the largest trials provided appropriate specific data on harms that were not conveyed appropriately in the systematic reviews. Thus, the lack of information reflects both the poor quality of safety data presented in primary trial reports (2,7– 11) and the further loss of important information in the
Volume 117
Large-Scale Evidence on Specific Harms/Papanikolaou and Ioannidis
Table 3. Evaluated Interventions, and Types of Adverse Events for Which There Were Specific Data on at Least 4000 Subjects, in a Meta-analysis Study No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Trials (n)
Sample (n)
9
113,762
Convulsions
15
124,387
Hypotonic hyporesponsiveness
11
121,573
Encephalopathy
2
18,650
Convulsions
4
25,901
Hypotonic hyporesponsiveness
4
25,901
Diastolic blood pressure ⬎100 mm Hg Neonatal respiratory distress syndrome Spontaneous miscarriage
3
4636
1
4507
1
4334
Mortality
28
4913
Nonfatal myocardial infarction
31
4239
Death
12
4912
Dry mouth
20
4632
Major extracranial bleed
16
22,049
Symptomatic intracranial bleed Major extracranial bleed
15
22,794
6
11,721
Symptomatic intracranial bleed Major extracranial bleed
5
11,646
2
9720
Symptomatic intracranial bleed Major extracranial bleed Symptomatic intracranial bleed 30-day stroke or death
2
9720
9 9
41,399 41,399
2
5841
30-day stroke or death
2
4656
Weight change after 6 cycles
1
5328
10
18,792
9
18,546
11
7647
Comparison in Meta-analysis
Adverse Event
Acellular vs. whole cell vaccines for preventing whooping cough in children Acellular vs. whole cell vaccines for preventing whooping cough in children Acellular vs. whole cell vaccines for preventing whooping cough in children Acellular vaccines vs. placebo/diphtheria-tetanus vaccine for preventing whooping cough in children Acellular vaccines vs. placebo/diphtheria-tetanus vaccine for preventing whooping cough in children Acellular vaccines vs. placebo/diphtheria-tetanus vaccine for preventing whooping cough in children Active vs. expectant management in the third stage of labor Mid-trimester amniocentesis vs. control for prenatal diagnosis Early vs. mid-trimester amniocentesis for prenatal diagnosis Aprotinin vs. control for minimizing perioperative allogeneic blood transfusion Antifibrinolytic agents vs. control for minimizing perioperative allogeneic blood transfusion Antileukotriene agents vs. inhaled corticosteroids for recurrent or chronic asthma in adults and children Anticholinergic drugs vs. placebo for overactive bladder syndrome in adults Anticoagulants vs. control in acute presumed ischemic stroke Anticoagulants vs. control in acute presumed ischemic stroke Anticoagulants vs. antiplatelet agents for acute ischemic stroke Anticoagulants vs. antiplatelet agents for acute ischemic stroke Anticoagulants ⫹ antiplatelet agents vs. antiplatelet agents for acute ischemic stroke Anticoagulants ⫹ antiplatelet agents vs. antiplatelet agents for acute ischemic stroke Antiplatelet agents vs. control in acute ischemic stroke Antiplatelet agents vs. control in acute ischemic stroke Carotid endarterectomy vs. control for symptomatic carotid stenosis Carotid endarterectomy vs. control for symptomatic carotid stenosis (⬍70% stenosis) Norethindrone/ethinylestradiol vs. desogestrel/ethinylestradiol (at specific doses) Continuous electronic fetal heart rate monitoring vs. intermittent auscultation during labor Continuous electronic fetal heart rate monitoring vs. intermittent auscultation during labor Dipyridamole vs. control for preventing stroke and other vascular events in patients with vascular disease
Encephalopathy
Cesarean delivery Operative vaginal delivery Major extracranial bleed
October 15, 2004
THE AMERICAN JOURNAL OF MEDICINE威
Volume 117 585
Large-Scale Evidence on Specific Harms/Papanikolaou and Ioannidis
Table 3. Continued Study No. 28 29 30 31
32 33 34 35 36 37 38 39 40
41
42
43
44 45 46 47 48 49 50 51 52 53 54 55
586
Comparison in Meta-analysis
Adverse Event
Low molecular weight heparins vs. unfractionated heparin for venous thromboembolism Isoniazid vs. placebo for preventing tuberculosis in non–HIV-infected persons Isoniazid vs. placebo for preventing tuberculosis in non–HIV-infected persons Isoniazid 6 months vs. isoniazid 12 months for preventing tuberculosis in non–HIV-infected persons Laparoscopic vs. open techniques for inguinal hernia repair Laparoscopic vs. open techniques for inguinal hernia repair Laparoscopic vs. open techniques for inguinal hernia repair Laparoscopic vs. open techniques for inguinal hernia repair Laparoscopic vs. open surgery for suspected appendicitis in adults Low molecular weight heparins vs. unfractionated heparin for acute coronary syndromes Magnesium sulfate vs. control for preeclampsia Magnesium sulfate vs. control for preeclampsia Periconceptional supplementation with folate or multivitamins vs. control to prevent neural tube defects Periconceptional supplementation with folate or multivitamins vs. control to prevent neural tube defects Periconceptional supplementation with folate or multivitamins vs. control to prevent neural tube defects Periconceptional supplementation with folate or multivitamins vs. control to prevent neural tube defects Platelet glycoprotein IIb/IIIa blockers in any percutaneous coronary revascularization Platelet glycoprotein IIb/IIIa blockers in percutaneous transluminal coronary angioplasty Platelet glycoprotein IIb/IIIa blockers in stent Platelet glycoprotein IIb/IIIa blockers in unstable angina and non–ST-elevation myocardial infarction Upright or lateral vs. supine position/lithotomy during second stage of labor Upright or lateral vs. supine position/lithotomy during second stage of labor Oral misoprostol vs. injectable uterotonics to prevent postpartum hemorrhage Oral misoprostol vs. injectable uterotonics to prevent postpartum hemorrhage Rofecoxib vs. naproxen for rheumatoid arthritis Rofecoxib vs. naproxen for rheumatoid arthritis Rofecoxib vs. naproxen for rheumatoid arthritis Rofecoxib vs. naproxen for rheumatoid arthritis
October 15, 2004
THE AMERICAN JOURNAL OF MEDICINE威
Volume 117
Major bleeds
Trials (n)
Sample (n)
14
4754
Hepatitis
1
20,874
Hepatitis-related deaths
2
25,714
Hepatitis
1
13,884
Vascular injury
26
5256
Visceral injury
22
4914
Mesh/deep infection
23
4654
Wound/superficial infection
29
5565
Wound infections
34
4324
Major bleeds
6
11,022
Major neurologic problem Major respiratory problem Miscarriage
2 2 3
10,677 10,677 7600
Ectopic pregnancy
3
7600
Stillbirth
3
7600
Multiple gestation
3
6241
30-day major bleeding
12
17,469
30-day major bleeding
8
12,305
30-day major bleeding 30-day major bleeding
4 8
5164 29,920
Blood loss ⬎500 mL
10
17,469
Second-degree perineal tears
10
12,305
Severe shivering
3
19,038
Pyrexia (ⱖ38°C)
4
21,059
Fatal infarction/sudden death Nonfatal myocardial infarction Unstable angina Transient ischemic attack
1 1 1 1
8076 8076 8076 8076
Large-Scale Evidence on Specific Harms/Papanikolaou and Ioannidis
Table 3. Continued Study No. 56 57 58 59 60 61 62 63 64 65 66
Trials (n)
Sample (n)
Ischemic stroke Peripheral thrombotic Severe neutropenia
1 1 1
8076 8076 19,185
Severe thrombocytopenia
1
19,185
Severe gastrointestinal hemorrhage Severe extracranial hemorrhage Severe skin rash
2
19,247
1
19,185
1
19,185
Severe diarrhea
1
19,185
Severe indigestion/nausea/ vomiting Symptomatic intracranial bleed Symptomatic intracranial bleed
1
19,185
3
22,316
2
9500
Comparison in Meta-analysis
Adverse Event
Rofecoxib vs. naproxen for rheumatoid arthritis Rofecoxib vs. naproxen for rheumatoid arthritis Clopidogrel vs. aspirin to prevent stroke/serious vascular events (high-risk patients) Ticlopidine or clopidogrel vs. aspirin to prevent stroke/serious vascular events (high-risk patients) Ticlopidine or clopidogrel vs. aspirin to prevent stroke/serious vascular events (high-risk patients) Ticlopidine or clopidogrel vs. aspirin to prevent stroke/serious vascular events (high-risk patients) Clopidogrel vs. aspirin to prevent stroke/serious vascular events (high-risk patients) Clopidogrel vs. aspirin to prevent stroke/serious vascular events (high-risk patients) Ticlopidine or clopidogrel vs. aspirin to prevent stroke/serious vascular events (high-risk patients) Ticlopidine or clopidogrel vs. aspirin to prevent stroke/serious vascular events (high-risk patients) Ticlopidine or clopidogrel vs. aspirin to prevent stroke/serious vascular events (in transient ischemic attack or stroke)
HIV ⫽ human immunodeficiency virus.
generation of systematic reviews of evidence from randomized trials. Finally, the quantitative synthesis of harms based on composite counts, different grades, and different types of toxicity across studies may yield even more misleading results that are impossible to interpret for clinical decision making. In this report we only looked at adverse events from randomized trials. We did not use observational studies, which are sometimes also considered in the Cochrane systematic reviews, even though they provide useful data on harms. We focused on randomized trials because we wanted to determine if in principle large-scale evidence on harms could be provided by evidence from randomized trials and the subsequent systematic reviews. Our cutoff of 4000 subjects was also arbitrary, but it was set a priori to ensure that there was sufficient evidence to target adverse events for which the absolute risk conferred by an intervention was small. A sample of 4000 subjects would still provide suboptimal power for detecting an excess frequency of 1% if the adverse event also occurred in a certain number of control patients. Thus, 4000 subjects is likely a conservative threshold. The likelihood of many severe or serious adverse events is quite low. While other adverse events are common even among control subjects, the absolute increase in risk conferred by an intervention is small. In both of these situations, it has been traditionally accepted that risks cannot be addressed by randomized trials owing to limitations in statistical power (3). Theoretically, these limitations may be overcome if there is randomization of suffi-
ciently large numbers of subjects, which our analysis demonstrates is feasible. In the majority of eligible systematic reviews, the absolute risk difference between the treatment arms was relatively small, but several examples were identified showing notable increases in adverse events. A small absolute risk is still clinically important if an adverse event is serious or severe, or if the absolute benefit of the intervention is also small. That which constitutes a sufficiently large sample varies by study, and even several thousand subjects might not suffice for meaningful evidence on harms in certain instances. The risks of hormone replacement therapy, such as venous thrombosis and breast cancer, have been quantified in larger trials such as the Women’s Health Initiative (which has about 16,000 participants) (13), larger observational cohorts such as the Million Women Study (14), and case-control studies from larger background samples such as the U.K. General Practice Data Base (15). Most of the specific harms with eligible data that we identified (41 of 66) did not reach statistical significance, but this does not necessarily mean that there was no difference between the treatment arms. With 4000 subjects, power would be negligible in detecting very small excesses of major adverse events (e.g., aplastic anemia occurring in 1 of 10,000 patients taking chloramphenicol). Even relatively frequent harms with a longer latency period (e.g., of more than 2 or 3 years) cannot be quantified easily by randomized controlled trials or metaanalyses, since it is rare to have sufficient length of fol-
October 15, 2004
THE AMERICAN JOURNAL OF MEDICINE威
Volume 117 587
Large-Scale Evidence on Specific Harms/Papanikolaou and Ioannidis
Table 4. Eligible Systematic Reviews with Statistically Significant Differences between the Comparison Arms (P ⬍0.05)* Study No. 2 3 7 8 9 13 14 15 17 18 20 24
25
26
29 35 36 39 45 47 48 50 51 53 62
Adverse Event
Crude Rate (Control)
Convulsions with whole cell vs. acellular pertussis vaccines Hypotonic hyporesponsiveness with whole cell vs. acellular pertussis vaccines Diastolic blood pressure ⬎100 mm Hg with active vs. expectant management in third-stage of labor Neonatal respiratory distress syndrome with midtrimester amniocentesis Spontaneous miscarriage with early vs. midtrimester amniocentesis Dry mouth with anticholinergic drugs Major extracranial bleed with anticoagulants Symptomatic intracranial bleed with anticoagulants Symptomatic intracranial bleed with anticoagulants vs. antiplatelet agents Major extracranial bleed with anticoagulants ⫹ antiplatelet agents vs. antiplatelet agents Major extracranial bleed with antiplatelet agents Weight change (cycle 6) with norethindrone/ethinylestradiol vs. desogestrel/ ethinylestradiol Cesarean delivery with continuous electronic fetal heart rate monitoring vs. intermittent auscultation Operative vaginal delivery with continuous fetal heart rate monitoring vs. intermittent auscultation Hepatitis with isoniazid Wound/superficial infection with open vs. laparoscopic techniques Wound infections with open vs. laparoscopic surgery Major respiratory problem with magnesium sulfate 30-day major bleeding with platelet glycoprotein IIb/IIIa blockers 30-day major bleeding with platelet glycoprotein IIb/IIIa blockers Blood loss ⬎500 mL with upright or lateral vs. supine position/lithotomy Severe shivering with oral misoprostol vs. injectable uterotonics Pyrexia (ⱖ38°C) with oral misoprostol vs. injectable uterotonics Nonfatal myocardial infarction with rofecoxib vs. naproxen Severe skin rash with clopidogrel vs. aspirin
Relative Risk (95% Confidence Interval)
Heterogeneity in Effect Metrics
0.05% 0.04 (0.01 to 0.07)
2.1 (1.4 to 3.2)
Neither
0.08% 0.05 (0.01 to 0.09)
3.8 (1.2 to 12)
Relative risk
0.38% 0.94 (0.46 to 1.43)
3.8 (1.1 to 13)
Neither
0.53% 0.59 (0.06 to 1.12)
2.1 (1.1 to 4.2)
(1 study only)
0.79%
3.2 (1.9 to 5.5)
(1 study only)
15% 21 (16 to 26) 0.38% 0.91 (0.67 to 1.16) 0.48% 0.80 (0.56 to 1.04)
2.4 (1.7 to 3.3) 3.3 (2.4 to 4.7) 2.6 (2.0 to 3.6)
Both Neither Neither
0.56% 0.68 (0.14 to 1.22)
2.1 (1.3 to 3.5)
Risk difference
1.3 (⫺0.5 to 3.1)†
3.2 (1.1 to 9.6)
Both
0.56% 0.37 (0.20 to 0.53) NP NP
1.7 (1.3 to 2.1) NP
Neither (1 study only)
3.5%
3.9 (1.7 to 6.1)
1.6 (1.3 to 2.1)
Both
10%
2.4 (0.3 to 4.5)
1.1 (1.0 to 1.3)†
Both
0.10% 0.45 (0.31 to 0.60) 1.5% 0.85 (0.07 to 1.6)
5.5 (2.6 to 12) 1.8 (1.1 to 2.9)
(1 study only) Risk difference
3.9%
1.8 (1.4 to 2.3)
Neither
0.49% 0.44 (0.15 to 0.74) 4.4% 1.8 (0.03 to 3.6)
2.0 (1.2 to 3.2) 1.4 (1.0 to 1.8)
Neither Both
3.6%
0.48 (0.16 to 0.79)
1.4 (1.1 to 1.9)
Risk difference
3.8%
2.5 (0.2 to 4.8)
1.6 (1.1 to 2.4)
Both
0.23%
4.5 (0.3 to 8.7)
7.0 (4.5 to 11)
Risk difference
0.95%
8.3 (4.6 to 12)
6.3 (4.2 to 9.6)
Risk difference
0.10% 0.35 (0.12 to 0.57)
4.5 (1.5 to 13)
(1 study only)
0.10% 0.16 (0.04 to 0.28)
2.5 (1.2 to 5.2)
(1 study only)
0.47%
% Risk Difference (95% Confidence Interval)
1.8 (1.0 to 2.5)
3.0 (1.7 to 4.3)
* All comparisons have been coined in the order to state the arm with the excess harms first. The study numbers correspond to those in Table 3 “Crude rate (control)” describes the proportion of subjects with the adverse event in the control arm (the arm with the lower event rate) by pooling data across trials. “Heterogeneity” describes in which metric statistically significant between-study heterogeneity was observed (P ⬍0.10 by the Q statistic). “Neither” indicates that there was no significant between-study heterogeneity for risk difference and relative risk. “Both” indicates that there was significant between-study heterogeneity for both metrics. † P ⱖ0.05. NP ⫽ not pertinent (continuous measure outcome; weighted mean difference, 0.26 kg).
588
October 15, 2004
THE AMERICAN JOURNAL OF MEDICINE威
Volume 117
Large-Scale Evidence on Specific Harms/Papanikolaou and Ioannidis
low-up of sufficient numbers in several trials. Thus, observational data should still be considered alongside evidence from randomized trials. One other limitation is that only adverse events that are thought of in a similar way by different investigators during the planning phase of the trials can be investigated reliably by subsequent meta-analysis. For example, several of the 25 systematic reviews that provided good evidence on adverse events involved anticoagulation or antiplatelet treatment. Investigators were aware of the risk of major bleeding complications and reported them in a similar fashion. Unanticipated harms that are detected during a trial might be reported in markedly different ways by different investigators. Harms that become apparent after marketing or by clinical observation might remain buried in trial reports and be difficult to uncover even in retrospect. The idea that randomized controlled trials might yield better data on adverse events, or even detect harms, was first proposed by Skegg and Doll in 1977, who advocated the recording of all medical events that occur in trial participants (16). This practice is still not performed efficiently enough. Data on harms should be better collected, recorded, and reported, and made accessible for quantitative synthesis (17). Despite their limitations, metaanalyses of randomized trials can be made more useful for quantifying known and well-specified harms by giving proper attention to the reporting of harms in randomized controlled trials and systematic reviews.
ACKNOWLEDGMENT We are thankful to Professor Jan P. Vandenbroucke, Department of Clinical Epidemiology, University of Leiden Medical School, Leiden, The Netherlands, for discussing our study protocol and for critical reading and suggestions for the discussion in the manuscript.
REFERENCES 1. Ioannidis JP, Lau J. Improving safety reporting from randomised trials. Drug Saf. 2002;25:77– 84.
2. Ioannidis JP, Lau J. Completeness of safety reporting in randomized trials: an evaluation of 7 medical areas. JAMA. 2001;285:437– 443. 3. Jick H. The discovery of drug-induced illness. N Engl J Med. 1977; 296:481– 485. 4. Berlin JA, Colditz GA. The role of meta-analysis in the regulatory process for foods, drugs, and devices. JAMA. 1999;281:830 – 834. 5. Jadad AR, Moher M, Browman GP, et al. Systematic reviews and meta-analyses on treatment of asthma: critical evaluation. BMJ. 2000;320:537–540. 6. Jadad AR, Cook DJ, Jones A, et al. Methodology and reports of systematic reviews and meta-analyses: a comparison of Cochrane reviews with articles published in paper-based journals. JAMA. 1998;280:278 –280. 7. Gøtzsche PC. Methodology and overt and hidden bias in reports of 196 double-blind trials of nonsteroidal, antiinflammatory drugs in rheumatoid arthritis [published correction in Control Clin Trials. 1989;10:356]. Control Clin Trials. 1989;10:31–56. 8. Loke YK, Derry S. Reporting of adverse drug reactions in randomised controlled trials - a systematic survey. BMC Clin Pharmacol. 2001;1:3. 9. Edwards JE, McQuay HJ, Moore RA, Collins SL. Reporting of adverse effects in clinical trials should be improved: lessons from acute postoperative pain. J Pain Symptom Manage. 1999;18:427– 437. 10. Hayashi K, Walker AM. Japanese and American reports of randomized trials: differences in the reporting of adverse effects. Control Clin Trials. 1996;7:99 –110. 11. Martin RCG, Brennan MF, Jaques DP. Quality of complication reporting in the surgical literature. Ann Surg. 2002;235:803– 812. 12. Ernst E, Pittler MH. Systematic reviews neglect safety issues. Arch Intern Med. 2001;161:125–126. 13. Writing Group for the Women’s Health Initiative Investigators. Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results from the Women’s Health Initiative randomized controlled trial. JAMA. 2002;288:321–333. 14. Million Women Study Collaborators. Breast cancer and hormonereplacement therapy in the Million Women Study. Lancet. 2003; 362:419 – 427. 15. Jick H, Derby LE, Myers MW, et al. Risk of hospital admission for idiopathic venous thromboembolism among users of postmenopausal oestrogens. Lancet. 1996;348:981–983. 16. Skegg DC, Doll R. The case for recording events in clinical trials. BMJ. 1977;2:1523–1524. 17. McPherson K, Hemminki E. Synthesising licensing data to assess drug safety. BMJ. 2004;328:518 –520.
October 15, 2004
THE AMERICAN JOURNAL OF MEDICINE威
Volume 117 589