Assessing the Quality of Randomized Trials: Reliability of the Jadad Scale Heather D. Clark, MD, CM, FRCPC, George A. Wells, PhD, Charlotte Hue¨t, MD, Finlay A. McAlister, MD, FRCPC, L. Rachid Salmi, MD, PhD, Dean Fergusson, MHA, and Andreas Laupacis, MD, MSc, FRCPC The Department of Medicine, University of Ottawa (H.D.C., G.A.W., F.A.M., A.L.), The Clinical Epidemiology Unit, Loeb Health Research Institute, Ottawa, Ontario (G.A.W., F.A.M., D.F., A.L.), Canada; and the Universite´ Victor Segalen Bordeaux 2, Bordeaux, (C.H., L.R.S.) France
ABSTRACT: An instrument was developed and validated by Jadad, et al. to assess the quality of clinical trials using studies from the pain literature. Our study determined the reliability of the Jadad scale and the effect of blinding on interrater agreement in another group of primary studies. Four raters independently assessed blinded and unblinded versions of 76 randomized trials. Interrater agreement was calculated among combinations of four raters for blinded and unblinded versions of the studies. A 4 3 2 3 2 repeated measures design was employed to evaluate the effect of blinding. The interrater agreement for the Jadad scale was poor (kappa 0.37 to 0.39), but agreement improved substantially (kappa 0.53 to 0.59) with removal of the third item (an explanation of withdrawals). Blinding did not significantly affect the Jadad scale scores. A more precise description of how to score the withdrawal item and careful conduct of a practice set of articles might improve interrater agreement. In contrast with the conclusions reached by Jadad, we were unable to demonstrate a significant effect of blinding on the quality scores. Control Clin Trials 1999;20:448–452 Elsevier Science Inc. 1999 KEY WORDS: Randomized trials, quality, methodology, meta-analysis, interrater agreement
INTRODUCTION Meta-analyses of randomized clinical trials are being published with increasing frequency [1, 2], resulting in great interest in assessing the quality of randomized clinical trials included in meta-analyses [3–10]. Numerous scales and checklists have been suggested to evaluate the quality of randomized clinical trials [4]. However, a 5-item scale developed by Jadad et al. is the only known scale developed with standard scale development techniques [3]. Although the Address reprint requests to: Dr. Heather D. Clark, 405-737 Parkdale Ave, Ottawa, Ontario, Canada K1Y 1J8. E-mail:
[email protected] Received March 25, 1998; accepted June 8, 1999. Controlled Clinical Trials 20:448–452 (1999) Elsevier Science Inc. 1999 655 Avenue of the Americas, New York, NY 10010
0197-2456/99/$–see front matter PII S0197-2456(99)00026-4
Assessing Quality of Randomized Trials
449
scale was developed and validated to assess the quality of reports of pain relief, it has been used extensively in other clinical areas as it is efficient to use. Jadad reported that the scores were lower and more consistent if quality assessment was blinded [3]. The current study examines the impact of blinding on the quality score as well as the reliability of assessments by multiple raters, in publications from a different field than those evaluated by Jadad. METHODS Articles describing 76 individual randomized trials were obtained from the International Study of Perioperative Transfusion (ISPOT) investigators who were conducting meta-analyses of technologies that reduce perioperative allogeneic blood transfusion during elective surgery [11, 12]. The articles were presented to reviewers in a blinded and unblinded fashion (original references available from author). Blinding was achieved by masking the names and affiliations of the authors, the journal, date of publication, source of funding, and the acknowledgments of the primary studies. Four reviewers (FM, HC, CH, LRS), none of whom were involved in the original meta-analyses, performed the quality assessments using the Jadad scale. The reviewers were grouped into two pairs by country of residence (FM and HC from Canada, CH and LRS from France). A distinct set of 10 studies was evaluated as a practice set. The two pairs of reviewers then met separately to discuss the interpretation of the Jadad scale and to resolve any differences in scoring within the pair. All four reviewers discussed difficulties in the interpretation of the Jadad scale on a telephone conference call before proceeding with this study. A 4 3 2 3 2 repeated measures design was used to analyze the effect of blinding and country of reviewer. The 76 articles were randomly allocated into four groups. The articles were reviewed during two time periods, 2 months apart. The articles were evaluated in two forms, blinded and unblinded. Each reviewer reviewed half of the articles in the first time period and the other half in the second time period. Each pair of reviewers evaluated every article with one of the pair reviewing the article blinded and the other of the pair reviewing the article unblinded. For example, if one member of the pair evaluated the article in the first period in a blinded fashion, then the other member of the pair evaluated it in an unblinded fashion in the second period. With this design, no rater reviewed any article more than once. The reason for choosing this design was that it was felt that a single reviewer would most likely recall the rating they gave to a study even after a time interval. To assess interrater agreement, the kappa statistic for multiple categories was used for both two and four raters [13]. Interrater agreement for two raters was calculated for the articles in a blinded and unblinded fashion, and for four raters interrater agreement was calculated ignoring the presence or absence of blinding. RESULTS The reliability of the two pairs of raters, using the practice set of 10 articles, revealed a kappa of 1.00 for HC and FM, and a kappa of 0.51 for LRS and CH.
450
H.D. Clark et al.
The practice set of articles, using the Jadad scale, ranged in quality from 0 to 3 (maximum score of 5) with an average quality of 1.5. The pair of reviewers from France met to discuss the reasons for disagreement in the practice set and reached a consensus about their scoring differences. A conference call involving all four reviewers after the practice set, but before initiation of the actual study, identified the item about withdrawals as the most subjective component of the Jadad scale. The reports on the 76 trials, using the Jadad scale, ranged in quality from 0 to 5 with an average quality of 2.64. The average quality of the articles rated blinded and unblinded were 2.66 and 2.61, respectively. The average quality of the articles by reviewer was as follows: FM 2.83; HC 2.74; CH 2.60; LRS 2.36. For the 4 3 2 3 2 repeated measures design, the interaction effects were significant (p , 0.001) for the between factor (article group) and the two within factors (country and blinding). Because of the interactions, we were not able to assess the main effects. Interaction effects were examined graphically and no obvious systematic trend for change in quality score was demonstrated for blinding and country, and the interaction consistently appeared for each group of studies (data not shown, but available from corresponding author on request). Due to the interaction effect the data was reanalyzed in two different ways. First, the withdrawal item was omitted from the scale as this item contributed to the majority of the discrepancy between raters in the practice set. The quality scale now ranged from 0 to 4 and the reports on the 76 articles now had an average quality of 2.18. For the 4 3 2 3 2 repeated measures design the interaction term was not significant (p 5 0.16) when the withdrawal item was excluded from the Jadad scale. The main effects evaluated were country of reviewer and blinding, neither of which demonstrated a statistically significant effect (p 5 0.90 for country of reviewer and p 5 0.81 for blinding). The mean and 95% confidence interval (CI) for the blinded and unblinded studies irrespective of country of reviewer were 2.20 6 0.17 and 2.18 6 0.17, respectively. The mean and 95% CI for the reviewers from France and Canada were 2.18 6 0.17 and 2.18 6 0.17, respectively. The second method of analysis was to exclude the French reviewers. The Canadian reviewers had a perfect kappa result on the pretest set. For the reduced 4 3 2 repeated measures design the interaction effect for the between factor (article group) and the within factor (blinding) was not significant. Because the interaction effect was not significant, we were able to assess the main effect; that is, blinding for the complete Jadad scale, which again was not significant (p 5 0.92). The interrater agreement calculated for the total Jadad scale ranged from 0.37 to 0.39 depending on two or four raters, blinded or unblinded (Table 1). The results improved with omission of the withdrawal item (kappa of 0.53 to 0.59). The kappa for each individual item of the Jadad scale varied dramatically, as demonstrated in Table 1. It was a condition for inclusion in the meta-analysis that all trials described in the articles were randomized. Also, the majority of the trials were double-blind. Thus, a low kappa was achieved even with high percent agreements.
451
Assessing Quality of Randomized Trials
Table 1 Interrater Agreement: kappa (SE) 4 Raters 2 Raters Items Total Score Total (5 Items) Adjusted (4 Items) Individual Items Randomization Randomization (bonus) Double-blind Double-blind (bonus) Withdrawal
Blinded
Unblinded
Blinded and Unblinded
kappa (SE)
kappa (SE)
kappa (SE)
0.39 (0.05) 0.53 (0.08)
0.37 (0.05) 0.59 (0.8)
0.37 (0.05) 0.55 (0.07)
20.01 0.63 0.86 0.64 0.43
(0.10) (0.09) (0.10) (0.10) (0.09)
0.65 0.54 0.86 0.62 0.36
(0.09) (0.09) (0.10) (0.09) (0.08)
0.35 0.53 20.03 0.51 20.09
(0.10) (0.09) (0.10) (0.08) (0.10)
4 Raters Percent Agreement 24 50 93 71 87 66 38
SE, Standard error.
DISCUSSION Our study suggests that there may be considerable interrater variability in the results of the Jadad scale as it is currently used. We did not repeat a second practice set of articles before proceeding with the study to establish that our agreement had improved after discussion of the issues. This certainly could also have contributed to the low kappa, but is how most research groups use a quality assessment scale. When using the Jadad scale, it may be important to ensure that good agreement is achieved prior to using the scale. Unlike Jadad’s original article, a consistent effect of blinding on the mean score was not detected. However, our ability to detect such an effect was impeded because of an interaction between article group, blinding, and country of reviewer. When the data was reanalyzed eliminating the withdrawal item or using only the Canadian reviewers, there was still no significant effect of blinding on the Jadad score. The design of our study was complicated, but had the advantage of using multiple raters over time with no duplicate rating of individual studies by the same reviewer. Our reviewers were experienced in research methods, but had minimal experience in formal quality assessments of reports of randomized trials. Their amount of experience is likely to be similar to that of many individuals using the Jadad scale. In summary, in this study the interrater reliability of the Jadad scale was relatively low, due largely to the item about withdrawal. We were unable to demonstrate that blinding had a significant effect upon the quality score. Due to the discrepancy between studies, more studies with different raters and using different reports of randomized trials are still needed to evaluate the reliability of the Jadad scale. This study was partially supported by the First International Fellowship of the International Society of Technology Assessment in Health Care, funded by the Private Patients Plan (PPP) Medical Trust, PPP Medical Institute, UK, which was awarded to Andreas Laupacis. Andreas Laupacis is a Scientist of the Medical Research Council of Canada. Finlay McAlister is a recipient of a Health Research Fellowship funded by the Medical Research Council of Canada.
452
H.D. Clark et al.
REFERENCES 1. Felson DT. Bias in meta-analytic research. J Clin Epidemiol 1992;45:885–892. 2. Moher D, Olkin I. Meta-analysis of randomized controlled trials. A concern for standards. JAMA 1995;274:1962–1964. 3. Jadad AR, Moore RA, Carroll D, et al. Assessing the quality of reports of randomized clinical trials: Is blinding necessary? Control Clin Trials 1996;17:1–12. 4. Moher D, Jadad AR, Nichol G, et al. Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. Control Clin Trials 1995; 16:62–73. 5. Naylor CD. Two cheers for meta-analysis: problems and opportunities in aggregating results of clinical trials. CMAJ 1988;138:891–895. 6. Sacks HS, Berrier J, Reitman D, et al. Meta-analyses of randomized controlled trials. N Engl J Med 1987;316:450–455. 7. Moher D, Jadad AR, Tugwell P. Assessing the quality of randomized controlled trials. Current issues and future directions. Int J Technol Assess Health Care 1996;12:195–208. 8. Khan KS, Daya S, Jadad A. The importance of quality of primary studies in producing unbiased systematic reviews. Arch Intern Med 1996;156:661–666. 9. Assendelft WJ, Koes BW, Knipschild PG, Bouter LM. The relationship between methodological quality and conclusions in reviews of spinal manipulation. JAMA 1995;274;1942–1948. 10. Detsky AS, Naylor CD, O’Rourke K, et al. Incorporating variations in the quality of individual randomized trials into meta-analysis. J Clin Epidemiol 1992;45:255–265. 11. Laupacis A, Fergusson D, for the International Study of Perioperative Transfusion (ISPOT) Group. Drugs to minimize perioperative blood loss in cardiac surgery— Meta-analyses using perioperative blood transfusion as the outcome. Anesth Analg 1997;85:1258–1267. 12. Laupacis A, Fergusson D, for the International Study of Perioperative Transfusion (ISPOT) Group. Erythropoietin to minimize peri-operative blood transfusion. Four meta-analyses of randomized trials. Transfus Med 1998;8:309–317. 13. Fleiss JL. The measurement of interrater agreement. In: Statistical methods for rates and proportions. New York: Wiley; 1982:212–236.