Interobserver Variability in Applying a Radiographic Definition for ARDS* Gordon D. Rubenfeld, MD, MSc; Ellen Caldwell, MS; John Granton, MD, FCCP; Leonard D. Hudson, MD, FCCP; and Michael A. Matthay, MD, FCCP†
Context: Acute lung injury (ALI) and ARDS are currently defined by the American-European Consensus Conference (AECC) definition criteria, which contain a radiographic criterion. The accuracy or reliability of this consensus radiographic definition has not been evaluated, and no radiographic definition of ALI-ARDS has been evaluated by a large international group of experts. Objective: To study the interobserver variability in applying the AECC radiographic criterion for ALI-ARDS. Design: Survey. Participants: A convenience sample of 21 experts selected from participants attending the 1997 Toronto Mechanical Ventilation Workshop and from members of the National Institutes of Health ARDS Network. Outcome measures: Participants reviewed 28 randomly selected chest radiograph from critically ill, hypoxemic (PaO2/fraction of inspired oxygen ratio, < 300) patients and decided whether the radiograph fulfilled the AECC definition for ALI-ARDS. Results: Interobserver agreement in applying the AECC definition for ALI-ARDS was moderate (k 5 0.55; 95% confidence interval, 0.52 to 0.57). Thirteen radiographs (43%) showed nearly complete agreement (defined as 20 or 21 readers in agreement). Nine radiographs (32%) had more than or equal to five dissenting readers. The percentage of radiographs interpreted as consistent with ALI-ARDS by individual readers ranged from 36 to 71%. Participants commented that mild infiltrates, pleural effusions, atelectasis, isolated lower lobe involvement, radiographic technique, and overlying monitoring equipment posed the most difficulties. Conclusions: The radiographic criterion used in the current AECC definition for ALI-ARDS showed high interobserver variability when applied by expert investigators in the fields of mechanical ventilation and ARDS. This variability may result in differences in ALI-ARDS populations at different clinical research centers and may make it difficult for clinicians to apply the results of clinical trials to their patients. Modifications to the radiographic criterion or annotated reference radiograph may improve the reliability of future definitions for ALI-ARDS. (CHEST 1999; 116:1347–1353) Key words: ARDS; chest radiography; interobserver variability; lung injury Abbreviations: AECC 5 American-European Consensus Conference; ALI 5 acute lung injury; Fio2 5 fraction of inspired oxygen; NIH 5 National Institutes of Health
the original description of ARDS in 1967, S ince the chest radiograph has been an essential part of its definition.1 In the initial report, the chest radiograph was described as presenting “patchy, bilateral alveolar infiltrates.”1 Since then, a variety of radio*From the Harborview Medical Center and the Division of Pulmonary and Critical Care Medicine (Drs. Rubenfeld, Hudson, and Ms. Caldwell), University of Washington, Seattle WA; Toronto Hospital and the Critical Care Medicine Program (Dr. Granton), University of Toronto, Toronto, Canada; and the Cardiovascular Research Institute (Dr. Matthay), University of California, San Francisco, CA. †See Appendix for a complete list of participants who read chest radiographs in the study. Supported by NIH grants SCORHL96014 (Drs. Rubenfeld, Hudson, and Ms. Caldwell) and RO1HL51856 (Dr. Matthay). Manuscript received January 11, 1999; accepted May 5, 1999. Correspondence to: Gordon D. Rubenfeld, MD, MSc, Pulmonary & Critical Care Medicine, Harborview Medical Center, Box 359762, 325 9th Ave, Seattle WA 98104; e-mail: nodrog@u. washington.edu
graphic criteria have been described in the definition of acute lung injury (ALI)-ARDS, including “reduction in the longitudinal pulmonary diameter,”2 “interstitial and alveolar edema,”3 and at least two radiographic scoring methods.4,5 Critics have hypothesized that the variability in reports of the incidence, risk factors, and outcomes of ARDS were due in part to poorly characterized definitions and to heterogeneous patient populations.6,7 To address this problem and to set standard definitions, an American-European Consensus Conference (AECC) was convened in 1992.8 In their report, members of the conference defined ALI as the acute onset of arterial hypoxemia (Pao2/fraction of inspired oxygen [Fio2] ratio, # 300), a pulmonary artery wedge pressure # 18 mm Hg or no clinical evidence of left atrial hypertension, and bilateral infiltrates consistent with pulmonary edema on fronCHEST / 116 / 5 / NOVEMBER, 1999
1347
tal chest radiograph. The authors specifically noted that the pulmonary infiltrates could be mild. ARDS is defined by the same criteria as ALI, but with more severe hypoxemia (Pao2/Fio2, # 200). No radiographic distinction was made between ALI and ARDS, and therefore, for this report, we will refer to a common entity, ALI-ARDS. Interobserver variability in the interpretation of diagnostic radiograph has been reported by a number of investigators. Evaluations of mammograms, ventilation-perfusion scans, and chest radiographs in cases of pneumonia and pneumoconiosis may demonstrate poor agreement between readers.9 –11 We hypothesized that the AECC radiographic definition was not specific enough to lead readers to a reliable and reproducible interpretation of chest radiographs. Therefore, in applying the definition, we hypothesized that there would be a wide range of individual thresholds for determining that the infiltrates were consistent with pulmonary edema and interobserver variability.
Materials and Methods The study design was a survey of volunteers recruited from participants at the Toronto Mechanical Ventilation Workshop held November, 1997 and from the National Institutes of Health (NIH) ARDS Network. Chest radiographs were obtained from three sources. Two institutions that prospectively identify hypoxemic critically ill patients for the presence of ALI-ARDS (University of Washington, Harborview Medical Center; Seattle, WA, and University of California MoffittLong Hospital; San Francisco, CA) contributed a random selection of chest radiographs from intubated patients with a Pao2/Fio2 , 300. In addition, randomly selected chest radiographs, from patients enrolled in a recently completed study of mechanical ventilation in patients at risk for developing ARDS, were also contributed.12 Radiographs could have come from any time during the patient’s course of intubation as long as they met the hypoxemia threshold on the day they underwent the radiograph. In selecting the radiographs, the investigators did not know which patients had received diagnoses of ALI-ARDS, cardiogenic pulmonary edema, pneumonia, or any other condition at the contributing center; therefore, this information could not affect radiograph selection. All were routine, posteroanterior, portable radiographs, taken for clinical use, and no attempt was made to standardize the technique. The radiographs from Seattle and San Francisco were 10 3 14-inch digitized computed radiographs, and the Toronto radiographs were 14 3 17-inch standard analog radiographs. Patient age, institution, and other identifying information was obscured by tape. Eighteen participants read the radiographs at the Toronto meeting, and 3 others received the series of radiographs by mail. All readers who volunteered to participate in the study were included. Identical instructions, provided to each reader, stated: “All radiographs were taken from intubated patients with Pao2/Fio2 , 300. Does this chest radiograph fulfill the AECC definition for ALI and ALI-ARDS, ‘bilateral infiltrates consistent with pulmonary edema’? Note that the 1348
American-European Consensus Conference definition specifically included mild and patchy infiltrates.” No clinical history or additional information was provided. No time constraint was placed on the readers. Responses reported as “positive” indicated that the chest radiographs fulfilled the definition of the AECC. The readers were asked to indicate aspects of the radiographs that made the definition difficult to apply to a specific radiograph. Data were analyzed to determine the percentage of readers that interpreted each radiograph as positive, the percentage of radiographs read as positive or negative by each reader, and to measure interobserver variability (k-statistic).13 The approximate normal test was used to test for statistical significance between k-statistic values. All analyses were performed on a computer (IBM-PC; Danbury, CT) using appropriate software (SAS; SAS Institute; Cary, NC).
Results The 21 participants had a median length of experience in critical care practice of 12 years (range, 3 to 28 years). Seventeen were from North America, and 4 were from Europe or South America. Twenty participants (95%) have lectured on mechanical ventilation or lung injury, 19 (90%) are currently engaged in clinical research in lung injury, and all have coauthored research papers on mechanical ventilation or lung injury (see “Appendix” section). Seven of the readers are investigators in the NIH ARDS Clinical Trial Network. Twenty-eight chest radiographs (10 from Seattle, 4 from San Francisco, and 14 from Toronto) were evaluated. The percentages of readers who scored each radiograph as fulfilling the radiographic definition “bilateral infiltrates consistent with pulmonary edema” are listed in Table 1. The number of readers who agreed on the interpretation of chest radiographs varied. Thirteen interpretations of radiographs (8 read as consistent with ALI-ARDS and 5 as negative for ALI-ARDS) showed nearly complete agreement (0 or 1 dissenting reader). Nine interpretations of radiographs had more than or equal to five dissenting readers. The k-statistic for interobserver agreement was 0.55 (95% confidence interval, 0.52 to 0.57). There was no statistically significant difference between the agreement between the 7 NIH ARDS Clinical Trial Network readers and the 14 other readers. To evaluate whether digital imaging had an effect on agreement, we compared the k-statistic values for the two radiographic techniques. Agreement on the analog radiographs was superior to that on the digital radiographs (k-statistic, 0.72 vs 0.38, respectively; p , 0.0001). The readers showed different thresholds for concluding that a chest radiograph was positive. The reader least likely to call a radiograph positive interpreted 36% of the radiographs as positive, while the Clinical Investigations in Critical Care
Table 1—Chest Radiograph Interpretation Sorted by Majority Interpretation* Interpretations
Agreement, %
Positive Positive Positive Positive Negative Negative Negative Positive Positive Positive Positive Negative Negative Positive Positive Positive Negative Negative Negative Positive Positive Positive Negative Positive Negative Negative Positive Positive
100 100 100 100 100 100 100 95 95 95 95 95 95 90 86 86 86 86 81 76 76 76 71 67 67 57 52 52
*Positive 5 percentage of readers interpreting chest radiograph with bilateral infiltrates consistent with pulmonary edema for each of the 28 chest radiographs in the study; Negative 5 percentage of readers interpreting chest radiographs without bilateral infiltrates consistent with pulmonary edema for each of the 28 chest radiographs in the study. Radiographs above 90% were interpreted with nearly complete agreement (0 or 1 dissenting reader).
reader most likely to call a radiograph positive interpreted 71% as positive (Table 2). Positive chest radiographs that were interpreted with the greatest agreement had dense alveolar consolidation in all four lung quadrants (Fig 1). Radiographs that were interpreted with high variability had opacities that some readers interpreted as atelectasis and others interpreted as infiltrates, bilateral lower lobe infiltrates, small lung volumes, or overlying monitoring equipment that obscured the pulmonary parenchyma (Fig 2). Pleural effusions also accounted for disagreement when some readers identified infiltrates “behind” the effusion and others did not (Fig 3). Radiographic features that might be consistent with mild pulmonary edema, such as increased interstitial markings, indistinct vessels, and blurring of the hilar structures, were frequently cited by the readers as problematic (Fig 4).
Table 2—Responses for Each of the 21 Readers Sorted by Percentage of Radiographs Interpreted as Positive (Consistent With AECC Radiographic Definition) Reader No.
Positive, %
22 3 4 6 13 15 19 11 20 1 14 7 16 18 21 5 23 24 8 17 2
36 43 43 46 46 46 46 57 57 57 61 61 61 64 64 68 68 68 71 71 71
Discussion Experts in the field of ALI do not agree when they apply the current consensus radiographic definition for ALI-ARDS. A k-statistic value of 0.55 indicates only moderate agreement.13 k-statistic values in this range have raised concerns in the interpretation of mammograms, ventilation-perfusion scans, and chest radiographs in community-acquired pneumonia.10,14,15 There was full agreement on less than half the radiographs. Chest radiographs that were interpreted consistently as positive were obtained from patients with dense alveolar infiltrates in four lung quadrants. Infiltrates limited to lower lung zones, atelectasis, small lung volumes, mild involvement, pleural effusions, and overlying monitoring devices all were identified as contributing factors for high variability of radiograph interpretations. There was a twofold difference in the positive radiograph rate between the reader least likely to call a radiograph positive (36%) and the reader most likely to call a radiograph positive (71%). This wide range in the percentage of radiograph readers termed positive was not due to isolated outliers, but reflected a continuous distribution across all readers (Table 2). These data provide empiric evidence for the concerns that have been raised regarding the reproducibility of the ALI-ARDS definition for different institutions.6,7 If the variability in radiographic interpretation causes variability in the clinical diagnosis of ALI-ARDS, this finding may account for some of the CHEST / 116 / 5 / NOVEMBER, 1999
1349
Figure 3. Chest radiograph with 86% agreement. The majority of interpretation was inconsistent with ALI-ARDS. Readers commented on the presence of right pleural effusion.
geographic and institutional variation in the incidence, risk factors, resource use, and outcome of ALI-ARDS. Only one other published study has evaluated
interobserver variability in ARDS chest radiographs.16 These investigators found excellent agreement between two radiologists and poor agreement between four nonradiologist clinicians in calculating a radiographic score in a series of chest radiographs from patients who had ARDS diagnoses. Ours is the first study to explore the reliability of an accepted consensus conference radiographic definition (as opposed to a scoring system) in a set of chest radiographs. In addition, the radiographs in our study
Figure 2. Chest radiograph with 71% agreement. The majority of interpretation was inconsistent with ALI-ARDS. Readers commented on low lung volumes and atelectasis complicating interpretation.
Figure 4. Chest radiograph with 52% agreement. The majority of interpretation was consistent with ALI-ARDS. Readers commented on mild interstitial infiltrates.
Figure 1. Chest radiograph with 100% agreement: consistent with AECC radiographic criterion.
1350
Clinical Investigations in Critical Care
were selected only on the basis of intubation and hypoxemia, therefore representing a broad range of radiographs, unlike the earlier study, in which radiographs were selected from patients who had already received diagnoses of ARDS. An additional strength of our study was the group of 21 international experts in ALI-ARDS and mechanical ventilation who were the participating readers. The previous study used readers from a single institution who interpreted the radiographs together. Our study sample may be more representative of interobserver variability. Finally, we report feedback from the readers on specific aspects of the radiographs that led to variability in interpretation; definitions can be modified based on these findings. There was more agreement on the analog radiographs than on those taken with computed radiography. There are several possible explanations for this unexpected observation. All of the analog radiographs were selected from patients who were actually enrolled in a clinical trial, reflecting, therefore, only a subset of patients considered for inclusion. Thus difficult radiographs may be underrepresented. If this is true, then our k-statistic estimate of 0.55 overestimates the agreement that would be seen in a sample of radiographs evaluated in ALI-ARDS screening. It is possible that the digital radiographs were of poorer quality than the analog radiographs or that the smaller size of the digital radiographs made them more difficult to interpret. Participants may have lacked experience in evaluating digital radiographs. However, the seven readers from the ARDS Network reviewed digital radiographs from patients with ALI-ARDS at several planning sessions, and their agreement, as a group, on radiograph interpretations was no different from other readers. It is not possible to distinguish among these possibilities without studying the interpretation of digital and analog radiographs on the same patients. There are several potential limitations to this study. We did not study the readers’ accuracy in diagnosing ALI-ARDS using the entire AECC definition. This would have required a presentation of clinical data including onset, history, physical examination, and laboratory tests in the form of a vignette. Had we used vignettes with this information, the resulting agreement on the diagnosis of ALI-ARDS would have reflected our skills, or lack thereof, in abstracting clinical information and writing vignettes. Because we were interested specifically in the readers’ agreement with each other in applying the radiographic definition, and because there is no “gold standard” with which to compare the readers’ decisions, accuracy was not particularly relevant. The question was not whether the readers were right or
wrong in some objective sense, but whether they agreed with each other in applying a standard radiographic definition. Because any diagnostic test may perform poorly in a specific spectrum of cases, it is possible that the poor agreement in this study reflects the sample of radiographs.17 We tried to simulate the broad range of chest radiographs that would be encountered in screening patients for ALI-ARDS for enrollment in a clinical trial by using a sample of patients who were critically ill, intubated, and met the oxygenation criterion for ALI-ARDS. It is possible that our sample size of 21 readers and 28 radiographs was too small to estimate the true k-statistic value. This uncertainty is reflected in the confidence intervals around the k-statistic value, which exclude excellent agreement. The k-statistic is affected by the prevalence of positive readings in the sample. If, for example, the average reader had read the radiograph as 90% positive or 90% negative for ALI-ARDS, the k-statistic might have appeared low when, in fact, considerable agreement existed among readers. However, this limitation does not apply to our study, because the wide variability among readers led to a broad range of positive readings, and the average prevalence was nearly optimal 54% (Table 2). Some aspects of the chest radiograph presentation process may have contributed to the level of observed agreement. Serial radiographs were not available for review as they might be in clinical practice, and such review could have improved apparent agreement on radiographs of pleural effusions or on those using overlying monitoring devices. In addition, we chose to study the readings of pulmonary and critical care physician experts rather than radiologists. We consider it a unique opportunity to have studied their performance; however, it is possible that a group of radiologists would have interpreted the chest radiographs with less variability. Because the diagnosis of ALI-ARDS and the decision to enroll in clinical trials are evaluations frequently made by a clinician at the bedside, we believe the study participants were a valid choice to address the research question. We did not proctor the readers when they interpreted radiographs. However, any collaborative reading would only have biased the study toward finding a higher level of agreement than actually exists. To reduce interobserver variability, future panels charged with revising the definition for ALI-ARDS should consider the issues raised by this study. An annotated set of training radiographs clarifying the interpretation of the difficult radiographic patterns identified by our readers might be a useful adjunct to a written definition. The effect of analog vs digital CHEST / 116 / 5 / NOVEMBER, 1999
1351
technique on agreement needs to be further evaluated. Modifications to the definition may also improve agreement. For example, a “negative” definition that specifies which radiographic patterns are inconsistent with ALI-ARDS may lead to greater agreement than the current version. Specific instructions to interpret radiographs strictly by the definition, even if it results in positive radiographs that the reader might personally consider negative for ALIARDS, may facilitate consistent readings. It is interesting to note that the ARDS Network investigators, who have read chest radiographs together as part of clinical trial planning, interpreted radiographs no more consistently as a group than other participants. Group reading exercises may be insufficient to ensure agreement, and a modified definition or example radiograph may be necessary. Finally, it is important that definitions proposed by consensus panels be empirically evaluated for interobserver agreement by readers who will be using the definition. This study has important implications for consensus panels charged with defining critical care syndromes in general and for the interpretation of clinical trials. Sepsis syndrome, multiple organ dysfunction syndrome, and ARDS are diagnosed on the basis of operational definitions proposed by experts.8,18,19 These definitions frequently make clinical sense and therefore seem valid. However, they are rarely subjected to empiric testing to evaluate their reliability. As we have shown, interobserver variability may be high, particularly with regard to radiographic or clinical features that are difficult to standardize among clinicians. Because the accuracy of critical care syndrome definitions cannot be verified, as accepted “gold standard” diagnostic tests do not exist, it is particularly important that future critical care syndrome definitions demonstrate their reliability. It is important to appreciate that the absence of effective therapies for ALIARDS limits the clinical consequences of its diagnosis. Therefore, the findings of this study are largely a challenge to the research community. However, when effective treatments are found, their applicability at the bedside will depend on clinicians’ abilities to identify and treat patients similar to those enrolled in the clinical trials.12,20 If the efficacy of an ALI-ARDS therapy depends, in part, on radiographic aspects of the syndrome, and if clinicians cannot identify similar patients because of variability in applying the definition, then the treatment will not be as effective in their patients as in the clinical trial subjects. Before we can expect clinicians to consistently identify those chest radiographs that meet criteria for ALI-ARDS, tools should be developed to help experts apply the definition consistently. 1352
Appendix The following individuals participated as readers: Marcello Amato, MD, Sao Paulo, Brazil; Derek C. Angus, MD, MPH, Pittsburgh, PA; Cordon R. Bernard, MD, Nashville, TN; Desmond J. Bohn, MD, Toronto, Ontario, Canada; Roy G. Brower, MD, Baltimore, MD; Deborah Cook, MD, Hamilton, Ontario, Canada; Timothy Evans, MB, London, UK; John T. Granton, MD, Toronto, Ontario, Canada; John J. Marini, MD, St. Paul, MN; Alan H. Morris, MD, Salt Lake City, UT; Polly Parsons, MD, Denver, CO; V. Marco Ranieri, MD, Bari, Italy; Gordon D. Rubenfeld, MD, MSc, Seattle, WA; Arthur S. Slutsky, MD, Toronto, Ontario, Canada; Thomas E. Stewart, MD, Toronto, Ontario, Canada; Jesus Villar, MD, Canary Islands, Spain; Ronald B. Hirschl, MD, Ann Arbor, MI; Leonard D. Hudson, MD, Seattle, WA; Brian P. Kavanagh, MB, Toronto, Ontario, Canada; Paul Lanken, MD, Philadelphia, PA; Herbert P. Wiedemann, MD, Cleveland, OH;
References 1 Ashbaugh DG, Bigelow DB, Petty TL, et al. Acute respiratory distress in adults. Lancet 1967; 2:319 –323 2 Esteban A, Fernandez-Segoviano P, Oliete S, et al. Radiographic findings for the adult respiratory distress syndrome in patients with peritonitis. Crit Care Med 1983; 11:880 – 882 3 Wegenius G, Erikson U, Borg T, et al. Value of chest radiography in adult respiratory distress syndrome. Acta Radiol Diagn Stockh 1984; 25:177–184 4 Murray JF, Matthay MA, Luce JM, et al. An expanded definition of the adult respiratory distress syndrome. Am Rev Respir Dis 1988; 138:720 –723 5 Johnson KS, Bishop MH, Stephen CM, et al. Temporal patterns of radiographic infiltration in severely traumatized patients with and without adult respiratory distress syndrome. J Trauma 1994; 36:644 – 650 6 Schuster DP. What is acute lung injury? What is ARDS? Chest 1995; 107:1721–1726 7 Garber BG, Hebert PC, Yelle JD, et al. Adult respiratory distress syndrome: a systemic overview of incidence and risk factors. Crit Care Med 1996; 24:687– 695 8 Bernard GR, Artigas A, Brigham KL, et al. The AmericanEuropean Consensus Conference on ARDS: definitions, mechanisms, relevant outcomes, and clinical trial coordination. Am J Respir Crit Care Med 1994; 149:818 – 824 9 Wagner GR, Attfield MD, Parker JE. Chest radiography in dust-exposed miners: promise and problems, potential and imperfections. Occup Med 1993; 8:127–141 10 Christiansen F, Andersson T, Rydman H, et al. Rater agreement in lung scintigraphy. Acta Radiol 1996; 37:754 –758 11 Elmore JG, Wells CK, Howard DH, et al. The impact of clinical history on mammographic interpretations. JAMA 1997; 277:49 –52 12 Stewart TE, Meade MO, Cook DJ, et al. Evaluation of a ventilation strategy to prevent barotrauma in patients at high risk for acute respiratory distress syndrome. N Engl J Med 1998; 338:355–361 13 Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York, NY: Wiley, 1981 14 Elmore JG, Wells CK, Lee CH, et al. Variability in radiologists’ interpretations of mammograms. N Engl J Med 1994; 331:1493–1499 15 Melbye H, Dale K. Interobserver variability in the radiographic diagnosis of adult outpatient pneumonia. Acta Radiol 1992; 33:79 – 81 16 Beards SC, Jackson A, Hunt L, et al. Interobserver variation Clinical Investigations in Critical Care
in the chest radiograph component of the lung injury score. Anesthesia 1995; 50:928 –932 17 Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med 1978; 299:926 –930 18 Bone RC, Balk RA, Cerra FB, et al. Definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis: the ACCP/SCCM Consensus Conference
Committee; American College of Chest Physicians/Society of Critical Care Medicine. Chest 1992; 101:1644 –1655 19 Bone RC. Sepsis, the sepsis syndrome, multi-organ failure: a plea for comparable definitions. Ann Intern Med 1991; 114:332–333 20 Amato MBP, Barbas CSV, Medeiros DM, et al. Effect of a protective-ventilation strategy on mortality in the acute respiratory distress syndrome. N Engl J Med 1998; 338:347–354
CHEST / 116 / 5 / NOVEMBER, 1999
1353