Comparison of Independent Double Readings and Computer-Aided Diagnosis (CAD) for the Diagnosis of Breast Calcifications1 Yulei Jiang, PhD, Charles E. Metz, PhD, Robert M. Nishikawa, PhD, Robert A. Schmidt, MD
Rationale and Objectives. The aim of the study is to compare independent double readings by radiologists and computeraided diagnosis (CAD) in diagnostic interpretation of mammographic calcifications. Materials and Methods. Ten radiologists independently interpreted 104 mammograms containing clustered microcalcifications. Forty-six of these were malignant and 58 were benign at biopsy. Radiologists read the images with and without a computer aid by using a counterbalanced study design. Sensitivity and specificity were calculated from observer biopsy recommendations, and receiver operating characteristic (ROC) curves were computed from their diagnostic confidence ratings. Unaided double-reading sensitivity and specificity values were derived post hoc by using three different objective rules and an additional rule of simulated-optimal double reading that assumed that consultations for resolving two radiologists’ different independent diagnoses always produce the correct clinical recommendation. ROC curves of unaided double readings were obtained according to the literature. Results. Single reading without computer aid yielded 74% sensitivity and 32% specificity, whereas CAD reading yielded 87% sensitivity and 42% specificity and appeared on a higher ROC curve (P ⬍ .0001). Three methods of formulating independent double readings generated sensitivities between 59% and 89%, specificities between 50% and 13%, and operating points that moved essentially along the average unaided single-reading ROC curve. ROC curves of unaided independent double readings showed small, statistically insignificant improvement over those of unaided single readings. Results of the simulated-optimal double reading were similar to CAD: 89% sensitivity and 50% specificity. Conclusion. Independent double readings of mammographic calcifications may not improve diagnostic performance. CAD reading improves diagnostic performance to an extent approaching the maximum possible performance. Key Words. Diagnostic radiology, observer performance; receiver operating characteristic (ROC) curve; breast radiography; breast neoplasm, diagnosis; breast neoplasm, calcification; computer, diagnostic aid. ©
AUR, 2006
Researchers have long been interested in improving radiographic interpretations by having two or more radiologists read the same images (1–3). Double reading of mammograms has been shown to increase breast-cancer detection Acad Radiol 2006; 13:84 –94 1 From the Department of Radiology, The University of Chicago, 5841 South Maryland Avenue, Chicago, IL 60637. Received August 4, 2005; Revision received and accepted September 20. Address correspondence to: Y.J. e-mail:
[email protected]
© AUR, 2006 doi:10.1016/j.acra.2005.09.086
84
by 3%–15% (4 –7). This finding, with both logistical and economic implications, has led some clinical practices to have screening mammograms routinely read by two radiologists (8). Computer-aided diagnosis (CAD), introduced in recent years, is conceptually similar to double reading by two radiologists. In CAD, a computer technique analyzes the image automatically and makes its results available to an interpreting radiologist. Laboratory studies showed that when used by radiologists, computer aids can help radiologists improve diagnostic performance significantly (9 –18).
Academic Radiology, Vol 13, No 1, January 2006
CAD offers potential advantages compared with double readings by radiologists: it is inherently more consistent than human readers, it can be improved continually over time, and it is potentially more economical than an additional human reader. However, effects of these two approaches for improving the performance of diagnostic mammography have not been compared. In this report, we compare CAD and independent double readings by radiologists through analysis of data from radiologists’ retrospective interpretations of mammograms with and without a computer aid that we developed. We derive the double-reading performance post hoc by using several objective methods. In this analysis, we evaluate only independent (ie, blind) double readings that can be ascribed to objective rules, which does not include such subject methods as (not-blind) consultation.
MATERIALS AND METHODS Mammograms From the University of Chicago Hospitals Radiology files, we identified mammograms of 104 consecutive patients who had undergone breast biopsies for tissue diagnosis of clustered microcalcifications. The study included only calcification cases because the computer aid was designed specifically to analyze this common type of mammographic lesion (12,19). Microcalcifications often are the only mammographic indication of breast cancer (20). This case series consists of 46 consecutive cancers and 58 consecutive benign lesions. Of cancers, nine were invasive and 37 were ductal carcinoma in situ. To increase statistical power, as in other receiver operating characteristic (ROC) studies, the proportion of cancer cases was enriched in this series compared with that typically seen in clinical practice (21). Therefore, cancer cases necessarily were drawn from a longer time span than benign cases. Criteria in selecting these mammograms were: (1) a cluster of microcalcifications was the only significant lesion that led to the biopsy, and pathological results were definitive; (2) original mammograms of at least two standard views and one magnification view were available; and (3) technical quality of the mammograms was adequate for interpretation (12). Our Institutional Review Board (IRB) approved a waiver of patient consent for this study because the study involved only retrospective review of existing mammograms with known diagnoses. As in similar studies, previous mammograms were not included in this series because not all
COMPARISON OF DOUBLE READINGS AND CAD
cases had previous mammograms. Clinical information was withheld from the observers to focus the analysis on the interpretation of mammograms. Locations of microcalcification clusters were marked on each film (bracketed between two arrows) by an experienced mammographer (R.A.S.) after radiology and pathology reports were reviewed. Radiologist Observers Ten radiologists with experience in mammography who previously had not seen the study cases participated as observers. Five of these radiologists were general attending radiologists from the Chicago metropolitan area, and five others were senior radiology residents from our institution. The attending radiologists were Mammography Quality Standard Act (MQSA)– certified readers who read mammograms for an average of 30% of their clinical practice. They had read mammograms for an average of 9 years (median ⫽ 6 years, range ⫽ 1–30 years) and had read at least 1000 mammograms in the preceding year. Residents had limited experience acquired from training rotations of 1–2 months. IRB–approved informed consent was obtained from all observers. The Computer Aid The computer aid was a numerical estimate of the likelihood of malignancy made by a computer scheme we developed on the basis of eight image features extracted by the computer from digitized mammograms. Details of this computer aid have been reported previously (12,19). Computer analysis was performed on standard-view mammograms only, but radiologist observers interpreted both standard- and magnification-view mammograms. Standard-view mammograms were digitized with a laser scanner (Lumisys, Sunnyvale, CA) with a 0.1-mm pixel size and 12-bit gray scale. Locations of individual microcalcifications were identified manually on a computer monitor (19). Observers were instructed to consider the computer results when computer results were presented to them. They were told that the computer’s accuracy was approximately 90% sensitivity and 61% positive predictive value for the study cases when a threshold of 30% was applied to the computer-estimated likelihood of malignancy. One way to interpret this result is that any observer could have achieved the same accuracy by recommending a biopsy only when the computer reported a 30% or greater likelihood of malignancy.
85
JIANG
Mammogram Review and Data Acquisition Each observer independently reviewed all mammograms twice: once without the computer aid and once with the computer aid. Readings of the same mammogram were separated by 10 – 60 days. A counterbalanced study design was used: half the mammograms were read without the computer aid in the first reading session and then with the computer aid in the second reading session; the other half of the mammograms were read first with the computer aid and then without. The study protocol was designed carefully to minimize potential biases in comparing the interpretation of mammograms with and without the computer aid, the details of which are described elsewhere (12,21,22). After reviewing the mammograms for each case, observers were asked to report: (1) their degree of suspicion that a lesion was malignant on a visual analogue scale of 0%–100%, and (2) their choice of clinical recommendation from: (a) surgical biopsy, (b) alternative tissue sampling, (c) short-term follow-up, and (d) routine follow-up. Analysis of Single Readings With and Without the Computer Aid ROC curves (23,24) were computed from each observer’s quasicontinuous confidence data (25,26). We used the LABROC4 algorithm to obtain fitted binormal ROC curves (27). Two curves were obtained for each observer, from readings of mammograms with and without the computer aid, respectively. We then obtained pooled “average” ROC curves for observers as a group from their individual curves by averaging the slope and intercept parameters of the binormal curves (21). Again, two curves were obtained: an average ROC curve for reading mammograms without the computer aid and another average ROC curve for reading mammograms with the computer aid. Sensitivity and specificity were calculated from observers’ clinical recommendation data. Sensitivity is defined as the fraction of cancers that were recommended for surgical biopsy or alternative tissue sampling. Specificity is defined as the fraction of benign lesions that were recommended for short-term or routine follow-up. Sensitivityand specificity-value pairs were plotted as operating points and compared with ROC curves to show the relationships between the observers’ clinical recommendations and their quasicontinuous confidence data. Two pairs of sensitivity and specificity values were calculated for each observer: one for reading mammograms without the computer aid and one for reading mammograms with
86
Academic Radiology, Vol 13, No 1, January 2006
Table 1 Definitions and Clinical Ramifications of Independent Double Reading Rules Rule Logical-OR Logical-AND Tiebreaking by a third attending radiologist
Simulated-optimal double reading
Biopsy Recommended If and Only If Either radiologist recommends biopsy Both radiologists recommend biopsy (1) Both radiologists recommend biopsy; or (2) Only one radiologist recommends biopsy and a third, tiebreaking, attending radiologist also recommends biopsy (1) Both radiologists recommend biopsy; or (2) Only one radiologist recommends biopsy and the lesion is malignant (a hypothetical situation)
the computer aid. Average sensitivity and average specificity values with and without the computer aid also were calculated for observers as a group. Sensitivity and Specificity of Independent Double Readings Sensitivity and specificity of independent double readings were calculated post hoc to the mammogram interpretation from the observers’ clinical recommendations by pairing two (or more) observers’ data. To ensure validity, we used only objective rules of independent double readings in which radiologists read mammograms independently without knowledge of other radiologists’ interpretations, with their diagnoses subsequently combined according to an objective rule. Table 1 lists three rules of independent double readings and one additional rule of simulated-optimal double reading. The logical-OR rule results in biopsy recommendations for cases that either of two radiologists considers suspicious enough for biopsy. Sensitivity of this rule is expected to be greater than that of a single reader, but usually with a loss in specificity. This method has been studied in clinical mammography because of its potential to increase sensitivity (4,5). The logical-AND rule requires that both of two radiologists consider a lesion suspicious enough for a combined biopsy decision. This method generally does not cause as many benign biopsy results as a single reader does, but its sensitivity is expected to be less. Consequently, this method generally is not studied in clinical mammography, presumably because of its unfavorable effect on sensitivity.
Academic Radiology, Vol 13, No 1, January 2006
The rule of tiebreaking by a third attending radiologist when a primary reader recommends biopsy and another primary reader recommends follow-up may be a less time-consuming method than the consensus-seeking consultation among radiologists to resolve conflicting recommendations. This objective rule is equivalent to a majority vote of three radiologists. A more accurate radiologist ideally should serve as the tiebreaker because this radiologist’s decision will determine the final recommendation. Note that clinical consultation often is somewhat different from the tiebreaking rule studied here. In clinical practice, the tiebreaking radiologist normally would know the recommendations made by the primary readers. The tiebreaking radiologist may consider the primary readers’ decisions, individual accuracy, and patterns of recommending biopsies in making his or her own decision in a particular case. However, in the present study, because all study radiologists interpreted the mammograms independently, the tiebreaking radiologist was blinded to the primary readers’ decisions. Whereas the previous two rules each produced 45 different pairings of two radiologists from the pool of 10 radiologists (45 ⫽ 10 ⫻ 9/2), the rule of tiebreaking generated 180 different groupings of three radiologists: two primary readers from the pool of 10 radiologists and one additional tiebreaking radiologist from the pool of five attending radiologists (compare Figure 1a and b with c). The rule of simulated-optimal double reading estimates an upper bound on performance when double reading involves consultation between two radiologists. It otherwise is difficult to ascribe the highly subjective consultation process to an objective mathematical model. The rule of simulated-optimal double reading assumes that two radiologists enter into a consultation only if one radiologist recommends biopsy and another recommends follow-up, and that through the consultation, radiologists always reach the correct decision— biopsy for a cancer and follow-up for a benign lesion. This method obviously is hypothetical because no consultation can result in the correct diagnosis in all cases; hence, the method estimates an upper bound on performance that may or may not be achievable in practice. Note that this rule of simulated-optimal double reading can still produce incorrect decisions when both radiologists agree on one, albeit incorrect, diagnosis because they would not enter into a joint consultation in that situation. Note also that, as immediate consequences of the definition of the rule of simulated-optimal double reading, its sensitivity equals that of the rule of logi-
COMPARISON OF DOUBLE READINGS AND CAD
cal-OR and its specificity equals that of the rule of logical-AND. ROC Curves of Independent Double Readings In addition to sensitivity and specificity values, ROC curves also were obtained for double readings by using a method described by Metz and Shen (28), who quantitatively predicted an improvement in double-reading ROC curve over single-reading ROC curves. Analysis of double-reading ROC curves complements the analysis of sensitivity and specificity because double-reading ROC curves were derived from the radiologists’ diagnostic confidence data, whereas sensitivity and specificity values were calculated from the radiologists’ clinical recommendation data. Gain in the double-reading ROC curves is expected because radiologists’ interpretations of mammograms are affected by random variations in an individual’s decision-making process, which propagate to lower the single-reading ROC curve and the area under that curve (AZ). In contrast, the average of two independent interpretations tends to decrease the magnitude of random variations, thereby enhancing the double-reading ROC curve. These calculations are described in Appendix A.
RESULTS Single Readings With and Without the Computer Aid For single reading without the computer aid, average sensitivity was 73% and average specificity was 32%. For single reading with the computer aid (ie, the CAD reading), average sensitivity increased to 87% (P ⫽ .0006) and average specificity increased to 42% (P ⫽ .003, Student’s two-tailed t-test for paired data; Table 2). Figure 1 shows the average single-reading ROC curve without the computer aid and the average CAD-reading ROC curve. Figure 1 also shows operating points (pairs of sensitivity and specificity values) for each of the 10 observers in both single-reading conditions (with and without aid). Average AZ values were 0.61 for single reading without the computer aid and 0.75 for CAD reading. This difference in AZ values was statistically significant (P ⬍ .0001) according to the Dorfman-Berbaum-Metz jackknife with analysis of variance (ANOVA) method2 (29). 2 Note that this method uses the LABROC algorithm in calculating the A value Z of individual ROC curves (for details, see http://xray.bsd.uchicago.edu/krl/ roc_soft.htm).
87
JIANG
Academic Radiology, Vol 13, No 1, January 2006
Figure 1. Sensitivity and specificity values produced by the independent double-reading rules of: (a) logical-OR, (b) logical-AND, (c) tiebreaking by a third attending radiologist, and (d) simulated-optimal double reading, in comparison to values produced by single reading without computer aid and the CAD reading. Results were obtained from 10 radiologists’ independent interpretations of malignant and benign lesions identified only as clustered microcalcifications in 104 mammograms. The computer aid was an estimate of the likelihood of malignancy. Rules are listed in Table 1. Average ROC curves of the 10 radiologists’ single readings of mammograms with (dashed curve) and without (solid curve) the computer aid also are shown. Note that some double-reading combinations produced identical results, so their data points are superimposed.
Average sensitivity of the five attending radiologists reading mammograms without the computer aid was 75% ⫾ 13% (average ⫾ SD of reader-specific values),
88
average specificity was 29% ⫾ 18%, and average AZ was 0.62 ⫾ 0.07. Corresponding average values for the five residents were 72% ⫾ 10% for sensitivity, 34% ⫾ 14%
COMPARISON OF DOUBLE READINGS AND CAD
Academic Radiology, Vol 13, No 1, January 2006
Table 2 Comparison of Sensitivity and Specificity Sensitivity (%)
Single reading without aid Single reading with computer aid (CAD) Independent double readings Logical-OR Logical-AND Tiebreaking by a third attending radiologist Simulated-optimal double reading
Specificity (%)
Mean ⫾ SD
95% Confidence Intervalsⴱ
Mean ⫾ SD
95% Confidence Intervalsⴱ
73 ⫾ 11 87 ⫾ 9
— (0.05, 0.23)
32 ⫾ 15 42 ⫾ 15
— (0.01, 0.21)
89 ⫾ 6 58 ⫾ 10 79 ⫾ 7 89 ⫾ 6
(0.10, 0.20) (⫺0.19, ⫺0.11) (0.01, 0.10) (0.10, 0.20)
13 ⫾ 8 50 ⫾ 13 26 ⫾ 9 50 ⫾ 13
(⫺0.22, ⫺0.14) (0.14, 0.22) (⫺0.11, ⫺0.01) (0.14, 0.22)
Note. For a reading condition to be considered significantly better than single reading without aid, 95% confidence intervals for both sensitivity and specificity must not include zero and both must be positive. For sensitivity and specificity, one cannot infer a significant gain in diagnostic accuracy from a combination of an entirely positive and an entirely negative 95% confidence interval that do not include zero because a significant change of this kind can be caused by movement along a single ROC curve. SD is of reader-specific values. ⴱOf the average difference between a designated reading method and the reading method of single reading without aid, estimated from 10,000 bootstraps (See Appendix B for detail).
for specificity, and 0.61 ⫾ 0.05 for AZ. The small differences in sensitivity, specificity, and AZ values between the two groups were not statistically significant (P ⬎ .50, Student’s two-tailed t-test). Based on these results and results from our previous analyses (12), data from attending radiologists and residents were combined in subsequent analyses. Sensitivity and Specificity of Independent Double Readings As expected, the logical-OR rule generated the highest sensitivity, but the lowest specificity (Figure 1a; Table 2). Conversely, the logical-AND rule generated the highest specificity, but the lowest sensitivity (Figure 1b; Table 2). The rule of tiebreaking by a third attending radiologist resulted in intermediate sensitivity and intermediate specificity values (Figure 1c; Table 2). Operating points generated from these rules of independent double readings fall near the average single-reading ROC curve without the computer aid (Figure 1a– c). These double-reading rules were able to move the operating points toward higher sensitivity or higher specificity, but none of the rules moved the operating points onto a higher ROC curve. The deviation of the operating points away from the average ROC curve was qualitatively similar to that of the single-reading operating points (30). Therefore, results do not indicate that any of these rules improved diagnostic performance.
Table 3 Comparison of Area Under the ROC Curve (Az) 95% Confidence Intervalsⴱ
ROC Curves
AZ (mean ⫾ SD)
Single reading without aid Single reading with computer aid (CAD) Double reading† Empirical value Theoretically expected value
0.614 ⫾ 0.056 0.755 ⫾ 0.030
— (0.07, 0.22)
0.625 ⫾ 0.042 0.638 ⫾ 0.038
(⫺0.01, 0.03) (0.00, 0.04)
Note. SD is of reader-specific values. the average difference in AZ values between a designated reading method and the reading method of single reading without aid, estimated from 10,000 bootstraps (see Appendix B for details). †See Appendix A for details. ⴱOf
Sensitivity and Specificity of the SimulatedOptimal Double Reading For simulated-optimal double reading, average sensitivity was 89% and average specificity was 50% (Figure 1d; Table 2). The average value for sensitivity was similar to and specificity was slightly higher than those of the CAD reading (Table 2). Operating points of the simulated-optimal double reading clearly fall above the average singlereading ROC curve without the computer aid, indicating an improvement in diagnostic accuracy, whereas they overlap with operating points of the CAD reading.
89
JIANG
ROC Curves of Double Readings Table 3 lists average AZ values for single reading without the computer aid, CAD reading, and double readings. Two double-reading AZ values are listed: one empirical value (0.625) and one expected value from theoretical ground, as described by Metz and Shen (28) (0.638; see Appendix A for detail). Both double-reading AZ values were slightly higher than the single-reading without-aid AZ value (0.614). However, all three AZ values were small compared with the AZ obtained from CAD reading (0.755).
DISCUSSION Bird et al (4) found a 3.4% increase and Thurfjell et al (5) found a 15% increase in detection of breast cancers (ie, sensitivity) when mammograms were read by two radiologists. Unfortunately, the literature does not always report callback rate, which relates to specificity—an integral component of diagnostic performance (31). If the gain in sensitivity was achieved with a simultaneous increase in callback rate, then that gain might be a result of calling back more women for workup and thus does not represent a true diagnostic improvement (32). In a study of 108 radiologists, Beam et al (33) showed that outcomes of independent double reading of mammograms depended on individual practices and did not always produce beneficial results. Consistent with these reports, our results show that independent double readings can increase sensitivity, but the increase in sensitivity tends to be accompanied by a decrease in specificity. When combined by objective post hoc methods, independent double readings are able to move the operating points along the average single-reading ROC curve, but are unable to move the operating points to a higher ROC curve (Figure 1a– c). The three objective methods we analyzed for implementing independent double readings yield results that generally agree with expectations. The logical-OR rule generated the highest sensitivity because it is optimized for high sensitivity by recommending biopsy for any case that any one radiologist finds suspicious, regardless of whether other readers seek to decrease the number of likely-benign biopsies on that particular case. The logicalAND rule goes to the other extreme and is optimized for high specificity: the requirement of a consensus for recommending biopsy prevents questionable benign lesions from going to biopsy. However, it is clear that cancers
90
Academic Radiology, Vol 13, No 1, January 2006
will be missed in this kind of double reading because many malignant lesions also are only questionably, rather than strongly, suspicious. Between the two extremes of logical-OR and logical-AND rules is the tiebreaking by a third attending radiologist. Perhaps not entirely expected is the result that all three methods generated operating points that fall near the average single-reading ROC curve, implying that these methods of independent double readings do not improve radiologists’ abilities to distinguish malignant from benign calcifications. Results of our analysis of sensitivity and specificity in double readings are consistent with results of double-reading ROC curves. Although we used ad hoc methods motivated by clinical practice to calculate sensitivity and specificity, the analysis of double-reading ROC curves had an explicit statistical basis (28). Although we observed relatively small gains in AZ, less than previously shown (28), all our analyses—theoretical calculation of AZ, empirical calculation of AZ, and empirical calculation of sensitivity and specificity— consistently show relatively small gains compared with that achieved by the CAD reading. We presented data from both the attending and resident radiologists based on the observation that the performance of the two groups is similar. Analysis of data from attending radiologists alone did not change the results (Tables 2 and 4). Note that the double-reading operating points obtained by attending radiologists are a subset of those shown in Figure 1 and therefore do not change the implication of these results that double readings failed to improve radiologists’ abilities to distinguish malignant from benign calcifications. We believe inclusion of residents in this analysis is clinically relevant because senior residents are at a local peak in their mammography training because they are studying for American Board of Radiology oral examinations. Moreover, the majority of recent residents and fellows who go into private practice are assigned to reading screening mammograms. The three objective rules of independent double readings are clinically relevant methods for implementing double readings. The logical-OR rule was studied in screening mammography and increased the number of breast cancers detected from mammograms (4,5). It is logistically efficient because two radiologists can read mammograms at different times and an office assistant subsequently can combine their independent interpretations. The logical-AND rule minimizes overdiagnosis, but it may not be appropriate for screening mammography because the goal of screening is to detect early-stage breast cancers. Tiebreaking is another practical clinical
COMPARISON OF DOUBLE READINGS AND CAD
Academic Radiology, Vol 13, No 1, January 2006
Table 4 Comparison of Results Obtained from Attending Radiologists With Results From Residents Sensitivity (%) Attending (n ⫽ 5)
Single reading without aid Single reading with computer aid (CAD) Independent double readings Logical-OR Logical-AND Tiebreaking by a third attending radiologist Simulated-optimal double reading
Specificity (%)
Resident (n ⫽ 5)
Attending (n ⫽ 5)
Resident (n ⫽ 5)
Mean ⫾ SD
Range
Mean ⫾ SD
Range
Mean ⫾ SD
Range
Mean ⫾ SD
Range
75 ⫾ 13 88 ⫾ 9
52–87 74–96
72 ⫾ 10 87 ⫾ 10
61–87 74–100
29 ⫾ 18 42 ⫾ 15
9–53 28–60
34 ⫾ 14 42 ⫾ 16
19–52 22–67
90 ⫾ 6 59 ⫾ 12 81 ⫾ 7
78–98 46–78 70–91
87 ⫾ 5 57 ⫾ 8 78 ⫾ 7
78–94 44–70 63–91
13 ⫾ 8 45 ⫾ 14 25 ⫾ 9
5–29 17–62 12–38
14 ⫾ 4 54 ⫾ 13 26 ⫾ 9
5–21 33–78 10–48
90 ⫾ 6
78–98
87 ⫾ 5
78–94
45 ⫾ 14
17–62
54 ⫾ 13
33–78
Note. SD is of reader-specific values.
method in which a senior attending radiologist can be called on to help make a diagnosis in cases that appear to be difficult based on the disagreement between two primary readers. However, as we pointed out earlier, note that clinical implementations of tiebreaking often are not truly independent and therefore are different from the objective rule in this analysis in that the tiebreaking radiologist usually is not blinded to the primary readers’ interpretations. Unfortunately, our results suggest that none of these forms of double reading, if implemented in a truly independent (ie, blind) fashion, can help radiologists become more accurate in deciding whether breast calcifications are malignant. Consultation among radiologists that seeks to resolve conflicting independent diagnoses and reach a recommendation consensus is commonplace in clinical practice. We did not analyze this form of double reading because its subjective nature cannot be ascribed to an objective rule. In consultation, radiologists may point out features in a mammogram overlooked by others and discuss reasoning behind a recommendation, during which each radiologist’s prior experience consciously or subconsciously enters into the consensus-making process. These attributes are not available when radiologists read mammograms independently; it therefore is inappropriate to model consultation by using objective rules. However, the simulated-optimal double reading studied here was designed specifically to analyze consultation by estimating an upper bound on its performance. Not surprisingly, this hypothetical method generated a true improvement by moving
operating points above the average single-reading ROC curve. Therefore, unlike the three methods of independent double readings, consultation appears to have a potential ability to enhance radiologists’ diagnostic accuracy (34). However, the amount of improvement to be gained in this way, which will depend on how closely a consultation can approach our simulated-optimal double-reading process, generally is unknown. Other investigators have shown that CAD can significantly improve radiologists’ diagnostic accuracy (9 –18). This also is shown in the present analysis. In addition, our results show that the substantial improvement from CAD contrasts with the small improvements, if any, obtained from objective post hoc combination of independent double readings from two radiologists (Tables 2 and 3). The CAD approach has three clear advantages compared with double readings by radiologists. First, in this study, the computer’s accuracy in analyzing mammographic lesions was better than that of any of the individual radiologists (12). Second, the computer analysis was completely independent of any radiologist’s interpretation and was not affected by subjective factors that affect all radiologists (19). Third, the radiologist assumes final decision-making responsibility, with no need for arbitration by a third party. These observations support our finding that greater improvement in diagnostic accuracy and productivity can be gained from pairing a radiologist with our computer aid than from pairing a radiologist with a fellow radiologist.
91
JIANG
Perhaps remarkably, the performance achieved by CAD was similar to that of the simulated-optimal double reading; this indicates that CAD potentially offers a practical pathway for radiologists to approach optimal performance in routine clinical practice. To look at this result from a different perspective, CAD— one radiologist using a computer aid—is akin to the radiologist “double-reading” with the computer and also “arbitrating” when there is a discrepancy. Our results indicate that, at least for analysis of calcifications, CAD can be used to obtain results similar to those of double reading with perfect consultation, and this can be done without the logistical complications of requiring additional radiologist(s). The substantial gain obtained with CAD derives in part from the high performance of the computer analysis alone (12). One then could ask whether a similar gain can be achieved by pairing a radiologist with an expert—a specialist who is more accurate than others from long experience and who rivals the computer in accuracy. The combined performance of an expert and a nonexpert radiologist might be significantly better than that of the nonexpert alone. However, it is not clear in this situation whether the nonexpert radiologist makes a real contribution. Beam et al (33) showed that nonexperts may hurt the expert’s performance by forcing their combined performance to be inferior to that of the expert alone. In contrast, the computer analysis in a CAD reading milieu can provide the performance of an expert, whereas a nonexpert radiologist successfully contributes in making the final improved diagnostic decision. Furthermore, it is not clear how many expert radiologists in mammography exist in the United States, and, unfortunately, not all clinical practices have access to an expert reader. There is no standardized test to identify “experts,” and radiologists’ self-assessments were shown to be unreliable (35). A computer tool therefore may be more readily accessible to clinical practice than an expert, once CAD techniques become widely available. Our results were obtained from analysis of digitized screen-film mammograms. We expect the findings to be applicable also to full-field digital mammograms. If we assume that the computer aid can achieve the same high performance on full-field digital mammograms as on digitized screen-film mammograms (36), one would expect radiologists to be able to achieve gains of the same magnitude in diagnostic performance from use of our computer aid. This prediction needs to be tested in future studies.
92
Academic Radiology, Vol 13, No 1, January 2006
To our knowledge, this is the first study to look at double reading of mammograms for a diagnostic or characterization task, rather than a perceptual or detection task. From a mass screening perspective, the present analysis suggests that patients would benefit substantially by having this particular decision-making task performed by radiologists with access to CAD algorithms. This potential benefit of the CAD reading comes from the accurate performance of the computer image analysis for deciding whether to perform a biopsy for suspect microcalcifications compared with that of the radiologists (12). This alternative double-reading pathway of CAD tested here provides hope and, potentially, a practical course for improving diagnoses in the future. ACKNOWLEDGMENT
This work was funded in part by the National Institutes of Health (NIH) through grants CA60187, CA92361 and GM57622, the US Army through grant DAMD17-001-0197, and the Cancer Research Foundation of America. The contents of this paper are solely the responsibility of the authors and do not necessarily represent the official views of any of the supporting organizations. Conflict of Interest Statement: Robert Nishikawa, Robert Schmidt, and Charles Metz are shareholders of, and receive royalty from, R2 Technology, Inc. (Sunnyvale, CA). Yulei Jiang, Robert Nishikawa, and Charles Metz receive research funding from R2 Technology, Inc. It is the University of Chicago Conflict of Interest Policy that investigators disclose publicly actual or potential significant financial interests that may appear to be affected by the research activities. REFERENCES 1. Yerushalmy J, Harkness J, Cope J, Kennedy B. The role of dual reading in mass radiography. Am Rev Tuberc 1950; 61:443– 464. 2. Hillman BJ, Hessel SJ, Swensson RG, Herman PG. Improving diagnostic accuracy: a comparison of interactive and Delphi consultations. Invest Radiol 1977; 12:112–115. 3. Hessel SJ, Herman PG, Swensson RG. Improving performance by multiple interpretations of chest radiographs: effectiveness and cost. Radiology 1978; 127:589 –594. 4. Bird RE, Wallace TW, Yankaskas BC. Analysis of cancers missed at screening mammography. Radiology 1992; 184:613– 617. 5. Thurfjell EL, Lernevall KA, Taube AA. Benefit of independent double reading in a population-based mammography screening program. Radiology 1994; 191:241–244. 6. Warren RM, Duffy SW. Comparison of single reading with double reading of mammograms, and change in effectiveness with experience. Br J Radiol 1995; 68:958 –962.
Academic Radiology, Vol 13, No 1, January 2006
7. Brown J, Bryan S, Warren R. Mammography screening: an incremental cost effectiveness analysis of double versus single reading of mammograms. BMJ 1996; 312:809 – 812. 8. Blanks RG, Wallis MG, Given-Wilson RM. Observer variability in cancer detection during routine repeat (incident) mammographic screening in a study of two versus one view mammography. J Med Screen 1999; 6:152–158. 9. Getty DJ, Pickett RM, D’Orsi CJ, Swets JA. Enhanced interpretation of diagnostic images. Invest Radiol 1988; 23:240 –252. 10. Chan HP, Doi K, Vyborny CJ, et al. Improvement in radiologists’ detection of clustered microcalcifications on mammograms. The potential of computer-aided diagnosis. Invest Radiol 1990; 25:1102–1110. 11. Kegelmeyer WP Jr, Pruneda JM, Bourland PD, Hillis A, Riggs MW, Nipper ML. Computer-aided mammographic screening for spiculated lesions. Radiology 1994; 191:331–337. 12. Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K. Improving breast cancer diagnosis with computer-aided diagnosis. Acad Radiol 1999; 6:22–33. 13. Chan HP, Sahiner B, Helvie MA, et al. Improvement of radiologists’ characterization of mammographic masses by using computer-aided diagnosis: an ROC study. Radiology 1999; 212:817– 827. 14. Huo Z, Giger ML, Vyborny CJ, Metz CE. Breast cancer: effectiveness of computer-aided diagnosis observer study with independent database of mammograms. Radiology 2002; 224:560 –568. 15. Kobayashi T, Xu XW, MacMahon H, Metz CE, Doi K. Effect of a computer-aided diagnosis scheme on radiologists’ performance in detection of lung nodules on radiographs. Radiology 1996; 199:843– 848. 16. Difazio MC, MacMahon H, Xu XW, et al. Digital chest radiography: effect of temporal subtraction images on detection accuracy. Radiology 1997; 202:447– 452. 17. Monnier-Cholley L, MacMahon H, Katsuragawa S, Morishita J, Ishida T, Doi K. Computer-aided diagnosis for detection of interstitial opacities on chest radiographs. AJR Am J Roentgenol 1998; 171:1651–1656. 18. Ashizawa K, MacMahon H, Ishida T, et al. Effect of an artificial neural network on radiologists’ performance in the differential diagnosis of interstitial lung disease using chest radiographs. AJR Am J Roentgenol 1999; 172:1311–1315. 19. Jiang Y, Nishikawa RM, Wolverton DE, et al. Malignant and benign clustered microcalcifications: automated feature analysis and classification. Radiology 1996; 198:671– 678. 20. Sickles EA. Breast calcifications: mammographic evaluation. Radiology 1986; 160:289 –293. 21. Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 1989; 24:234 –245. 22. Swets JA, Pickett RM. Evaluation of Diagnostic Systems: Methods From Signal Detection Theory. New York: Academic, 1982. 23. Metz CE. ROC methodology in radiologic imaging. Invest Radiol 1986; 21:720 –733. 24. Swets JA. Measuring the accuracy of diagnostic systems. Science 1988; 240:1285–1293. 25. Rockette HE, Gur D, Metz CE. The use of continuous and discrete confidence judgments in receiver operating characteristic studies of diagnostic imaging techniques. Invest Radiol 1992; 27:169 –172. 26. Wagner RF, Beiden SV, Metz CE. Continuous versus categorical data for ROC analysis: some quantitative considerations. Acad Radiol 2001; 8:328 –334. 27. Metz CE, Herman BA, Shen JH. Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. Stat Med 1998; 17:1033–1053. 28. Metz CE, Shen JH. Gains in accuracy from replicated readings of diagnostic images: prediction and assessment in terms of ROC analysis. Med Decis Making 1992; 12:60 –75. 29. Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis. Generalization to the population of readers and patients with the jackknife method. Invest Radiol 1992; 27:723–731. 30. Jiang Y, Nishikawa RM, Schmidt RA, Toledano AY, Doi K. Potential of computer-aided diagnosis to reduce variability in radiologists’ interpretations of mammograms depicting microcalcifications. Radiology 2001; 220:787–794.
COMPARISON OF DOUBLE READINGS AND CAD
31. Beam CA, Sullivan DC. What are the issues in the double reading of mammograms? Radiology 1994; 193:582. 32. Taplin SH, Rutter CM, Elmore JG, Seger D, White D, Brenner RJ. Accuracy of screening mammography using single versus independent double interpretation. AJR Am J Roentgenol 2000; 174:1257–1262. 33. Beam CA, Sullivan DC, Layde PM. Effect of human variability on independent double reading in screening mammography. Acad Radiol 1996; 3:891– 897. 34. Beam C, Conant E, Sickles E, Guse C. The value of consensus reading in screening mammography. Radiology 1999; 213(P):239 –240. 35. Schmidt RA, Newstead GM, Linver MN, et al. Mammographic screening sensitivity of general radiologists. In: Karssemeijer N, Thijssen M, Hendriks J, van Erning L, eds. Digital Mammography. Dordrecht: Kluwer, 1998; 383–388. 36. Jiang Y, Rana RS, Schmidt RA, et al. Computer classification of malignant and benign calcifications in full-field digital mammograms. In: Pisano E, ed. Digital Mammography 2004. Dordrecht: Kluwer, (in press). 37. Metz CE, Wang P-L, Kronman HB. A new approach for testing the significance of differences between ROC curves measured from correlated data. In: Deconinck F, ed. Information Processing in Medical Imaging. The Hague: Nijhoff, 1984; 432– 445.
APPENDIX A. DOUBLE-READING ROC CURVES Metz and Shen (28) provided a theoretical model for the expected gain in Az from double readings. They showed that lower correlation between two readers’ confidence ratings produces higher gains in AZ, and vice versa. They also showed that AZ values calculated from their theoretical model agreed with AZ values obtained empirically from double-reading ROC curves. Following their method, we calculated two doublereading AZ values: a theoretically expected value and an empirical value obtained from fitted ROC curves. To calculate the theoretically expected AZ value for a given pair of double-reading radiologists, we used the CLABROC algorithm3 to fit a pair of binormal ROC curves to the correlated diagnostic confidence data from the two radiologists and, from the curve-fitting results, obtained an estimate of the correlation between the two radiologists’ latent, normally distributed, decision-variable outcomes (28,37). Based on the estimated correlation coefficients, we then calculated the expected values of the binormal-model slope and intercept parameters (28) and subsequently calculated the expected AZ value. To calculate the empirical AZ value for a given pair of double-reading radiologists, we calculated the average of the two radiologists’ diagnostic confidence data for each case and then fitted a “double3 The CLABROC algorithm is now obsolete. The functionality of this algorithm is incorporated in the ROCKIT algorithm (for details, see http:// xray.bsd.uchicago.edu/krl/roc_soft.htm).
93
JIANG
reading” ROC curve to the resulting averaged confidence data. In all, 45 theoretically expected AZ values and 45 empirical double-reading ROC curves were obtained from all possible pairings of the 10 radiologists. As listed in Table 3, theoretical and empirical doublereading AZ values were similar. In addition, Table 3 compares these results with 10 single-reading ROC curves without the computer aid and 10 single-reading ROC curves of CAD, obtained by using the LABROC4 algorithm (27). Consistent with our analysis of sensitivity and specificity, the two double-reading AZ values were only slightly higher than the single-reading AZ value, indicating that little improvement in ROC curves was obtained from averaging the diagnostic confidence ratings.
APPENDIX B. BOOTSTRAP ESTIMATION OF CONFIDENCE INTERVALS We estimated 95% confidence intervals of the difference in a performance measure (sensitivity, specificity,
94
Academic Radiology, Vol 13, No 1, January 2006
or AZ) between two reading conditions from 10,000 bootstrap reader samples and case samples. One of the two reading conditions was always single reading without the computer aid (Tables 2 and 3). These estimates may be conservative (ie, confidence intervals may be wider than they should be) because the bootstrap reader samples tended to contain fewer groups of readers for double readings than the exhaustive pairings of different readers in the original experiment. The reason for this is explained as follows. By nature, bootstrap sampling of readers has a tendency to draw duplicate copies of identical reader(s), which then are placed in a single bootstrap sample. By definition, the double-reading rules (Table 1) will generate a null effect from identical data by pairing a reader with himself or herself. Therefore, to prevent underestimating effects from double readings when readers are sampled in this way, we did not pair any reader with himself or herself for any double-reading calculations, thereby often necessarily obtaining fewer groups of readers for double readings than the exhaustive pairings of readers in the original experiment.