An Empirical Comparison of Discrete Ratings and Subjective Probability Ratings

An Empirical Comparison of Discrete Ratings and Subjective Probability Ratings

Original Investigations An Empirical Comparison of Discrete Ratings and Subjective Probability Ratings1 Kevin S. Berbaum, PhD, Donald D. Dorfman, PhD...

564KB Sizes 1 Downloads 133 Views

Original Investigations

An Empirical Comparison of Discrete Ratings and Subjective Probability Ratings1 Kevin S. Berbaum, PhD, Donald D. Dorfman, PhD, E. A. Franken, Jr, MD, Robert T. Caldwell, MFA

Rationale and Objectives. The authors comp ared receiver operating characteristic (ROC) data from a five-category discrete scale with that from a 101-category subjective probability scale to determine how well the latter categories define the ROC curve. Materials and Methods. The authors analyzed data from a pilot study performed for another purpose in which 10 radiologists provided both a five-point confidence rating and a subjective probability rating of abnormality for each interpretation. ROC operating points were plotted for a fi ve-category scale and a 101-category scale to determine how well the observed points covered the range of false-positive probabilities. ROC curves were fitted to the subjective probability data according to the standard ROC model. Results. For these data, subjective probability ratings were somewhat more effective in populating the range of false-positive probability with ROC points. For three observers, the ROC curves inappropriately crossed the chance line. For another four, prevention of such crossing seemed to depend on one or two ROC points near the upper right corner of the ROC space, points based on discriminations within the discrete category “no abnormality to report.” Conclusion. Subjective probability rating should provide substantially better coverage of the ROC space with operating points, preventing inappropriate crossing of the chance line. Unfortunately, the protection offered by subjective probability ratings was unreliable and depended on ROC points derived from discriminations not directly related to apparent abnormality. The use of proper ROC models to fit data may offer a better solution. Key Words. Diagnostic radiology, observer performance; images, interpretation; quality assurance; receiver operating characteristic curve (ROC). ©

AUR, 2002

The use of a quasi-continuous subjective probability response scale has been recommended to improve receiver operating characteristic (ROC) curve fitting (1– 6). The quasi-continuous response scale with many discrete categories may then be transformed to distribute the counts uniformly into fewer categories (3,4), thereby avoiding Acad Radiol 2002; 9:756 –763 1 From the Departments of Radiology (K.S.B., D.D.D., E.A.F., R.T.C.) and Psychology (K.S.B., D.D.D.), University of Iowa, 3170 Medical Laboratories, Iowa City, IA 52242. Received December 27, 2001; revision requested February 6, 2002; revision received February 8; accepted February 15. Supported by U.S. Public Health Service grants R01 CA 42453 and R01 CA 62362 from the National Cancer Institute, Bethesda, Md. Address correspondence to K.S.B.

©

AUR, 2002

756

data difficult to fit with the standard binormal ROC model because of binormal degenerate data sets (7–9). Metz (10) defines binormal degenerate data sets as those that result in exact-fit binormal ROC curves of inappropriate shape, a series of horizontal and/or vertical line segments in which the ROC curve crosses the chance line, with the probability of true-positive findings less than the probability of false-positive findings (P [FP]). An alternative approach to the problem of binormal data degeneracy is to fit the data with a “proper” ROC model, one that does not permit inappropriate chance line crossing (11–13). An advantage of using a small number of discrete categories is that the observer rather than the experimenter determines the data counts within analyzed category bound-

Academic Radiology, Vol 9, No 7, July 2002

DISCRETE RATINGS AND SUBJECTIVE PROBABILITY RATINGS

aries. ROC operating points based on derived categories are convenient for extrapolation to regions of the ROC space that contain no data (14). The typical discrete-category scale with few categories uses semantic definitions, such as “definitely normal,” “probably normal,” “possibly abnormal,” “probably abnormal,” and “definitely abnormal.” It is sometimes possible to define a discrete-category scale in terms of medical actions, in which successively more aggressive treatment is associated with greater certainty in the presence of abnormality. We have argued previously that ROC operating points based on medical actions may be more valid than those that are not (14). By definition, ROC operating points defined by clinical actions are those used by physicians, whereas ROC operating points based on derived categories convenient for extrapolation to regions of the ROC space that contain no data may not be clinically meaningful. In this article, we compare the ROC data from a fivecategory discrete scale with that from a 101-category subjective probability scale to determine whether subjective probability categories provide better coverage of the domain of the ROC curve with ROC operating points than discrete-category ratings. MATERIALS AND METHODS Rating Data The data for this investigation were taken from a pilot study on search-based satisfaction of search effect in abdominal examinations (K.S.B., unpublished data). In that study, detection of 23 plain radiographic abnormalities in 44 patients with plain radiographic and contrast material– enhanced examinations was measured in 10 radiologist observers, who provided both a five-point confidence rating and a subjective probability rating of abnormality for each interpretation. The observers were instructed to indicate a discrete level of confidence that the feature was abnormal and a percentage likelihood that the feature was abnormal. Confidence ratings were therefore both discrete (categorized with these terms: suspicious, but likely normal; possibly abnormal; probably abnormal; and definitely abnormal) and quasi-continuous (0%–100%). If no abnormalities were detected, a box was checked indicating “normal, no abnormalities to report.” In these cases the observer was still asked to give a percentage likelihood that an abnormality might be present. Location responses were used to determine which ratings applied to abnormalities demonstrated with contrast enhancement and which applied to those visible without enhancement. Because the former abnormalities were an experimental ma-

nipulation, responses to them were not included in the scoring. Analysis of these data found no significant differences in either the area under the ROC curves or the decision thresholds for detection of abnormalities on the plain radiographs and the unenhanced regions of contrast-enhanced studies. Given the lack of differences between the conditions, the data of plain radiographs and contrastenhanced studies were combined in the current investigation, yielding a case sample size of 88 cases with 46 abnormalities. Data Analysis Two-way tables were constructed with column headings for the five categorical ratings and row headings for the subjective probability ratings used for normal and abnormal cases by each observer. Entries in the table indicated how often a particular five-category rating and a particular subjective probability rating were used together. Marginal sums were available for columns and rows and allow ROC operating points to be computed and plotted for the five-category and 101-category scales. Of critical interest is how well the observed points covered the range of P(FP). In addition to plotting the ROC points, we fitted the subjective probability data with the standard binormal model (10) with a computer program, RSCORE 4.66 (ftp://perception.radiology.uiowa.edu). This version of RSCORE supports up to 20 rating categories. The number of categories was not a limitation, in that no observer used more than 17 of the 101 subjective probability categories. This analysis indicated how well the subjective probability ratings avoid inappropriate chance line crossings with standard binormal fits. RESULTS Figures 1 and 2 present the data for observers 1 and 2, respectively. Note that only 15 of the subjective probability categories were used by observer 1 and only nine were used by observer 2. These numbers are quite typical. There is a very strong correlation between the five-category and the 101-category ratings, but the five-category ratings are split into two or more categories of subjective probability. While more points are available from the subjective probability scale, most of them fall between those from the five-category scale. Figures 1 and 2 also show another major feature of the ROC points. One problem with discrete ratings is that

757

BERBAUM ET AL

Academic Radiology, Vol 9, No 7, July 2002

Figure 1. Five-category and 101-category responses for observer 1 are cross-tabulated, and ROC points are plotted for each scale. F ⫽ ROC points from the five-category data, E ⫽ ROC points from the 101-category data.

Figure 2. Five-category and 101-category responses for observer 2 are cross-tabulated, and ROC points are plotted for each scale. F ⫽ ROC points from the five-category data, E ⫽ ROC points from the 101-category data.

they sometimes leave large sections of the range for P(FP), the x axis of the ROC curve, without ROC operating points. Large sections without points allow curves that cross the chance line. Subjective probability ratings are supposed to fill in ROC points more evenly across P(FP) values so that regions without data do not occur. In Figures 1 and 2, both discrete and quasi-continuous rating scales show large sections of the P(FP) range without ROC points. Although sections without points are some-

758

what smaller for quasi-continuous ratings, the situation is not much improved. For none of the observers in our experiment did the subjective probability ratings fill in ROC points evenly or eliminate the problem of large sections of the P(FP) range without ROC points. There was always a major gap in the ROC points generated from the 101-point subjective probability scale. For the 10 observers, the 101-point subjective probability ratings left the following gaps in P(FP) values: 0.286 – 0.833

Academic Radiology, Vol 9, No 7, July 2002

DISCRETE RATINGS AND SUBJECTIVE PROBABILITY RATINGS

Figure 3. Five-category and 101-category responses for observer 5 are cross-tabulated, and ROC points are plotted for each scale. F ⫽ ROC points from the five-category data, E ⫽ ROC points from the 101-category data. The ROC curve, fitted to the 101-category data with the standard binormal model, inappropriately crosses the chance line.

(observer 1); 0.381–1.0 (observer 2); 0.096 – 0.786 (observer 3); 0.310 – 0.833 (observer 4); 0.333–1.0 (observer 5); 0.238 – 0.738 (observer 6); 0.381– 0.929 (observer 7); 0.333–1.0 (observer 8); 0.119 – 0.738 (observer 9); and 0.500 – 0.881 (observer 10). Rather than data filled in evenly across the range of P(FP), there are large gaps in the individual ROC points for subjective probability, spanning 0.548, 0.619, 0.690, 0.524, 0.667, 0.500, 0.548, 0.667, 0.619, and 0.381 of the range. Thus, the average range that remains empty is 0.576 ⫾ 0.030, nearly 60% of the range of P(FP). Seven observers (observers 1, 3, 4, 6, 7, 9, and 10) showed the type of data demonstrated in Figure 1, with one or two ROC points in the upper right corner of the ROC space to the right of the gap in the points. Figure 2 is representative of data for the other three observers (observers 2, 5, and 8), in which there were no such points. The ROC curve crossed the chance line at P(FP) ⫽ 0.59 for observer 3, P(FP) ⫽ 0.84 for observer 5, and P(FP) ⫽ 0.73 for observer 8. Figures 3 and 4 show the data for observers 5 and 8 with the ROC curve fitted to the data. For the three observers who had no operating points near the upper right corner, the fitted ROC curve data crossed the chance line. For the seven observers who did have operating points near the upper right corner, the fitted ROC curves did not cross the chance line or crossed it so near to P(FP) ⫽ 1 that the crossing had no effect on the estimated area under the ROC curve. These findings

suggest that operating points near the upper right corner were critical in preventing the inappropriate chance line crossings. To test this hypothesis, we performed additional analyses on the data from the seven observers with operating points in the upper right corner and without chance line crossings. We could eliminate the points in the upper right corner by combining adjacent categories that define the operating points. When these points were thus eliminated for the seven observers in whom ROC curves fitted to the original data did not cross the chance line, the new curves crossed the chance line for four of them, observers 4, 7, 9, and 10, at P(FP) values of 0.69, 0.87, 0.70, and 0.70, respectively. Figures 5 and 6 show the data for observers 9 and 10, respectively. The upper ROC plot in each figure shows the ROC curves fitted to the original subjective probability data. The lower ROC plot shows ROC curves fitted to the subjective probability data but with the final point eliminated by combining the final two categories. Categories 0% and 5% were combined to eliminate the upperright-corner point in Figure 5, and categories 5% and 10% were combined to eliminate the one in Figure 6. Although the ROC curves fitted to the original data did not cross the chance line in these figures, those fitted to data without the upper-right-corner points crossed the chance line at values for P(FP) that substantially affect the estimated area under the ROC curves.

759

BERBAUM ET AL

Academic Radiology, Vol 9, No 7, July 2002

Figure 4. Five-category and 101-category responses for observer 8 are cross-tabulated, and ROC points are plotted for each scale. F ⫽ ROC points from the five-category data, E ⫽ ROC points from the 101-category data. The ROC curve, fitted to the 101-category data with the standard binormal model, inappropriately crosses the chance line.

DISCUSSION For these data, subjective probability ratings were ineffective in populating with ROC points most of the range of P(FP). For three observers, curves fitted to the data with the standard binormal model inappropriately crossed the chance line. For another four, preclusion of inappropriate chance line crossings seemed dependent on one or two ROC points near the upper right corner of the ROC space, based primarily on boundaries between low ratings of subjective probability. It is pertinent to consider the judgments responsible for these ROC points in the upper right corner. They were defined by low subjective probability ratings that were mostly accompanied by the discrete rating “normal, no abnormality to report,” which included no localization response. The second discrete category, probably normal, did require localization of a potential abnormality. The second category should have collected judgments in which a specific image feature was considered suspicious. Thus, the low subjective probability ratings that define the points in the upper right corner are not based on specific locations considered suspicious, and the words used are “no abnormality to report.” However, if these low subjective probability ratings are not based on judgments about specific possible abnormalities, what might they be based on?

760

After the experiment, we showed observers the subjective probability ratings they gave when they used the discrete category “normal, no abnormality to report” on normal cases and asked them how they assigned the subjective probability ratings. Most observers had some difficulty answering. Some said that the subjective probability rating reflected their “gut feeling” of how insecure they felt about a case. One reported that “2%, I was pretty darn sure it was normal, 5% is my standard ‘I think it is normal and am not going to think about it anymore’ number, and 3% is between them, and the single case of 10% was probably one where I wanted to call it normal but there was some nagging suspicion that I couldn’t put a handle on.” Others seemed to base it on image quality or how diagnostic they considered the image. For example, one observer reported, “my answer (5% vs 10% vs 15%) depended on the amount of bowel gas (did it obscure any bony structures?) or was there complete coverage of the corners of the radiograph in terms of diaphragm and symphysis pubis? Is there contrast in the bowel obscuring a portion of the abdomen or is there a lot of air or stool in the bowel overlying a kidney?” These ratings and ROC points may indicate a different standard from the other rating categories. While the higher rating categories with location responses seem to reflect the certainty that an identifiable feature or location actually reflects a perceived medical abnormality, the lower certainty rating when the image is also called “nor-

Academic Radiology, Vol 9, No 7, July 2002

DISCRETE RATINGS AND SUBJECTIVE PROBABILITY RATINGS

Figure 5. Five-category and 101-category responses for observer 9 are cross-tabulated. F ⫽ ROC points from the five-category data, E ⫽ ROC points from the 101-category data. On the lower ROC plot, the final category boundary for the 101-category data (0% vs 5%) has been deleted, with the two categories merged. The ROC curves were fitted to the 101-category data with the standard binormal model. The final ROC point for the 101-category scale is removed in the lower plot, and the ROC curve fitted to those data inappropriately crosses the chance line.

mal, no abnormalities to report” may reflect concern about what the examination does not reveal, rather than what it does reveal. If a different image dimension was rated to produce the ROC points in the upper right corner, this may account for the substantial gap between those points and the others in terms of distance along the false-positive axis. Whatever the source of the ROC points in the upper right corner, they were critical to the resistance of the standard binormal model to inappropriate chance line crossings. The fitted ROC curves for the three observers without upper-right-corner points all inappropriately crossed the chance line. When these points were eliminated in the seven observers for whom the ROC curves fitted to their original data did not inappropriately cross the chance, four of the new ROC curves did inappropri-

ately cross the chance line. The protection against inappropriate chance line crossing offered by subjective probability ratings was unreliable. When resistance to inappropriate chance line crossing was obtained, it seemed to depend on ROC operating points derived from discriminations along a stimulus dimension not directly related to apparent abnormality. The data we used to compare the two rating scales might not be representative of all data in radiology. Our abnormalities were diverse, and scoring involved location dependency. Perhaps other radiology data do not yield this kind of distribution of ROC points for subjective probability rating or inappropriate chance line crossings of standard binormal ROC curves. Moreover, with larger case samples, additional ROC points might fill in some of the P(FP) range lacking in our data.

761

BERBAUM ET AL

Academic Radiology, Vol 9, No 7, July 2002

Figure 6. Five-category and 101-category responses for observer 10 are cross-tabulated. F ⫽ ROC points from the five-category data, E ⫽ ROC points from the 101-category data. On the lower ROC plot, the final category boundary for the 101-category data (5% vs 10%) has been deleted, with the two categories merged. The ROC curves were fitted to the 101-category data with the standard binormal model. The final ROC point for the 101-category scale is removed in the lower plot, and the ROC curve fitted to those data inappropriately crosses the chance line.

Our data are not the only example of clustered ROC points or inappropriate chance line crossing with subjective probability ratings. Swensson et al (15), analyzing subjective probability ratings of interstitial disease, pulmonary nodule, pneumothorax, alveolar infiltrate, and rib fracture on chest radiographs in 539 patients, found a large range of P(FP) without ROC points and inappropriate chance line crossing for some of their reader-disease rating combinations. For those data, there was a large case sample and nothing unusual about the scoring. Even if subjective probability ratings provided ROC points where they are most needed for effective curve fitting, the necessary case sample size would be difficult to predict, as some observers could chose to use few subjective probability categories. We have argued elsewhere and maintain here that recommended minimum case sam-

762

ple sizes should depend on representative sampling and statistical precision rather than on limitations of ROC analysis (14). Our analysis suggests that subjective probability ratings fitted with the standard binormal model may sometimes yield inappropriate chance line crossings with potentially reduced power to detect differences between experimental conditions (16,17). With the increasing availability of proper ROC models and software to fit them to rating method ROC data (13,18 –22), there may be less need for subjective probability ratings to prevent chance line crossings. The question of whether the subjective probability rating scale offers better data for fitting ROC curves than the discrete rating scale should be investigated further with data from ROC studies in radiology that include a variety of imaging and observer vari-

Academic Radiology, Vol 9, No 7, July 2002

DISCRETE RATINGS AND SUBJECTIVE PROBABILITY RATINGS

ables. We recommend, therefore, that other investigators collect both discrete and subjective probability ratings to provide the data needed for this purpose. REFERENCES 1. Rockette HE, Gur D, Metz CE. The use of continuous and discrete confidence judgments in receiver operating characteristic studies of diagnostic imaging techniques. Invest Radiol 1992; 27:169 –172. 2. King JL, Britton CA, Gur D, Rockette HE, Davis PL. On the validity of the continuous and discrete confidence rating scales in receiver operating characteristic studies. Invest Radiol 1993; 28:962–963. 3. Metz CE, Shen JH, Herman BA. New methods for estimating a binormal ROC curve from continuously-distributed test results. Presented at the 1990 annual meeting of the American Statistical Association, Anaheim, Calif, August 7, 1990. 4. Metz CE, Herman BA, Shen JH. Maximum-likelihood estimation of ROC curves from continuously-distributed data. Stat Med 1998; 17: 1033–1053. 5. Houn FH, Bright RA, Bushar HF, et al. Study design in the evaluation of breast cancer imaging technologies. Acad Radiol 2000; 7:684 – 692. 6. Wagner RF, Beiden SV, Metz CE. Continuous versus categorical data for ROC analysis: some quantitative considerations. Acad Radiol 2001; 8:328 –334. 7. Dorfman DD, Alf E Jr. Maximum likelihood estimation of parameters of signal detection theory and determination of confidence intervals: rating method data. J Math Psych 1969; 6:487– 496. 8. Dorfman DD. RSCORE II. In: Swets JA, Pickett RM, eds. Evaluation of diagnostic systems: methods from signal detection theory. New York, NY: Academic Press, 1982; 212–232. 9. Dorfman DD, Berbaum KS. Degeneracy and discrete ROC rating data. Acad Radiol 1995; 2:907–915. 10. Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radiol 1989; 24:234 –245.

11. Egan JP. Signal detection theory and ROC analysis. New York, NY: Academic Press, 1975; 37–38, 195, 266 –268. 12. Metz CE. “Proper” binormal ROC curves: theory and maximum-likelihood estimation. Presented at the Sixth Farwest Image Perception Conference, Philadelphia, Pa, October 13–15, 1995. 13. Dorfman DD, Berbaum KS, Metz CE, Lenth RV, Hanley JA, AbuDagga H. Proper receiver operating characteristic analysis: the bigamma model. Acad Radiol 1997; 4:138 –149. 14. Dorfman DD, Berbaum KS, Brandser EA. A contaminated binormal model for ROC data. I. Some interesting examples of binormal degeneracy. Acad Radiol 2000; 7:420 – 426. 15. Swensson RG, King JL, Gur D. A constrained formulation for the receiver operating characteristic (ROC) curve based on probability summation. Med Phys 2001; 28:1597–1609. 16. Berbaum KS, Dorfman DD, Franken EA Jr, Caldwell RT. Proper ROC analysis and joint ROC analysis of the satisfaction of search effect in chest radiography. Acad Radiol 2000; 7:945–958. 17. Berbaum KS, Brandser EA, Franken EA Jr, Dorfman DD, Caldwell RT, Krupinski EA. Gaze dwell times on acute trauma injuries missed because of satisfaction of search. Acad Radiol 2001; 8:304 –314. 18. Swensson RG. Unified measurement of observer performance in detecting and localizing target objects on images. Med Phys 1996; 23: 1709 –1725. 19. Pan X, Metz CE. The “proper” binormal model: parametric receiver operating characteristic curve estimation with degenerate data. Acad Radiol 1997; 4:380 –389. 20. Metz CE, Pan X. Proper binormal ROC curves: theory and maximumlikelihood estimation. J Math Psych 1999; 43:1–33. 21. Dorfman DD, Berbaum KS. A contaminated binormal model for ROC data. II. A formal model. Acad Radiol 2000; 7:427– 437. 22. Dorfman DD, Berbaum KS. A contaminated binormal model for ROC data. III. Initial evaluation with detection ROC data. Acad Radiol 2000; 7:438 – 447.

763