Estimation and Comparison of CAD System Performance in Clinical Settings1

Estimation and Comparison of CAD System Performance in Clinical Settings1

Estimation and Comparison of CAD System Performance in Clinical Settings1 Hans Bornefalk, MSc Rationale and Objectives. Computer-aided detection (CAD...

151KB Sizes 1 Downloads 19 Views

Estimation and Comparison of CAD System Performance in Clinical Settings1 Hans Bornefalk, MSc

Rationale and Objectives. Computer-aided detection (CAD) systems are frequently compared using free-response receiver operating characteristic (FROC) curves. While there are ample statistical methods for comparing FROC curves, when one is interested in comparing the outcomes of 2 CAD systems applied in a typical clinical setting, there is the additional matter of correctly determining the system operating point. This article shows how the effect of the sampling error on determining the correct CAD operating point can be captured. By incorporating this uncertainty, a method is presented that allows estimation of the probability with which a particular CAD system performs better than another on unseen data in a clinical setting. Materials and Methods. The distribution of possible clinical outcomes from 2 artificial CAD systems with different FROC curves is examined. The sampling error is captured by the distribution of possible system thresholds of the classifying machine that yields a specified sensitivity. After introducing a measure of superiority, the probability of one system being superior to the other can be determined. Results. It is shown that for 2 typical mammography CAD systems, each trained on independent representative datasets of 100 cases, the FROC curves must be separated by 0.20 false positives per image in order to conclude that there is a 90% probability that one is better than the other in a clinical setting. Also, there is no apparent gain in increasing the size of the training set beyond 100 cases. Discussion. CAD systems for mammography are modeled for illustrative purposes, but the method presented is applicable to any computer-aided detection system evaluated with FROC curves. The presented method is designed to construct confidence intervals around possible clinical outcomes and to assess the importance of training set size and separation between FROC curves of systems trained on different datasets. Key Words . CAD; performance evaluation; sampling error; confidence interval; mammography; operating point estimation. ©

AUR, 2005

The use of free-response receiver operating characteristic (FROC) (1,2) curves has long been used in illustrating the performance of mammography computer-aided detection (CAD) systems (3,4). The recommended methods (5) of

Acad Radiol 2005; 12:687– 694 1 From the Royal Institute of Technology, AlbaNova University Center, Department of Physics, SE - 106 91 Stockholm, Sweden. Received November 5, 2004; revision received January 1, 2005; accepted February 3. Address correspondence to H.B. e-mail: [email protected]

© AUR, 2005 doi:10.1016/j.acra.2005.02.005

comparing CAD systems are the same as those for comparisons between imaging modalities: the FROCFIT (6), AFROC (7,8) or the JAFROC (9). The particular choice depends upon which assumptions one is willing to make about the distribution of true positive (tp) and false positive (fp) marks. As this article does not address the issue of comparing FROC curves per se, we take a somewhat different approach to assessing the performance difference between 2 CAD systems. Instead of making a maximum likelihood estimation of a pair of underlying parameters most likely to generate the observed combination of tp’s and fp’s and

687

BORNEFALK

thus obtain some summary index with error margins like A1 in AFROC methodology, we ask the related but different question: how likely it is that the outcome in a clinical setting of one system is better than that of the other? This approach is different in 2 ways. First and most obviously, it allows explicit modeling of the clinical circumstances in which the CAD systems are to be used, and this affects the estimated probabilities; second and perhaps more subtly, it captures the inherent uncertainty in determining the detection threshold that will yield a desired operating point on unseen cases. The FROC curve is generated by combining the true positive fractions and false positive markings per image on a certain training set for different detection thresholds (3,10). The FROC curve is thus parameterized by the detection threshold and this system parameter must be determined before the system can be used in a clinical application. (The same applies to any other system parameters that greatly affect the trade-off between tp’s and fp’s (11,12), but in this article we will assume that one parameter is essentially sweeping out the entire FROC curve.) Since the system threshold can only be estimated on the limited training sample available to the CAD manufacturer, this introduces a sampling error. The sampling error will affect both the location of the FROC curve and its parameterization. This situation is similar to the one with human readers in ROC studies of medical imaging: there is no unique operating point in ROC space; there is not even a unique ROC curve. Instead there is a spread of operating points both in the across-curve direction and along the ROC curve direction (13). In this article it is shown how the sampling error’s effect along the direction of the FROC curve can be captured by estimating the distribution of system thresholds that yield a desired operating point by a method first proposed in reference (14). In that sense, this method regards the FROC curve location as given: one could have applied a maximum likelihood procedure for fitting a FROC curve to the data, or just used the empirically obtained curve. It is the method by which we estimate the uncertainty in the direction along the FROC curve that gives us the necessary means for constructing confidence intervals for typical clinical outcomes around the FROC curve. The method is developed by analyzing 2 artificial CAD systems with different FROC curves, where the first clearly outperforms the second on a certain dataset. Since there is no clear ordering of points in 2 dimensions, some measure of superiority must be introduced before the probability can be calculated that a typical clinical out-

688

Academic Radiology, Vol 12, No 6, June 2005

come from one system is superior to that of another. With the model developed, we examine how these probabilities vary with the training sample size and the separation between the FROC curves. We have chosen an ‘‘average’’ mammography CAD system for illustrative purposes, but the method presented applies to all CAD systems that are evaluated using FROC curves and where the distributional assumptions made are fulfilled.

MATERIALS AND METHODS Two Artificial CAD Systems Let the FROC curve be parameterized by ␪, which could be regarded as the decision threshold of some classifying machine. In our analysis, we use the below parameterization to resemble real mammography CAD systems, but in a realworld application these equations could be substituted by the empirical or fitted FROC curves. For system 1, the sensitivity is denoted y1 and is given by 1

y 1(␪) ⫽ e⫺ (a · ␪)b

(1)

where ␪ ⑀ (0, ⬁). The number of false positives per image is denoted x1 and modeled by x1(␪) ⫽ max(0, ␪ ⫺ ␤).

(2)

The FROC curve for system 2, the locus of points (x2 (␪), y2 (␪)), is given by y 2( ␪ ) ⫽ y 1

(3)

x2(␪) ⫽ max(0, x1 ⫹ ␣).

(4)

␣ is the separation in false positives between the 2 systems at a given sensitivity. With a ⫽ 7/3, b ⫽ 3, ␤ ⫽ 0.3, and ␣ ⫽ 0.2 the resulting FROC curves are shown in Figure 1. At 90% sensitivity, there are 0.61 and 0.81 false positive markings per image respectively. We assume that these FROC curves are representative for the entire population, ie, that there exists no systematic bias in the training dataset. We also assume that the detection criteria are the same for the 2 systems since reported performance is so highly dependent on scoring protocol (15,16), that any comparisons between systems with different detection criteria would be futile.

CAD SYSTEM PERFORMANCE IN CLINICAL SETTINGS

Academic Radiology, Vol 12, No 6, June 2005

Similarly, the distribution of false positive markings per image is assumed to be Po(x(␪)), since for each image in an unseen case the number of false positive markings can be k ⫽ 0,1,2,ѧ and the expected value is x(␪). Since the Poisson distribution is additive, the total number of false positive markings in m normal images would then be Po(mx) and a simple rescaling yields the distribution of false positives per image, x: P(x ⫽ k ⁄ mⱍ␪) ⫽ P(mx ⫽ kⱍ␪) ⫽ e⫺mx(␪)(mx(␪))k ⁄ k! (6) Figure 1. System 1 is everywhere better than system 2 on a particular training set. At any given sensitivity, the number of false positive markings per image in system 1 is ␣ ⫽ 0.2 less than in system 2.

Assumptions To be able to construct confidence intervals, or rather confidence areas, around the FROC curve for possible clinical outcomes some distributional assumptions must be made. These assumptions are the same as the assumptions made in many curve fitting methods for observer studies (2,6), namely that true positive markings are binomially distributed, that false positive markings per image are Poisson distributed and that these are independent of each other. These assumptions have however received some criticism (17,18). For instance, the Poisson assumption is most certainly faulty since it places positive probability for more true positive markings in an image than there are pixels. However, within reasonable bounds, this assumption is more likely to be fulfilled for automated readers than for observer studies (19). Furthermore, there is no psychological satisfaction effect preventing an algorithm from continuing a search after having made the first mark, thus an algorithm is more likely to have stationary performance than is a human observer (18). This makes the independence assumption more plausible. As we shall see however, the independence assumption has a very limited effect on the presented results. Given a certain threshold ␪, yielding a detection probability of y(␪), the number of cancers actually detected in a population of n cancerous breasts is assumed to be binomially distributed bin(n,y(␪)). A rescaling yields the distribution of the true positive fraction y given the threshold level: P(y ⫽ k ⁄ nⱍ␪) ⫽ P(ny ⫽ kⱍ␪) ⫽

冉冊 n k

y(␪)k(1 ⫺ y(␪))n⫺k , (5)

where k ⫽ 0,1,ѧ,n is the number of cancers detected.

Wherever appropriate, we use continuous approximations and then denote the probability density functions above f y|␪ and f x|␪, respectively. If one is reluctant to make any distributional assumptions, one can still apply the method presented in this article. In Ref. (14) it is shown how the necessary distributions can be estimated with a nonparametric method based on bootstrapping the leave-one-out data generated in the training and evaluation of a CAD system. This approach produces similar confidence intervals for the number of false positive markings per image at a given expected sensitivity as the above distributional assumptions. The confidence intervals are estimated for a real CAD system, again indicating that the distributional assumptions are reasonable. Effect of Sampling Error Now assume that the CAD manufacturer has determined that 90% sensitivity is the goal to aim for in a certain population. For a given CAD system there is one and only one value of ␪ that would yield 90% sensitivity over this population. However, the system will be trained only on a subset of the population and this introduces a sampling error. The manufacturer will have to set the decision threshold ␪* yielding 90% sensitivity based on the available sample. For instance, for systems 1 and 2 above, the threshold yielding 90% sensitivity for the entire population is ␪* ⫽ 1/a (⫺1/ln(0.90))1/b ⫽ 0.91 from Equation 1, but it is not at all clear that ␪* is the threshold that yields 90% sensitivity, or the same number of false positives per image, when the system is evaluated on some other subset. This problem is pointed out by Yoshida et al (20), but no solution is proposed. Given f y|␪ from equation 5, Bayes’s rule can be used to determine the distribution of ␪’s yielding the desired sensitivity y*:

689

BORNEFALK

Academic Radiology, Vol 12, No 6, June 2005

Figure 2. Sampling error’s effect on determining the correct threshold.

f ␪ⱍy⫽y∗ ⫽

f ␪ f y⫽y∗ⱍ␪

兰ff 0

␪ y⫽y ∗ⱍ␪

d␪

.

Figure 3. 99% confidence interval for possible outcomes in a clinical setting after one year. The threshold ␪ is set to yield an average sensitivity of 90%.

(7)

The prior f ␪ is the parametrization of the FROC curve and is assumed uniform over the interval of interest. In Figure 2 the resulting distributions of f ␪|y⫽y∗ are shown for y* ⫽ 90% for different realistic choices of the sample size n (many CAD research articles present FROC curves for systems trained on 50 –150 cases). The distributions show that there is a considerable effect of the sampling error on correctly determining the system threshold. Distribution of Possible Clinical Outcomes What is the distribution of possible outcomes in a typical clinical setting if the system is evaluated after, say, one year? With 300 women weekly receiving their mammographic screening examination, we get about 15,000 patients per year. With 2 projections (mediolateral oblique and craniocaudal) of each breast this gives us m ⫽ 60,000 images per year. The expected number of cancer patients will vary with the age of the population invited to screening, but a reasonable incidence rate to assume is 0.5% (21), which translates to 75 incidences of cancer. Now assume that a certain system threshold ␪ has been determined. If the distributions of false positives per image and true positive fraction are assumed to be independent for a given ␪, an assumption almost trivially fulfilled at an incidence rate of 0.5% since almost all images contain only false positive marks (if any), the joint probability of false positive marks per image and sensitivity is given by f xyⱍ␪ ⫽ f xⱍ␪ f yⱍ␪ .

690

(8)

With m ⫽ 60,000 we can use normal approximation for f x|␪ in Equation 6. For f y|␪ we use n ⫽ 75 in Equation 5. For the (unknown, but correct) choice ␪ ⫽ 0.91 the distribution of possible outcomes is illustrated in Figure 3. Note that the variability in false positives per image is much smaller than in the sensitivity, due to the larger sample size on which this statistic is calculated (60,000 as opposed to 75 for the sensitivity). However, as we have seen from Figure 2, there is a considerable uncertainty in determining the threshold that yields a specified sensitivity (here we assume that the system is tuned to yield 90% sensitivity). By multiplying and integrating Eqs. (8) and (7) we get the joint distribution of false positives and sensitivity, conditional on the desired sensitivity (which is how we have assumed these systems are applied): f xyⱍy⫽y∗ ⫽





f xyⱍ␪ f ␪ⱍy⫽y∗d␪ .

(9)

This distribution is depicted in Figure 4 when it is assumed that the CAD system is trained on 100 samples (n⫽100 in Equations (5) and (7)). Note how large the distributions are and the overlap between them. As all of the remaining analysis is based on the validity of the distribution f xy|y⫽y∗, it is verified by simulation in the Appendix.

RESULTS The Pareto Optimality Criterion If we want to determine the likelihood that system 1 is better than system 2, we must first define what better is.

CAD SYSTEM PERFORMANCE IN CLINICAL SETTINGS

Academic Radiology, Vol 12, No 6, June 2005

Figure 4. Distribution of typical clinical outcomes accounting for the sampling error’s effect on correctly determining the system threshold ␪ that yields an expected sensitivity of 90%. Confidence intervals (areas) marked at 90%, 95% and 99%.

In 2 dimensions, there is no clear ordering of points. The Pareto criterion of optimality is perhaps the most uncompromising definition and states that a combination of variables is Pareto dominant over another if at least one variable is superior (in the easily defined one-dimensional sense) and the others are not inferior. Applied to the distribution of clinical outcomes, this means that a point from system 1 is better, in the Pareto sense, than one from system 2 if it has either a higher sensitivity and an equivalent or lower number of false positive markings or a lower number of false positive markings and equivalent or higher sensitivity. The comparison between 2 points, where one has a higher sensitivity and a simultaneous higher number of false positives, is indeterminate. Equipped with this definition, we can answer the question how likely it is that the outcome in a clinical setting of system 1 is better than that of system 2 in the Pareto sense: p⫽

兰 兰 ⬁

1

x⫽0

y⫽0

1

f xyⱍy⫽y∗(x, y) · h(x, y)dxdy.

(10)

f xy|y⫽y∗共x, y兲 is the probability that the outcome from system i is (x, y) when the threshold is determined so as to yield an expected sensitivity of y*. h(x, y) is the fraction of outcomes from system 2 that (x, y) is Pareto superior to, ie, all combinations of points with a higher number of false positive markings per image, x 僆 共x, ⬁兲, and simultaneously a lower sensitivity, y 僆 共0, y兲: i

h(x, y) ⫽

兰 兰 ⬁

y

x⫽x

y ⫽0

2

f xyⱍy⫽y∗(x, y )dxdy  .

(11)

Figure 5. Probability of clinical outcome of one system being better than another as a function of training dataset size (n) and separation in false positive marking per image (␣). The Pareto definition of optimality is used.

For the distributions 1 f xy|y⫽y∗共x, y兲 and 2 f xy|y⫽y∗共x, y兲 depicted in Figure 4 the probability that an outcome from system 1 is better than from system 2 is p ⫽ 0.455. The distributions f xy|y⫽y∗共x, y兲 are dependent on the training sample size n, and the probability that an outcome from system 1 is better than one from system 2 is thus also dependent on n. In Figure 5 we plot the dependence of p on n and on the separation between the FROC curves (␣ in Equation (4)). For a separation of 0.3 fp/im between the FROC curves of the 2 systems, the probability that the ‘‘better’’ CAD system actually performs better in a clinical setting, when it is trained on 50 cases, is only 44%. Also, the curves level out at a level depending only on the asymmetry of the distributions i f xy|y⫽y∗共x, y兲 around the line y ⫽ y*, and not on the separation ␣. This is due to the inability of the Pareto criterion to make a trade-off between the 2 variables. For instance, the comparison between a point with a sensitivity of 90% and 0.6 fp/im and one with 91% sensitivity and 2 false positives per image would be indeterminate. In the next section, we propose a method to remedy this. First a subtle point on correlation2: if 2 algorithms are trained on the same dataset, the estimated distributions of system thresholds of the 2 systems will be correlated and thus the distributions 1 f xy|y⫽y∗ and 2 f xy|y⫽y∗ are not independent. This invalidates the approach of Equation 10 and that is why the presented method should only be used when comparing systems evaluated on independent datasets. FROC Slope Trade-off The slope of the FROC curve at the chosen operating point defines the marginal trade-off between sensi2

The author is grateful to an anonymous referee for pointing this out.

691

BORNEFALK

Academic Radiology, Vol 12, No 6, June 2005

The interpretation of the curves is the following. If the CAD system has been trained on 50 cases, the separation between 2 FROC curves must be at least 0.24 false positive markings per image to allow the conclusion that there is a 90% probability that system 1 is better than system 2 in a clinical setting. If the systems had been trained on 100 cases, a separation of 0.21 fp/im would be sufficient to draw the same conclusion. The graph also shows that there is limited gain – from a sample error point of view – in increasing the training dataset size beyond 100 cases. Figure 6. Probability of clinical outcome of one system being better than another as a function of training dataset size (n) and separation in false positive marking per image (␣). Trade-off between sensitivity and false positives as defined by the slope of the FROC curve at the chosen operating point.

tivity and false positives per image that the user is de facto making. This trade-off can be used when comparing different combinations of sensitivity and false positives. Our alternative method for ranking 2 points depends on which side of a line through the first point with direction 共x1共␪∗兲, y˙ 1共␪∗兲兲 the second point lies. If point 2 lies to the right of this line, it is inferior to point 1 and vice versa. The probability that an outcome of system 1 is superior to that of system 2 is then given by

p⫽

兰 兰 ⬁

1

x⫽0

y⫽0

1

f xyⱍy⫽y*(x, y) · ˜h(x, y)dxdy

(12)

where ˜h共x, y兲 is the fraction of points from system 2 that the chosen point from system 1 is better than. With I being the indicator function with I(x) ⫽ 1 if x ⬎ 0 and zero otherwise, w ⫽ 共x ⫺ x, y  ⫺ y兲 and n ⫽ 共 ⫺ y˙ 共␪∗兲, x˙共␪∗兲兲, the normal to the FROC curve at the chosen operating point, we get:

˜h(x, y) ⫽

兰 兰 ⬁

1

x⫽0

y ⫽0

2

f xyⱍy⫽y*(x, y )I(w · n)dxdy  . (13)

With this new criterion, the probability that an outcome from system 1 is better than one from system 2 (as depicted in Figure 4) in a clinical setting is 0.89. The dependency of this probability on the training set size n and ␣ is illustrated in Figure 6. Especially note that by introducing a trade-off between false positives and sensitivity, the derivatives dp/d␣ are everywhere strictly positive.

692

DISCUSSION This article has described a method for constructing confidence intervals for possible clinical outcomes of mammography CAD systems. The focus has been on a problem confronting the CAD manufacturer before a system can be used in a clinical setting; the necessity to determine the appropriate operating point by setting a detection threshold. The limited number of training samples introduces a sampling error that affects the choice of operating point. By incorporating this effect, confidence intervals for clinical outcomes around desired operating points of FROC curves can be constructed. If the training datasets of 2 CAD systems are independent of each other and considered representative of the underlying population, the estimated confidence intervals lend themselves to estimation of the probability that one CAD system will outperform another in a typical clinical setting. In order to estimate the probability that one system will perform better than another, some measure of superiority is necessary to rank possible outcomes. We propose to use the slope of the FROC curve at the desired operating point as the trade-off when comparing combinations of sensitivity and false positive markings. The estimated probability will be dependent on both the separation between the FROC curves and the size of the training sets, and this allows us to draw 2 main conclusions. Firstly, for typical training set sizes of 50 –150 cases the separation between the FROC curves from 2 typical mammography CAD systems (at 90% sensitivity) must be about 0.2 false positives per image to be able to draw the conclusion that one is better than the other at the 90% confidence level. Secondly, the effect of increasing the dataset beyond 100 cases is very limited in terms of increased probability that one system outperforms the other in a clinical setting.

Academic Radiology, Vol 12, No 6, June 2005

Throughout this article, we have chosen an ‘‘average’’ mammography CAD system to illustrate the method. However, the method presented is general to all CAD systems that are evaluated using the FROC methodology. We hope that the method presented on how to translate the difference between 2 FROC curves to a probability that one system will actually outperform the other in a clinical setting will be useful to other researchers. REFERENCES 1. Egan JP, Greenberg GZ, Schulman AI. Operating characteristics, signal detectability and the method of free response. J Acoust Soc Am 1961; 33:993–1007. 2. Bunch PC, Hamilton JF, Sanderson GK, Simmons AH. A free-response approach to the measurement and characterization of radiographicobserver performance. J Appl Photogr Eng 1978; 4:166 –171. 3. Giger ML. Current issues in CAD for mammography. In Doi K, Giger ML, Nishikawa RM, Schmidt RA, eds. Digital Mammography ’96. Philadelphia, Pa: Elsevier Science, 1996; 53–59. 4. Metz CE. Evaluation of CAD methods. In Doi K, MacMahon H, Giger ML, Hoffmann KR, eds. Computer-Aided Diagnosis in Medical Imaging. Amsterdam: Elsevier Science B.V, 1999; 543–554. 5. Chakraborty DP. The FROC, AFROC and DFROC Variants of the ROC Analysis. In Beutel J, Kundel HL, Van Matter RL eds. Handbook of medical imaging. Vol. 1: Physics and psychophysics. Bellingham, Wa: SPIE Press, 2000; pp. 786 –788. 6. Chakraborty DP. Maximum likelihood analysis of free-response receiver operating characteristic (FROC) data. Med Phys 1989; 16:561–568. 7. Chakraborty DP, Winter HL. Free-response methodology: alternate analysis and a new observer-performance experiment. Radiology 1990; 174:873– 881. 8. Chakraborty DP. Statistical power in observer-performance studies: comparison of the receiver operating characteristic and free-response methods in tasks involving localization. Acad Radiol 2002; 9:147–156. 9. Chakraborty DP, Berbaum K. Observer studies involving detection and localization: modeling, analysis and validation. Med Phys 2004; 31:2313–2330. 10. Bowyer KW. Validation of Medical Image Analysis Techniques. In Sonka M and Fitzpatrick JM, eds. Handbook of Medical Imaging. Vol 2: Medical Image Processing and Analysis. Bellingham, Wa: SPIE Press, 2000; p. 574. 11. Wood K, Bowyer KW. Generating ROC Curves for Artificial Neural Networks. IEEE Trans Med Imag 1997; 16:329 –337. 12. Anastasio MA, Kupinski MA, Nishikawa RM. Optimization and FROC analysis of rule-based detection schemes using a multiobjective approach. IEEE Trans Medical Imaging 1998; 17:1089 –1093. 13. Beam C, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists. Arch Intern Med 1996; 156:209 –213. 14. Bornefalk H, Heronannson A. On the comparison of FROC curves in mammography CAD systems. Med Phys 2005; 32:412– 417. 15. te Brake G, Karssemeijer N. Detection criteria for evaluation of computer diagnosis systems. Proc IEEE 1996; 3:1157–1158 16. Nishikawa RM, Yarusso LM. Variations in measured performance of CAD schemes due to database composition and scoring protocol. Proc SPIE 1998; 3338:840 – 844. 17. Metz CE. Evaluation of digital mammography by ROC analysis. In Doi K, Giger ML, Nishikawa RM, Schmidt RA, eds. Digital mammography ’96. Amsterdam: Elsevier Science B-V, 1996; 61– 68. 18. Swensson RG. Unified measurement of observer performance in detecting and localizing target objects on images. Med Phys 1996; 23: 1709 –1725. 19. Edwards DC, Kupinski MA, Metz CE, Nishikawa RM. Maximum likelihood fitting of FROC curves under an initial-detection-and-candidateanalysis model. Med Phys 2002; 29:2861–2870.

CAD SYSTEM PERFORMANCE IN CLINICAL SETTINGS

20. Yoshida H, Anastasio MA, Nagel RH, Nishikawa RM, Doi K. Computeraided diagnosis for detection of clustered microcalcifications in mammograms: Automated optimization of performance based on genetic algorithm. In Doi K, MacMahon H, Giger ML, Hoffmann KR, eds. Computer-Aided Diagnosis in Medical Imaging. Amsterdam: Elsevier Science B.V.,1999; 247–252. 21. Ries LAG, Eisner MP, Kosary CL, et al (eds). SEER Cancer Statistics Review, 1975–2001, National Cancer Institute. Bethesda, MD, http:// seer.cancer.gov/csr/1975_2001/, 2004. 22. Chakraborty DP. Proposed solution to the FROC problem and an invitation to collaborate. Proc SPIE Medical Imaging 2003; 5034:204 –212.

APPENDIX Experimental Verification To validate the spread-out distribution of possible clinical outcomes f xy|y⫽y∗ presented in Figure 4 we use the XFROC model for simulating FROC data (8,22). Let ni,j denote the classification machine output from the jth noise site in the ith image. This is a potential false positive marking if ni,j is larger than some cutoff. Similarly, si,j denotes the output value from the jth signal site (cancer) in the ith image. ni,j and si,j are drawn from 2 normal distributions ni,j ⬃ N(␰i, ␴LN)

(14)

si,j ⬃ N(␺i, ␴LS)

(15)

where the means ␰ii and ⌿i are themselves random variables: (␰i, ␺i) ⬃ N(0, ␮, ␴CN, ␴CS, ␳SN).

(16)

␰ii and ⌿i are possibly correlated in the case ␳SN ⫽ 0 and this allows for modeling intra-image correlation. We use ␴LN ⫽ ␴LS ⫽ ␴CN ⫽ ␴CS ⫽ 1, ␮ ⫽ 3, and ␳SN ⫽ ⫺0.2 since the resulting FROC curve location is then similar to those of real mammography CAD systems, for instance the system designed for finding speculated lesions described briefly in reference (14). Now assume that the system has been trained on n pairs of cancer mammograms. It is assumed that the evaluation is performed pair-wise, ie, that a cancer is considered detected if it is correctly marked in one projection. This gives us a total of 2n images containing one cancer each. For simplicity, we make the assumption that the CAD system produces a limited (19) and fixed (14) number of candidate noise sites per image, T ⫽ 9. The error in the Poisson assumption introduced by this limitation is negligible for operating points around one false positive per image. The subsequent steps of

693

BORNEFALK

Academic Radiology, Vol 12, No 6, June 2005

the CAD algorithm classify these candidate sites as malignant or non-malignant. It is assumed that the same value of ␰i and ⌿i is used for both projections of the breast. We thus get j ⫽ 1,ѧ,T in Equation(14), j ⫽ 1 in Equation (15) and i ⫽1,ѧ, n in the above equations. For each of B ⫽10,000 simulation runs, the following is done: ● Training data consisting of n ⫽100 pairs of images with cancer is simulated as described above. ● The system is evaluated with a FROC curve parametrized by the cutoff level ␯. ● The cutoff ␯* that yielded 90% sensitivity on the training set is extracted. ● 75 pairs of abnormal images and 60,000 normal images are simulated using the above model. ● The true positive fraction is calculated on the 75 simulated cancers and the number of false positives per image is calculated using the threshold ␯*. To be able to compare the simulated distribution of combinations of false positives and sensitivities with the confidence intervals (CI’s) predicted in Figure 4, a nonbiased estimate of the FROC curve of the system defined by equations 14 –16 must be obtained. This can be done by assuming that the functional form of the FROC curve is given by equations 1 and 2, and then solving the minimization problem: (a∗, b∗, ␤∗) ⫽ arg min

兺 共e

a,b,␤ xi⬎0

694

⫺1⁄(a(xi ⫹ ␤))b

⫺ y i兲

2

(17)

Figure 7. Small scale (B ⫽ 500) simulation of clinical outcomes of a modeled CAD system. The solid lines are theoretically calculated confidence areas obtained with the method presented in this article.

where the set {xi, yi} is the non-biased empirical FROC curve derived from a simulation with a large value of n. With the estimated values of a*, b* and ␤*, the framework of the article can be used to estimate the confidence intervals for f xy|y⫽y∗ assuming that the system has been trained on n ⫽ 100 cases. In Figure 7 these estimated CI’s are superimposed on the result of B ⫽ 500 (for clarity) simulated combinations of false positives and sensitivities. For a particular run with B ⫽ 10,000 the number of simulated points falling inside the 90% CI was 8,999, 9,500 inside the 95% CI and 9,883 inside the 99% CI. This indicates that the proposed method for capturing the effect of the sampling error on determining the threshold is correct.