Adequacy testing of training set sample sizes in the development of a computer-assisted diagnosis scheme

Adequacy testing of training set sample sizes in the development of a computer-assisted diagnosis scheme

Adequacy Testing of Training Set Sample Sizes in the Development of a Computerassisted Diagnosis Scheme Bin Zheng, PhD, Yuan-Hsiang C h a n g , PhD Wa...

993KB Sizes 0 Downloads 15 Views

Adequacy Testing of Training Set Sample Sizes in the Development of a Computerassisted Diagnosis Scheme Bin Zheng, PhD, Yuan-Hsiang C h a n g , PhD Walter F. G o o d , PhD, David Gur, ScD

Rationale mad Objectives., The authors assessed the performance changes of a computer:assisted diagnosis (CAD) scheme as a function of the number of regions used for training (rule-setting). . . Materials and Methods, One hundred twenty regions depicting actual masses and 400 susoicious but ac!nally negative regions were selected as a testing data set from a database of 2,146 regions identified a s suspleiouson 6 t 8 mammograms. An artificial neural network using 24 mad 16 region-based features as input nettro/~s Was applied to classify the regions as positive or negative for the presence of a mass. CAD scheme performance was evaluated on the testing data set as the number of regions used for training increased from 60 to 496. Results. As the number of regions in the traimng sets increased, the results de,creased mad plateaued beyond a sample size of approximately 200 regions.:Performance with the testing data set continued to improve as the training data set increased in size. : . . . . .

Conclusion. A trend in a system's performance as a function of trai,aing set size can be Used to assess adequacy of the training data set in the development Of a CAD scheme. . . . . Key Words. Breast radiography; computers, diagnostic aid; computers, neural network;images, processing.

Acad Radiol 1997; 4:497-502 From A449 Scaife Hall, Department of Radiology, University of Pittsburgh, 3550 Terrace St, Pittsburgh, PA 15261-0001. Received December 16, 1996, and a c c e p t e d for publication after revision March 24, 1997. Supported in part by the Eastman Kodak Company, Rochester NY, Address reprint requests to B,Z. ©AUR, 1997

Evaluation and validation of the performance of diagnostic imaging systems have been the topic of extensive research during the past 2 decades (1). Despite substantial progress in all aspects related to study design and analysis, many questions remain (2,3). These are often related, but not limited, to the comparison between two or more imaging systems or techniques (3,4). One fundamental problem that often becomes the limiting factor in validating new techniques and objectively assessing their potential effect on the clinical environment is case-selection adequacy for optimization during development and later validation of the system's performance or robustness (5). With the rapid increase in the number of computer-assisted diagnosis (CAD) schemes that are being developed, similar questions concerning case-selection and sampling adequacy during validation and comparisons have arisen (6). Many studies reported to date performed a single measurement based on an available data set, where some of the images are used for training and the remainder for testing (7,8). To improve error assessment, a recent tendency is to use well-established cross-validation techniques, such as jackknifing (9) or the round-robin approach (10). Others have reported results of both training and testing experiments, where the two were performed on different data sets (11,12). Although all of these methods are valid when used appropriately, none of these efforts attempted to assess the adequacy of the training set in terms of both sample size and its ability to cover adequately the variable domain that is used to identify and/ or characterize the abnormality in question. Hence, generalizability of the results, or inferences of the scheme's performance with other image sets, may not be valid.

497

In this preliminary study, we investigated the relationship between the testing performance of a simple threelevel feed-forward artificial neural network (ANN) and the size of the training data set. The general purpose of this experiment was to develop and evaluate a method for assessing the adequacy of a training set sample size in a sparsely sampled environment.

140 130 120 110 100 90 80 ell 70 t..

~ATERIALS A N D METHOI~

The image database from which regions of interest were selected for this project contained 618 digitized mammograms; of these, 545 were acquired during breast examinations of 226 patients at the University of Pittsburgh Medical Center or its affiliated hospitals and clinics, and 73 images were selected from a set of mammograms provided by a research group at Washington University, St Louis, Mo. In this project, 2,146 regions of interest depicting 368 true-positive and 1,778 true-negative regions that were suspicious for mass were used. All positive regions were verified by pathologic examination. One hundred eightyone of the positive regions were different views of 122 malignant masses, and 187 regions were different views of 127 benign masses. All regions used in this study were identified as suspicious by means of a rule-based CAD scheme after several stages of analysis (13). With the exception of the suspicious regions that matched the verified masses, all other regions were identified by this independent CAD scheme as suspicious but were "determined" by a radiologist' s review of current and follow-up mammograms to be negative. No pathologic verification was available for the negative regions. The CAD scheme used to identify these regions has three distinct stages regarding the identification of suspicious regions. The initial step of dual kernel filtering, subtraction, thresholding, and labeling resulted in the identification of 16.5 false-positive regions per image when applied to our database. The second stage includes thresholding of individual topographic layer-based features (region growth and shape) and resulted in the identification of 2.88 false-positive regions per image (or 1,778 total). The third stage includes nonlinear, multifeature, topographic (multilayer) thresholding and yielded final results of 0.86 false-positive regions per image. The 368 true-positive and 1,778 true-negative regions identified during the second stage of the scheme were included in a data set from which regions were selected for the following experiment. All film mammograms in the data set were digitized by

498

,t:

60 50 40

z

30 20 10 0 5

10

15

20

25

30

35

40

45

50

Effective Size (ram) Figure 1. Distribution of effective size of mass regions used in this study. As defined here, effective size is the square root of the p r o d u c t dimension of the mass (maximum axis x minim u m axis).

90 80

O

70 60 50 40

t.. 30 20 10 0 100

200

300

400

500

600

700

800

900

1000

1100

Digital Value Contrast Figure 2. Distribution of contrast values of mass regions used in this study. As defined here, contrast is the difference in digital value b e t w e e n the a v e r a g e digital values inside the mass region a n d a square w i n d o w surrounding the mass region, but excluding the mass region itself. (The dimension of the w i n d o w was 5 m m larger than the maximum axis of the mass.)

using a laser-film digitizer (Lumisys, Sunnyvale, Calif) with a pixel size of 100 x 100 ~tm and 12-bit gray-level resolution. After digitization, these images were subsampled by a factor of four in both dimensions to make the size of digitized mammograms approximately 600 x 450 pixels. Thus, the effective pixel size in the subsampled images was 400 x 400 ~tm. As per the method described

Mass and Background-related Input Features Used in This Study

Feature No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

15 16 17 18 19 20 21 22 23 24

Description Size growth ratio between layers 1 and 2 Size growth ratio between layers 2 and 3 Central position shift from layer 1 to layer 2 Central position shift from layer 2 to layer 3 Size in layer 1 (pixel number) Size in layer 2 Size in layer 3 Circularity in layer 2 Circularity in layer 3 Shape factor in layer 3 Perimeter ratio in layer 3 Mean radial length in layer 3 Standard deviation of radial length in layer 3 Ratio of the longest and shortest radial lengths in layer 3 Longest axis in layer 3 Distance from local minimum point to region center in layer 3 Digital value contrast in layer 1 Digital value contrast in layer 3 Digital value standard deviation in layer 3 Digital value skewness in layer 3 Standard deviation of digital value in the background Rank of local minimum in the original image Rank of region's depth in the subtracted image Distance from region's center to the skin boundary

elsewhere (6,14), the effective size and digital value contrast of the positive mass regions used in this experiment are provided in Figures 1 and 2, respectively. A fraction (about 70%) of this image database, including 220 truepositive mass regions, had been used previously (12). Because we were attempting to investigate a methodologic issue, the specific scheme used is of secondary importance. A rule-based system that is optimized by iterative, but largely empirical, selection of rules (or general shapes of boundary conditions) could introduce substantial biases in the results. Therefore, a relatively simple ANNbased scheme was used. The ANN was a feed-forward, three-layer scheme with 24 input neurons, eight hidden neurons, and one output. The relationship between the input neurons (x) and the output neuron (Y) is determined with the following equation: m

n

Y = g [~-aw~g(~aw~ix i+ j = l J i=1 J

0 i n ) + 0hid],

where g(z) = 1/(1 + e-Z), wj is the weight from thejth hidden neuron to the output neuron, wJr is the weight from the ith input neuron to thejth hidden neuron, 0i, is a bias neuron in the input layer, and 0hid is another bias neuron in the hidden layer. A nonlinear sigmoid function, =

l+e

_(~awjiOpi+Oj ) '

i

is used as the activation function for each of the processing units (neurons) in the ANN, where Oj is thejth element of the output pattern produced by the input pattem O c Then, with use of the back-propagation concept and the Widrow/Hoff learning law (15), the weights between every two neurons are adjusted iteratively so that the difference between the actual output values and the desired output values is minimized. The weights are initially assigned randomly (16). Then, the adjusted weights are calculated as follows:

Awji(k + 1) = 'l]~pjOpi+ ~A14~i(k), where q is the learning rate, c~ is a momentum term used to determine the effect of past weight changes on the current changes, k is the number of iterations, and 5pj is the error between the desired and actual ANN output value. In the training mode, output digital values of 1 and 0 are assigned to true-positive and true-negative regions, respectively. All the final weights in the ANN are determined when either the error ~ is smaller than a predetermined value (eg, 0.01) or the number of iterations k has reached a predetermined number (eg, 3,000). Twenty-four features were computed for each region. These were used as input neurons of the ANN. These features are listed in the Table and have been described elsewhere (12,13). In brief, these can be generally divided into four categories. Seven neurons (features 1-7) represented the region growth features between adjacent growth layers or contrast levels (eg, size, growth ratio, and central position shift). Eight neurons (features 8-15) were features related to the shape and boundary condition of each suspicious region at different levels. Circularity is defined as the fraction of a region covered by a circle that has the same size as the region. Shape factor is defined as a ratio between the longest axis and the shortest axis of a region. Perimeter ratio is computed as perimeter length divided by the area (number of pixels) in a region. Mean radial length is an average value of radial length from the region center to each point on the perimeter of the region. Five input

499

neurons (features 16-20) were computed from digital value distributions at different levels (eg, contrast, standard deviation, and skewness). Four input neurons (features 21-24) were related to the surrounding background of each suspicious region and its ranking relationship with other suspicious regions in the same image. The distributions of feature values from the total ensemble of regions were used to normalize the input values for each feature to between 0 and 1. To evaluate scheme performance as a function of the number of regions used in the training data set, the scheme was trained with each training parameter set at a constant value. The number of iterations was 500, and the momentum and learning rates were 0.6 and 0.02, respectively. From the total 368 positive and 1,778 negative regions, 120 positive and 400 negative regions were randomly selected as a testing data set. This set was used solely for testing purposes and was not used in any of the training protocols. The remaining 248 positive and 1,378 negative regions made up the database from which the appropriate number of positive and negative regions were randomly selected for each training experiment. The training sample sizes were 60, 100, 150, 200, 280, 360, and 496. Each training experiment included a 1:1 ratio between true-positire and true-negative regions. For example, in the case of 280 training regions, 140 were positive regions (randomly selected from 248 available regions) and 140 were negative regions (randomly selected from 1,378 available regions). At the evaluated completion of each the training cycle,data perfor-of mance was by using same testing set 520 regions (120 positive and 400 negative). With use of the ANN output as a summary index, the area under the receiver operating characteristic (ROC) curve (A) was computed for each test at varying training sample sizes, Performance indexes for the training and testing sets were plotted as a function of the size of the training data set, and the results were compared. To evaluate the consistency of the results, the experiment was repeated after the number of iterations was increased from 500 to 3,000, as well as with the number of input features reduced from 24 to 16.

rESULT,c

Figure 1 shows the distribution of the effective sizes of true-positive mass regions used in this study. Effective size was defined as the square root of the product of the maximum and minimum dimensions of each mass. Figure 2 shows the distribution of the digital value contrast of the mass regions used in this study. These regions are not dis-

500

I

G

~

t

i

_.61..---o

o.gs 0.9

i

~

i

o

~0.85

~ 0.8 O ~ 0.78 ,~ ~ 0.7 ~ o.es < 0.8

training dataset after 500 iterations.

x ....

A z of

o .....

Az

of training dataset after 3,000 ~terations.

* - ....

Az

of testing dataset after 500 training iterations.

+ .....

Az

0.58

0.50

50

of testing dataset after 3,000 training iterations.

I

I

I

I

I

i

i

I

I

100

150

200

250

300

350

400

450

500

Number of Regions in the Training Data Set

Figure 3. Areas under ROC curves at different training d a t a set sizes. At e a c h measurement point, 50% of the training regions were positive. Results after 500 and 3,000 iterations with 24 input features are shown.

1

8.g5 o.0 ~~

0.85

~ 0.8 ~ 0.75 ,~ ¢~ 0.7 ~= 0.851

!

0.8~ 0.55 0.5"

o

x .....

Az

of training dataset using 24 input n e u r o n s .

o .....

Az

of training dataset using

+ ....

~

of testing dataset using 24 input n e u r o n s .

* - ....

Az

of testing dataset using

16 input n e u r o n s .

16 i n p u t n e u r o n s .

i

I

I

I

I

I

I

I

I

50

100

150

200

250

300

350

400

450

500

Number of Regions in the Training Data Set

Figure 4. Comparison of the area under ROC curves at different training d a t a set sizes with 24 and 16 input neurons. At e a c h measurement point, 50% of the training regions were positive. Results after 500 iterations are shown.

similar to other data sets used in the past for training and testing different schemes (12). Figure 3 shows the computed area under the ROC curve (Az) for the different training sample sizes and 24 input neurons. This figure demonstrates that training performance decreased as the number of regions included in the training data set increased. However, there was no substantial change when the training set increased beyond approximately 200 regions (100 of which were true-positive). The scheme per-

0.5

~'~.

=3

~ e A r~

~

,

,

,

,

,

,

o.4I

0.3f5 0.3

9© OA ~

0.95

=.~

o.15

= E ~l

J

,=.=

o ---

Training

dataset.

0.1

0.05 51

0

i

+00

i

~50

I

200

i

250

i

30o

i

as0

i

4o0

r

45o

I

50o

N u m b e r of Regions in the T r a i n i n g D a t a Set Figure 5. False-positive d e t e c t i o n rates a t 80% sensitivity as a function of training d a t a set sizes. Results shown are for 500

iterations with 24 input neurons.

formance on the testing data set, however, continued to improve when the number of training regions increased. When the number of training iterations increased to 3,000, the training results improved at all sample sizes (Fig 3). With minor variations, scheme performance on the testing database remained virtually the same. If anything, the performance on the testing data set decreased slightly with the increased number of training iterations, highlighting the overfitting issues associated with this approach. As shown in Figure 4, similar results were obtained when the number of input features was reduced from 24 to 16. Figure 5 demonstrates similar characteristics by using the fraction of false-positive regions remaining at a given sensitivity of 80%. The number of false-positive regions in the testing data set decreased from 194 to 130 as the training data set increased from 60 to 496.

)lSCUSSlOb

There are many theoretically sound techniques for optimizing and validating CAD schemes (17). However, most are based on the assumption that the training (rule-setting) database covers the entire sample space sufficiently well. When the case domain is adequately sampled (for the truepositive as well as the true-negative regions), and the investigator takes great care not to overtrain, these are valid approaches. This is typically the case when the feature domain is reasonably limited and well defined (eg, optical character recognition), but, unfortunately, it is not the case in many clinical applications, including mammographic

CAD techniques. Given the large number of independent variables needed to characterize the abnormalities on mammograms and the fact that many of these features are continuous and span a wide range of values, it may require a very large, carefully selected sample of cases to ensure that the entire variable domain is adequately covered. There is no a priori way to know whether a particular finite training set is sufficient in this regard. In reality, current prospective studies in mammographic CAD, with "cases never seen," demonstrate a reduction in performance compared with that on the data set used to develop and optimize the scheme, despite the fact that great care and valid sampling methods had been used during the development phase (18). The reason may be that the set from which training data are selected, segmented, and used to optimize CAD techniques is very sparse in relationship to the feature space. As a result, one is more likely to "stress" the system during testing by using cases that were "never seen," since at least some of these may cover areas in the feature space that had not been (or at best had been sparsely) represented in the training set. This is a somewhat different aspect of the classic overtraining or overfitting phenomenon. When the number of input neurons was reduced from 24 to 16, the similarity of the testing results clearly demonstrated dependence in our feature set (Fig 4). The concept presented in this preliminary and limited study offers one potential solution to a sample-size adequacy test of the training set in sparsely sampled environments. Although many of the details must be further investigated and simulations are currently under way to explore some of the methodologic issues associated with this approach, we believe the concept may prove both sound and unique, particularly when both the training and testing sample sizes available are not too small. This general approach to system validation could ultimately provide two types of inferences. First, information on adequacy of the training set sample size can be assessed from the number of training cases required to reach an asymptotic performance on the testing data set. It should be remembered, however, that adequacy of the testing set sample sizes have similar associated issues that are beyond the scope of this preliminary report. Second, the gap (difference) between the asymptotic performances of the training and testing databases for a large number of training and testing cases could potentially be used to infer on the adequacy of the domain description by the parameters (features) used in the model. The latter may prove to be an important tool for assessing potential improvements in CAD schemes as a

501

function of the input parameters used in the model. These concepts clearly need substantial additional support, not only in validating them for different training and testing sets to demonstrate consistency and generalizability, but also to better understand the limitations of this approach, particularly with limited data sets. All of these are clearly beyond the scope of this preliminary report. We do not wish to diminish the role or challenge the validity of other proved assessment methods. However, unless these are appropriately adapted to the conditions that apply to many CAD schemes, these techniques may overestimate actual performance in the clinical environment. The issues discussed herein are not unique to CAD development.

6.

7.

8.

9.

10.

11. kCKNOWLEDGMENT, c

The authors thank William Reinus, MD, and the research group at Washington University, St Louis, Mo, for providing us with some of the images used in this study.

12.

13. tEFERENCEI

1. Swets JA. ROC analysis applied to the evaluation of medical imaging techniques. Invest Radio11997; 14:109-120. 2. International Commission on Radiation Units. Medical imaging: the assessment of image quality. International Commission on Radiation Units and Measurements Report no. 54. Bethesda, M d : International Commission on Radiation Units, 1996. 3. Metz CE. Some practical issues of experimental design and data analysis in radiological ROC studies. Invest Radio11989; 24:234245. 4. Fryback DG, Thornbury JR. The efficacy of diagnostic imaging, Med Decis Making 1991; 11:88-94. 5. Gur D, King JL, Rockette HE, Britton CA, Thaete FL, Hoy RJ. Practi-

502

14.

15. 16.

17. 18,

col issues of experimental ROC: selection of controls. Invest Radio11990; 25:583-586. Nishikawa RM, Giger ML, Doi K, et al. Effect of case selection on the performance of computer-aided detection schemes. Med Phys 1994; 21:265-269. Ema T, Doi K, Nishikawa RM, Jiang Y, Papaioannou J, Image feature analysis and computer-aided diagnosis in mammography: reduction of false-positive clustered microcalcifications using local edge-gradient analysis. Med Phys 1995; 22:161-169. Wei D, Chan HP, Helvie MA, et al. Classification of mass and normal breast tissue on digital mammograms: multiresolution texture analysis. Med Phys 1995; 22:1501-1513. Zhang W, Doi K, Giger ML, Wu Y, Nishikawa RM, Schmidt RA. Computerized detection of clustered microcalcifications in digital mammograms using a shift invariant artificial neural network. Med Phys 1995; 22:1555-1567. Wu Y, Doi K, Giger ML, et al. Detection of lung nodules in digital chest radiographs using artificial neural networks: a pilot study. J Digital Imaging 1995; 8:88-94. Chan HP, Lo SB, Sahiner B, Lam KL, Helvie MA. Computer-aided detection of mammographic microcalcifications: pattern recognition with an artificial neural network. Med Phys 1995; 22:15551567. Zheng B, Chang YH, Gur D. An adaptive computer-aided diagnosis scheme of digitized mammograms. Acad Radio11996; 3:806-814, Zheng B, Chang YH, Gur D. Computerized detection of masses in digitized mammograms using single-image segmentation and a multilayer topographic feature analysis. Acad Radio11995; 2:959-966. Li HD, Kallergi M, Clark LP, Jain VK, Clark RA. Markov random field for tumor detection in digital mammography. IEEETrans Med Imaging 1995; 14:565-576, Hecht-Nielsen R. Neurocomputing. New York, NY: AddisonWelsey, 1989; 59-63. PressWH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes in C. New York, NY: Cambridge University Press, 1992; 274-286, Efron B, Gong G. A leisurely look at the bootstrap, the jackknife, and cross-validation. Am Statistician 1983; 37:36-48. Nishikawa RM, Schmidt RA, Osnis RB, Giger ML, Doi K, Wolverton DE. Two-year evaluation of a prototype clinical mammography workstation for computer-aided diagnosis (abstr). Radiology 1996; 201 (P):256.