Use of prior mammograms in the classification of benign and malignant masses

Use of prior mammograms in the classification of benign and malignant masses

European Journal of Radiology 56 (2005) 248–255 Use of prior mammograms in the classification of benign and malignant masses Celia Varela a , Nico Ka...

250KB Sizes 0 Downloads 41 Views

European Journal of Radiology 56 (2005) 248–255

Use of prior mammograms in the classification of benign and malignant masses Celia Varela a , Nico Karssemeijer a,∗ , Jan H.C.L. Hendriks b , Roland Holland b a

Radboud University Medical Centre Nijmegen, Department of Radiology, Geert Grooteplein 18, 6525 GA Nijmegen, The Netherlands b Radboud University Medical Centre Nijmegen, National Expert and Training Center for Breast Cancer Screening, P.O. Box 9101, 6500 HB Nijmegen, The Netherlands Received 8 February 2005; received in revised form 8 April 2005; accepted 11 April 2005

Abstract The purpose of this study was to determine the importance of using prior mammograms for classification of benign and malignant masses. Five radiologists and one resident classified mass lesions in 198 mammograms obtained from a population-based screening program. Cases were interpreted twice, once without and once with comparison of previous mammograms, in a sequential reading order using soft copy image display. The radiologists’ performances in classifying benign and malignant masses without and with previous mammograms were evaluated with receiver operating characteristic (ROC) analysis. The statistical significance of the difference in performances was calculated using analysis of variance. The use of prior mammograms improved the classification performance of all participants in the study. The mean area under the ROC curve of the readers increased from 0.763 to 0.796. This difference in performance was statistically significant (P = 0.008). © 2005 Elsevier Ireland Ltd. All rights reserved. Keywords: Breast neoplasm; Diagnosis; Cancer screening; Diagnostic radiology; Observer performance

1. Introduction In screening mammography, the use of prior mammograms is recommended when interpreting newly obtained mammograms [1]. Nevertheless, reading of priors is not always practiced because of the additional cost and logistical obstacles of retrieving prior mammograms. With the upcoming conversion of breast cancer screening to a digital environment, retrieval and viewing of priors will become easier to organize. Initially, however, there will be a transition period in which prior and current mammograms will be on different media. Reading of films on a lightbox in combination with a soft-copy workstation for digital images is known to be difficult. Digitization of prior films would be a solution. However, this would be a considerable effort that should be supported by convincing evidence for the benefit of using priors. Comparison with previous mammograms is believed to increase cancer detection by adding the possibility to perceive ∗

Corresponding author. Tel.: +31 24 3614548; fax: +31 24 3540866. E-mail address: [email protected] (N. Karssemeijer).

0720-048X/$ – see front matter © 2005 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.ejrad.2005.04.007

changes in appearance between examinations, and to reduce unnecessary recall to assessment for long standing benign lesions [1]. However, there is not much scientific evidence that supports this belief. Existing studies show that specificity of mammography increases with the use of priors [2–5], but no conclusive evidence was found that sensitivity is influenced by having older films available. A problem is that most existing studies provide sensitivity and specificity values for a single confidence level. Nowadays, it is widely accepted that observer experiments should provide results independent of the decision threshold of the observer. Receiver operating characteristic (ROC) methodology is considered as the most complete way to measure diagnostic accuracy of radiologists in two-group classification tasks [6]. To our knowledge, only one study has been reported in which influence of prior mammograms on mammographic screening was evaluated by means of ROC analysis [3]. This study had a low power, and it could not be concluded that the presence of previous mammograms lead to an improvement in diagnostic accuracy. Reading of mammograms in breast cancer screening involves both detection of suspect areas and their classifica-

C. Varela et al. / European Journal of Radiology 56 (2005) 248–255

tion. In this study we focus on classification, motivated by the fact that misinterpretation has been identified as a major cause of missing breast cancer [7,8]. The purpose of the study was to evaluate the use of prior mammograms to discriminate benign and malignant mass lesions. An experiment was conducted in which six radiologists interpreted a series of screening cases twice, without and with comparison of prior mammograms. All mammograms were displayed in a soft copy reading environment. The results were evaluated using ROC methodology.

2. Material and methods 2.1. Case collection and database All cases used in the study were collected from the Dutch breast cancer screening program. This is a nation-wide program that offers all women aged 50 to 75 years a biennial screening examination. In initial screens, both medio-lateral oblique (MLO) and craniocaudal (CC) views are obtained, in subsequent screens generally only MLO views are taken. Upon indications of high breast density or suspect abnormalities both MLO and CC views are taken in about 20–30% of subsequent screenings. For subsequent screening rounds, it is obligatory to read mammograms in comparison with previous films. In subsequent screenings the referral rate is currently around 1.0% [9]. In our institution, we have a collection of consecutive cases that were referred between 1996 and 2000. These cases contained mammograms at referral and mammograms of all previous screening rounds. In each case abnormalities were annotated by an experienced radiologist (JH) using all information available, such as pathology results when a biopsy had been performed. In a number of cases, lesions could also be annotated on mammograms taken in previous screening rounds. These were cases with visible lesions that were not referred at the previous screenings. In the study, for each case two examinations corresponding to two subsequent screening rounds were used. The more recent mammogram will be referred to as “current”, and for the older (previous) mammogram we will use the term “prior”. The selection of mammograms for this study was ruled by the following criteria: (1) current mammograms should always have both MLO and CC views available and (2) at least one view of the current mammogram should have a mass, asymmetry, or architectural distortion (referred to as masses in the rest of the paper). Of all referral mammograms that met these criteria we randomly selected 171 cases, of which 87 were malignant and 84 were benign. In addition, we selected all cases in which the last mammogram before referral met the criteria to be used as current. These were 27 cases, of which 12 were malignant (i.e. false negatives of screening) and 15 were benign. Combining the two selections we had 198 study cases, 99 malignant and 99 benign. Because of the selection criteria, current mammograms always had

249

Table 1 Histopathologic and mammographic characteristics of findings of malignant cases Characteristics

No. of lesions

Biopsied Invasive ductal carcinoma Invasive lobular carcinoma Tubular carcinoma Mucineus/colloid carcinoma Intracystic carcinoma with invasion Intracystic carcinoma without invasion Ductal carcinoma in situ

99 71 18 3 2 1 2 2

Mammographic lesion size <11 mm 11–20 mm >20 mm

99 14 59 26

Lesion type Mass Architectural distortion Asymmetry

99 90 7 2

Lesion size corresponds to the mammographic annotation made by the radiologist.

MLO and CC views. Prior mammograms always had MLO views, while CC views were available in 21.7% of the cases (43/198, 22 malignant and 21 benign). All 99 malignant cases were biopsy proven. Of the benign cases 39 were histologically confirmed, while the remaining 60 cases had at least 6-month follow-up with mammography without suspicion for malignancy. Table 1 shows histopathologic and mammographic characteristics of findings of the malignant cases. There were two cases of ductal carcinoma in situ (DCIS) that were classified mammographically as mass with microcalcifications. Such cases are commonly referred to as tumor-forming DCIS. Two other cases were intracystic in situ carcinomas. Table 2 shows histopathologic and mamTable 2 Histopathologic and mammographic characteristics of findings of benign cases Characteristics

No. of lesions

Biopsied Solitary cyst Fibroadenoma Fibrocystic change Atypical ductal hyperplasia Other benign lesion Normal tissue

39 17 4 2 5 8 3

Mammographic lesion size <11 mm 11–20 mm >20 mm

99 20 54 25

Lesion type Mass Architectural distortion Asymmetry

99 90 7 2

Lesion size corresponds to the mammographic annotation made by the radiologist.

250

C. Varela et al. / European Journal of Radiology 56 (2005) 248–255

mographic characteristics of findings of the non-malignant cases. There were 14 cases that had also microcalcifications in the annotated mass, 10 were malignant and 4 benign. The average time between prior and current screening mammograms was 2.51 years (range 1.26–6.40 years). The images were digitized with a laser scanner (CFS300, Canon) at a pixel size of 50 ␮m and 12-bit gray levels, and subsequently down-sampled to a final resolution of 100 ␮m. Digitized mammograms were anonymized by removing all patient identification fields before display. Figs. 1 and 2 show

two typical cases included in the study. Institutional review board approval was not required. 2.2. Experimental set-up Six observers were asked to classify mass lesions in the selected series of mammograms, using a dedicated mammography workstation. In the reading sessions, for each case initially only the current mammogram was shown. After characterization of the lesion the prior mammogram was added, and

Fig. 1. Sixty-two-year-old-woman with invasive ductal carcinoma (11 mm size). Left medio-lateral oblique mammograms. The bottom row corresponds to the current round and the top row to the prior mammogram 23 months earlier. In the current mammogram the square together with the magnification view enclose the malignant lesion recalled in the screening. The corresponding region in the prior is also shown. When using only the current mammograms the average score of the radiologists was 51.7 (range 20–80). Comparing with prior mammograms the average score increased to 72.5 (range 30–90).

C. Varela et al. / European Journal of Radiology 56 (2005) 248–255

251

Fig. 2. Sixty-six-year-old-woman with mucineus carcinoma (12 mm size). Left medio-lateral oblique mammograms. The bottom row corresponds to the current round and the top row to the prior mammogram 21 months earlier. In the current mammogram the square together with the magnification view enclose the malignant lesion recalled in the screening. The corresponding region in the prior is also shown. When using only the current mammograms the average score of the radiologists was 50 (range 30–60). Comparing with prior mammograms the average score increased to 60 (range 40–80).

the radiologist was asked to report any changes of opinion. Cases were presented in a randomized order and by pressing a key the contour of the abnormality to be judged was displayed in each view for which an annotation was available. The radiologists were instructed to check the location of lesions by pressing this key, to ensure that the correct mass was evaluated. Other possible abnormalities in the images could be ignored. The time for reading the images was not limited. For reporting, radiologists were asked to rate the likelihood of malignancy of each case on a continuous scale from 0 to

100. In most cases, lesions were present in both views of the current mammograms. In 35.8% of the cases (71/198, 36 malignant and 35 benign) the abnormalities were also visible in the prior mammograms. In any case, radiologists were asked to give a single rating of the lesion. The mammograms were displayed on a dedicated mammography workstation (MBC-SCR1, Mevis BreastCare). The workstation was equipped with two high-resolution CRT monitors (BARCO, MGD 521, 300 Cd/m2 ), which had a spatial resolution of 2048 × 2560 pixels each, which is sufficient

252

C. Varela et al. / European Journal of Radiology 56 (2005) 248–255

to display one image at 100 ␮m. Initially, images were displayed at low spatial resolution (200 ␮m), in such a way that all images included in a case could be displayed simultaneously. Full spatial resolution (100 ␮m) images could subsequently be displayed by pressing a key of a dedicated keypad. Different views from the same breast, or prior and current images, could also be displayed on both monitors at the same time. This made it possible to analyze temporal changes in a simple and user-friendly way. Images were preprocessed using an unsharp-masking technique [10] to compensate for the decrease of sharpness with respect to the original films due to digitization and electronic display. Five of the six readers in this study were attending radiologists with breast cancer screening experience. The other observer was a radiology resident in her last year who had specialized in mammography. All participants received a training session (25 cases) before the observer study started, in order to become familiar with the soft-copy reading system and the design of the experiment. The true diagnosis was given immediately after each training case. General information about the data set was provided to the radiologists. They were informed that the number of benign and malignant cases was approximately the same, and that all cases were referrals from a screening program. 2.3. Data analysis The confidence ratings of the observers for each reading condition were analyzed using ROC methodology [11,12] and classification performance was quantified by means of the area under the ROC curve, Az . Statistical analysis was performed using the Dorfman–Berbaum–Metz (DBM) approach [13]. This method has been widely adopted in recent years for analyzing experimental data obtained in a multireader multi-case (MRMC) study design [14,15]. It has the advantage that both reader and case variability are taken into account in a proper way, such that generalization to both the population of readers and cases are permitted. The publicly available LABMRMC software [16] was used for computations. Statistical significance of the difference between the two reading conditions was determined for the whole group of observers and for the individual readers. Average ROC

curves for all radiologists for both reading conditions were also calculated by averaging the parameters of the individual curves. In our series of 198 cases we included 27 screening mammograms with visible abnormalities that did not lead to referral. These were referred though in the subsequent screening round, and 12 of these cases turned out to have breast cancer. For these 27 cases, radiologists read the last mammogram before referral as the current mammogram and the preceding one as the prior. The effect of reading with prior mammograms was determined separately for this subset of cases. As the sample of 27 cases was too small to allow statistical analysis with the MRMC approach we pooled the data of the observers to allow computation of ROC curves [17].

3. Results The Az values of the six radiologists for the reading of the 198 cases without and with prior mammograms are listed in Table 3. The performance of all observers was improved when prior mammograms were used to classify the lesions. The improvement in their Az values ranged from 0.014 to 0.050. The average reading performance of the radiologists increased from Az = 0.763 to Az = 0.796 when prior mammograms were available. The improvement in the radiologists’ classification accuracy by using prior images was statistically significant (P = 0.008, DBM method). Fig. 3 shows the average ROC curves. There was also a statistically significant difference for two of the six observers (Table 3). Of the six radiologists in the study, one was a radiology resident with extensive training in mammography. Her performance was in agreement with the other radiologists. Fig. 4 shows pooled ROC curves for the subset of 27 cases with mass lesions regarded as negative at one screening round, but were referred in the subsequent screening exam 2 years later. Also for these cases radiologists performed better when they had prior mammograms available. The areas under the ROC curves were 0.559 and 0.638, respectively, when classification was performed using only current mammograms and using both current and prior mammograms.

Table 3 Az values of individual radiologists without and with prior mammograms Observer no.

Current mammograms (S.E.)

Current + prior mammograms (S.E.)

95% CI for the population mean difference

1 2 3 4 5 6

0.773 (0.033) 0.791 (0.032) 0.735 (0.035) 0.751 (0.034) 0.768 (0.035) 0.751 (0.034)

0.792 (0.034) 0.825 (0.029) 0.749 (0.034) 0.801 (0.031) 0.813 (0.031) 0.790 (0.032)

(−0.0618, 0.0206) (−0.0650, −0.0019)∗ (−0.0420, 0.0148) (−0.0866, −0.0104)∗ (−0.1096, 0.0247) (−0.0817, 0.0026)

Az from average a, b parameters

0.763

0.796

S.E.: standard error; CI: confidence interval. * Statistically significant.

C. Varela et al. / European Journal of Radiology 56 (2005) 248–255

Fig. 3. Average ROC curves of the six radiologists for each reading condition using the 198 cases in the data set.

Fig. 4. ROC curve for pooled data of the six radiologists using 27 cases in which we used the last mammogram before referral as current mammogram, and the preceding one as prior.

4. Discussion Results of the observer study show that the performance of radiologists in classification of benign and malignant mammographic masses is improved significantly when radiologists have access to prior mammograms for temporal comparison. The statistical analysis performed allows generalizing these results beyond the specific case sample and readers in our study. The beneficial effect of priors was very consistent,

253

as individual performance was improved for all six observers in the study. In our experiment we did not try to mimic a real screening situation, in which inclusion of a high number of normal mammograms would be required. Our goal was to study the use of prior mammograms in the classification of suspect regions and not in the initial perceptual process of detecting potential lesion locations. By focusing on this aspect we were able to design an ROC study with high statistical power. Classification of lesions is crucial in screening, as the best possible decision for further assessment must be made to maintain both high sensitivity and specificity. Moreover, misinterpretation of lesions has been identified as a major cause of screening errors [7,8,18]. We argue that the additional use of priors to interpret current mammograms in classification of breast lesions is particularly relevant in screening, as in that setting generally no other diagnostic methods can be used to improve the decision making process. Since results obtained in a screening and in a diagnostic situation might differ greatly [19] we only used referred cases from a screening program. As a consequence, the criteria for referral used in the screening program have affected the distribution of cases. In particular, as the screening program from which we took our cases has a very low referral rate one might expect that subtle abnormalities that would have warranted referral in other programs may not have been well represented in our case sample. However, in 27 cases we used the last “negative” screening exam before referral as current mammogram and the preceding mammogram taken 4 years before referral as prior. In those cases a subtle abnormality could already be identified in the mammogram 2 years prior to referral. These 27 cases made an important contribution to the series of cases used in the experiment, as they include 12 cases in which a lesion that was not seen or misinterpreted on the current mammogram was malignant. By including these initially “negative” cases we obtained a case sample that was more representative for studying the problem of lesion classification in screening. For this subset of cases we computed a pooled ROC for all the radiologists. Results in Fig. 4 show the same trend in the data as for the whole data set, and it seems that the effect of using priors is even stronger. The areas under the ROC curves of the pooled data for this subset were lower than the average performances for the total number of cases. This is understandable as these cases were more subtle. Several studies have assessed the value of prior mammograms in screening with no conclusive results. In some studies the use of prior mammograms showed a benefit in decreasing the number of false positive detections [2–5]. However, these studies did not demonstrate an improvement in detection accuracy [2,3]. This may be due to the low statistical power of those studies, resulting from inclusion of only few abnormal cases, much less than the 198 abnormal mammograms we used [2,3]. Particularly, Callaway et al. [3] included only nine cancer cases in the 100 case set they used to perform their ROC study. Burnside et al. [4] showed an increase

254

C. Varela et al. / European Journal of Radiology 56 (2005) 248–255

in true positive findings when performing diagnostic mammography with priors, but this outcome was not shown when performing screening mammography. One of the limitations of their study was that the sets of cases read with and without priors were different. Because they were not able to randomize case selection there may have been a bias. In our study, all cases were read twice, first without and next with comparison of previous mammograms. Sumkin et al. [5] reported a statistically significant improvement in the specificity due to priors but not in the sensitivity. Their goal was to determine whether mammograms from 1 or 2 years earlier should be used as reference. Bearing this in mind, they excluded from their data set those cases where the comparison films were deemed to be of limited or no value to the final disposition of the case [5]. Therefore, from their results no general conclusions about the use of previous mammograms can be drawn. All previous studies conducted to evaluate the impact of prior mammograms on the detection of breast cancer in a screening program were screen-film based. In this study we used digitized mammograms in a soft-copy reading environment. Studies have shown that soft-copy reading can be as effective as hard copy reading in detecting breast cancer [20–22], provided that the digital reading environment is of high quality. We used a dedicated mammography workstation equipped with two high-resolution monitors. A previous study showed that the hardware and display software used in this system was suited to perform soft-copy reading of mammograms without loss of quality [21]. In this study we used a sequential reading order design. We consider that this study design was appropriate, as it closely resembles the way radiologists work in practice. In this situation the potential benefit of the first reading to interpretations made from the second reading becomes not a bias, but instead a factor of realistic experimental design [6]. The sequential reading design was chosen, because it has been shown to be much more sensitive for finding differences and less demanding on logistics and reader time investment [14]. Independent reading studies have the advantage that possible bias due to reading order effects can be eliminated. However, experimental evidence suggests that this does not outweigh the benefits of sequential reading order design, in particular not in a case where sequential reading mimics clinical practice [14,15]. The results of our study suggest that priors should be routinely used in breast cancer screening and that a proper solution should be found to make priors available in the transition phase to digital mammographic screening. From our results, we can conclude that comparison with previous mammograms should be strongly recommended whenever a suspect abnormality is perceived.

Acknowledgments The authors wish to thank to S. van Engeland MSc and S. Timp MD for their assistance in preparation of this experi-

ment and their valuable comments. The authors are grateful to Dr. F. van der Horst of the National Expert and Training Center for Breast Cancer Screening, Nijmegen, The Netherlands for his valuable comments on the preparation of the study. Funding: Supported by grant IST-2001-33439 from the European Community in the Information Society Technologies program in the 5th Framework.

References [1] European commission quality control guidelines for mammographic screening, 3rd ed. European Commission;2001. [2] Thurfjell MG, Vitak B, Azavedo E, Svane G, Thurfjell E. Effect on sensitivity and specificity of mammography screening with or without comparison of old mammograms. Acta Radiol 2000;41: 52–6. [3] Callaway MP, Boggis CRM, Astley SA, Hutt I. The influence of previous films on screening mammographic interpretation and detection of breast carcinoma. Clin Radiol 1997;52:527–9. [4] Burnside ES, Sickles EA, Sohlich RE, Dee KE. Differential value of comparison with previous examinations in diagnostic versus screening mammography. AJR 2002;179:1173–7. [5] Sumkin JH, Holbert BL, Herrmann JS, et al. Optimal reference mammography: a comparison of mammograms obtained 1 and 2 years before the present examination. AJR 2003;180:343–6. [6] Metz CE. Fundamental ROC analysis. In: Beutel J, Kundel H, Van Metter R, editors. Handbook of medical imaging: physics and psychophysics, vol 1. Bellingham, WA: SPIE Press; 2000. p. 751– 69. ¨ [7] Vitak B. Invasive interval cancers in the Osterg¨ otland mammographic screening programme: radiological analysis. Eur Radiol 1998;8:639–46. [8] Karssemeijer N, Otten JDM, Verbeek ALM, et al. Computer-aided detection versus independent double reading of masses on mammograms. Radiology 2003;227:192–200. [9] Francheboud J, Otto S, van Ineveld B, et al. National evaluation of breast cancer screening in the Netherlands. National Evaluation Team for Breast cancer screening (NETB) 2002. [10] Roelofs AJ, van Woudenberg S, Hendriks JHCL, Karssemeijer N. Optimized soft-copy display of digitized mammograms. Proc SPIE 2003;5034:10–9. [11] Metz CE. Roc methodology in radiographic imaging. Invest Radiol 1986;21:720–33. [12] Metz CE, Herman BA, Shen JH. Maximum-likelihood estimation of receiver operating characteristic curve (roc) curves from continuously-distributed data. Statist Med 1998;17:1033–53. [13] Dorfman DD, Berbaum KS, Metz CE. Roc rating analysis: generalization to the population of readers and cases with the jackknife method. Invest Radiol 1992;27:723–31. [14] Beiden SV, Wagner RF, Doi K, et al. Independent versus sequential reading in roc studies of computer-assist modalities: Analysis of components of variance. Acad Radiol 2002;9:1036–43. [15] Hadjiiski L, Chan H-P, Sahiner B, et al. Improvement in radiologists’ characterization of malignant and benign breast masses on serial mammograms with computer-aided diagnosis: an roc study. Radiology 2004;233:255–65. [16] http://www-radiology.uchicago.edu/krl/roc soft.htm/. [17] Metz CE. Some practical issues of experimental design and data analysis in radiological roc studies. Invest Radiol 1989;24:234–45. [18] Manning D, Ethell S, Donovan T. Categories of observer error from eye-tracking and afroc data. Proc SPIE 2004;5372:90–9. [19] Dee KE, Sickles EA. Medical audit of diagnostic mammography examinations: comparison with screening outcomes obtained concurrently. AJR 2001;176:729–33.

C. Varela et al. / European Journal of Radiology 56 (2005) 248–255 [20] Pisano ED, Cole EB, Kistner EO, et al. Interpretation of digital mammograms: comparison of speed and accuracy of soft-copy versus printed-film display. Radiology 2002;223:483–8. [21] Roelofs AJ, van Woudenberg S, Hendriks JH, B¨odicker A, Evertsz CJ, Karssemeijer N. Performance evaluation of a digital reading station for screening mammography. In: Peitgen HO, edi-

255

tor. Digital mammography. Berlin: Springer-Verlag; 2003. p. 455– 9. [22] Lewin JM, Hendrick RE, D’Orsi CJ, et al. Comparison of fullfield digital mammography with screen-film mammography for cancer detection: results of 4945 paired examinations. Radiology 2001;218:873–80.