Observer variability of Breast Imaging Reporting and Data System (BI-RADS) for breast ultrasound

Observer variability of Breast Imaging Reporting and Data System (BI-RADS) for breast ultrasound

European Journal of Radiology 65 (2008) 293–298 Observer variability of Breast Imaging Reporting and Data System (BI-RADS) for breast ultrasound Hye-...

688KB Sizes 0 Downloads 63 Views

European Journal of Radiology 65 (2008) 293–298

Observer variability of Breast Imaging Reporting and Data System (BI-RADS) for breast ultrasound Hye-Jeong Lee a , Eun-Kyung Kim a,∗ , Min Jung Kim a , Ji Hyun Youk a , Ji Young Lee a , Dae Ryong Kang b , Ki Keun Oh a a

Department of Diagnostic Radiology, College of Medicine, Yonsei University, Seodaemun-ku Shinchon-dong 134, Seoul 120-752, South Korea Department of Preventive Medicine, College of Medicine, Yonsei University, Seodaemun-ku Shinchon-dong 134, Seoul 120-752, South Korea

b

Received 26 December 2006; received in revised form 23 February 2007; accepted 4 April 2007

Abstract Purpose: To evaluate inter- and intra-observer variabilities in breast sonographic feature analysis and management, using the fourth edition of the Breast Imaging Reporting and Data System (BI-RADS). Materials and methods: We included 136 patients with 150 breast lesions who underwent breast ultrasound (US) and ultrasound-guided core needle biopsy. A pathological diagnosis was available for all 150 lesions: 77 (51%) malignant and 73 (49%) benign. Four radiologists retrospectively reviewed sonographic images of lesions twice within an 8-week interval. The observers described each lesion, using BI-RADS descriptors and final assessment. Inter- and intra-observer variabilities were assessed with Cohen’s kappa statistic. Positive predictive value and negative predictive value (NPV) for final assessment were also calculated. Results: For inter-observer agreements for sonographic descriptors, substantial agreement for lesion calcification and final assessment (κ = 0.61 for both), moderate agreement for lesion shape, orientation, boundary, and posterior acoustic features (κ = 0.49, 0.56, 0.59, and 0.49, respectively), and fair agreement for lesion margin and echo pattern (κ = 0.33 and 0.37, respectively) were obtained. For intra-observer agreement, substantial to perfect agreement was found for almost all lesion descriptors and final assessments. NPV for final assessment category 3 was 95%. Positive predictive value (PPV) for final assessment categorized as 4 or 5 were as follows: category 4a, 26%; category 4b, 89%; category 4c, 90%; and category 5, 97%. Conclusion: Because inter- and intra-observer agreement with the BI-RADS lexicon for US is good, the use of BI-RADS lexicon can provide accurate and consistent description and assessment for breast US. © 2007 Elsevier Ireland Ltd. All rights reserved. Keywords: Breast ultrasound; BI-RADS lexicon; Inter-observer variability; Intra-observer variability

1. Introduction The American College of Radiology (ACR) has developed the Breast Imaging Reporting and Data System (BI-RADS) lexicon in mammography for the purpose of standardizing mammographic reporting [1,2]. Until recently, BI-RADS had only been applied to mammography and not to other breast imaging techniques. Although mammography is recognized as the best method of screening for breast cancer, breast ultrasound (US) has become well established as a valuable imaging technique [3]. With the broad use of the US, the ACR recently has provided



Corresponding author. Tel.: +82 2 2228 7400; fax: +82 2 393 3035. E-mail address: [email protected] (E.-K. Kim).

0720-048X/$ – see front matter © 2007 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.ejrad.2007.04.008

a BI-RADS lexicon for breast US to standardize sonographic descriptors of lesions [4,5]. This lexicon includes descriptors of features such as mass shape, orientation, margin, posterior acoustic features, vascularity, and other sonographic features. Although there has been some controversies regarding the utility of US for determining the likelihood of malignancy of solid breast masses [6,7], several studies have suggested that sonographic appearance can be useful in differentiating benign from malignant solid breast masses [8–10]. Observer variability in sonographic interpretation of breast mass is inherent and this variability is attributable to differences in lesion detection and to variation in lesion characterization and subsequent management. However, to our knowledge, variability in the use of BI-RADS terminology for breast US has not been widely studied [11].

294

H.-J. Lee et al. / European Journal of Radiology 65 (2008) 293–298

Therefore, the purpose of this study was to retrospectively assess inter- and intra-observer variabilities in sonographic feature analysis and management of breast lesions. 2. Materials and methods 2.1. Patient population Institutional review board approval was obtained for this retrospective study, and informed patient consent was not required. Between January 2003 and May 2003, 324 breast lesions of 295 consecutive women underwent bilateral whole breast US and US-guided core needle biopsy. Sonograms was obtained using ATL HDI 5000 and 3000 ultrasound units (Philips-Advanced Technology Laboratories, Bothell, WA, USA) with 10 MHz linear-array transducers by three radiologists. With the use of the HDI 5000 machine, compound imaging was performed in all cases. All sonography-guided biopsies were performed with automated guns (Pro-Mag 2.2, Manan Medical Products, Northbrook, II) and 14-gauge Tru-cut needles with a 22 mm throw with free hand technique. Of these 324 lesions, we included only those that were pathologically proven to be malignant or those that were benign showing stability for at least 2-year follow-up. High-risk lesion such as atypical ductal hyperplasia, papilloma, phyllodes tumor, lobular carcinoma in situ was not included. Thus, a total 150 lesions of 136 patients met the selection criteria and were included in this study. In 14 patients, more than one lesion had been biopsied. The 136 patients ranged in age from 21 to 76 years (mean, 46.7 years). Of the 150 lesions, 72 had undergone US-guided core needle biopsy only, 7 had undergone both US-guided core needle biopsy and directional vacuum-assisted removal, and 71 lesions had undergone US-guided core needle biopsy and surgical excision. It was principle that category 3 and 4a lesion proved to be benign at US-guided core biopy was followed up at least 2 years, however, surgery was done if the patient or physician want to remove the lesion. If category 4b, 4c, or 5 lesions proved to be benign at core biopsy, it was recommended repeat biopsy or excision. The size range of the lesions was 3–45 mm (mean, 14.1 mm) in maximal dimension. Seventy-seven lesions were assigned to the malignancy: infiltrating ductal carcinoma (IDC) in 62 lesions, ductal carcinoma in situ (DCIS) in 12, and medullary carcinoma in 3. Seventy-three lesions were benign: fibroadenoma in 19 lesions, fibrocystic change in 18, fibroadenomatous hyperplasia in 15, stromal fibrosis in 12, fat necrosis in 3, adenosis in 3, ductal epithelial hyperplasia in 2, and papillary epithelial hyperplasia in 1. 2.2. Retrospective reviews of breast sonography One investigator (L.H.J.) selected two representative transverse and longitudinal sonographic images for each lesion and converted the sonographic images into TIFF (Tag Image File Format) files with 300 dots per inch (dpi). Then she arranged them in random order in Microsoft Power Point XP. Four breast radiologists (Y.J.H., K.M.J., K.E.K., and L.J.Y.) with 1,

4, 10, and 1 years of breast imaging experience, respectively, reviewed the sonographic images. To evaluate inter-observer agreement, each radiologist independently evaluated the images. To assess the intra-observer agreement, each radiologist reevaluated the sonographic images again in random order on average 6 weeks after the first. All observers were blinded to the clinical information and pathologic results of each case as well as the ratio of malignant to benign lesions included in the study. The radiologists evaluated the images according to the BI-RADS lexicon for US. The contents of the BI-RADS descriptors are shown in Table 1. Each observer was instructed to choose the most appropriate term to describe each lesion. In this study, the agreement among observers for the presence of associated findings and special cases could not be evaluated. Although there are seven categories (from 0 to 6) in BI-RADS lexicon, we only used BI-RADS category 3 (probably benign), 4a (suspicious malignancy with low likelihood), 4b (suspicious malignancy with intermediate likelihood), 4c (suspicious malignancy with moderate likelihood), and 5 (highly suggestive of malignancy) in this study because the BI-RADS categories of 0, 1, 2, and 6 were already excluded in this study. 2.3. Statistical analysis Kappa statistics were calculated, using the SAS system (MAGREE SAS Macro program) to assess the proportion of inter- and intra-observer agreement beyond that expected by chance [12]. The method for estimating an overall kappa value in the case of multiple observers and multiple categories is based on the work of Landis and Koch [13] as follows. A value of κ = 1.0 corresponds to complete agreement, 0 to no agreement, and less than 0 to disagreement. Landis and Koch [13] have suggested that a kappa value (κ) of equal to or less than 0.20 indicates slight agreement; 0.21–0.40, fair agreement; 0.41–0.60, moderate agreement; 0.61–0.80, substantial agreement; and 0.81–1.00, almost perfect agreement. Additionally, diagnostic indices including sensitivity, specificity, positive predictive value (PPV) for category 4 and 5 and negative predictive value (NPV) for category 3 were calculated on the basis of first review. Table 1 The contents of BI-RADS descriptors for breast sonography Shape Orientation

Oval; round; irregular Parallel; non-parallel

Margin

Circumscribed Not circumscribed: indistinct, angular, microlobulated, spiculated

Lession boundary Echo pattern

Abrupt interface; echogenic halo Anechoic; hyperechoic; complex; hypoechoic; isoechoic Absent; enhancement; shadowing; combined Macrocalcification; microcalcification in a mass; microcalcification out of a mass; microcalcification in and out of a mass Category 3; category 4 (a, b, c); category 5

Posterior acoustic features Calcification

Final assessment

H.-J. Lee et al. / European Journal of Radiology 65 (2008) 293–298

295

Table 2 Inter-observer variability in BI-RADS descriptor and assessment category BI-RADS lexicon for US

Descriptors

κ-value

Shape

Oval Round Irregular Overall

0.47 0.32 0.58 0.49

Orientation

Parallel Not parallel Overall

0.56 0.56 0.56

Margin

Circumscribed Indistinct Angular Microlobulated Spiculated Overall

0.42 0.2 0.21 0.25 0.66 0.33

Boundary

Abrupt interface echogenic halo Overall

0.59 0.59 0.59

Echo Pattern

Anechoic Hyperechoic Complex Hypoechoic Isoechoic Overall

0.00 0.00 0.13 0.41 0.38 0.37

Posterior Acoustic Features

Absent Enhancement Shadowing Combined Overall

0.47 0.5 0.59 0.14 0.49

Calcification

Absent Macrocalcification Microcalcification in a mass Microcalcification out of a mass Micro in and out of a mass Overall

0.64 0.00 0.6

Category 3 Category 4a Category 4b Category 4c Category 5 Overall

0.58 0.57 0.09 0.38 0.71 0.53

Final assessment

0.00 0.53

Fig. 1. A 30-year-old woman with palpable breast mass. This is a representative case that showed perfect inter-observer agreement for BI-RADS lexicon: all observers described the mass as having oval shape, parallel orientation, circumscribed margin, abrupt interface for lesion boundary, hypoechoic echogenicity, no posterior acoustic feature, and no calcification. The mass was finally categorized as “category 3” by all observers and was pathologically confirmed as a fibroadenoma. (a) Transverse image; (b) Longitudinal image.

0.61

3. Results 3.1. Inter-observer and intra-observer variabilities The summary of inter-observer variability for each descriptor and final assessment category is shown in Table 2. We obtained a relatively high degree of inter-observer agreement for BI-RADS lexicon (Figs. 1 and 2). Moderate agreement among our radiologists was seen in describing shape, orientation, boundary, and posterior acoustic features (κ = 0.49, 0.56, 0.59, and 0.49, respectively). Overall agreement for the margins of the mass was fair (κ = 0.33) (Fig. 3), even though moderate agreement of circumscribed (κ = 0.42) and spiculated

margin (κ = 0.66). When we concerned the margin as circumscribed or not circumscribed, overall agreement was increased as moderate (κ = 0.48). In assessing the echo pattern, overall agreement was fair (Fig. 4). For calcification, overall substantial agreement was obtained. Kappa value was zero in the descriptors “anechoic”, “hyperechoic”, “macrocalcification”, and “microcalcification out or mass” because of zero or very low use of terms (“anechoic (n = 0)”, “hyperechoic (n = 1)”, “macrocalcification (n = 0)”, and “microcalcification out or mass (n = 3)”. In assessing the final category including sub-classification of category 4 into 4a, 4b, and 4c, overall agreement was moderate (κ = 0.53). However, overall agreement was increased as substantial (κ = 0.62), when we classified the final category into 3, 4, and 5. Intra-observer variability for each descriptor and final assessment category are summarized in Table 3. Moderate and substantial agreement was achieved in describing shape, orientation, boundary, echo pattern, posterior acoustic features, calcification, and final assessment, where as fair agreement in describing margin.

296

H.-J. Lee et al. / European Journal of Radiology 65 (2008) 293–298

Fig. 2. A 39-year-old woman with palpable breast mass. This is a representative case that showed perfect inter-observer agreement for BI-RADS lexicon: all observers described the mass as having irregular shape, non-parallel orientation, spiculated margin, echogenic halo for lesion boundary, hypoechoic echogenicity, posterior acoustic shadowing, and no calcification. The mass was finally categorized as “category 5” by all observers and was pathologically confirmed as a malignant mass. (a) Transverse image; (b) Longitudinal image. Table 3 Intra-observer variability in BI-RADS descriptors and assessment category BI-RADS descriptor

κ-value Observer 1

Observer 2

Observer 3

Observer 4

Shape Orientation Margin Boundary Echo pattern Posterior acoustic features Calcification Final assessment

0.71 0.83 0.59 0.85 0.67 0.82

0.72 0.8 0.53 0.56 0.81 0.68

0.66 0.81 0.53 0.8 0.74 0.69

0.56 0.75 0.61 0.75 0.72 0.67

0.8 0.77

0.9 0.72

0.84 0.79

0.73 0.73

Fig. 3. A 54-year-old woman with palpable mass. This is a representative case that showed fair inter-observer agreement for the mass margin. Each of the four observers used a different descriptor for the margin. The descriptive terms that were used were spiculated, angular, indistinct, and microlobulated. However, no one used the descriptor “circumscribed”. The mass was pathologically confirmed as malignant. (a) Transverse image; (b) Longitudinal image.

3.2. Sensitivity, specificity, positive predictive value, and negative predictive values These are summarized in Table 4. Sensitivity was 97–98% and specificity was 26–40%. With the results of all observers combined, PPVs of sonographic BI-RADS final assessment categorized as 4 or 5 were as follows: category 4a, 26% (63 of 240); category 4b, 83% (24 of 29); category 4c, 91% (108 of 119); and category 5, 96% (106 of 110). Overall PPV for category 4 was 50%. NPV was 95% (97 of 102).

H.-J. Lee et al. / European Journal of Radiology 65 (2008) 293–298

297

Table 4 Sensitivity, specificity, positive predictive value, and negative predictive values Diagnostic index

Observer 1

Observer 2

Observer 3

Observer 4

Overall

Sensitivity (%) Specificity (%)

73/75 (97) 30/75 (40)

76/77 (99) 21/73 (29)

76/77 (99) 27/73 (37)

76/77 (99) 19/73 (26)

301/306 (98) 97/294 (33)

PPV (%) 4 4a 4b 4c 5

38/81 (47) 10/49 (20) 2/2 (100) 26/30 (87) 35/37 (95)

56/108 (52) 16/66 (24) 2/2 (100) 38/40 (95) 20/20 (100)

51/96 (53) 18/60 (30) 13/14 (93) 20/22 (91) 25/26 (96)

57/110 (52) 19/65 (29) 7/11 (67) 24/27 (89) 26/27 (96)

202/395 (51) 63/240 (26) 24/29 (83) 108/119 (91) 106/110 (96)

NPV (%)

30/32 (94)

21/22 (95)

27/28 (96)

19/20 (95)

97/102 (95)

Note: PPV, positive predictive value. NPV, negative predictive value.

Fig. 4. A 57-year-old-woman with palpable mass. This is a representative case that showed fair inter-observer agreement for the echo pattern of the mass. There was a discrepancy between observers in describing the echo pattern for this mass. The descriptive terms that were used were complex, hypoechoic, and isoechoic. The mass was pathologically confirmed as malignant. (a) Transverse image; (b) Longitudinal image.

4. Discussion Our study results showed a relatively high degree of interand intra-observer agreements in the use of BI-RADS descriptors and final assessment categories. Moderate agreement was achieved for assessment of lesion shape, orientation, boundary,

and posterior acoustic features. However, fair agreement was found for the mass margin. Lazarus et al. [11] also reported that fair agreement was achieved in the evaluation of the lesion margin using sonographic descriptors due to the multitude of terms available to describe the mass margin. In this study, we also evaluated inter-observer agreements for simplifying the mass margin as “circumscribed” or “not circumscribed”, and the overall agreement, agreement for “circumscribed”, and agreement for “not circumscribed” were moderate. Observers could evaluate whether a margin is “circumscribed” or “not” without difficulty. However, observers were hard to choose one of four descriptors in “not circumscribed margin” because four descriptive terms about “not circumscribed margin” may overlap each other in a mass. We do not suggest it is a problem, because one of either four descriptors should be assigned as suspicious finding, so final assessment would not be affected. For echo pattern, fair agreement among radiologists was also achieved. Lazarus et al. [11] speculated that the low rate of agreement for lesion echo pattern on sonogram suggests that observers had difficulty in choosing a single descriptor. Actually, many breast solid lesions, especially large lesions, can be described by more than one echo pattern descriptor. In such cases, it is difficult to describe the echo pattern with the preexisting BI-RADS lexicon (Fig. 4). We suggest the addition of the term “combined” for heterogeneous solid breast mass to the BI-RADS lexicon for mass echo pattern. This may improve inter-observer agreement for echo pattern, although echo pattern is not very useful in differentiating between benign and malignant masses [10,14]. We achieved a substantial agreement for final assessment category in this study. It is important because a high agreement for final assessment category is a major element, which can reinforce the use of BI-RADS lexicon for breast US. Interestingly, substantial agreement was seen for evaluation of calcification. It has been known that US is less sensitive for detection of microcalcifications than mammography is. However, most of calcifications in our case were within mass, so they are more likely to be seen on US with current US equipment. Because of this reason, substantial agreement for calcification could be achieved in this study [15]. There has been no available report about intra-observer variability in sonographic BI-RADS lexicon. When lesions were re-evaluated 8 weeks later by the same radiologists, relatively substantial intra-observer agree-

298

H.-J. Lee et al. / European Journal of Radiology 65 (2008) 293–298

ment was achieved. This suggests that each observer has a clear concept of the BI-RADS descriptors. The high degree of intraobserver agreement in our study can also support the utility of the sonographic BI-RADS lexicon for interpreting breast US images. Orel et al. [16] reported that mammographic BI-RADS categories have a PPV of 30% for category 4 lesions and 97% for category 5 lesions. In this study, we had a PPV of 51% for category 4 lesions and 96% for category 5 lesions. So, breast US with BI-RADS lexicon may be as useful as mammography for predicting malignancy, although the case numbers are small (n = 150). For the subcategories, there was a PPV of 26% for category 4a lesions, 83% for category 4b lesions, and 91% for category 4c lesions. Lazarus et al. [11] suggested that the use of subcategories is helpful in communicating the level of suspicion to referring physicians and patients. However, in this study, the four radiologists infrequently categorized breast lesion as category 4b. Only 29 categories (4.8%) were “category 4b” of total 600 categories (150 lesions × 4 observers). It may be because the observers had not undergone intense training for the new version of the BI-RADS lexicon prior to this study. One more possible reason is the ambiguous concept of category 4b. Since the management of category 4b lesion is indeterminate, radiologists may hesitate to put the lesion in 4b category. In clinical practice, dividing category 4 into two sub-categories (i.e. 4a and 4b) instead of three may be simpler and more useful. Stavros has also proposed this idea in his textbook [17]. We evaluated NPV of malignancy in category 3 as well as PPV of malignancy in category 4 of 5. Lazarus et al. [11] reported only PPV of category 4a and 4b in their results. However, we think that NPV is one of important diagnostic indices that show the usefulness of breast US with BI-RADS lexicon. Graf et al. [18] reported that NPV of category 3 was 100%. However, their study group included only palpable probably benign lesions underwent both mammogram and sonography. The mean NPV of category 3 in our study was 95%. Total 107 of 600 categories (150 lesions × 4 observers) were interpreted as category 3 by four observers in first review and of these, 5 showed malignancies on pathologic results. This NPV is somewhat lower than that of ACR recommendation. However, in our study the four observers assigned a final BI-RADS category for each lesion using only sonographic features. Actually, the decision for categorization is made using the highest level of both mammographic and US findings. For instance, for a lesion that is categorized as benign by US but is suspicious by mammography, final assessment is made using the most suspicious feature. However, we evaluated only US images, so NPV would be higher if we evaluated by both US and mammography. There are several limitations in our study. First, this study only included benign masses that underwent core needle biopsy and showed no size change based on a 2-year follow-up US after biopsy. As a result, typical probably benign lesions on US findings were excluded from this study. This exclusion results in low specificity. In clinical practice, the specificity may be much

higher. The second limitation is that some of the cases had undergone US with core needle biopsy by two of the radiologists prior to this retrospective study. Therefore, there may be a learning effect associated with impressive cases. However, because there were only a few impressive cases in our study, this probably had little effect on our results. Third, the agreement among observers for the presence of associated findings and special cases could not be evaluated because observers evaluated the static images focusing on a mass. Although these associated findings can raise the PPV, it would be very minimal. In conclusion, because inter- and intra-observer agreement with the BI-RADS lexicon for US is good, the use of BI-RADS lexicon can provide accurate and consistent description and assessment for breast US. References [1] American College of Radiology. Breast imaging reporting and data system. 2nd ed. Reston, VA: American College of Radiology; 1995. [2] American College of Radiology. Breast imaging reporting and data system (BI-RADS). 3rd ed. Reston, VA: American College of Radiology; 1998. [3] Hong AS, Rosen EL, Soo MS, Baker JA. BI-RADS for sonography: positive and negative predictive values of sonographic features. AJR Am J Roentgenol 2005;184(4):1260–5. [4] American College of Radiology. BI-RADS: ultrasound. 1st ed. In: Breast imaging reporting and data system: BI-RADS atlas. 4th ed. Reston, VA: American College of Radiology; 2003. [5] Mendelson EB, Berg WA, Merritt CR. Toward a standardized breast ultrasound lexicon, BI-RADS: ultrasound. Semin Roentgenol 2001;36(3): 217–25. [6] Hall FM. Sonography of the breast: controversies and opinions. AJR Am J Roentgenol 1997;169(6):1635–6. [7] Jackson VP. Management of solid breast nodules: what is the role of sonography? Radiology 1995;196(1):14–5. [8] Harper AP, Kelly-Fry E, Noe JS, Bies JR, Jackson VP. Ultrasound in the evaluation of solid breast masses. Radiology 1983;146(3):731–6. [9] Rahbar G, Sie AC, Hansen GC, et al. Benign versus malignant solid breast masses: US differentiation. Radiology 1999;213(3):889–94. [10] Stavros AT, Thickman D, Rapp CL, Dennis MA, Parker SH, Sisney GA. Solid breast nodules: use of sonography to distinguish between benign and malignant lesions. Radiology 1995;196(1):123–34. [11] Lazarus E, Mainiero MB, Schepps B, Koelliker SL, Livingston LS. BIRADS lexicon for US and mammography: interobserver variability and positive predictive value. Radiology 2006;239(2):385–91. [12] Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York: John Wiley & Sons Inc.; 1981. [13] Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33(1):159–74. [14] Baker JA, Kornguth PJ, Soo MS, Walsh R, Mengoni P. Sonography of solid breast lesions: observer variability of lesion description and assessment. AJR Am J Roentgenol 1999;172(6):1621–5. [15] Moon WK, Im JG, Koh YH, Noh DY, Park IA. US of mammographically detected clustered microcalcifications. Radiology 2000;217(3):849–54. [16] Orel SG, Kay N, Reynolds C, Sullivan DC. BI-RADS categorization as a predictor of malignancy. Radiology 1999;211(3):845–50. [17] Stavros AT. Introduction to breast ultrasound. In: Stavros AT, editor. Breast ultrasound. 1st ed. Philadelphia, PA: Lippincott Williams & Wilkins; 2004. p. 1–15. [18] Graf O, Helbich TH, Fuchsjaeger MH, et al. Follow-up of palpable circumscribed noncalcified solid breast masses at mammography and US: can biopsy be averted? Radiology 2004;233(3):850–6.