Spatial resolution requirements for acquisition of the virtual screening slide for digital whole-specimen breast histopathology

Spatial resolution requirements for acquisition of the virtual screening slide for digital whole-specimen breast histopathology

Human Pathology (2007) 38, 1764–1771 www.elsevier.com/locate/humpath Original contribution Spatial resolution requirements for acquisition of the v...

909KB Sizes 0 Downloads 32 Views

Human Pathology (2007) 38, 1764–1771

www.elsevier.com/locate/humpath

Original contribution

Spatial resolution requirements for acquisition of the virtual screening slide for digital whole-specimen breast histopathologyB Gina M. Clarke MSc a,b,⁎, Judit T. Zubovits MD c , Marko Katic BA d , Chris Peressotti BASc a , Martin J. Yaffe PhD a,b a

Imaging Research, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada M4N 3M5 Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada M4N 3M5 c Department of Pathology, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada M4N 3M5 d Research Design and Biostatistics, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada M4N 3M5 b

Received 28 November 2006; revised 27 March 2007; accepted 5 April 2007

Keywords: Virtual screening slide; Breast cancer; Pathology screening; Digitizing resolution; Whole-slide imaging

Summary We examined the effect of lateral spatial resolution and reader specialty on the accuracy of detection of breast cancer. The motivation for this pilot study was the need to acquire and display very large data sets in whole-specimen 3D digital breast histopathology imaging. The ultimate goal is to determine the minimum resolution adequate for detection of malignancy. Twenty-three histologic slides were selected from breast pathology cases and digitized at 2 sampling distances (3.2 and 1.9 μm pixels). Images were viewed by 14 pathologists, of whom 5 had breast pathology as their primary specialty. The readers assessed the likelihood of malignancy on a 5-point Likert scale, and provided a provisional diagnosis. For the detection task, sensitivity, specificity, overall accuracy of detection, and area under the receiver-operator curve were calculated. An overall diagnostic score, and scores grouped by malignancy type, were also computed. Outcome measures were examined for significant resolution and specialty effects. Increasing the lateral resolution significantly improved accuracy in diagnosis (P = .004) but no effect was found for detection. Breast specialists achieved significantly higher scores for all outcome measures except specificity. Differences in performance between the 2 groups of readers tended to be greater for the diagnostic task compared to detection, especially at the higher resolution. However, specimen coverage may also be a significant factor. Factors related to the readers may have also affected performance in this study. Based on these results, a more comprehensive study should examine pixel sizes between 0.7 and 1.9 μm. © 2007 Elsevier Inc. All rights reserved.

☆ This work received funding from by the Canada Foundation for Innovation, Ontario Research and Development Challenge Fund, and was supported in part by a Terry Fox Program Project Grant through the National Cancer Institute of Canada. ⁎ Corresponding author. Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada M4N 3M5. E-mail address: [email protected] (G. M. Clarke).

0046-8177/$ – see front matter © 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.humpath.2007.04.006

1. Introduction Economic and staffing constraints make the use of telepathology for diagnosis and intraoperative consultation a necessity in many centers. Telepathology has been a primary motivator for the development of digital pathology

Spatial resolution in the virtual screening slide for breast pathology imaging systems. Various approaches to telepathology provide different degrees of coverage, meaning the fraction of tissue that is digitized and viewed. In static telepathology, selected “snapshots” of slides are electronically sent for consultation, and in the dynamic approach, the consulting pathologist navigates the slide on a remote microscope, still viewing one field-of-view at a time [1]. Whole-slide imaging systems that can digitize an entire slide into a highresolution, composite image or “virtual slide” are replacing the static, as well as dynamic and hybrid approaches [1-6]. Coverage is usually the least in static pathology and diagnostic accuracy (concordance with glass-slide diagnosis) can vary widely (76%-100%) depending in part on how the images are selected [1,7-9]. On the other hand, the virtual slide yields a more reliable diagnosis [1,6,10,11]. Digital pathology imaging makes it possible to view and navigate over large volumes of data, which are typically captured from conventional histology slides. To further enhance specimen coverage, we are developing a system for 3-dimensional (3D) breast histopathology imaging of whole specimens using large slides instead. The specimen is prepared into whole-mount serial sections (about 150, 70 × 70 mm sections from an average specimen) as it is maintained in as close as possible to the in vivo conformation [12]. The imaging hardware consists of a modified transmission microscope, with a camera and large-area translation stage [13]. Compared to digital imaging of conventional, smallformat histology, the 3D, whole-specimen approach offers far greater coverage of the specimen. Increased coverage is desirable because undersampling is associated with failure to detect malignancy [14,15]. This, in turn, may be associated with underestimating total tumor area, missing a close or involved margin that requires secondary treatment, or failing to correctly identify disease as multifocal or unifocal. Increasing either coverage or the lateral digitizing resolution increases the amount of image data that must be processed, stored, and displayed. The best resolution available at a diagnostic magnification (20×) is 0.7 μm.1 To image the entire specimen in the 3D approach at this resolution would produce hundreds of trillions of bytes of image data. In conventional small-format histopathology, work has been done to reduce the amount of image data using sophisticated image compression and other computational methods [16,17]. Breast cancer is commonly initially detected with screening mammography. Suspicious lesions are subsequently characterized by a “diagnostic” imaging workup. We are investigating a similar approach for histopathology to

1 This assumes that a plan achromat objective is used (numerical aperture [NA] = 0.40) and is calculated by using the Rayleigh criterion for the minimum distance between 2 adjacent dots that appear distinct:



0:61k ; where λ is an average wavelength of light: NA

1765

manage the enormous amount of image data. Our approach is to digitize at a reduced sampling rate (number of samples/ mm), but one that still permits detection of abnormalities in a 3D presentation of the entire tissue sample. Then, for definitive diagnosis, suspicious areas can be revisited, if necessary, by automatic stage repositioning for viewing at a higher, diagnostic magnification. This strategy, which separates the processes of screening for abnormalities and diagnosis, is based on the assumption that abnormalities can be detected by using architecture-based patterns alone with limited cellular or nuclear detail, and that a level of resolution that is subdiagnostic is adequate to capture such features. Unlike commercially available virtual slide processors that can digitize conventional size slides at a diagnostic resolution, the “virtual screening slide” is, at present, a research tool that is optimized for screening serial whole-mount sections using a lower resolution. We examine the effect of lateral resolution while considering the effect of pathologist specialty by using 2 categories of readers: those whose primary specialty is breast, and those who specialize in other sites. Concentration of expertise and the effect of volume have been studied in detail for other specialties, and have been associated with improved outcome after liver and kidney transplantation, for example [18]. For breast, there is some evidence for improved accuracy with specialization [19]. However, this association has not been extensively studied and most hospitals continue to operate with generalists. The lateral resolution available from the digital image is determined by 2 major factors: the optical resolution of the microscope and the sampling performed during the digitization process. The optical resolution depends on the objective NA and the wavelength of light. The limiting factor for resolution in a digital image is usually the sampling performed in digitization, which is determined by the camera pixel size (pixelation). The central question in this approach, which this study begins to explore, is the choice of (optimal) pixelation—ie, what is the coarsest pixelation that achieves detection sensitivity equivalent to that of conventional histopathologic evaluation under a microscope? In this pilot study, we test for differences on the rate of detection as well as diagnosis due to lateral resolution and reader specialty effects, using 2 categories for each, and using standard-sized slides.

2. Methods The study was performed using readers and cases from Sunnybrook and Women's College Health Sciences Centre.2 A retrospective set of 23 breast cases from 2002 to 2004 was collected and 14 pathologists performed readings of the case set from February 2005 to May 2005. 2

Now Sunnybrook Health Sciences Centre.

1766

G. M. Clarke et al.

Fig. 1 The multiresolution viewing program. The entire virtual screening slide is displayed, at 1/32 of the digitizing resolution (A). A feature of interest is “zoomed” or displayed at a higher resolution (1/8 of full) resolution (B). A portion of the slide at the highest resolution available (ie, the digitizing resolution) (C). One field-of-view covers 1 × 1 mm2 of tissue. Images can be annotated in the viewing program (D).

2.1. Case selection and digitization The cases were selected by a “reference” pathologist specializing in breast cancer to represent a test of the digital image quality rather than of reader expertise. Therefore, the selection included only cases for which it was deemed that a pathologist who does not specialize in the breast could readily formulate the correct diagnosis when viewing the slide in the conventional manner, without any assistance from special stains or consultation. The “truth” diagnoses were established by microscopic reading in the conventional manner, by the reference pathologist. A second reference pathologist, also specializing in breast cancer, established consensus for the truth diagnoses for the cases.3 Because reader time was limited, the case selection was heavily enriched with cancers: 15 cases contained at least 1 malignant feature (some with additional benign lesions), 3 had only benign lesions, and 5 showed no pathologic abnormality. The malignant features

3 Only minor discrepancies in the truth diagnoses occurred (eg, “other fibrocystic changes”) between the 2 reference pathologists. When these occurred, the feature was not included in the computation of diagnostic scores.

consisted of in situ carcinoma (10 occurrences), invasive disease (9 occurrences), and lymphovascular invasion (LVI) (3 occurrences). There were a total of 20 benign lesions present in the case set. The slides were digitized at 2 levels: a “lower resolution” image was produced using an effective pixel size of 3.2 μm, and a “higher resolution” image using 1.9 μm pixels. In an initial reading session with the reference pathologist, these levels were estimated as being adequate for screening. The 2 pixelations were produced using the 2.5× and 5× microscope objectives, respectively (NA = 0.07 and 0.12). The overall magnification was then reduced by a 0.63× coupler lens used to mount the digitizing camera and increase its field-of-view. Each image was constructed by scanning the slide over the optical path, 1 field-of-view at a time and assembling the set of “tiles” into a virtual slide. Artifacts such as vignetting and discontinuity at the boundaries between tiles were reduced to an imperceptible level [13]. Automatic refocusing during scanning was usually not required at the lower resolution, and this reduced acquisition time considerably. The images were acquired in 24-bit RGB tagged image format (.tif), and converted to portable network graphics (.png) to take advantage of lossless compression (factor of 1.5:1 for a typical histology image).

Spatial resolution in the virtual screening slide for breast pathology

1767

2.2. Reader selection and presentation of cases

2.3. Statistical analysis

The set of readers consisted of 12 staff pathologists, 1 research fellow, and 1 visiting pathologist. All readers are board-certified pathologists. The pathology department at Sunnybrook Health Sciences Centre is site-specific in that each pathologist has a “primary” and “secondary” site of expertise. Five of the readers specialized in breast pathology and devote at least 50% or more of their time in this area. The department processes approximately 2100 breast cancer cases in a year from the surgical and radiology departments within the hospital, as well as another 1700 cases in consultation for outside hospitals. Each pathologist participated in 2 interpretation sessions that were separated by a “washout” period of at least 4 weeks to reduce bias due to recollection of a previously read case. In each session, the whole set of 23 virtual slides was presented at either the higher or lower resolution. The order of higher-lower sessions, as well as the order of images, was randomized using a computer-generated random permutation. Images were viewed on a high-resolution 9-megapixel IBM T221 monitor. This monitor displays up to 9 times the area of tissue compared to a conventional computer monitor. Because each image cannot be displayed in its entirety by any monitor, a display program “TileView” was written to display the images in the computationally efficient, multiresolution approach illustrated in Fig. 1. The viewer can navigate over an image by using “pan” and “zoom” operations, and annotate the images by using a region of interest tool that stores selected coordinates in a database. Each reader completed the questionnaire shown in Table 1 for each image viewed. For screening, the likelihood of malignancy was indicated on the 5-point Likert scale in the leftmost column. For those cases where the likelihood was not “negative for malignancy,” the suspected diagnosis was indicated in the rightmost column. Any benign features believed to be present are indicated in the middle column.

A 2-factor experimental design was used, with 2 levels for each factor [20]. The first factor, lateral resolution, was a “within” factor because the 2 resolution levels, “lower” and “higher,” were applied to all readers. The second factor, specialty, was a “between” factor, because each reader was classified as either a breast specialist or a nonspecialist. An initial overall test of the 2 factors on all outcome measures or test variables was achieved using a multivariate analysis of variance (ANOVA). Subsequent, univariate repeated-measures ANOVA were performed for each outcome measure to examine resolution effects, specialty effects, and interaction effects between them (eg, if a significant resolution effect exists for 1 specialty but not the other) [20]. A level of 0.95 was used to report significance. Four outcome measures were calculated for performance in screening or detection, for each reader and session. Sensitivity was calculated as the fraction of cases in which malignancy was detected, out of the 15 cases containing 1 or more malignant feature. Specificity was calculated as the fraction of cases, of the 8 with no malignancy, in which no malignancy was detected, and accuracy is the fraction of all 23 cases accurately called as malignant or normal/benign. The tradeoff between sensitivity and specificity for the set of thresholds for calling malignancy was investigated using standard receiver operating characteristic (ROC) analysis [21-23]. The area under the ROC curve (Az), frequently used as an index of performance for all diagnostic thresholds or degrees of uncertainty, was calculated by using trapezoidal integration [24]. For an ideal system, this index equals unity. Each of the 4 test variables provides a useful descriptor of an aspect of the system performance, although sensitivity is the most relevant in selecting a digitizing resolution for screening. The ROC approach, along with the Likert scale, is commonly used in radiology to assess the ability to detect malignancy from images. The threshold for calling malignancy was set to “probably negative for malignancy”; a response at this level and anything more suspicious was considered a positive finding. The use of such an aggressive threshold was justified because this was a screening technique, where positive findings would trigger subsequent inspection at higher spatial resolution, but pose no further invasive procedures to the patient. For performance in diagnosis, the effect of resolution as well as reader specialty was assessed using 5 response variables. An overall diagnostic score (“diagnostic accuracy”) was calculated per reader, per session, as the proportion of the total number of malignant features in the case set that were accurately identified. Pathology-dependent diagnostic scores were also calculated from the diagnostic scores for occurrences of LVI, invasive cancer, in situ cancer, and benign lesions separately. All statistical calculations were performed using SAS version 8.2 (SAS Institute Inc, Cary, NC), NCSS 2004, and

Table 1

The study instrument

1768

G. M. Clarke et al.

Table 2 Summary of significant effects (P b .05) on performance in screening and diagnosis Lateral resolution

Specialty

Screening

Sensitivity (P = .006, Δ = 0.14) Accuracy (P = .009, Δ = 0.081) ROC area (P =.009, Δ = 0.080) Diagnosis Overall (P = .004, Δ = 0.12) Overall (P = .001, Δ = 0.22) In situ a (P = .057, Δ = 0.09) LVI (P = .02, Δ = 0.12) Invasive (P = .003, Δ = 0.18)

The difference (Δ) in the test statistic between higher and lower resolution or specialist and nonspecialist is also shown. Interaction effects could not be detected (ie, the resolution effects consider both specialties together, and the specialty effects are reported for both resolutions together). a Denotes a borderline effect.

PASS 2002 (NCSS and PASS Number Cruncher Statistical Systems, Kaysville, UT). We performed a simple failure analysis in an attempt to understand the reasons for false positive and false negative interpretations, using the specialist group of readers. Categories of likely causes for failure were identified by the reference pathologist and used to classify the images that were incorrectly identified as malignant or benign by at least one reader. Any images that were correctly identified by fewer than half of the readers were further examined by

Fig. 2 Box-and-whisker plots for the test variables for screening, for specialists (Sp) and nonspecialists (NSp) at lower (Lo) and higher resolution (Hi). Horizontal lines appear at the median, upper, and lower quartile values, notches indicate a confidence interval about the mean and whiskers show the extent of the data. Outliers, which are more than 1.5 times the interquartile range above or below the box, are indicated by a square. For each pair of boxes, notches that do not overlap indicate that the medians differ significantly (http://www.mathworks.com/access/helpdesk/help/ toolbox/stats/boxplot.html).

Fig. 3 ROC curves for specialists (A), and nonspecialists (B) at higher (Hi) and lower (Lo) resolution. The true positive fraction (TPF) is the proportion of cases with malignancy that were called correctly, and the true negative fraction (TNF) is the proportion of normal or benign cases that were called correctly. The mean area (±mean error) under the curves at (lower resolution, higher resolution) are (0.86 ± 0.03, 0.86 ± 0.02) averaged over the specialist readers, and (0.78 ± 0.02, 0.78 ± 0.02) for nonspecialists.

reading in the conventional manner, to support whether failure is due to inadequate resolution or other digitization effects rather than reader expertise.

3. Results The significant effects that were detected are summarized in Table 2. Here, Δ is the difference in the outcome measure or test statistic (higher resolution-lower resolution or breast specialist-other specialty). This pilot study is insufficiently

Fig. 4 Box-and-whisker plots for overall diagnostic score (A), and for diagnostic scores for the following pathologic features: benign features (B), in situ carcinoma (C), invasive carcinoma (D), and LVI (E).

Spatial resolution in the virtual screening slide for breast pathology Table 3 Factors attributed to cases being incorrectly identified as malignant or benign/normal Category

No. of Examples images

1. Mimickers

5

2. Other subtle 3 architecture

3. Reader issues

7

Micropapillary carcinoma vs tubular carcinoma Sclerosing adenosis or radial scar vs IDC (invasive duct carcinoma) Retraction artifact a vs LVI Epithelial hyperplasia vs DCIS (ductal carcinoma in situ) Absence of fibrotic reaction Small lesion Preoperative chemotherapy alters architecture; tumor in tiny nests Missing obvious cases (eg, DCIS with comedo necrosis) Tendency to select interior, solid areas for “zooming” hence missing tumor at periphery Overcalling based on a priori knowledge (florid hyperplasia, radial scar)

The number of cases that were incorrectly called due to each category of factors is shown for the breast specialist group and higher resolution images. Generally, the lower resolution image was also called incorrectly. Some cases may appear in multiple categories. Failure in one additional case was attributed to image unsharpness, which confounded the appearance of both IDC and LVI. a The appearance of LVI created by collagen pulling away from epithelium due to overdrying during histotechnical processing.

powered to detect any true negative result, and there may be other significant effects that do not appear in the table. For all factors that were statistically significant, the improvement was in favor of the conditions of increased lateral resolution and breast-specific experience. The box-and-whisker plots in Fig. 2 provide a graphical representation of the data (using nonparametric statistics) and some descriptive statistics for screening variables. The ROC curves are shown in Fig. 3 for each resolution, averaged over all readers within an experience category. Results for the diagnostic response variables are shown in Fig. 4. The failure analysis suggests that image features and reader issues are comparable contributors to errors (Table 3). Two broad categories of image features were associated with increased failure: a benign pattern mimicking a malignancy or vice versa (eg, epithelial hyperplasia versus DCIS), and subtle architectural features.

4. Discussion Reader specialty has a highly significant effect on performance in both screening and diagnostic tasks. For detection, breast specialists are accustomed to a wide variety

1769

of cases and are thus more likely to recognize subtle features of malignancy in the images. Specificity was the only outcome measure for detection in which the specialist group did not demonstrate statistically significant superiority. The tendency for improved performance by the specialist group may be offset by more extensive experience at detecting subtle abnormalities that ultimately prove to represent cancer. The stronger effect for diagnosis suggests that this task draws even more heavily on reader expertise than simple detection. This tends to be most apparent in the higher resolution images, which contain more detail to arouse suspicion. The points along the ROC curves in Fig. 3 are clustered toward low false positive values (b0.25) for both groups of readers; however, the specialist group (Fig. 3A) appears to operate at a slightly higher point on the curve. This suggests that they are willing to sacrifice some specificity for improved sensitivity. No effect of lateral resolution was detected for the screening task although resolution significantly affected the overall diagnostic score. This may simply be due to the increased resolution requirement for diagnosis compared to detection. From the pathology-dependent diagnostic scores, it appears that finer resolution is required to diagnose carcinoma in situ and especially invasive disease. These are typically more difficult to discern than LVI, which architecturally stands out with its “bulls' eye” appearance (lymphovascular channel and white “space” surrounding a group of malignant cells) in the selected cases. The lowest diagnostic score was achieved for benign lesions and this is probably because these are not of primary interest to pathologists, especially when there are malignant features present in the same slide (as was the case in all but 3 of the 20 occurrences of benign lesions). For the specialist group, the absolute values for sensitivity we report are 82% ± 3% and 86% ± 3% at the lower and higher resolutions, respectively. The diagnostic accuracies were 59% ± 3% and 73% ± 4%. Reader performance appears to be lower than that reported in the literature for routine surgical pathology assessment with glass slides (error rate of 0.26%-5.7%) [25]. There are 3 factors that may have substantially reduced performance in this study: the case selection is heavily enriched with cancers, coverage is very limited, and there may be influence from factors associated with the readers such as motivation and time commitment. The most important of these 3 effects is probably limited coverage, because, in this study readers were shown only one small-format slide per case. There has been some investigation of coverage effects in static telepathology. In one study, the authors observed a significant effect of coverage, along with a borderline resolution effect (at diagnostic resolutions) on diagnostic accuracy [26]. Limited coverage, which includes misrepresentative field selection for imaging, is a limitation of static telepathology and may be a factor in the wide range of diagnostic accuracies observed [1,7-9]. The diagnostic accuracies achieved with dynamic and hybrid approaches

1770

G. M. Clarke et al.

Table 4 Proportion of cases that were correctly called (true positive or true negative) by all breast specialists Truth

n

Lower resolution (%)

Higher resolution (%)

Benign/normal Malignant Total

8 15 23

37.5 66.7 56.5

50.0 86.7 73.9

These proportions also show that sensitivity is higher than specificity, and performance is improved at the higher resolution.

(81%-100%) are typically higher than for static telepathology despite that digitizing resolution is generally lower [1,2,8]. The diagnostic accuracy that we report for the virtual screening slide (73% ± 4% for the specialist group at higher resolution) falls just outside of the observed range for static telepathology, despite our use of subdiagnostic resolution. Factors related to the readers, rather than the cases or the technical aspects of the system, may also be important. These include motivation, time commitment, and comfort in working with computer images. Some evidence for this is provided by the fact that for all but one “difficult” case, most readers were able to detect the cancer or benign finding, or perform a correct diagnosis. Many of the cases were correctly identified by all readers (Table 4). Even in the difficult case, the cancer was detected by 2 of 5 specialist readers. In addition, a (breast specialist) reader who committed about twice the average length of time to complete each session (average time was about 40 minutes) achieved 100% detection sensitivity at both higher and lower resolutions. Nevertheless, some readers performed markedly less well, possibly because of lack of familiarity or comfort in this different reading environment. Readers may benefit from further training and the opportunity to gain experience with the display program. Modality-specific experience has been shown to affect reader performance significantly [27]. In a study using 226 static telepathology images of breast tissue, the reader, who had modality-specific experience, achieved an Az of 0.98 [27]. This is comparable to Az = 0.97, the highest ROC performance achieved by a breast specialist in our study. Finally, the case selection was heavily enriched with cancers, emphasizing those that are considered more subtle or unusual. Although it was necessary to introduce this bias to most effectively use the limited reader time, the case set is not representative of a typical clinical caseload. Although cases were selected to pose no diagnostic challenge with conventional microscopic viewing, their unusual nature might render them more susceptible to the challenges of limited resolution and a novel viewing methodology. Failure in the difficult case may be attributed to a malignant architecture which is very subtle, but which could be resolved by using a finer pixelation than those used in this study. The lesion was a very small, micropapillary

carcinoma, and lacked the stellate appearance that normally results from fibrotic reactions typical of other types of cancers. In addition, its unusually apocrine or eosinophilic appearance mimicked a (benign) tubular adenoma. A retrospective examination of this case by a nonspecialist reader supports that the case selection successfully tests the imaging system and resolution effects as opposed to reader expertise. This reader made a benign diagnosis from the glass slide with a resolution comparable to those available from the study images (4× magnification objective), but detected the malignancy using the 32× objective. This case, which was correctly identified by only 1 of 10 nonspecialist readers, also suggests that breast specialists are more adept at detecting “soft” signs of cancer in a digital image such as directionality of the interface between fibrous and fatty tissues. The NA that was varied to control the lateral resolution also affects the axial resolution. These NAs theoretically provide adequate lateral resolution to view a 5-μm nucleus. [13]. It is shown in Fig. 5, however, that for the range of NAs used in this study the loss of axial resolution predominates over lateral resolution [28]. Therefore, the poorer performance observed at the lower NA may also be an effect of axial resolution.

4.1. Conclusions and future directions Motivated by the goal of displaying whole-specimen digital breast pathology images, we have begun to investigate the decoupling of detection of abnormalities from their diagnosis. We considered the effects of both lateral resolution and reader specialization on performance in these 2 tasks. We found that breast specialists achieve significantly higher rates of detection and diagnosis of breast cancer than nonspecialists. Differences between the 2 groups tend to be most apparent at the higher resolution, especially for

Fig. 5 Comparison of axial and planar resolution. Axial resolution (≈λ/NA2) depends more strongly on NA than lateral resolution (=0.61λ/NA) (NA = 0.07 and 0.12 for lower and higher resolution, respectively).

Spatial resolution in the virtual screening slide for breast pathology diagnostic tasks. We observed a significant effect of resolution on diagnosis. As expected, the resolution effect tends to be stronger for diagnosis of features that have finer detail (eg, IDC) compared to simple patterns (eg, LVI) for the cases selected in this study. The study did not have adequate power to reliably test the effect of resolution on detection. Based on our experiences in this pilot study, we believe that a larger study is required to definitively identify the minimum resolution required for reliable detection of abnormality. Using effect sizes estimated from the results of this study, a posthoc sample size calculation shows that 297 readers are required in each group to achieve a power of 0.8 in detecting interaction effects using the same study design with 23 cases. However, we recommend that a comprehensive study use specialist readers only, implement methods to ensure a high measure of reader compliance, increase case coverage, and use a more balanced case set that is more representative of a typical caseload. Moreover, a greater number and variety of cases will reduce the confidence bands and allow for a more informative failure analysis. The study design should calculate the scores for each reader against the “truth” diagnosis rendered by the same reader using the glass slide. These data should be applied to calculation of the kappa statistic to establish interobserver variability. Intrareader variability should also be established, by reader retesting, to provide a reliability estimate for the data. Finally, this design should be applied to pixelations between 1.9 and 0.7 μm, although we believe that a resolution on the order of 1.9 μm may be adequate if these measures are in place.

Acknowledgments The authors gratefully acknowledge the readers from the Department of Pathology, Sunnybrook Health Sciences Centre, Toronto, for their participation in this study.

References [1] Cross SS, Dennis T, Start RD. Telepathology: current status and future prospects in diagnostic histopathology. Histopathology 2002;41: 91-109. [2] Weinstein RS, Descour MR, Liang C, et al. Telepathology overview: from concept to implementation. HUM PATHOL 2001;32:1283-99. [3] Zhou J, Hogarth MA, Walters RF, et al. Hybrid system for telepathology. HUM PATHOL 2000;31:829-33. [4] Leong FJW-M, McGee JO'D. Automated complete slide digitization: a medium for simultaneous viewing for multiple pathologists. J Pathol 2001;195:508-14. [5] Demichelis F, Barbareschi M, Dalla Palma P, et al. The virtual case: a new method to completely digitize cytological and histological slides. Virchows Arch 2002;441:159-64.

1771

[6] Gilbertson JR, Ho J, Anthony L, et al. Primary histologic diagnosis using automated whole slide imaging: a validation study. BMC Clin Path 2006;6:4. [7] Weinberg DS, Allaert FA, Dusserre P, et al. Telepathology diagnosis by means of digital still images: an international validation study. HUM PATHOL 1996;27:111-8. [8] Halliday BE, Bhattacharyya AK, Graham AR, et al. Diagnostic accuracy of an international static-imaging telepathology consultation service. HUM PATHOL 1997;28:17-21. [9] Weinstein RS, Bhattacharyya AK, Graham AR, et al. Telepathology: a ten-year progress report. HUM PATHOL 1997;28:1-7. [10] Weinstein RS, Descour MR, Liang C, et al. An array microscope for ultrarapid virtual slide processing and telepathology. Design, fabrication, and validation study. HUM PATHOL 2004;35:1303-13. [11] Costello SSP, Johnston DJ, Dervan PA. Development and evaluation of the virtual pathology slide: a new tool in telepathology. J Med Internet Res 2003;5:e11. [12] Clarke G, Eidt S, Sun L, et al. Whole-specimen histopathology: a method to produce whole-mount breast serial sections for 3D digital histopathology imaging. Histopathology 2007;50:232-42. [13] Clarke GM, Peressotti C, Mawdsley GE, et al. Design and characterization of a digital image acquisition system for wholespecimen breast histopathology. Phys Med Biol 2006;51:5089-103. [14] Gupta D, Nath M, Layfield LJ. Utility of four-quadrant random sections in mastectomy specimens. Breast J 2003;9:307-11. [15] Wiley EL, Keh P. Diagnostic discrepancies in breast specimens subjected to gross reexamination. Am J Surg Pathol 1999;23:876-91. [16] Lundin M, Lundin J, Helin H, et al. A digital atlas of breast histopathology: an application of web based virtual microscopy. J Clin Pathol 2004;57:1288-91. [17] Gombas P, Skepper JN, Hegyi L. The image pyramid system—an unbiased, inexpensive and broadly accessible method of telepathology. Pathol Oncol Res 2002;8:68-73. [18] Axelrod DA, Guidinger MK, McCullough KP, et al. Association of center volume with outcome after liver and kidney transplantation. Am J Transplant 2004;4:920-7. [19] Verkooijen HM, Peterse JL, Schipper ME, et al. Interobserver variability between general and expert pathologists during the histopathological assessment of large-core needle and open biopsies of non-palpable breast lesions. Eur J Cancer 2003;39:2187-91. [20] Montgomery D. Design and analysis of experiments. 5th ed. New York, NY: Wiley; 2001. [21] Obuchowski N. Receiver operating characteristic curves and their use in radiology. Radiology 2003;229:3-8. [22] Metz CE. Basic principles of ROC analysis. Semin Nucl Med 1978;8:283-98. [23] Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 1993;39:561-77. [24] Burden RL. Numerical analysis. 5th ed. Boston (Mass): PWS; 1993. [25] Frable WJ. Surgical pathology—second reviews, institutional reviews, audits, and correlations: what's out there? Error or diagnostic variation? Arch Pathol Lab Med 2006;130:620-5. [26] Williams BH, Hong IS, Mullick FG, et al. Image quality issues in a static image-based telepathology consultation practice. HUM PATHOL 2003;34:1228-34. [27] Roca OF, Pitti S, Diaz Cardama A, et al. Factors influencing distant tele-evaluation in cytology, pathology, conventional radiology and mammography. Anal Cell Pathol 1996;10:13-23. [28] Martin LC. The theory of the microscope. 3rd ed. New York, NY: American Elsevier; 1978.