Accepted Manuscript
Automatic Grading of Prostate Cancer in Digitized Histopathology Images: Learning from Multiple Experts Guy Nir, Soheil Hor , Davood Karimi , Ladan Fazli, Brian F. Skinnider, Peyman Tavassoli, Dmitry Turbin, Carlos F. Villamil, Gang Wang, R. Storey Wilson, Kenneth A. Iczkowski, M. Scott Lucia, Peter C. Black, Purang Abolmaesumi, S. Larry Goldenberg, Septimiu E. Salcudean PII: DOI: Reference:
S1361-8415(18)30749-7 https://doi.org/10.1016/j.media.2018.09.005 MEDIMA 1410
To appear in:
Medical Image Analysis
Received date: Revised date: Accepted date:
11 January 2018 11 July 2018 21 September 2018
Please cite this article as: Guy Nir, Soheil Hor , Davood Karimi , Ladan Fazli, Brian F. Skinnider, Peyman Tavassoli, Dmitry Turbin, Carlos F. Villamil, Gang Wang, R. Storey Wilson, Kenneth A. Iczkowski, M. Scott Lucia, Peter C. Black, Purang Abolmaesumi, S. Larry Goldenberg, Septimiu E. Salcudean, Automatic Grading of Prostate Cancer in Digitized Histopathology Images: Learning from Multiple Experts, Medical Image Analysis (2018), doi: https://doi.org/10.1016/j.media.2018.09.005
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights • A system for automatic histological grading of prostate cancer was devel-
CR IP T
oped. • The classifier was trained based on detailed annotations of six pathologists. • The inter-observer variability was addressed by adapting a crowdsourcing approach.
• Novel features based on spatial statistics of the nuclei were proposed.
AN US
• The performance of the classifier was within agreement levels among pathol-
AC
CE
PT
ED
M
ogists.
1
ACCEPTED MANUSCRIPT
Graphical Abstract Patch RGB Image
Patch 2
Tissue Segmentation
Patch 1
Scl 3 Scale 2 Scale 1
Gland Segmentation
Nuclei Segmentation
(b) Each core is divided into image patches for multi-scale processing
?
Benign
Pathologist 2
(d) Patches are segmented for tissue types, glands and nuclei for feature extraction Pathologist 4
Pathologist 5
?
Grade 3
?
Grade 4
?
(g) Predict label probabilities of left-out patches, estimate their labels and reconsturct TMA slides
Grade 5 Pathologist 1
Pathologist 2
Pathologist 3
Pathologist 4
Pathologist 5
Pathologist 6
Consensus
Automatic
Pathologist 3
feature X2
Pathologist 1
? ? ?
Pathologist 6
(c) Cores are annotated in detail by pathologists using a tablet, and corresponding label maps are constructed
Patch 1: ( {µ1,c} | 3, 0.21, … , 85.30 ) Patch 2: ( {µ2,c} | 12, 0.01, … , 34.68 ) : Patch i: ( {µi,c} | 2, 0.19, … , 35.47 )
(h) Measure statistics and agreement levels
feature X1 (f1) Train a supervised classifier (regressor) based on label probabilities (f2) Update label probabilities based on training results.
AC
CE
PT
ED
M
AN US
(e) Extracted features for each patch are quantized and assigned label probabilities
?
CR IP T
(a) Individual cores are segmented on TMA slides
2
ACCEPTED MANUSCRIPT
Automatic Grading of Prostate Cancer in Digitized Histopathology Images: Learning from Multiple Experts
CR IP T
Guy Nira,i,∗, Soheil Horb,∗∗, Davood Karimii,∗∗, Ladan Fazlia , Brian F. Skinnider∗∗∗,e,c , Peyman Tavassoli∗∗∗,a,d , Dmitry Turbin∗∗∗,f , Carlos F. Villamil∗∗∗,e , Gang Wang∗∗∗,e , R. Storey Wilsong , Kenneth A. Iczkowskih , M. Scott Luciag , Peter C. Blacka , Purang Abolmaesumii , S. Larry Goldenberga , Septimiu E. Salcudeani,a,∗ a Department
of Urologic Sciences, University of British Columbia, Vancouver, BC, Canada of Electrical Engineering, Stanford University, Stanford, CA, USA c Department of Pathology, Vancouver General Hospital, Vancouver, BC, Canada d Department of Pathology, Richmond Hospital, Richmond, BC, Canada e BC Cancer Agency, Vancouver, BC, Canada f Anatomical Pathology, St. Paul’s Hospital, Vancouver, BC, Canada g Department of Pathology, University of Colorado School of Medicine, Aurora, CO, USA h Medical College of Wisconsin, Milwaukee, WI, USA i Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC, Canada
M
AN US
b Department
Abstract
ED
Prostate cancer (PCa) is a heterogeneous disease that is manifested in a diverse range of histologic patterns and its grading is therefore associated with an inter-observer variability among pathologists, which may lead to an under- or
PT
over-treatment of patients. In this work, we develop a computer aided diagnosis system for automatic grading of PCa in digitized histopathology images using supervised learning methods. Our pipeline comprises extraction of multi-scale
CE
features that include glandular, cellular, and image-based features. A number of novel features are proposed based on intra- and inter-nuclei properties; these fea-
AC
tures are shown to be among the most important ones for classification. We train our classifiers on 333 tissue microarray (TMA) cores that were sampled from 231 radical prostatectomy patients and annotated in detail by six pathologists for different Gleason grades. We also demonstrate the TMA-trained classifier’s ∗ Corresponding
∗∗ ,∗∗∗ These
author(s) authors contributed equally to this work
Preprint submitted to Medical Image Analysis
September 24, 2018
ACCEPTED MANUSCRIPT
performance on additional 230 whole-mount slides of 56 patients, independent of the training dataset, by examining the automatic grading on manually marked lesions and randomly sampled 10% of the benign tissue. For the first time, we
CR IP T
incorporate a probabilistic approach for supervised learning by multiple experts to account for the inter-observer grading variability. Through cross-validation experiments, the overall grading agreement of the classifier with the pathologists
was found to be an unweighted kappa of 0.51, while the overall agreements between each pathologist and the others ranged from 0.45 to 0.62. These results suggest that our classifier’s performance is within the inter-observer grading
AN US
variability levels across the pathologists in our study, which are also consistent with those reported in the literature.
Keywords: Digital pathology, Prostate cancer, Computer aided diagnosis, Machine learning
1. Introduction
M
Conflicts of interest: None
ED
1.1. Prostate Cancer Grading
With an estimated number of 161,360 new cases and 26,730 deaths in 2017, prostate cancer (PCa) is the second most commonly diagnosed cancer, and third most common cause of cancer death among American men (Siegel et al., 2017).
PT
5
PCa is a heterogeneous disease and is manifested in a diverse range of histologic
CE
patterns. The Gleason score (Gleason, 1966) is currently the most common grading system of prostate adenocarcinoma, and proven to be a clinically relevant prognostic marker. The grade is determined by pathologists based on the glandular architectural features observed in haematoxylin and eosin (H&E)
AC
10
stained samples. Firstly, PCa patterns in a tissue slide are graded from Gleason grade 1, with well differentiated and uniform glands associated with less aggressiveness, to Gleason grade 5, with occasional gland formation associated with poorer outcomes. Then, the primary (most common) and secondary (second
15
most common) grades are added to yield the Gleason score. 4
ACCEPTED MANUSCRIPT
The Gleason system has undergone a few modifications since its inception, including deeming Gleason grades 1 and 2 as clinically irrelevant (Epstein et al., 2005), and new correlations to cancerous glandular patterns which correlated
20
CR IP T
better with patients’ prognosis (Epstein et al., 2016a). The most recent grade grouping system (Epstein et al., 2016b) defines Gleason scores ≤ 6 as grade
group 1, score 3+4 = 7 as group 2, score 4+3 = 7 as group 3, score 8 as group 4,
and scores 9 and 10 as group 5. This grade grouping has been shown to provide better stratification to improve prognosis accuracy (in part by differentiating
between scores 3 + 4 = 7 and 4 + 3 = 7), and has a lowest grade of 1 (as opposed
to 6) to reduce over-treatment. However, the patterns of grading (mainly grades
AN US
25
3 and 4) that constitute the scores and groups are still susceptible to inter- and intra-observer variability that is affected by pathologists’ training, experience, and institutional consensus.
The inter-observer Gleason grading variability ranges from barely moderate 30
agreement between general pathologists (Allsbrook et al., 2001b) (overall un-
M
weighted kappa (κ) coefficient of 0.435), to substantial inter-observer agreement between urologic pathologists (Allsbrook et al., 2001a) (overall weighted κ range
ED
of 0.56–0.70). The heterogeneity of PCa patterns and grading is demonstrated in Figure 1. The existing variance in PCa grading among pathologists (Berney 35
et al., 2014) may lead to an under- or over-treatment of patients, and there-
PT
fore impact their survival rates, quality of life, and the costs to the healthcare system.
CE
Tissue microarray (TMA) technology (Kononen et al., 1998) allows for many samples of tissue to be arrayed in a grid and simultaneously processed for sec-
40
tioning and staining on a single histologic slide. TMA slides present less tissue
AC
than typical biopsy samples, and much less than whole-mount (WM) slides, each containing a cross-section of the entire specimen. However, they provide the means for a high-throughput analysis and improved experimental control by having a diverse range of samples from multiple patients and disease types
45
on a single slide. In (De La Taille et al., 2003), the inter-observer reproducibility of Gleason grading between ten genitourinary pathologists was evaluated on 5
ACCEPTED MANUSCRIPT
Pathologist 1
Pathologist 2
Pathologist 3
Gleason grade 3? 'ůĞĂƐŽŶϱ
Pathologist 4
Pathologist 5
Pathologist 6
'ůĞĂƐŽŶϯ ĞŶŝŐŶ /ŶǀĂůŝĚ
Gleason grade 4?
CR IP T
'ůĞĂƐŽŶϰ
Gleason grade 5?
(b) Intra-class variability and
betical order)
inter-class similarity
AN US
(a) Inter-observer variability (pathologists are not in alpha-
Figure 1: Heterogeneity of PCa patterns and grading lead to classification challenges.
TMAs. It was concluded that the approach is useful for resident training and teaching because it allows trainees to focus on small areas of tissue. However, the grading variability between the observers was high, with agreement percentages ranging from 21% to 82.5%. In (Bova et al., 2001), the authors designed
M
50
a web-based system for Gleason grading of images of TMA cores for assessing
ED
inter- and intra-observer variability between two expert genitourinary pathologists. The images were graded at the core-level (grading of the entire core) by answering text-based questions, and the pathologists had an exact agreement on Gleason score in 66% of the cases (98% when separated by only one Gleason
PT
55
grade).
CE
1.2. Machine Learning in Digital Pathology In recent years, the field known as digital pathology has grown dramatically,
AC
largely due to technological advancements in whole slide imaging, image pro-
60
cessing and machine learning algorithms, and increase in computational power (Pantanowitz, 2010). As part of this field, many methods have been proposed in the literature for automatic histopathological image analysis and classification (Madabhushi and Lee, 2016). Such automation can be used in developing computer aided diagnosis (CAD) systems to increase accuracy, reproducibility 6
ACCEPTED MANUSCRIPT
65
and throughput in diagnosis of cancer in general, and of PCa in particular. A typical CAD system first employs image processing for staining standardization, tissue segmentation and histologic feature extraction, followed by
CR IP T
employment of machine learning classification of the extracted features. The features can be largely classified into architectural, morphological, and textural 70
features. Recent survey papers (Arevalo et al., 2014; Bhargava and Madabhushi, 2016; Jothi and Rajam, 2017; Mosquera-Lopez et al., 2015) describe the different approaches towards histopathological CAD systems, and below we discuss the works most relevant to ours.
75
AN US
In (Tabesh et al., 2007), the authors extracted features that describe the
color, texture, and morphologic cues from 367 and 268 H&E image patches, acquired from TMA and WM datasets, respectively, and employed them in a supervised learning framework. They achieved accuracy of 96.7% and 81.0% for cancer detection (benign vs. cancer) and low- vs. high-grade classification, respectively, in a five-fold cross-validation on each dataset.
In (Doyle et al., 2012b), the authors proposed a cascade approach for the
M
80
multi-class grading problem. Rather than the conventional one-shot classifica-
ED
tion and one-versus-all approaches to multi-class classification, they grouped samples from different classes to maximize intra-group homogeneity and intergroup heterogeneity. They proposed architectural and textural features, and also proposed features based on minimum spanning tree between nuclei. They
PT
85
used their approach to classify 2000 biopsy image patches, taken from 214 pa-
CE
tients, of seven classes (Gleason 3–5, epithelium, stroma, atrophy, and prostatic intraepithelial neoplasia (PIN)), and reported a positive predictive value (PPV) of 86%.
AC
90
In (Gorelick et al., 2013), the authors partitioned image patches into super-
pixels, namely, non-overlapping regions that are similar in color and texture. These superpixels were then classified into different tissue labels based on extracted morphological, geometric and intensity features. The geometric features used for tissue labeling include histograms of scale invariant feature transform
95
(SIFT) descriptors (Lowe, 1999). The image patches are then classified into 7
ACCEPTED MANUSCRIPT
benign vs. cancer, and low- vs. high-grade based on histogram of their corresponding superpixels’ tissue labels with reported accuracies of 90% and 85%, respectively, on 991 image patches from 15 patients.
100
CR IP T
In (Nguyen et al., 2014), the authors developed a method to segment glands based on the spatial arrangement of nuclei. Such a nuclei-based method is
useful in detection of glands without lumen, as is the case in higher grades of PCa. They proposed using minimum spanning trees and show that a fused
lumen- and nuclei-based gland segmentation improves the classifier performance in differentiating Gleason 3 from 4, in comparison with a lumen-based approach alone, and report a 3-class (benign, Gleason 3, and Gleason 4) grading accuracy
AN US
105
of 88% on 317 image patches taken from 29 patients. In (Lee et al., 2014), the authors looked at the disorder in gland orientations for predicting the degree of PCa malignancy. They reported an accuracy of 73% in predicting biochemical recurrence using such features.
Recently, deep learning (DL) approaches were proposed in medical image
110
M
analysis (Litjens et al., 2017), including digital pathology images, e.g., (Janowczyk and Madabhushi, 2016; Litjens et al., 2016) and references therein. These ap-
ED
proaches use the raw digital images as input, with minimal preprocessing, and have the advantage of avoiding the traditional extraction of handcrafted fea115
tures, which may be inaccurate or insufficient. A few recent works applied DL
PT
for detection and grading of PCa directly (K¨ all´en et al., 2016; Rezaeilouyeh et al., 2016; Zhou et al., 2017), while other works have used DL for more low-
CE
level tasks such as gland and nuclei segmentation (Chen et al., 2016; Cire¸san et al., 2013; Ronneberger et al., 2015; Xu et al., 2017). The main disadvantage of any DL approach is the requirement of a large
120
AC
training set. Fortunately, histopathology patterns are visible on a small scale. Therefore, a single slide can be used to generate a large number of training patches. On the other hand, as discussed above, the heterogeneity of the data presents a challenge and patches extracted from a single slide are typically cor-
125
related. Another disadvantage in DL is the lack of feature interpretation, which may hinder the clinical deployment of CAD systems (Niazi et al., 2017; Shen 8
ACCEPTED MANUSCRIPT
et al., 2017). In (Niazi et al., 2017), the authors proposed a classifier based on luminal and architectural features that are visually meaningful. They trained and tested their method on independent datasets of 43 and 88 image patches that were extracted from 25 and 30 WM slides (unspecified number of patients),
CR IP T
130
respectively, and reported a 2-class (low- vs. high-grade) classification accuracy of 90.9%.
In (K¨ all´en et al., 2016), the authors used a pre-trained network for PCa
grading by feeding the output of a mid-layer to random forests (RF) (Breiman, 135
2001) and support vector machines (SVMs) (Bishop, 2006) classifiers. They
AN US
achieved a patch-wise and image-wise accuracy of 81% and 89%, respectively,
using a 10-fold cross validation. In (Rezaeilouyeh et al., 2016), the authors proposed a combined DL-based pipeline with handcrafted features (magnitude and phase of a shearlet transform) for PCa grading and reported accuracy, 140
sensitivity and specificity of 88%, 89% and 94%, respectively. Additional hybrid methods of DL and handcrafted features were proposed in (Zhou et al., 2017), for
M
a two-class (Gleason grade group 2 vs. group 3) classification with an accuracy of 75%, and in (Cruz-Roa et al., 2013) for basal-cell carcinoma detection (in
145
ED
skin histopathology slides).
1.3. Multi-label Classification
PT
In this work, we have developed a CAD system for automatic grading of PCa in digitized histopathology images. Our pipeline consisted of processing multiscale image patches, which were then used as a training dataset for supervised
CE
learning. To address the inter-observer variability issue in PCa grading, the
150
training dataset was labeled by multiple pathologists. This leads to a multi-
AC
class (four classes – benign, Gleason grades 3, 4 and 5) and multi-label (also known as multi-output) classification problem, where, given the inter-observer variability, it is unclear how to choose the ground truth label on which the classifier should be trained. One approach is to train a separate classifier on each
155
label. However, such an approach would not utilize the underlying correlations among the labels and leaves the same dilemma of how to choose the correct 9
ACCEPTED MANUSCRIPT
output from the different classifiers. Another approach is to consider the ground truth as the “consensus label”1 based on the majority vote among the labels of each sample (and a tiebreaker policy). Such voting approach was applied, e.g., in (Veta et al., 2015) for mitosis
CR IP T
160
detection in breast cancer images. However, as discussed in (Dekel and Shamir, 2009; Raykar et al., 2010) this approach implicitly assumes that the experts’ an-
notations are weighted equally, while in practice annotators performance may be affected by known and unknown factors as mentioned above. Other approaches 165
include ensemble methods that train many classifiers on random subsets of the
AN US
labels (Tsoumakas and Vlahavas, 2007). Additional approaches to tackle multilabels are also reviewed in (Tsoumakas et al., 2009; Zhang and Zhou, 2014).
In (Warfield et al., 2004), the authors presented the simultaneous truth and performance level estimation (STAPLE) method to compute a probabilistic es170
timate of the true segmentation of medical images from a collection of segmentations that were provided by multiple experts and/or automatic algorithms. The
M
method tackles the inter-observer variability by estimating the sensitivity and specificity of each expert/algorithm using an iterative expectation-maximization
175
ED
(EM) algorithm, and then computes the maximum-a-posteriori (MAP) estimate of the true segmentation from a weighted combination of the different segmentations. The algorithm is used to assess the performance of both automated
PT
and human segmentations.
In our work, to utilize the multi-label data, we follow (Raykar et al., 2010)
CE
and proposed a probabilistic approach for supervised learning by multiple ex180
perts. The method was shown to be useful in similar scenarios that arise in radiology, as well as in crowdsourcing applications (Ipeirotis et al., 2010). It
AC
may be seen as a generalization of the STAPLE algorithm that updates the estimation of the annotators’ performance and ground truth as part of the learning 1 Due
to the high inter-observer variability, considering the actual consensus label, i.e., only
samples that were assigned the same class by all experts, significantly reduces the size of the training set.
10
ACCEPTED MANUSCRIPT
process, and thus takes into account image information (or features) rather than 185
the labels alone (Chatelain et al., 2013). Similar to STAPLE, it also provides
CR IP T
the annotators’ performance as a byproduct. 1.4. Contributions
In achieving our goals, we present the following main contributions: i) acquisition and utilization of multi-expert annotations in the training and evaluation 190
of the classifier, ii) a probabilistic approach to automatic grading of PCa to con-
tend with the inter-observer variability, and iii) development of new multi-scale
AN US
features based on spatial point processes statistics of nuclei (Poisson’s intensity, Moran’s I, Ripley’s functions and pair correlation function), speeded up robust features (SURF) of nuclei, and intra-nuclei properties such as the nuclei 195
perimeter spectrum.
To the best of our knowledge, this work is the first to utilize multiple experts for training and evaluation of a CAD system for PCa grading. Due to
M
the reported inter-observer variability in the literature on PCa grading, we hypothesize that such utilization is necessary to capture the heterogeneity in both data and across pathologists. In addition, our proposed grading CAD sys-
ED
200
tem, classifying for benign and Gleason grades 3, 4, and 5, was evaluated in a rigorous leave-one-patient-out manner on an image set that is in the order of
PT
100 times larger than most comparable works in the literature, and involved a multi-institutional study. Nevertheless, as discussed in Section 4, the automatic 205
classifier’s performance is within the agreement range of human pathologists,
CE
and on par with the most comparable works in the literature. The rest of the manuscript is organized as follows. In Section 2 we describe
AC
the data, the annotation and label probability assignment, and the training and evaluation methods of the classifier. In Section 3 we summarize the results of
210
the classification experiments, and in Section 4 we discuss the significance of the results, the limitations of the study, potential impact and future research direction of the study. We conclude with a summary of the results and contributions in Section 5. 11
ACCEPTED MANUSCRIPT
Patch RGB Image Patch 2
Tissue Segmentation
Patch 1
Scl 3 Scale 2
(a) Individual cores are segmented on TMA slides
(b) Each core is divided into image patches for multi-scale processing
Grade 5
(e) Extracted features for each patch are quantized and assigned label probabilities
?
(g) Predict label probabilities of left-out patches and estimate their labels
feature X
(f1) Train a supervised classifier (regressor) based on label probabilities
(f2) Update label probabilities based on training results.
M
Grade 4
AN US
feature Y Grade 3
Nuclei Segmentation
Patch 1: ( {µ1,c} | 3, 0.21, … , 85.30 ) Patch 2: ( {µ2,c} | 12, 0.01, … , 34.68 ) : Patch i: ( {µi,c} | 2, 0.19, … , 35.47 )
?
?
?
? ? ? ?
Gland Segmentation
(c) Patches are segmented for tissue types, glands and nuclei for feature extraction
(d) Cores are annotated in detail by pathologists
Benign
CR IP T
Scale 1
Figure 2: Illustration of TMA image processing and classification workflow.
215
ED
2. Materials and Methods
2.1. Tissue Microarrays Dataset and Processing
PT
Our histopathology data were acquired from seven TMA blocks that were constructed and processed at the Vancouver Prostate Centre and the study was
CE
approved by the institutional Clinical Research Ethics Board (CREB #H1501064). Each block contains 160 tissue cores of 1.1 mm in diameter that were
220
extracted from 502 radical prostatectomy specimens (two to three cores per
AC
specimen, 1120 cores in total). A single level from each of the seven blocks was sectioned, stained in H&E, and digitized as virtual slides at 40x magnification using a SCN400 Slide Scanner (Leica Microsystems, Wetzlar, Germany). The digital images of the seven slides were processed as described below for the de-
225
velopment and evaluation of our CAD system. The processing and classification workflow is further illustrated in Figure 2. 12
ACCEPTED MANUSCRIPT
For systematic processing, each core in the TMA slides was automatically segmented and labeled. The segmentation consisted of thresholding the grayscale converted slide image, at ∼1/116 of its native resolution, followed by morphological operations on the binary image to remove noise and to complete partial
CR IP T
230
cores. The binary image was convolved with a disk kernel mask and the centroids of the cores were detected by another thresholding. To associate each
core with its pixels, a k-means clustering (Lloyd, 1982) was performed, with
the spatial coordinates of the binary image pixels as features and the detected 235
centroids from the previous step as initialization. The cores were then indexed
2.2. Whole-Mount Slides Dataset
AN US
based on their respected row and column.
For evaluation, we have also used a second dataset that was acquired at the University of Colorado, Denver, from which digitized images and annotations 240
were shared through an inter-institutional Material Transfer Agreement (MTA
M
#M15-00267). The dataset comprises 230 WM slides of 56 radical prostatectomy patients. The slides were stained in H&E and scanned at 20x magnifica-
ED
tion using an Aperio ScanScope XT (Leica Biosystems, Vista, CA, USA). The virtual slides were annotated by uropathologists using the Aperio ImageScope 245
software for different patterns of PCa tissue with which a Gleason grade could
PT
be associated, and different patterns of benign tissue. The annotated patterns, as described in (Iczkowski et al., 2011), included single, small separate acini (corresponding to Gleason grade 3), luminal blue
CE
mucincontaining single, separate acini (Gleason grade 3), undulating, branched,
250
or angulated larger acini that are not truly papillary - no bridging or stromal
AC
cores (Gleason grade 3), mucinous/colloid carcinoma without fusion or individual cells (Gleason grade 3), fused small acini (Gleason grade 4), true papillary with stromal cores or bridging across acinar spaces (Gleason grade 4), small and large cribriform (Gleason grade 4-5), individual cells (corresponding to Gleason
255
grade 5), high grade PIN (benign), and atrophy with and without inflammation (benign). 13
ACCEPTED MANUSCRIPT
2.3. Detailed Annotation of TMA Cores We developed an Android-based application (named “PathMarker”) for quick and intuitive annotation of individual core images using a tablet with a stylus. In order to allow fast loading between images, the images were downsampled to
CR IP T
260
a resolution equivalent to 10x magnification that still allows proper diagnosis (Gordetsky and Epstein, 2016).
Due to the time consuming task of grading 1120 cores in detail, a subset of
333 cores (231 patients) was selected as follows. Based on initial core grading by 265
one pathologist, the 333 cores subset was skewed towards high-grades in expec-
AN US
tation that co-occurrence of benign and low-grade regions in such cores would
yield an equalized distribution of grades in the detailed annotation. All cores in grade group 5 (57 cores) and grade group 4 (106 cores) were selected, followed by a random selection of 10% of benign cores (20 cores), and random selection 270
of equal number of cores from grade groups 1-3 (50 cores from each group). The PathMarker app was loaded with the subset of 333 cores, randomly shuffled
M
such that cores from the same patient or slide are not necessarily consecutive. To date, six pathologists have used PathMarker to annotate the 333 H&E
275
ED
cores for benign, Gleason grade 3, grade 4, and grade 5 tissue. The pathologists were asked to circle regions on the cores with the corresponding grade as they see fit based on their experience. Similar to other studies, e.g., (De La Taille
PT
et al., 2003), there was no initial consensus meeting to agree on how to apply the Gleason grading system for this project. The pathologists’ experience in PCa
CE
diagnosis on histopathology ranged from early-, mid-career to veteran (LF: 27, 280
BFS: 15, PT: 1, DT: 24, CFV: 17, and GW: 5 years of experience); one was a clinical general pathologist, one was a research genitourinary pathologist, and
AC
the others were clinical genitourinary pathologists. To keep the annotators’ performance anonymized, the figures and results reported throughout this paper are not in alphabetical order.
285
Based on the color-coded annotations of each pathologist2 , a set of label 2 For
two of the pathologists, annotations of subsets of 191 and 92 cores were recovered.
14
ACCEPTED MANUSCRIPT
Table 1: Summary of the TMA dataset based on the detailed annotations by the pathologists.
231 63.2 (6.3)
CR IP T
25.9 (8.4%) 102.0 (33.1%) 173.4 (56.3%) 6.6 (2.1%) 307.9 (100%) 19 (5.7%) 61 (18.3%) 58 (17.4%) 62 (18.6%) 114 (34.2%) 19 (5.7%) 333 (100%)
AN US
Number of patients Mean (STD) age at surgery Grade area distribution (consensus, in mm2) Benign Grade 3 Grade 4 Grade 5 Overall Grade group distribution (consensus, in cores) Benign/Other Group 1 Group 2 Group 3 Group 4 Group 5 Overall
maps for each core was constructed by assigning pixels encircled by the contours with the corresponding class. The resulting distribution of the data based on
M
the generated maps is summarized in Table 1.
2.4. Extraction of Explicit Multi-Scale Features Our feature extraction is based on patches that are cropped from the digi-
290
ED
tized slide images. As suggested in (Doyle et al., 2012a), histological features
PT
may be meaningful at different resolutions. Thus, we process concentric image patches at different magnification scales. A data sample, xi = x1i , . . . , xSi ,
therefore comprises the S × M features, xsi ∈ RM , extracted from the sth scale 295
around the center of the ith patch.
CE
On the one hand, the larger the patches are, and the lower the scale is, the
less localized are the features. On the other hand, smaller patches in higher
AC
scales may not contain enough information to allow extraction of descriptive features. Also, the smaller the sampling rate of the patches is, the more repeti-
300
tive is the data set, and the more computation time is required. Thus, the patch size, scales and sampling rate were selected such that most histologic structures (e.g., glands) are captured, while maintaining enough data samples for a fine representation of the annotated labels.
15
ACCEPTED MANUSCRIPT
Table 2: Summary of the proposed features with their most important rank (lower is better), class and scale.
Nucleibased
Imagebased
Importance 18 1 2 4 4 1 22 1 59 2 2 2 1 33 26 10 20 33 4 1 2 3 1
Class Gleason 5 Gleason 3 Benign Benign Gleason 4 Gleason 4 Benign Benign Benign Gleason 4 Gleason 3 Gleason 5 Benign Gleason 3 Gleason 5 Gleason 5 Gleason 4 Gleason 5 Gleason 4 Gleason 4 Gleason 3 Gleason 5 Gleason 3
Scale 20x 10x 40x 40x 40x 20x 20x 10x 10x 20x 10x 40x 40x 10x 10x 40x 20x 10x 20x 10x 40x 40x 40x
Notes mean and std over glands mean and std over glands mean and std over glands mean and std over glands (Lee, et al., 2014) mean and std over glands mean and std over glands
CR IP T
Glandularbased
Description number of glands gland area gland perimeter gland diameter gland eccentricity gland orientation variance lumen to gland area ratio epithelial to gland area ratio number of nuclei nuclei area nuclei distance nuclei coordinates distribution nuclei coordinates distribution spectrum intra-nuclei intensity distribution minimum spanning tree between nuclei complete spatial randomness of nuclei Poisson’s λ intensity map of nuclei Moran’s I autocorrelation of nuclei Ripley’s andBesag's functions of nuclei pair correlation function of nuclei nuclei perimeter Fourier coefficients haematoxylin to eosin pixel ratio SURF descriptors
mean and std over nuclei mean, std, min and max over nuclei entropy, covariance matrix mean, std, covariance, eigenvalues mean and std (Doyle et al., 2012b; Nguyen et al., 2014) one-sided and two-sided p-value see text see text see text see text see text
AN US
Type
proposed novel features, see text
Specifically, we used three scales, 10x, 20x, and 40x, and a patch size of 512×512 pixels (thus, the scales correspond to 512×512 µm2 , 256×256 µm2
M
305
and 128×128 µm2 , respectively). The patches are cropped from the native 40x
ED
resolution of the core image, and lower scales are downsampled into patches of size 512×512 pixels. The patches are sampled in half-patch intervals (at 40x) along each axis.
For feature extraction, we processed each image patch for gland and nu-
PT
310
clei segmentation using methods developed by our group (Rashid et al., 2013; Rashid, 2014) and are briefly summarized here. As a preprocessing step, if the
CE
pixels in a patch have a mean gray intensity value higher than 180 or lower than 120 (in the range of 0-255), we considered this patch to contain mostly glass, image noise or other staining artifact, and assigned it an invalid label. Then,
AC
315
based on their color intensities, the pixels are clustered into different histologic structures, namely, lumen, nuclei, epithelial and stroma (Nguyen et al., 2012). The gland segmentation in each patch is performed by grouping connected
lumen pixels as the initial gland boundary, and iteratively updating this bound320
ary by consolidating surrounding epithelial and nuclei pixels until the boundary 16
ACCEPTED MANUSCRIPT
reaches stroma pixels. The method was reported to achieve 89% recall and 71% precision. The nuclei segmentation in each patch is performed by applying a modified watershed transformation to the gradient image of the patch’s negated
325
CR IP T
‘R’ channel. The method was reported to achieve 84% recall and 90% precision. The parameters for these methods were tuned and evaluated in (Rashid
et al., 2013; Rashid, 2014) based on a dataset of 10 WM slides (8 patients), independent from the datasets used in this study.
A list of the proposed features from each category that were found to be
meaningful, based on a feature importance analysis of our classification experiments, is summarized in Table 2. In total, we extract 276 features per scale.
AN US
330
The extracted features consist of architectural and morphological features that include glandular-based features (their number, contents and shape statistics), and nuclei-based features (number, size, color content, spatial and spatial frequency statistics). Motivation for using these features arose from discussions 335
with pathologists, as well as common features used in previous work (Jothi and
M
Rajam, 2017; Mosquera-Lopez et al., 2015).
Among the new features designed by us are image-based features that are
ED
based on the computation of SURF descriptors (Bay et al., 2008) around the centers of the segmented nuclei in the patch. Similar to SIFT (Lowe, 1999), its 340
predecessor, the SURF algorithm detects salient points in images and character-
PT
ize them in a scale, intensity and rotational invariance using a vector of features, known as a descriptor. Here, we force the salient points to be the nuclei centers,
CE
and employ the mean and standard deviation (STD) vectors of their descriptors as features for our classifier. The motivation is that the SURF descriptors,
345
which are based on the Haar wavelet responses, can capture and discriminate
AC
both the nucleus interior contents and the statistical intensity distribution of its surrounding (depending on the image patch scale). Other extracted nuclei features include the mean, covariance matrix ele-
ments and eigenvalues of the spatial Fourier transform of the nuclei coordinates
350
distribution. In addition, we propose the spectrum of the nuclei perimeter as a novel feature. The distance from the centroid of each nucleus to its perimeter is 17
ACCEPTED MANUSCRIPT
3.5
3
2.5
3
2
[ m]
2.5
1.5 2
θ 1.5
1 -
0.5
- /2
0
/2
-
- /2
0
/2
[rad]
[rad]
(a) Segmented nuclei
CR IP T
1
θ
(b) Perimeter
(c) Spectrum
Figure 3: Example of the computation of the nuclei perimeter and its spectrum. (a) Examples of segmented nuclei. (b) The distance from the center of mass of the nucleus to points on its
AN US
perimeter, as a function of the angle θ (a perfect circular perimeter would yield a constant function). (c) The spectrum of the center-to-perimeter distance.
computed for angles in the range of [−π, π), and the first five coefficients of the Fourier transform of this distance function are recorded (these were found to be sufficient to describe the spectrum, with over 80% of the energy). The mean values of these coefficients over all nuclei in the image patch are used as features.
M
355
The computation of these features is illustrated on an example in Figure 3.
ED
We introduce new features based on spatial statistics (Ripley, 2005). The motivation is to improve classification of high-grade cancers that do not have well-defined glands and are mainly characterized by the distribution of nuclei. We estimate the Poisson point process intensity (Poisson’s λ), which is a measure
PT
360
of how densely the points are distributed. Given a distribution, the maximum ˆ= likelihood estimator of λ is given by the density λ
n kW k ,
where n is the number
CE
ˆ of points in a region W of area kW k (Sheskin, 2003). We estimate λ(p) on each pixel p in the image patch using its 101 × 101 pixels neighborhood, and use
ˆ over the patch as features. The size of the the mean, median and STD of λ
AC
365
neighborhood was chosen to be large enough to sample sufficiently many nuclei for the intensity estimation, yet small enough to capture its inhomogeneity over the patch (equivalent to choosing the bandwidth of a density estimation window).
370
We compute the Moran’s I statistic (Moran, 1950), which is a second order 18
ACCEPTED MANUSCRIPT
measure of spatial autocorrelation that captures how dispersed or clustered the points are, and take the maximum value of the calculated Moran’s I over the patch as a feature. We also compute Ripley’s K-function K(r), its Besag’s
375
CR IP T
L-function transformation L(r), and the pair correlation function (pcf) pcf (r) (Baddeley et al., 2015), which are based on the average number of nuclei within a search radius r over the image. Specifically, the pcf, as the derivative of the
nuclei with respect to the radius, can be a measure of nuclei “rings” that are associated with glands.
The mean and STD of r − L(r), as well as the number of its zero crossings
and the distance between them were calculated as features. For the pcf, the
AN US
380
magnitude and frequency of its spectrum peaks were used as features. In addition, inspired by the step response characteristics of a second-order control system (Dorf and Bishop, 2011), the approximated damping ratio and natural frequency of the pcf were also computed as features. A border edge correction 385
was used in the computation of the spatial point functions. Figure 4 illustrates
M
the spatial statistics functions for two different high-grade patches.
ED
2.5. Label Probability and Ground Truth Estimation We formulate our multi-scale, multi-class and multi-label training data as
PT
390
D = {(xi , yi )}N i=1 , where N is the total number of samples (image patches) with features xi , defined in Section 2.4, and yi = yi1 , . . . , yiL comprises the L
experts’ labels associated with the ith patch, with yil ∈ {0, . . . , C − 1}. In our case, we have a set of C = 4 classes, namely, class 0 for benign, and classes
CE
1 − 3 for Gleason grades 3 − 5, respectively. A null class is assigned to unmarked
patches of annotators, so that these patches are still taken into account if other annotators marked it.
AC
395
We convert the multi-class classification into a set of C binary classifications,
Dc = {(xi , y˜i,c )}N i=1 , for c = 0, . . . , C − 1, with 1 if y l = c i l y˜i,c = 0 Otherwise 19
(1)
ACCEPTED MANUSCRIPT
4
1.4 1.2
3
1 2 0.8 1 0.6 0.4 -1
-2
0.2
0
10
20
30
40
50
60
0
0
10
r [ m] 1.4 1.2
3
1 2 0.8 1 0.6 0
30
AN US
0.4
-2
20
40
50
60
40
50
60
r [ m]
4
-1
CR IP T
0
0.2
0
10
20
30
40
50
60
0
0
10
20
r [ m]
(b) r − L(r)
(a) Image patch
30
r [ m]
(c) pcf (r)
Figure 4: Spatial statistics of nuclei. The consensus for the patches in the top and bottom rows was Gleason 4 and Gleason 5, respectively. Features extracted from the L-function and pair correlation function can indicate dispersed or clustered patterns. The dashed lines are
M
the theoretical plots for a complete random distribution.
ED
Following (Raykar et al., 2010), we first initialize the sensitivity and specificity of each annotator l for each class c as αcl = 1, and βcl = 1, ∀c, l, respectively. We consider yˆi to be the unknown ground truth of patch i, and compute the
PT
initial estimate of its posterior probability for each class c based on the original
CE
label distribution,
L
µi,c = P r(ˆ yi = c|Dc , αcl , βcl ) =
1X l y˜i,c . L
(2)
l=1
A MAP estimate of the ground truth label itself for patch i can be then com-
AC
puted by yˆiMAP = arg max0≤c≤C−1 µi,c . Note that the MAP estimate of the
ground truth using the initial µi,c is simply the majority vote label.
400
For training, rather than using the estimated ground truth as labels, we set
the continuous posterior probabilities µi,c as regression targets, and therefore
train C regressors. The motivation is to provide a prediction of the label probability, from which any estimate (not necessarily MAP) can be deduced, as well 20
ACCEPTED MANUSCRIPT
as to have a confidence measure associated with the predicted labels that is based on the inter-observer agreement. In each iteration of the training process, we update the estimate of the
µi,c =
ai,c pi,c , ai,c pi,c + bi,c (1 − pi,c )
where ai,c =
L Y l l (αcl )y˜i,c (1 − αcl )1−˜yi,c
bi,c =
l=1
(3)
L Y l l (βcl )1−˜yi,c (1 − βcl )y˜i,c , l=1
(4)
and pi,c in our case is taken as the output score of regressor c in the current iteration.
AN US
405
CR IP T
ground truth probability (the E-step) by
We then compute estimates for the sensitivity and specificity of the annota-
αcl
=
i∈training P
l µi,c y˜i,c
i∈training
µi,c
βcl
=
P
M
tors (the M-step) by P
i∈training (1
P
l − µi,c )(1 − y˜i,c )
i∈training (1
− µi,c )
(5)
Typically, the EM algorithm should be run until convergence. However, to shorten the simulation time we set a fixed number of 25 iterations, which was
410
ED
found as a good tradeoff between convergence and running time. 2.6. Classification and Evaluation Methods
PT
After the feature extraction and initial ground truth estimation, we have experimented with different types of regression methods and models to test our
CE
classification framework. The experiments were conducted on subsets of 171 and 60 patients (247 and 86 cores) as the training and testing sets, respectively,
415
that were partitioned to attain a similar intra-set class distribution (stratified
AC
sampling). Specifically, we have evaluated the performance of linear discriminant analysis (LDA), logistic regression (LR), SVM, RF (with different number of trees). From the RF results, we have also estimated the importance of the features for each class based on its corresponding regressor.
420
We also evaluated the performance of a DL approach on the same training and testing subsets. The convolutional neural network (CNN) used in this work 21
ACCEPTED MANUSCRIPT
follows the general architecture of U-net (Ronneberger et al., 2015). We made a few changes to the basic U-net. First, we feed the network with the original patches of size 512×512 pixels as well as patches downsampled by factors of 2 and 4. Second, as it has been suggested by previous studies (Drozdzal et al., 2016),
CR IP T
425
we used short and long skip connections in both contracting and expanding
sections to improve the gradient flow and ease the training. Finally, to reduce
the model size, we forward the features computed at each level of the contracting path to all the coarser levels, similar to the approach suggested in (Huang et al., 430
2016). During training, we used PCA-based color jitter (Krizhevsky et al., 2012)
AN US
and elastic image deformation (Ronneberger et al., 2015) for data augmentation.
We trained the network by minimizing the cross-entropy of its output multi-class probability map and the true label map using Netsterov’s accelerated gradient descent method.
Based on the comparison analysis results, we have evaluated our framework
435
with LR in a thorough leave-one-patient-out cross-validation process over the
M
231 patients. In each experiment, we selected patches from all patients but one as the training set, trained a supervised classifier with the multi-expert
440
ED
probabilistic approach described above, and applied it to the patches of the left-out patient (1–3 cores). The predicted labels of the left-out patches were recorded for validation, and the classification process was repeated for each
PT
patient.
We evaluated the classification performance in several manners. The analysis
CE
was performed on the left-out patches, by aggregating the classifications results 445
of the left-one-out experiments. First, we plotted the receiver operating characteristic (ROC) curves, computed by varying the decision threshold for each
AC
of the labels against the others. For each curve we calculated the area under the curve (AUC), and estimated the optimal threshold and its corresponding accuracy, sensitivity and specificity.
450
In order to assess the agreement level of the automatic classification, in com-
parison to the inter-observer agreement, we used the κ statistic (Cohen’s kappa coefficient) (Sheskin, 2003; Viera et al., 2005). The κ coefficient ranges from 22
ACCEPTED MANUSCRIPT
Table 3: Comparison of the classification results between methods on a leave-out set of patients.
LR
SVM
91.6 94.5 76.0 79.1 80.2 77.1 .49
90.8 91.9 84.5 79.0 75.6 84.7 .52
88.5 90.8 75.5 73.8 69.3 81.6 .45
RF100 89.7 93.0 70.8 78.8 79.2 78.2 .49
RF250 89.7 92.8 72.2 79.4 79.1 79.7 .50
RF500 89.8 92.8 72.8 79.4 79.3 79.6 .50
DL 91.9 92.7 82.7 77.8 79.0 75.2 .37
CR IP T
Cancer detection (benign vs. all grades) Low- vs. High grade (Grade 3 vs. Grades 4, 5) Overall agreement
Accuracy (%) Sensitivity (%) Specificity (%) Accuracy (%) Sensitivity (%) Specificity (%) Unweighted κ
LDA
AN US
−1 to 1. Non-positive values indicate an accidental agreement; positive values
in the intervals 0.1-0.2, 0.21-0.4, 0.41-0.6, 0.61-0.8 and 0.81-1.0 reflect slight, 455
fair, moderate, substantial and (near) perfect agreements, respectively. To be consistent with (Allsbrook et al., 2001b), we computed the unweighted κ for the patch-wise grades of each pathologist compared to the others pathologists, as
M
well as to the automatic classification results. The overall unweighted κ’s were also computed based on the corresponding pathologist or automatic classifier vs. all human pathologists as predictors.
3. Results
ED
460
PT
3.1. TMA Classification
The results of the different classification methods we tested on the leave-out
CE
set of patients are summarized in Table 3, with the best performance achieved 465
by the LR classifier. For the leave-one-patient-out experiment using LR, the patch-wise ROC curves of the automatic classifier, as computed based on the
AC
left-out patches probabilities, are plotted in Figure 5. The weighted mean ROC curve is computed by taking the occurrence of each label as weights, and reflects the overall performance of the classifier with an AUC of 0.85.
470
When considering the MAP estimation, by assigning each patch the label
with the highest probability, the classification accuracy, sensitivity and specificity for cancer detection (benign vs. all grades) are 90.5%, 91.5% and 85.2%, 23
ACCEPTED MANUSCRIPT
Table 4: Inter-observer (unweighted) κ for patch-wise grading on leave-one-out patients (pathologists are not in alphabetical order).
Path. 2
Path. 3
Path. 4
Path. 5
.36 .41 .48 .45 .48 .40 .45
.46 .53 .53 .64 .40 .50
.69 .72 .60 .58 .61
.67 .58 .57 .61
.65 .53 .62
Path. 6
Automatic
CR IP T
Path. 1
.47 .59
.51
AN US
Path. 1 Path. 2 Path. 3 Path. 4 Path. 5 Path. 6 Automatic Overall
respectively. For low- vs. high-grade classification (grade 3 vs. grades 4 and 5) accuracy, sensitivity and specificity are 79.2%, 79.2% and 79.1%, respectively. 475
For comparison, the manual grading by the pathologists in this study leads to inter-observer mean accuracy, sensitivity and specificity of 97.2%, 98.4% and 86.2%, respectively, for cancer detection, and 78.8%, 79.2% and 79.3%, respec-
M
tively, for low- vs. high-grade.
Further inter-observer analysis of the patch-wise agreement using the unweighted κ coefficient is provided in Table 4. The overall agreement κ of each
ED
480
pathologist with the others (last row in the table, not including the last value) ranged between 0.45 to 0.62 (0.56±0.07), while the automatic classifier’s overall
PT
agreement with the pathologists was 0.51 (last value in the table). Examples of cores, as annotated by the pathologists, and their consensus and classification label maps are shown in Figures 6-8.
CE
485
In terms of running times, on a machine with two Intel Xeon E5-2650 v4
CPUs with 12 cores each @ 2.2Ghz and 256GB of RAM, the processing of
AC
each TMA core (∼400 image patches) for feature extraction using a Matlab (MathWorks, Natick, MA, USA) implementation took ∼2 hours using parallel
490
computing (∼14 hours for all 333 cores). Similarly, the training and classification took ∼4.5 hours for the 231 leave-one-patient-out simulations.
24
ACCEPTED MANUSCRIPT
1
0.8 0.7 0.6 0.5 0.4 0.3
Benign, AUC = 0.95 Gleason 3, AUC = 0.85 Gleason 4, AUC = 0.82 Gleason 5, AUC = 0.77 Weighted, AUC = 0.85
0.2 0.1 0
CR IP T
True positive rate (sensitivity)
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
with the consensus labels.
Pathologist 2
Pathologist 5
Pathologist 6
Pathologist 3
M
Pathologist 1
AN US
Figure 5: The ROC curves of the LR classifier evaluated on left-one-out patients in comparison
Automatic
PT
ED
Consensus
Pathologist 4
Figure 6: Example of the TMA classification results. The automatic classifier is in good
CE
agreement with the consensus and captures the detailed grade map.
3.2. Feature Importance
AC
Each feature was ranked by its importance based on the RF experiments,
with ranks one and 276 being the highest and lowest importance, respectively.
495
Since our framework consists of a multi-class and multi-scale regression, the features are ranked by multiple importance values, and Table 2 summarizes the highest importance achieved by each group of features, with the corresponding class and scale it was attained at. Below we provide a summary of the 25
ACCEPTED MANUSCRIPT
Pathologist 3
Pathologist 5
Pathologist 6
Consensus
Pathologist 4
CR IP T
Pathologist 2
Automatic
AN US
Pathologist 1
Figure 7: Example of the TMA classification results. In this example, the automatic classifier has upgraded the consensus grade.
Pathologist 2
Pathologist 5
Pathologist 6
Pathologist 3
Consensus
Pathologist 4
Automatic
PT
ED
M
Pathologist 1
Figure 8: Example of the TMA classification results. Here, the automatic classifier has
CE
downgraded the consensus grade, in agreement with some pathologists (when there is a tie, the consensus is taken as the higher grade).
AC
importance results for the novel features proposed in this paper.
500
The highest importance for the SURF descriptors’ mean was ranged 2–169
(mean±STD of 34±32) for differentiating mostly benign at 10x, while the descriptors’s STD was ranged 1–69 (18±16) for mostly Gleason grade 4 at 20x.
The nuclei distribution spectrum mean was ranked first for classifying benign
26
ACCEPTED MANUSCRIPT
tissue at 20x, its covariance matrix diagonal elements were ranked second and 505
fourth for classifying Gleason 5 at 40x, while the covariance matrix eigenvalues were ranked third and 30 for classifying benign at 40x and Gleason 3 at 20x,
CR IP T
respectively. The perimeter spectrum importance was ranged between 2–59, with its fourth coefficient ranked the highest in classification of Gleason grade 3 at 40x.
For the nuclei spatial statistics based features, Poisson’s λ mean and STD
510
were ranked 52 and 20, in classification of Gleason 5 at 40x and Gleason 4 at
20x, respectively. Moran’s I maximum value was ranked 33 in classification of
AN US
Gleason 5 at 10x. The mean and STD of the Ripley’s/Besag’s-based function,
r −L(r) were ranked 32 and 7, respectively, both in classification of Gleason 3 at 515
40x. The number of its zero crossings and their intermediate distances reached ranks 38 and 4, for benign and Gleason 4 at 20x, respectively. The damping ratio, and natural frequency of the pcf ranked 7 and 5, respectively, in classification of benign at 40x, while its spectrum peak magnitudes and frequencies
respectively.
ED
520
M
best ranks were first and second in classification of Gleason 4 and benign at 10x,
3.3. Whole-Mount Classification using TMA-trained Classifier Here, we applied the classifier that was trained exclusively on the TMA
PT
dataset to a WM dataset from another institution in order to evaluate its translation across laboratories. We have employed a similar patch-wise multi-scale 525
feature extraction to the WM slides. Due to the lower magnification of the WM
CE
dataset compared to the TMA dataset, we extracted image patches within the annotated regions and processed them in two scales (20x and 10x). Due to the
AC
significant amount of benign tissue compared to cancer in WM slides, so as to expedite the processing, we extracted features from a random selection of 10%
530
of the benign patches in each slide. We then employed a TMA-based classifier to classify the WM patches. The
classifier was trained on all TMA cores while using only the features that correspond to the two relevant scales in order to be compatible with the WM data. 27
ACCEPTED MANUSCRIPT
Grade 5
Grade 4
CR IP T
Grade 3
Benign
Invalid
(a) True
(b) Automatic
AN US
Figure 9: An example of a WM slide classified by the TMA trained classifier. 1 0.9
0.7 0.6 0.5 0.4
M
True positive rate (sensitivity)
0.8
0.3 0.2
Benign, AUC = 0.85 Gleason 3, AUC = 0.63 Gleason 4, AUC = 0.81 Gleason 5, AUC = 0.85 Weighted, AUC = 0.75
ED
0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
PT
Figure 10: The ROC curves of the TMA trained classifier evaluated on the WM slides.
An example of a classified slide (with all of its patches processed for illustration) is shown in Figure 9. The overall patch-wise ROC curves are plotted in
CE
535
Figure 10. The resulting weighted mean ROC curve has an AUC of 0.75.
AC
4. Discussion The results on the TMA dataset suggest that the classifier’s performance in
detecting cancer (90.5% accuracy) is on par with the most comparable works
540
described in Section 1.2, while its performance in differentiation between lowand high-grade cancer (79.2% accuracy) is lower than the results reported by 28
ACCEPTED MANUSCRIPT
some comparable works (e.g., 81.0% by (Tabesh et al., 2007), and 85% by (Gorelick et al., 2013)). We note, however, that most previous works have trained and evaluated their classifiers on datasets in the order of ∼100 image patches sampled from a single slide and/or patient. Such datasets are likely to be in-
CR IP T
545
sufficient for capturing the heterogeneity in histological patterns of PCa across populations. Indeed, the performances of the reviewed works in (Mosquera-
Lopez et al., 2015) seem to be inversely correlated with the size of the data. In
contrast, our multi-class classifier for benign and Gleason grades 3–5 patterns 550
was developed based on a dataset of over 16, 000 patch samples (not includ-
AN US
ing the half-patch and multi-scale sampling) from different slides and over 200 patients, and was evaluated in a leave-patients-out cross-validation.
We also note that, given the inter-observer grading variability of PCa, a perfect classifier that was trained and evaluated by a single pathologist is expected 555
to perform only at the mean inter-observer agreement rate when evaluated against another pathologist. Thus, even when evaluated in a cross-validation
M
manner, the classification results of previous works are likely to overfit the training data to some extent. In our study, the classifier was trained to predict the
560
ED
multi-label probabilities of each grade class based on six pathologists, and its overall agreement with the pathologists (unweighted κ of 0.51) fits within the range of agreement between the pathologists. The inter-observer agreements
PT
between the pathologists (unweighted κ of 0.56±0.07) were also consistent with the levels reported in the literature (De La Taille et al., 2003), despite the small
CE
amount of tissue in each TMA core (in practice the sample is larger and at least 565
three levels are provided). In general, the more annotations by more pathologists that we add to our
AC
data, the better and more robust our classifier is expected to perform. Information regarding the annotation quality, e.g., based on experience, can be utilized through the probabilistic framework by assigning a prior probability to each
570
expert. Such priors can be determined based on a screening questionnaire or a preliminary evaluation on examples prior to the annotation task. Annotation variability may also be associated with different grading guidelines between cen29
ACCEPTED MANUSCRIPT
ters or settings (e.g., research vs. clinical), rather than experience. Therefore, the framework also allows to tune the classifier to a specific center by changing 575
the priors in favour of annotators from that center.
CR IP T
The motivation for not having a consensus meeting among the pathologists prior to their annotation of the cores was to expose the classifier to different grading “doctrines,” and therefore to increase its robustness to datasets
acquired, e.g., by different centers. However, a consensus meeting might have 580
been useful in some ambiguous cases for which pathologists would normally look
into additional information. For example, regions with high-grade PIN, which
AN US
is usually confirmed with immunohistochemistry staining, were marked by some of the pathologists as benign or left unmarked, while others have assumed the “worst case scenario” and marked them as cancer.
The TMA dataset provided a consistent processing of the samples and a
585
diversity of patients in the dataset. However, this may have hindered the generality of the classifier since there are typically differences between laboratories in,
M
e.g., preparation, staining and scanning of histology slides (Yagi, 2011), which could be a reason for the lower performance of the classifier on the WM dataset that was completely excluded from the training phase. In the future, we in-
ED
590
tend to apply stain normalization (Bautista et al., 2014; Bejnordi et al., 2016; Janowczyk et al., 2017; Khan et al., 2014), data and color augmentation, or neu-
PT
ral networks approaches (Lafarge et al., 2017) to improve the generalization of the classifier to staining differences and other artifacts of histological processing. In addition, although we assumed that skewing the sampling of the 333 TMA
CE
595
cores towards higher grades would also produce low-grade and benign patches (since they are present in many high-grade cores), there were still far less benign
AC
(and grade 5) patches in the training set than grades 3 and 4 patches. Training on more benign patches is therefore expected to increase the classification
600
performance on the WM slides. Due to the large amount of benign tissue in WM (compared to cancer), it was assumed that the randomly sampled 10% benign patches provided an adequate representation. However, for full analysis, processing of the entire benign tissue on the WM slides is required. We also 30
ACCEPTED MANUSCRIPT
note that while four pathologists have annotated the WM data, they did not 605
overlap and only jointly discussed and reached a consensus regarding regions for which the assigned pathologist had a doubt (Iczkowski et al., 2011). Thus, the
CR IP T
performance on the WM data was effectively evaluated against one pathologist and therefore the results may suffer from the overfitting issues above.
The feature-based classification performance was found to be similar across 610
the different methods we tested, with the best results in terms of overall agree-
ment achieved by LR and the worst by SVM. We note that we did not tune parameters for these methods, and the classification performance may be im-
AN US
proved by using, e.g., kernel-based methods and regularization (Ginsburg et al., 2016). The general performance may be further improved by improving the 615
accuracy of the extracted features by, e.g., updating the gland and nuclei segmentation methods, which is an active field of research (Irshad et al., 2014). We note that, as suggested by (Nguyen et al., 2014), even a perfect segmentation of every gland and nuclei may not dramatically improve the grading performance.
620
M
In our multi-class, multi-scale approach, we found that most of the features we used to be important, at least in distinguishing one class at a certain scale.
ED
Specifically, the newly proposed features that are based on spatial point statistics, shape spectrum, and SURF descriptors of the nuclei were found to be ranked highly. While the combination of features seems to therefore contribute
625
PT
to the overall performance, a dynamic feature selection could be performed for each regressor based on the corresponding importance, such that different fea-
CE
tures at different scales are used for each classification task in order to optimize computational efficiency and avoid potential overfitting. The feature-based methods were found to perform better than the DL ap-
AC
proach we tested. However, we expect the DL performance to improve with
630
additional training data, which will also allow the training of deeper architectures. Still, since handcrafted features have the advantage of providing clinicians with insight into the automatic decision process, the lack of feature interpretation in DL, even if performing well, may hinder the clinical deployment of CAD systems (Niazi et al., 2017; Shen et al., 2017). Thus, as future work, we plan to 31
ACCEPTED MANUSCRIPT
635
investigate the correlation of DL-based features with the handcrafted ones and employ a hybrid approach similar to (Cruz-Roa et al., 2013; Rezaeilouyeh et al., 2016; Zhou et al., 2017).
CR IP T
We note that our performance analysis was performed by comparing the labels of corresponding pairs of image patches in order to provide detailed clas640
sification of the tissue. Based on the primary and secondary grades, a single score could be assigned for the entire TMA core or biopsy core (or WM slide),
as typically done for clinical purposes. The score can be estimated by setting thresholds for the number of patches required to constitute primary and sec-
645
AN US
ondary grades. In such case, the classification results are expected to improve
since the core-wise scores would be less “noisy” than the raw patch-wise grades. Also, to measure scoring agreement, a weighted κ coefficient can be used so as to mitigate grading differences, e.g., within ±1 Gleason, since these disagreements between neighbouring scores are often not clinically important.
The developed PathMarker app’s interface on the tablet was received positively by the pathologists involved in our study and allowed for quick and
M
650
intuitive, yet detailed annotations. To make the tool available for use by others,
ED
we intend to add more features and resolve technical issues, such as the current downsampling of the images to 10x magnification. While PCa can typically be performed at 4x, and in certain instances at 10x magnification (Gordetsky and Epstein, 2016), and the pathologists confirmed that the image resolution in the
PT
655
app was sufficient for grading, it might still be advantageous to allow for higher
CE
magnifications to avoid any potential impact on the inter-observer variability. We also intend to incorporate the app as part of the CAD system to “close the loop” by presenting pathologists with the suggested automatic grading annotations in real-time. Such feature will also be useful in training and teaching
AC
660
pathology residents, for whom digital pathology has already been shown to be an effective educational tool (Pantanowitz et al., 2012). Since histologic patterns that yield a same Gleason grade are heterogeneous,
explicit classification of patterns, such as those described in (Iczkowski et al., 665
2011; McKenney et al., 2016), from which the Gleason grades can be deduced 32
ACCEPTED MANUSCRIPT
and with which disease outcomes may be better correlated, is likely to provide a more consistent classifier. We therefore plan to include additional classes of patterns for developing such classifier. This, however, will require increasing
670
CR IP T
the training dataset. In addition, some tissue may contain a mixture of two or more patterns that are intermixed and cannot be annotated as separate regions.
Pathologists would then typically assign an estimate of the percentage of the
patterns to the entire region. Another advantage of the probabilistic approach is that it inherently supports such cases by simply allowing the experts’ labels to take real values.
We are also looking into classification of our features with respect to clinical
AN US
675
outcomes, e.g., biochemical recurrence. The classifier may allow a less subjective prediction of outcomes than Gleason score as it is based on a multitude of extracted features, many of which are not explicitly noted by pathologists. However, the single level and small amount of tissue in the TMA cores may not 680
have been sampled from the lesions that led to a poor outcome. A biopsy or
M
WM datasets, in which the samples are larger and more levels are available, are
ED
more suitable for such study.
5. Conclusion
685
PT
We developed a CAD system for automatic grading of PCa in digitized histopathology images using supervised learning methods. We proposed novel features that include spatial inter-nuclei statistics and intra-nuclei properties,
CE
and were found to be important in different scales, mainly for distinguishing high-grade PCa patterns. Our approach was evaluated in leave-patients-out experiments with consistent results across different classification methods. The classifiers were trained to predict the probabilities of each grade class based on
AC 690
labels provided by multiple pathologists, and their overall agreement with the pathologists was found to be consistent with the agreement levels reported in the literature. We consider the utilization of annotated data by multiple pathologists in a
33
ACCEPTED MANUSCRIPT
695
probabilistic framework as the main contribution of this work. The approach tackles the inter-observer variability in PCa grading and can lead to a consensusbased training that improves both classification and feedback to pathologists,
CR IP T
as more annotations are accumulated. The CAD can be eventually translated to the clinic as a tool for pathologists to allow rapid screening of vast amount of 700
tissue automatically, and to produce quantitative measures based on the entire tissue, e.g., for determining the pathological stage of postoperative specimens. Acknowledgments
AN US
We would like to thank Dr. Martin Gleave and his group at the Vancouver Prostate Centre for the preparation, staining and scanning of the TMA data. This work was funded by a Prostate Cancer Canada Grant #D2016-1352,
705
by the CIHR Grant #MOP-142439, and by Professor Salcudean’s C.A. Laszlo Chair. Guy Nir is a recipient of a Prostate Cancer Canada Post-Doctoral
M
Research Fellowship Award #PDF2016-1338.
References
Allsbrook, W.C., Mangold, K.A., Johnson, M.H., Lane, R.B., Lane, C.G., Amin,
ED
710
M.B., Bostwick, D.G., Humphrey, P.A., Jones, E.C., Reuter, V.E., et al., 2001a. Interobserver reproducibility of gleason grading of prostatic carcinoma:
PT
urologic pathologists. Human Pathology 32, 74–80. Allsbrook, W.C., Mangold, K.A., Johnson, M.H., Lane, R.B., Lane, C.G., Epstein, J.I., 2001b. Interobserver reproducibility of gleason grading of prostatic
CE
715
carcinoma: general pathologists. Human Pathology 32, 81–88.
AC
Arevalo, J., Cruz-Roa, A., et al., 2014. Histopathology image representation for automatic analysis: a state-of-the-art review. Revista Med 22, 79–91.
Baddeley, A., Rubak, E., Turner, R., 2015. Spatial point patterns: methodology
720
and applications with R. CRC Press.
34
ACCEPTED MANUSCRIPT
Bautista, P.A., Hashimoto, N., Yagi, Y., et al., 2014. Color standardization in whole slide imaging using a color calibration slide. Journal of Pathology Informatics 5, 4.
CR IP T
Bay, H., Ess, A., Tuytelaars, T., Van Gool, L., 2008. Speeded-up robust features (SURF). Computer Vision and Image Understanding 110, 346–359.
725
Bejnordi, B.E., Litjens, G., Timofeeva, N., Otte-H¨ oller, I., Homeyer, A., Karssemeijer, N., van der Laak, J.A., 2016. Stain specific standardization of wholeslide histopathological images. IEEE Transactions on Medical Imaging 35,
730
AN US
404–415.
Berney, D.M., Algaba, F., Camparo, P., Comp´erat, E., Griffiths, D., Kristiansen, G., Lopez-Beltran, A., Montironi, R., Varma, M., Egevad, L., 2014. The reasons behind variation in gleason grading of prostatic biopsies: areas of agreement and misconception among 266 european pathologists. Histopathol-
735
M
ogy 64, 405–411.
Bhargava, R., Madabhushi, A., 2016. Emerging themes in image informatics
ED
and molecular analysis for digital pathology. Annual Review of Biomedical Engineering 18, 387–412.
Bishop, C.M., 2006. Pattern recognition and machine learning. Springer-Verlag
740
PT
New York.
Bova, G.S., Parmigiani, G., Epstein, J.I., Wheeler, T., Mucci, N.R., Rubin,
CE
M.A., 2001. Web-based tissue microarray image data analysis: initial validation testing through prostate cancer gleason grading. Human Pathology 32,
AC
417–427.
Breiman, L., 2001. Random forests. Machine Learning 45, 5–32.
745
Chatelain, P., Pauly, O., Peter, L., Ahmadi, S.A., Plate, A., B¨ otzel, K., Navab, N., 2013. Learning from multiple experts with random forests: Application
35
ACCEPTED MANUSCRIPT
to the segmentation of the midbrain in 3d ultrasound, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 230–237. Chen, H., Qi, X., Yu, L., Heng, P.A., 2016. Dcan: Deep contour-aware networks
CR IP T
750
for accurate gland segmentation, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2487–2496.
Cire¸san, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J., 2013. Mitosis detection in breast cancer histology images with deep neural networks, in: In-
ternational Conference on Medical Image Computing and Computer-assisted
AN US
755
Intervention, Springer. pp. 411–418.
Cruz-Roa, A.A., Ovalle, J.E.A., Madabhushi, A., Osorio, F.A.G., 2013. A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection, in: International Conference
M
on Medical Image Computing and Computer-Assisted Intervention, Springer.
760
pp. 403–410.
ED
De La Taille, A., Viellefond, A., Berger, N., Boucher, E., De Fromont, M., Fondimare, A., Molini´e, V., Piron, D., Sibony, M., Staroz, F., et al., 2003. Evaluation of the interobserver reproducibility of gleason grading of prostatic adenocarcinoma using tissue microarrays. Human Pathology 34, 444–449.
PT
765
Dekel, O., Shamir, O., 2009. Vox populi: Collecting high-quality labels from a
CE
crowd, in: Proceedings of the 22nd Annual Conference on Learning Theory. Dorf, R.C., Bishop, R.H., 2011. Modern control systems. Pearson.
AC
Doyle, S., Feldman, M., Tomaszewski, J., Madabhushi, A., 2012a. A boosted
770
bayesian multiresolution classifier for prostate cancer detection from digitized needle biopsies. IEEE Transactions on Biomedical Engineering 59, 1205–1218.
Doyle, S., Feldman, M.D., Shih, N., Tomaszewski, J., Madabhushi, A., 2012b. Cascaded discrimination of normal, abnormal, and confounder classes in
36
ACCEPTED MANUSCRIPT
histopathology: Gleason grading of prostate cancer. BMC Bioinformatics 13, 282.
775
Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C., 2016. The
CR IP T
importance of skip connections in biomedical image segmentation, in: Deep Learning and Data Labeling for Medical Applications. Springer, pp. 179–187.
Epstein, J.I., Allsbrook Jr, W.C., Amin, M.B., Egevad, L.L., Committee, I.G.,
et al., 2005. The 2005 international society of urological pathology (isup) con-
780
sensus conference on gleason grading of prostatic carcinoma. The American
AN US
Journal of Surgical Pathology 29, 1228–1242.
Epstein, J.I., Egevad, L., Amin, M.B., Delahunt, B., Srigley, J.R., Humphrey, P.A., Committee, G., et al., 2016a. The 2014 international society of urological pathology (isup) consensus conference on gleason grading of prostatic
785
carcinoma: definition of grading patterns and proposal for a new grading
M
system. The American Journal of Surgical Pathology 40, 244–252. Epstein, J.I., Zelefsky, M.J., Sjoberg, D.D., Nelson, J.B., Egevad, L., Magi-
ED
Galluzzi, C., Vickers, A.J., Parwani, A.V., Reuter, V.E., Fine, S.W., et al., 2016b. A contemporary prostate cancer grading system: a validated alterna-
790
tive to the gleason score. European Urology 69, 428–435.
PT
Ginsburg, S.B., Lee, G., Ali, S., Madabhushi, A., 2016. Feature importance in nonlinear embeddings (fine): applications in digital pathology. IEEE Trans-
CE
actions on Medical Imaging 35, 76–88. 795
Gleason, D.F., 1966. Classification of prostatic carcinomas. Cancer Chemother-
AC
apy Reports 50, 125–128.
Gordetsky, J., Epstein, J., 2016. Grading of prostatic adenocarcinoma: current state and prognostic implications. Diagnostic pathology 11, 25.
Gorelick, L., Veksler, O., Gaed, M., G´ omez, J.A., Moussa, M., Bauman, G., 800
Fenster, A., Ward, A.D., 2013. Prostate histopathology: Learning tissue com-
37
ACCEPTED MANUSCRIPT
ponent histograms for cancer detection and classification. IEEE Transactions on Medical Imaging 32, 1804–1818. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L., 2016. Densely
805
CR IP T
connected convolutional networks. arXiv preprint arXiv:1608.06993 .
Iczkowski, K.A., Torkko, K.C., Kotnis, G.R., Storey Wilson, R., Huang, W.,
Wheeler, T.M., Abeyta, A.M., La Rosa, F.G., Cook, S., Werahera, P.N., et al., 2011. Digital quantification of five high-grade prostate cancer patterns, including the cribriform pattern, and their association with adverse outcome.
810
AN US
American Journal of Clinical Pathology 136, 98–107.
Ipeirotis, P.G., Provost, F., Wang, J., 2010. Quality management on Amazon mechanical turk, in: Proceedings of the ACM SIGKDD workshop on human computation, ACM. pp. 64–67.
Irshad, H., Veillard, A., Roux, L., Racoceanu, D., 2014. Methods for nuclei
M
detection, segmentation, and classification in digital histopathology: a reviewcurrent status and future potential. IEEE Reviews in Biomedical Engi-
815
ED
neering 7, 97–114.
Janowczyk, A., Basavanhally, A., Madabhushi, A., 2017. Stain normalization using sparse autoencoders (stanosa): Application to digital pathology. Com-
820
PT
puterized Medical Imaging and Graphics 57, 50–61. Janowczyk, A., Madabhushi, A., 2016. Deep learning for digital pathology
CE
image analysis: A comprehensive tutorial with selected use cases. Journal of Pathology Informatics 7.
AC
Jothi, J.A.A., Rajam, V.M.A., 2017. A survey on automated cancer diagnosis
825
from histopathology images. Artificial Intelligence Review 48, 31–81.
K¨ all´en, H., Molin, J., Heyden, A., Lundstr¨ om, C., ˚ Astr¨ om, K., 2016. Towards grading gleason score using generically trained deep convolutional neural networks, in: Biomedical Imaging (ISBI), 2016 IEEE 13th International Symposium on, IEEE. pp. 1163–1167. 38
ACCEPTED MANUSCRIPT
Khan, A.M., Rajpoot, N., Treanor, D., Magee, D., 2014. A nonlinear mapping approach to stain normalization in digital histopathology images using image-
830
specific color deconvolution. IEEE Transactions on Biomedical Engineering
CR IP T
61, 1729–1738. Kononen, J., Bubendorf, L., Kallionimeni, A., B¨ arlund, M., Schraml, P., Leighton, S., Torhorst, J., Mihatsch, M.J., Sauter, G., Kallionimeni, O.P., 1998. Tissue microarrays for high-throughput molecular profiling of tumor
835
specimens. Nature medicine 4, 844–847.
AN US
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, pp. 1097–1105. 840
Lafarge, M.W., Pluim, J.P., Eppenhof, K.A., Moeskops, P., Veta, M., 2017. Domain-adversarial neural networks to address the appearance variability of
M
histopathology images, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, pp. 83–91.
ED
Lee, G., Sparks, R., Ali, S., Shih, N.N., Feldman, M.D., Spangler, E., Rebbeck, T., Tomaszewski, J.E., Madabhushi, A., 2014. Co-occurring gland angularity
845
in localized subgraphs: predicting biochemical recurrence in intermediate-risk
PT
prostate cancer patients. PloS one 9, e97954. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian,
CE
M., van der Laak, J.A., van Ginneken, B., S´ anchez, C.I., 2017. A survey on deep learning in medical image analysis. Medical Image Analysis 42, 60–88.
850
AC
Litjens, G., S´ anchez, C.I., Timofeeva, N., Hermsen, M., Nagtegaal, I., Kovacs,
855
I., Hulsbergen-Van De Kaa, C., Bult, P., Van Ginneken, B., Van Der Laak, J., 2016. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Scientific Reports 6, 26286.
Lloyd, S., 1982. Least squares quantization in pcm. IEEE Transactions on Information Theory 28, 129–137. 39
ACCEPTED MANUSCRIPT
Lowe, D.G., 1999. Object recognition from local scale-invariant features, in: Computer vision, 1999. The proceedings of the seventh IEEE international conference on, Ieee. pp. 1150–1157. Madabhushi, A., Lee, G., 2016. Image analysis and machine learning in digital
CR IP T
860
pathology: Challenges and opportunities. Medical Image Analysis 33, 170– 175.
McKenney, J.K., Wei, W., Hawley, S., Auman, H., Newcomb, L.F., Boyer, H.D., Fazli, L., Simko, J., Hurtado-Coll, A., Troyer, D.A., et al., 2016. Histologic
AN US
grading of prostatic adenocarcinoma can be further optimized: analysis of the
865
relative prognostic strength of individual architectural patterns in 1275 patients from the canary retrospective cohort. The American Journal of Surgical Pathology 40, 1439–1456.
Moran, P.A., 1950. Notes on continuous stochastic phenomena. Biometrika 37, 17–23.
M
870
Mosquera-Lopez, C., Agaian, S., Velez-Hoyos, A., Thompson, I., 2015.
ED
Computer-aided prostate cancer diagnosis from digitized histopathology: a review on texture-based systems. IEEE Reviews in Biomedical Engineering 8, 98–113.
Nguyen, K., Sarkar, A., Jain, A.K., 2012. Structure and context in prostatic
PT
875
gland segmentation and classification, in: International Conference on Medi-
CE
cal Image Computing and Computer-Assisted Intervention, Springer. pp. 115– 123.
AC
Nguyen, K., Sarkar, A., Jain, A.K., 2014. Prostate cancer grading: Use of graph
880
cut and spatial arrangement of nuclei. IEEE Transactions on Medical Imaging 33, 2254–2270.
Niazi, M.K.K., Yao, K., Zynger, D., Clinton, S., Chen, J., Koyuturk, M., LaFramboise, T., Gurcan, M., 2017. Visually meaningful histopathological
40
ACCEPTED MANUSCRIPT
features for automatic grading of prostate cancer. IEEE Journal of Biomedical and Health Informatics 21, 1027–1038.
885
Pantanowitz, L., 2010. Digital images and the future of digital pathology. Jour-
CR IP T
nal of Pathology Informatics 1.
Pantanowitz, L., Szymas, J., Yagi, Y., Wilbur, D., 2012. Whole slide imaging for educational purposes. Journal of Pathology Informatics 3. 890
Rashid, S., 2014. Automatic pathology of prostate cancer in whole mount
slides incorporating individual gland classification. Ph.D. thesis. University
AN US
of British Columbia.
Rashid, S., Fazli, L., Boag, A., Siemens, R., Abolmaesumi, P., Salcudean, S.E., 2013. Separation of benign and malignant glands in prostatic adenocarcinoma, in: International Conference on Medical Image Computing and Computer-
895
Assisted Intervention, Springer. pp. 461–468.
M
Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy,
1297–1322. 900
ED
L., 2010. Learning from crowds. Journal of Machine Learning Research 11,
Rezaeilouyeh, H., Mollahosseini, A., Mahoor, M.H., 2016. Microscopic medi-
PT
cal image classification framework via deep learning and shearlet transform. Journal of Medical Imaging 3, 044501–044501.
CE
Ripley, B.D., 2005. Spatial statistics. volume 575. John Wiley & Sons. Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical
AC
905
Image Computing and Computer-Assisted Intervention, Springer. pp. 234– 241.
Shen, D., Wu, G., Suk, H.I., 2017. Deep learning in medical image analysis. Annual Review of Biomedical Engineering .
41
ACCEPTED MANUSCRIPT
910
Sheskin, D.J., 2003. Handbook of parametric and nonparametric statistical procedures. crc Press.
Cancer Journal for Clinicians 67, 7–30.
CR IP T
Siegel, R.L., Miller, K.D., Jemal, A., 2017. Cancer statistics, 2017. CA: A
Tabesh, A., Teverovskiy, M., Pang, H.Y., Kumar, V.P., Verbel, D., Kotsianti, A., Saidi, O., 2007. Multifeature prostate cancer diagnosis and gleason grad-
915
ing of histological images. IEEE Transactions on Medical Imaging 26, 1366– 1378.
AN US
Tsoumakas, G., Katakis, I., Vlahavas, I., 2009. Mining multi-label data, in: Data mining and knowledge discovery handbook. Springer, pp. 667–685. 920
Tsoumakas, G., Vlahavas, I., 2007. Random k-labelsets: An ensemble method for multilabel classification. Machine Learning: ECML 2007 , 406–417. Veta, M., Van Diest, P.J., Willems, S.M., Wang, H., Madabhushi, A., Cruz-Roa,
M
A., Gonzalez, F., Larsen, A.B., Vestergaard, J.S., Dahl, A.B., et al., 2015. Assessment of algorithms for mitosis detection in breast cancer histopathology images. Medical Image Analysis 20, 237–248.
ED
925
Viera, A.J., Garrett, J.M., et al., 2005. Understanding interobserver agreement:
PT
the kappa statistic. Fam Med 37, 360–363. Warfield, S., Zou, K., Wells, W., 2004. Simultaneous truth and performance level
CE
estimation (staple): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging 23, 903–921.
930
AC
Xu, Y., Li, Y., Wang, Y., Liu, M., Fan, Y., Lai, M., Chang, E., 2017. Gland instance segmentation using deep multichannel neural networks. IEEE Transactions on Biomedical Engineering .
Yagi, Y., 2011. Color standardization and optimization in whole slide imaging.
935
Diagnostic Pathology 6, S15.
42
ACCEPTED MANUSCRIPT
Zhang, M.L., Zhou, Z.H., 2014. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26, 1819–1837. Zhou, N., Fedorov, A., Fennessy, F., Kikinis, R., Gao, Y., 2017. Large scale
deep neural network. arXiv preprint arXiv:1705.02678 .
AC
CE
PT
ED
M
AN US
940
CR IP T
digital prostate pathology image analysis combining feature extraction and
43