Automatic grading of prostate cancer in digitized histopathology images: Learning from multiple experts

Automatic grading of prostate cancer in digitized histopathology images: Learning from multiple experts

Accepted Manuscript Automatic Grading of Prostate Cancer in Digitized Histopathology Images: Learning from Multiple Experts Guy Nir, Soheil Hor , Dav...

7MB Sizes 0 Downloads 18 Views

Accepted Manuscript

Automatic Grading of Prostate Cancer in Digitized Histopathology Images: Learning from Multiple Experts Guy Nir, Soheil Hor , Davood Karimi , Ladan Fazli, Brian F. Skinnider, Peyman Tavassoli, Dmitry Turbin, Carlos F. Villamil, Gang Wang, R. Storey Wilson, Kenneth A. Iczkowski, M. Scott Lucia, Peter C. Black, Purang Abolmaesumi, S. Larry Goldenberg, Septimiu E. Salcudean PII: DOI: Reference:

S1361-8415(18)30749-7 https://doi.org/10.1016/j.media.2018.09.005 MEDIMA 1410

To appear in:

Medical Image Analysis

Received date: Revised date: Accepted date:

11 January 2018 11 July 2018 21 September 2018

Please cite this article as: Guy Nir, Soheil Hor , Davood Karimi , Ladan Fazli, Brian F. Skinnider, Peyman Tavassoli, Dmitry Turbin, Carlos F. Villamil, Gang Wang, R. Storey Wilson, Kenneth A. Iczkowski, M. Scott Lucia, Peter C. Black, Purang Abolmaesumi, S. Larry Goldenberg, Septimiu E. Salcudean, Automatic Grading of Prostate Cancer in Digitized Histopathology Images: Learning from Multiple Experts, Medical Image Analysis (2018), doi: https://doi.org/10.1016/j.media.2018.09.005

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • A system for automatic histological grading of prostate cancer was devel-

CR IP T

oped. • The classifier was trained based on detailed annotations of six pathologists. • The inter-observer variability was addressed by adapting a crowdsourcing approach.

• Novel features based on spatial statistics of the nuclei were proposed.

AN US

• The performance of the classifier was within agreement levels among pathol-

AC

CE

PT

ED

M

ogists.

1

ACCEPTED MANUSCRIPT

Graphical Abstract Patch RGB Image

Patch 2

Tissue Segmentation

Patch 1

Scl 3 Scale 2 Scale 1

Gland Segmentation

Nuclei Segmentation

(b) Each core is divided into image patches for multi-scale processing

?

Benign

Pathologist 2

(d) Patches are segmented for tissue types, glands and nuclei for feature extraction Pathologist 4

Pathologist 5

?

Grade 3

?

Grade 4

?

(g) Predict label probabilities of left-out patches, estimate their labels and reconsturct TMA slides

Grade 5 Pathologist 1

Pathologist 2

Pathologist 3

Pathologist 4

Pathologist 5

Pathologist 6

Consensus

Automatic

Pathologist 3

feature X2

Pathologist 1

? ? ?

Pathologist 6

(c) Cores are annotated in detail by pathologists using a tablet, and corresponding label maps are constructed

Patch 1: ( {µ1,c} | 3, 0.21, … , 85.30 ) Patch 2: ( {µ2,c} | 12, 0.01, … , 34.68 ) : Patch i: ( {µi,c} | 2, 0.19, … , 35.47 )

(h) Measure statistics and agreement levels

feature X1 (f1) Train a supervised classifier (regressor) based on label probabilities (f2) Update label probabilities based on training results.

AC

CE

PT

ED

M

AN US

(e) Extracted features for each patch are quantized and assigned label probabilities

?

CR IP T

(a) Individual cores are segmented on TMA slides

2

ACCEPTED MANUSCRIPT

Automatic Grading of Prostate Cancer in Digitized Histopathology Images: Learning from Multiple Experts

CR IP T

Guy Nira,i,∗, Soheil Horb,∗∗, Davood Karimii,∗∗, Ladan Fazlia , Brian F. Skinnider∗∗∗,e,c , Peyman Tavassoli∗∗∗,a,d , Dmitry Turbin∗∗∗,f , Carlos F. Villamil∗∗∗,e , Gang Wang∗∗∗,e , R. Storey Wilsong , Kenneth A. Iczkowskih , M. Scott Luciag , Peter C. Blacka , Purang Abolmaesumii , S. Larry Goldenberga , Septimiu E. Salcudeani,a,∗ a Department

of Urologic Sciences, University of British Columbia, Vancouver, BC, Canada of Electrical Engineering, Stanford University, Stanford, CA, USA c Department of Pathology, Vancouver General Hospital, Vancouver, BC, Canada d Department of Pathology, Richmond Hospital, Richmond, BC, Canada e BC Cancer Agency, Vancouver, BC, Canada f Anatomical Pathology, St. Paul’s Hospital, Vancouver, BC, Canada g Department of Pathology, University of Colorado School of Medicine, Aurora, CO, USA h Medical College of Wisconsin, Milwaukee, WI, USA i Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC, Canada

M

AN US

b Department

Abstract

ED

Prostate cancer (PCa) is a heterogeneous disease that is manifested in a diverse range of histologic patterns and its grading is therefore associated with an inter-observer variability among pathologists, which may lead to an under- or

PT

over-treatment of patients. In this work, we develop a computer aided diagnosis system for automatic grading of PCa in digitized histopathology images using supervised learning methods. Our pipeline comprises extraction of multi-scale

CE

features that include glandular, cellular, and image-based features. A number of novel features are proposed based on intra- and inter-nuclei properties; these fea-

AC

tures are shown to be among the most important ones for classification. We train our classifiers on 333 tissue microarray (TMA) cores that were sampled from 231 radical prostatectomy patients and annotated in detail by six pathologists for different Gleason grades. We also demonstrate the TMA-trained classifier’s ∗ Corresponding

∗∗ ,∗∗∗ These

author(s) authors contributed equally to this work

Preprint submitted to Medical Image Analysis

September 24, 2018

ACCEPTED MANUSCRIPT

performance on additional 230 whole-mount slides of 56 patients, independent of the training dataset, by examining the automatic grading on manually marked lesions and randomly sampled 10% of the benign tissue. For the first time, we

CR IP T

incorporate a probabilistic approach for supervised learning by multiple experts to account for the inter-observer grading variability. Through cross-validation experiments, the overall grading agreement of the classifier with the pathologists

was found to be an unweighted kappa of 0.51, while the overall agreements between each pathologist and the others ranged from 0.45 to 0.62. These results suggest that our classifier’s performance is within the inter-observer grading

AN US

variability levels across the pathologists in our study, which are also consistent with those reported in the literature.

Keywords: Digital pathology, Prostate cancer, Computer aided diagnosis, Machine learning

1. Introduction

M

Conflicts of interest: None

ED

1.1. Prostate Cancer Grading

With an estimated number of 161,360 new cases and 26,730 deaths in 2017, prostate cancer (PCa) is the second most commonly diagnosed cancer, and third most common cause of cancer death among American men (Siegel et al., 2017).

PT

5

PCa is a heterogeneous disease and is manifested in a diverse range of histologic

CE

patterns. The Gleason score (Gleason, 1966) is currently the most common grading system of prostate adenocarcinoma, and proven to be a clinically relevant prognostic marker. The grade is determined by pathologists based on the glandular architectural features observed in haematoxylin and eosin (H&E)

AC

10

stained samples. Firstly, PCa patterns in a tissue slide are graded from Gleason grade 1, with well differentiated and uniform glands associated with less aggressiveness, to Gleason grade 5, with occasional gland formation associated with poorer outcomes. Then, the primary (most common) and secondary (second

15

most common) grades are added to yield the Gleason score. 4

ACCEPTED MANUSCRIPT

The Gleason system has undergone a few modifications since its inception, including deeming Gleason grades 1 and 2 as clinically irrelevant (Epstein et al., 2005), and new correlations to cancerous glandular patterns which correlated

20

CR IP T

better with patients’ prognosis (Epstein et al., 2016a). The most recent grade grouping system (Epstein et al., 2016b) defines Gleason scores ≤ 6 as grade

group 1, score 3+4 = 7 as group 2, score 4+3 = 7 as group 3, score 8 as group 4,

and scores 9 and 10 as group 5. This grade grouping has been shown to provide better stratification to improve prognosis accuracy (in part by differentiating

between scores 3 + 4 = 7 and 4 + 3 = 7), and has a lowest grade of 1 (as opposed

to 6) to reduce over-treatment. However, the patterns of grading (mainly grades

AN US

25

3 and 4) that constitute the scores and groups are still susceptible to inter- and intra-observer variability that is affected by pathologists’ training, experience, and institutional consensus.

The inter-observer Gleason grading variability ranges from barely moderate 30

agreement between general pathologists (Allsbrook et al., 2001b) (overall un-

M

weighted kappa (κ) coefficient of 0.435), to substantial inter-observer agreement between urologic pathologists (Allsbrook et al., 2001a) (overall weighted κ range

ED

of 0.56–0.70). The heterogeneity of PCa patterns and grading is demonstrated in Figure 1. The existing variance in PCa grading among pathologists (Berney 35

et al., 2014) may lead to an under- or over-treatment of patients, and there-

PT

fore impact their survival rates, quality of life, and the costs to the healthcare system.

CE

Tissue microarray (TMA) technology (Kononen et al., 1998) allows for many samples of tissue to be arrayed in a grid and simultaneously processed for sec-

40

tioning and staining on a single histologic slide. TMA slides present less tissue

AC

than typical biopsy samples, and much less than whole-mount (WM) slides, each containing a cross-section of the entire specimen. However, they provide the means for a high-throughput analysis and improved experimental control by having a diverse range of samples from multiple patients and disease types

45

on a single slide. In (De La Taille et al., 2003), the inter-observer reproducibility of Gleason grading between ten genitourinary pathologists was evaluated on 5

ACCEPTED MANUSCRIPT

Pathologist 1

Pathologist 2

Pathologist 3

Gleason grade 3? 'ůĞĂƐŽŶϱ

Pathologist 4

Pathologist 5

Pathologist 6

'ůĞĂƐŽŶϯ ĞŶŝŐŶ /ŶǀĂůŝĚ

Gleason grade 4?

CR IP T

'ůĞĂƐŽŶϰ

Gleason grade 5?

(b) Intra-class variability and

betical order)

inter-class similarity

AN US

(a) Inter-observer variability (pathologists are not in alpha-

Figure 1: Heterogeneity of PCa patterns and grading lead to classification challenges.

TMAs. It was concluded that the approach is useful for resident training and teaching because it allows trainees to focus on small areas of tissue. However, the grading variability between the observers was high, with agreement percentages ranging from 21% to 82.5%. In (Bova et al., 2001), the authors designed

M

50

a web-based system for Gleason grading of images of TMA cores for assessing

ED

inter- and intra-observer variability between two expert genitourinary pathologists. The images were graded at the core-level (grading of the entire core) by answering text-based questions, and the pathologists had an exact agreement on Gleason score in 66% of the cases (98% when separated by only one Gleason

PT

55

grade).

CE

1.2. Machine Learning in Digital Pathology In recent years, the field known as digital pathology has grown dramatically,

AC

largely due to technological advancements in whole slide imaging, image pro-

60

cessing and machine learning algorithms, and increase in computational power (Pantanowitz, 2010). As part of this field, many methods have been proposed in the literature for automatic histopathological image analysis and classification (Madabhushi and Lee, 2016). Such automation can be used in developing computer aided diagnosis (CAD) systems to increase accuracy, reproducibility 6

ACCEPTED MANUSCRIPT

65

and throughput in diagnosis of cancer in general, and of PCa in particular. A typical CAD system first employs image processing for staining standardization, tissue segmentation and histologic feature extraction, followed by

CR IP T

employment of machine learning classification of the extracted features. The features can be largely classified into architectural, morphological, and textural 70

features. Recent survey papers (Arevalo et al., 2014; Bhargava and Madabhushi, 2016; Jothi and Rajam, 2017; Mosquera-Lopez et al., 2015) describe the different approaches towards histopathological CAD systems, and below we discuss the works most relevant to ours.

75

AN US

In (Tabesh et al., 2007), the authors extracted features that describe the

color, texture, and morphologic cues from 367 and 268 H&E image patches, acquired from TMA and WM datasets, respectively, and employed them in a supervised learning framework. They achieved accuracy of 96.7% and 81.0% for cancer detection (benign vs. cancer) and low- vs. high-grade classification, respectively, in a five-fold cross-validation on each dataset.

In (Doyle et al., 2012b), the authors proposed a cascade approach for the

M

80

multi-class grading problem. Rather than the conventional one-shot classifica-

ED

tion and one-versus-all approaches to multi-class classification, they grouped samples from different classes to maximize intra-group homogeneity and intergroup heterogeneity. They proposed architectural and textural features, and also proposed features based on minimum spanning tree between nuclei. They

PT

85

used their approach to classify 2000 biopsy image patches, taken from 214 pa-

CE

tients, of seven classes (Gleason 3–5, epithelium, stroma, atrophy, and prostatic intraepithelial neoplasia (PIN)), and reported a positive predictive value (PPV) of 86%.

AC

90

In (Gorelick et al., 2013), the authors partitioned image patches into super-

pixels, namely, non-overlapping regions that are similar in color and texture. These superpixels were then classified into different tissue labels based on extracted morphological, geometric and intensity features. The geometric features used for tissue labeling include histograms of scale invariant feature transform

95

(SIFT) descriptors (Lowe, 1999). The image patches are then classified into 7

ACCEPTED MANUSCRIPT

benign vs. cancer, and low- vs. high-grade based on histogram of their corresponding superpixels’ tissue labels with reported accuracies of 90% and 85%, respectively, on 991 image patches from 15 patients.

100

CR IP T

In (Nguyen et al., 2014), the authors developed a method to segment glands based on the spatial arrangement of nuclei. Such a nuclei-based method is

useful in detection of glands without lumen, as is the case in higher grades of PCa. They proposed using minimum spanning trees and show that a fused

lumen- and nuclei-based gland segmentation improves the classifier performance in differentiating Gleason 3 from 4, in comparison with a lumen-based approach alone, and report a 3-class (benign, Gleason 3, and Gleason 4) grading accuracy

AN US

105

of 88% on 317 image patches taken from 29 patients. In (Lee et al., 2014), the authors looked at the disorder in gland orientations for predicting the degree of PCa malignancy. They reported an accuracy of 73% in predicting biochemical recurrence using such features.

Recently, deep learning (DL) approaches were proposed in medical image

110

M

analysis (Litjens et al., 2017), including digital pathology images, e.g., (Janowczyk and Madabhushi, 2016; Litjens et al., 2016) and references therein. These ap-

ED

proaches use the raw digital images as input, with minimal preprocessing, and have the advantage of avoiding the traditional extraction of handcrafted fea115

tures, which may be inaccurate or insufficient. A few recent works applied DL

PT

for detection and grading of PCa directly (K¨ all´en et al., 2016; Rezaeilouyeh et al., 2016; Zhou et al., 2017), while other works have used DL for more low-

CE

level tasks such as gland and nuclei segmentation (Chen et al., 2016; Cire¸san et al., 2013; Ronneberger et al., 2015; Xu et al., 2017). The main disadvantage of any DL approach is the requirement of a large

120

AC

training set. Fortunately, histopathology patterns are visible on a small scale. Therefore, a single slide can be used to generate a large number of training patches. On the other hand, as discussed above, the heterogeneity of the data presents a challenge and patches extracted from a single slide are typically cor-

125

related. Another disadvantage in DL is the lack of feature interpretation, which may hinder the clinical deployment of CAD systems (Niazi et al., 2017; Shen 8

ACCEPTED MANUSCRIPT

et al., 2017). In (Niazi et al., 2017), the authors proposed a classifier based on luminal and architectural features that are visually meaningful. They trained and tested their method on independent datasets of 43 and 88 image patches that were extracted from 25 and 30 WM slides (unspecified number of patients),

CR IP T

130

respectively, and reported a 2-class (low- vs. high-grade) classification accuracy of 90.9%.

In (K¨ all´en et al., 2016), the authors used a pre-trained network for PCa

grading by feeding the output of a mid-layer to random forests (RF) (Breiman, 135

2001) and support vector machines (SVMs) (Bishop, 2006) classifiers. They

AN US

achieved a patch-wise and image-wise accuracy of 81% and 89%, respectively,

using a 10-fold cross validation. In (Rezaeilouyeh et al., 2016), the authors proposed a combined DL-based pipeline with handcrafted features (magnitude and phase of a shearlet transform) for PCa grading and reported accuracy, 140

sensitivity and specificity of 88%, 89% and 94%, respectively. Additional hybrid methods of DL and handcrafted features were proposed in (Zhou et al., 2017), for

M

a two-class (Gleason grade group 2 vs. group 3) classification with an accuracy of 75%, and in (Cruz-Roa et al., 2013) for basal-cell carcinoma detection (in

145

ED

skin histopathology slides).

1.3. Multi-label Classification

PT

In this work, we have developed a CAD system for automatic grading of PCa in digitized histopathology images. Our pipeline consisted of processing multiscale image patches, which were then used as a training dataset for supervised

CE

learning. To address the inter-observer variability issue in PCa grading, the

150

training dataset was labeled by multiple pathologists. This leads to a multi-

AC

class (four classes – benign, Gleason grades 3, 4 and 5) and multi-label (also known as multi-output) classification problem, where, given the inter-observer variability, it is unclear how to choose the ground truth label on which the classifier should be trained. One approach is to train a separate classifier on each

155

label. However, such an approach would not utilize the underlying correlations among the labels and leaves the same dilemma of how to choose the correct 9

ACCEPTED MANUSCRIPT

output from the different classifiers. Another approach is to consider the ground truth as the “consensus label”1 based on the majority vote among the labels of each sample (and a tiebreaker policy). Such voting approach was applied, e.g., in (Veta et al., 2015) for mitosis

CR IP T

160

detection in breast cancer images. However, as discussed in (Dekel and Shamir, 2009; Raykar et al., 2010) this approach implicitly assumes that the experts’ an-

notations are weighted equally, while in practice annotators performance may be affected by known and unknown factors as mentioned above. Other approaches 165

include ensemble methods that train many classifiers on random subsets of the

AN US

labels (Tsoumakas and Vlahavas, 2007). Additional approaches to tackle multilabels are also reviewed in (Tsoumakas et al., 2009; Zhang and Zhou, 2014).

In (Warfield et al., 2004), the authors presented the simultaneous truth and performance level estimation (STAPLE) method to compute a probabilistic es170

timate of the true segmentation of medical images from a collection of segmentations that were provided by multiple experts and/or automatic algorithms. The

M

method tackles the inter-observer variability by estimating the sensitivity and specificity of each expert/algorithm using an iterative expectation-maximization

175

ED

(EM) algorithm, and then computes the maximum-a-posteriori (MAP) estimate of the true segmentation from a weighted combination of the different segmentations. The algorithm is used to assess the performance of both automated

PT

and human segmentations.

In our work, to utilize the multi-label data, we follow (Raykar et al., 2010)

CE

and proposed a probabilistic approach for supervised learning by multiple ex180

perts. The method was shown to be useful in similar scenarios that arise in radiology, as well as in crowdsourcing applications (Ipeirotis et al., 2010). It

AC

may be seen as a generalization of the STAPLE algorithm that updates the estimation of the annotators’ performance and ground truth as part of the learning 1 Due

to the high inter-observer variability, considering the actual consensus label, i.e., only

samples that were assigned the same class by all experts, significantly reduces the size of the training set.

10

ACCEPTED MANUSCRIPT

process, and thus takes into account image information (or features) rather than 185

the labels alone (Chatelain et al., 2013). Similar to STAPLE, it also provides

CR IP T

the annotators’ performance as a byproduct. 1.4. Contributions

In achieving our goals, we present the following main contributions: i) acquisition and utilization of multi-expert annotations in the training and evaluation 190

of the classifier, ii) a probabilistic approach to automatic grading of PCa to con-

tend with the inter-observer variability, and iii) development of new multi-scale

AN US

features based on spatial point processes statistics of nuclei (Poisson’s intensity, Moran’s I, Ripley’s functions and pair correlation function), speeded up robust features (SURF) of nuclei, and intra-nuclei properties such as the nuclei 195

perimeter spectrum.

To the best of our knowledge, this work is the first to utilize multiple experts for training and evaluation of a CAD system for PCa grading. Due to

M

the reported inter-observer variability in the literature on PCa grading, we hypothesize that such utilization is necessary to capture the heterogeneity in both data and across pathologists. In addition, our proposed grading CAD sys-

ED

200

tem, classifying for benign and Gleason grades 3, 4, and 5, was evaluated in a rigorous leave-one-patient-out manner on an image set that is in the order of

PT

100 times larger than most comparable works in the literature, and involved a multi-institutional study. Nevertheless, as discussed in Section 4, the automatic 205

classifier’s performance is within the agreement range of human pathologists,

CE

and on par with the most comparable works in the literature. The rest of the manuscript is organized as follows. In Section 2 we describe

AC

the data, the annotation and label probability assignment, and the training and evaluation methods of the classifier. In Section 3 we summarize the results of

210

the classification experiments, and in Section 4 we discuss the significance of the results, the limitations of the study, potential impact and future research direction of the study. We conclude with a summary of the results and contributions in Section 5. 11

ACCEPTED MANUSCRIPT

Patch RGB Image Patch 2

Tissue Segmentation

Patch 1

Scl 3 Scale 2

(a) Individual cores are segmented on TMA slides

(b) Each core is divided into image patches for multi-scale processing

Grade 5

(e) Extracted features for each patch are quantized and assigned label probabilities

?

(g) Predict label probabilities of left-out patches and estimate their labels

feature X

(f1) Train a supervised classifier (regressor) based on label probabilities

(f2) Update label probabilities based on training results.

M

Grade 4

AN US

feature Y Grade 3

Nuclei Segmentation

Patch 1: ( {µ1,c} | 3, 0.21, … , 85.30 ) Patch 2: ( {µ2,c} | 12, 0.01, … , 34.68 ) : Patch i: ( {µi,c} | 2, 0.19, … , 35.47 )

?

?

?

? ? ? ?

Gland Segmentation

(c) Patches are segmented for tissue types, glands and nuclei for feature extraction

(d) Cores are annotated in detail by pathologists

Benign

CR IP T

Scale 1

Figure 2: Illustration of TMA image processing and classification workflow.

215

ED

2. Materials and Methods

2.1. Tissue Microarrays Dataset and Processing

PT

Our histopathology data were acquired from seven TMA blocks that were constructed and processed at the Vancouver Prostate Centre and the study was

CE

approved by the institutional Clinical Research Ethics Board (CREB #H1501064). Each block contains 160 tissue cores of 1.1 mm in diameter that were

220

extracted from 502 radical prostatectomy specimens (two to three cores per

AC

specimen, 1120 cores in total). A single level from each of the seven blocks was sectioned, stained in H&E, and digitized as virtual slides at 40x magnification using a SCN400 Slide Scanner (Leica Microsystems, Wetzlar, Germany). The digital images of the seven slides were processed as described below for the de-

225

velopment and evaluation of our CAD system. The processing and classification workflow is further illustrated in Figure 2. 12

ACCEPTED MANUSCRIPT

For systematic processing, each core in the TMA slides was automatically segmented and labeled. The segmentation consisted of thresholding the grayscale converted slide image, at ∼1/116 of its native resolution, followed by morphological operations on the binary image to remove noise and to complete partial

CR IP T

230

cores. The binary image was convolved with a disk kernel mask and the centroids of the cores were detected by another thresholding. To associate each

core with its pixels, a k-means clustering (Lloyd, 1982) was performed, with

the spatial coordinates of the binary image pixels as features and the detected 235

centroids from the previous step as initialization. The cores were then indexed

2.2. Whole-Mount Slides Dataset

AN US

based on their respected row and column.

For evaluation, we have also used a second dataset that was acquired at the University of Colorado, Denver, from which digitized images and annotations 240

were shared through an inter-institutional Material Transfer Agreement (MTA

M

#M15-00267). The dataset comprises 230 WM slides of 56 radical prostatectomy patients. The slides were stained in H&E and scanned at 20x magnifica-

ED

tion using an Aperio ScanScope XT (Leica Biosystems, Vista, CA, USA). The virtual slides were annotated by uropathologists using the Aperio ImageScope 245

software for different patterns of PCa tissue with which a Gleason grade could

PT

be associated, and different patterns of benign tissue. The annotated patterns, as described in (Iczkowski et al., 2011), included single, small separate acini (corresponding to Gleason grade 3), luminal blue

CE

mucincontaining single, separate acini (Gleason grade 3), undulating, branched,

250

or angulated larger acini that are not truly papillary - no bridging or stromal

AC

cores (Gleason grade 3), mucinous/colloid carcinoma without fusion or individual cells (Gleason grade 3), fused small acini (Gleason grade 4), true papillary with stromal cores or bridging across acinar spaces (Gleason grade 4), small and large cribriform (Gleason grade 4-5), individual cells (corresponding to Gleason

255

grade 5), high grade PIN (benign), and atrophy with and without inflammation (benign). 13

ACCEPTED MANUSCRIPT

2.3. Detailed Annotation of TMA Cores We developed an Android-based application (named “PathMarker”) for quick and intuitive annotation of individual core images using a tablet with a stylus. In order to allow fast loading between images, the images were downsampled to

CR IP T

260

a resolution equivalent to 10x magnification that still allows proper diagnosis (Gordetsky and Epstein, 2016).

Due to the time consuming task of grading 1120 cores in detail, a subset of

333 cores (231 patients) was selected as follows. Based on initial core grading by 265

one pathologist, the 333 cores subset was skewed towards high-grades in expec-

AN US

tation that co-occurrence of benign and low-grade regions in such cores would

yield an equalized distribution of grades in the detailed annotation. All cores in grade group 5 (57 cores) and grade group 4 (106 cores) were selected, followed by a random selection of 10% of benign cores (20 cores), and random selection 270

of equal number of cores from grade groups 1-3 (50 cores from each group). The PathMarker app was loaded with the subset of 333 cores, randomly shuffled

M

such that cores from the same patient or slide are not necessarily consecutive. To date, six pathologists have used PathMarker to annotate the 333 H&E

275

ED

cores for benign, Gleason grade 3, grade 4, and grade 5 tissue. The pathologists were asked to circle regions on the cores with the corresponding grade as they see fit based on their experience. Similar to other studies, e.g., (De La Taille

PT

et al., 2003), there was no initial consensus meeting to agree on how to apply the Gleason grading system for this project. The pathologists’ experience in PCa

CE

diagnosis on histopathology ranged from early-, mid-career to veteran (LF: 27, 280

BFS: 15, PT: 1, DT: 24, CFV: 17, and GW: 5 years of experience); one was a clinical general pathologist, one was a research genitourinary pathologist, and

AC

the others were clinical genitourinary pathologists. To keep the annotators’ performance anonymized, the figures and results reported throughout this paper are not in alphabetical order.

285

Based on the color-coded annotations of each pathologist2 , a set of label 2 For

two of the pathologists, annotations of subsets of 191 and 92 cores were recovered.

14

ACCEPTED MANUSCRIPT

Table 1: Summary of the TMA dataset based on the detailed annotations by the pathologists.

231 63.2 (6.3)

CR IP T

25.9 (8.4%) 102.0 (33.1%) 173.4 (56.3%) 6.6 (2.1%) 307.9 (100%) 19 (5.7%) 61 (18.3%) 58 (17.4%) 62 (18.6%) 114 (34.2%) 19 (5.7%) 333 (100%)

AN US

Number of patients Mean (STD) age at surgery Grade area distribution (consensus, in mm2) Benign Grade 3 Grade 4 Grade 5 Overall Grade group distribution (consensus, in cores) Benign/Other Group 1 Group 2 Group 3 Group 4 Group 5 Overall

maps for each core was constructed by assigning pixels encircled by the contours with the corresponding class. The resulting distribution of the data based on

M

the generated maps is summarized in Table 1.

2.4. Extraction of Explicit Multi-Scale Features Our feature extraction is based on patches that are cropped from the digi-

290

ED

tized slide images. As suggested in (Doyle et al., 2012a), histological features

PT

may be meaningful at different resolutions. Thus, we process concentric image  patches at different magnification scales. A data sample, xi = x1i , . . . , xSi ,

therefore comprises the S × M features, xsi ∈ RM , extracted from the sth scale 295

around the center of the ith patch.

CE

On the one hand, the larger the patches are, and the lower the scale is, the

less localized are the features. On the other hand, smaller patches in higher

AC

scales may not contain enough information to allow extraction of descriptive features. Also, the smaller the sampling rate of the patches is, the more repeti-

300

tive is the data set, and the more computation time is required. Thus, the patch size, scales and sampling rate were selected such that most histologic structures (e.g., glands) are captured, while maintaining enough data samples for a fine representation of the annotated labels.

15

ACCEPTED MANUSCRIPT

Table 2: Summary of the proposed features with their most important rank (lower is better), class and scale.

Nucleibased

Imagebased

Importance 18 1 2 4 4 1 22 1 59 2 2 2 1 33 26 10 20 33 4 1 2 3 1

Class Gleason 5 Gleason 3 Benign Benign Gleason 4 Gleason 4 Benign Benign Benign Gleason 4 Gleason 3 Gleason 5 Benign Gleason 3 Gleason 5 Gleason 5 Gleason 4 Gleason 5 Gleason 4 Gleason 4 Gleason 3 Gleason 5 Gleason 3

Scale 20x 10x 40x 40x 40x 20x 20x 10x 10x 20x 10x 40x 40x 10x 10x 40x 20x 10x 20x 10x 40x 40x 40x

Notes mean and std over glands mean and std over glands mean and std over glands mean and std over glands (Lee, et al., 2014) mean and std over glands mean and std over glands

CR IP T

Glandularbased

Description number of glands gland area gland perimeter gland diameter gland eccentricity gland orientation variance lumen to gland area ratio epithelial to gland area ratio number of nuclei nuclei area nuclei distance nuclei coordinates distribution nuclei coordinates distribution spectrum intra-nuclei intensity distribution minimum spanning tree between nuclei complete spatial randomness of nuclei Poisson’s λ intensity map of nuclei Moran’s I autocorrelation of nuclei Ripley’s andBesag's functions of nuclei pair correlation function of nuclei nuclei perimeter Fourier coefficients haematoxylin to eosin pixel ratio SURF descriptors

mean and std over nuclei mean, std, min and max over nuclei entropy, covariance matrix mean, std, covariance, eigenvalues mean and std (Doyle et al., 2012b; Nguyen et al., 2014) one-sided and two-sided p-value see text see text see text see text see text

AN US

Type

proposed novel features, see text

Specifically, we used three scales, 10x, 20x, and 40x, and a patch size of 512×512 pixels (thus, the scales correspond to 512×512 µm2 , 256×256 µm2

M

305

and 128×128 µm2 , respectively). The patches are cropped from the native 40x

ED

resolution of the core image, and lower scales are downsampled into patches of size 512×512 pixels. The patches are sampled in half-patch intervals (at 40x) along each axis.

For feature extraction, we processed each image patch for gland and nu-

PT

310

clei segmentation using methods developed by our group (Rashid et al., 2013; Rashid, 2014) and are briefly summarized here. As a preprocessing step, if the

CE

pixels in a patch have a mean gray intensity value higher than 180 or lower than 120 (in the range of 0-255), we considered this patch to contain mostly glass, image noise or other staining artifact, and assigned it an invalid label. Then,

AC

315

based on their color intensities, the pixels are clustered into different histologic structures, namely, lumen, nuclei, epithelial and stroma (Nguyen et al., 2012). The gland segmentation in each patch is performed by grouping connected

lumen pixels as the initial gland boundary, and iteratively updating this bound320

ary by consolidating surrounding epithelial and nuclei pixels until the boundary 16

ACCEPTED MANUSCRIPT

reaches stroma pixels. The method was reported to achieve 89% recall and 71% precision. The nuclei segmentation in each patch is performed by applying a modified watershed transformation to the gradient image of the patch’s negated

325

CR IP T

‘R’ channel. The method was reported to achieve 84% recall and 90% precision. The parameters for these methods were tuned and evaluated in (Rashid

et al., 2013; Rashid, 2014) based on a dataset of 10 WM slides (8 patients), independent from the datasets used in this study.

A list of the proposed features from each category that were found to be

meaningful, based on a feature importance analysis of our classification experiments, is summarized in Table 2. In total, we extract 276 features per scale.

AN US

330

The extracted features consist of architectural and morphological features that include glandular-based features (their number, contents and shape statistics), and nuclei-based features (number, size, color content, spatial and spatial frequency statistics). Motivation for using these features arose from discussions 335

with pathologists, as well as common features used in previous work (Jothi and

M

Rajam, 2017; Mosquera-Lopez et al., 2015).

Among the new features designed by us are image-based features that are

ED

based on the computation of SURF descriptors (Bay et al., 2008) around the centers of the segmented nuclei in the patch. Similar to SIFT (Lowe, 1999), its 340

predecessor, the SURF algorithm detects salient points in images and character-

PT

ize them in a scale, intensity and rotational invariance using a vector of features, known as a descriptor. Here, we force the salient points to be the nuclei centers,

CE

and employ the mean and standard deviation (STD) vectors of their descriptors as features for our classifier. The motivation is that the SURF descriptors,

345

which are based on the Haar wavelet responses, can capture and discriminate

AC

both the nucleus interior contents and the statistical intensity distribution of its surrounding (depending on the image patch scale). Other extracted nuclei features include the mean, covariance matrix ele-

ments and eigenvalues of the spatial Fourier transform of the nuclei coordinates

350

distribution. In addition, we propose the spectrum of the nuclei perimeter as a novel feature. The distance from the centroid of each nucleus to its perimeter is 17

ACCEPTED MANUSCRIPT

3.5

3

2.5

3

2

[ m]

2.5

1.5 2

θ 1.5

1 -

0.5

- /2

0

/2

-

- /2

0

/2

[rad]

[rad]

(a) Segmented nuclei

CR IP T

1

θ

(b) Perimeter

(c) Spectrum

Figure 3: Example of the computation of the nuclei perimeter and its spectrum. (a) Examples of segmented nuclei. (b) The distance from the center of mass of the nucleus to points on its

AN US

perimeter, as a function of the angle θ (a perfect circular perimeter would yield a constant function). (c) The spectrum of the center-to-perimeter distance.

computed for angles in the range of [−π, π), and the first five coefficients of the Fourier transform of this distance function are recorded (these were found to be sufficient to describe the spectrum, with over 80% of the energy). The mean values of these coefficients over all nuclei in the image patch are used as features.

M

355

The computation of these features is illustrated on an example in Figure 3.

ED

We introduce new features based on spatial statistics (Ripley, 2005). The motivation is to improve classification of high-grade cancers that do not have well-defined glands and are mainly characterized by the distribution of nuclei. We estimate the Poisson point process intensity (Poisson’s λ), which is a measure

PT

360

of how densely the points are distributed. Given a distribution, the maximum ˆ= likelihood estimator of λ is given by the density λ

n kW k ,

where n is the number

CE

ˆ of points in a region W of area kW k (Sheskin, 2003). We estimate λ(p) on each pixel p in the image patch using its 101 × 101 pixels neighborhood, and use

ˆ over the patch as features. The size of the the mean, median and STD of λ

AC

365

neighborhood was chosen to be large enough to sample sufficiently many nuclei for the intensity estimation, yet small enough to capture its inhomogeneity over the patch (equivalent to choosing the bandwidth of a density estimation window).

370

We compute the Moran’s I statistic (Moran, 1950), which is a second order 18

ACCEPTED MANUSCRIPT

measure of spatial autocorrelation that captures how dispersed or clustered the points are, and take the maximum value of the calculated Moran’s I over the patch as a feature. We also compute Ripley’s K-function K(r), its Besag’s

375

CR IP T

L-function transformation L(r), and the pair correlation function (pcf) pcf (r) (Baddeley et al., 2015), which are based on the average number of nuclei within a search radius r over the image. Specifically, the pcf, as the derivative of the

nuclei with respect to the radius, can be a measure of nuclei “rings” that are associated with glands.

The mean and STD of r − L(r), as well as the number of its zero crossings

and the distance between them were calculated as features. For the pcf, the

AN US

380

magnitude and frequency of its spectrum peaks were used as features. In addition, inspired by the step response characteristics of a second-order control system (Dorf and Bishop, 2011), the approximated damping ratio and natural frequency of the pcf were also computed as features. A border edge correction 385

was used in the computation of the spatial point functions. Figure 4 illustrates

M

the spatial statistics functions for two different high-grade patches.

ED

2.5. Label Probability and Ground Truth Estimation We formulate our multi-scale, multi-class and multi-label training data as

PT

390

D = {(xi , yi )}N i=1 , where N is the total number of samples (image patches)  with features xi , defined in Section 2.4, and yi = yi1 , . . . , yiL comprises the L

experts’ labels associated with the ith patch, with yil ∈ {0, . . . , C − 1}. In our case, we have a set of C = 4 classes, namely, class 0 for benign, and classes

CE

1 − 3 for Gleason grades 3 − 5, respectively. A null class is assigned to unmarked

patches of annotators, so that these patches are still taken into account if other annotators marked it.

AC

395

We convert the multi-class classification into a set of C binary classifications,

Dc = {(xi , y˜i,c )}N i=1 , for c = 0, . . . , C − 1, with   1 if y l = c i l y˜i,c =  0 Otherwise 19

(1)

ACCEPTED MANUSCRIPT

4

1.4 1.2

3

1 2 0.8 1 0.6 0.4 -1

-2

0.2

0

10

20

30

40

50

60

0

0

10

r [ m] 1.4 1.2

3

1 2 0.8 1 0.6 0

30

AN US

0.4

-2

20

40

50

60

40

50

60

r [ m]

4

-1

CR IP T

0

0.2

0

10

20

30

40

50

60

0

0

10

20

r [ m]

(b) r − L(r)

(a) Image patch

30

r [ m]

(c) pcf (r)

Figure 4: Spatial statistics of nuclei. The consensus for the patches in the top and bottom rows was Gleason 4 and Gleason 5, respectively. Features extracted from the L-function and pair correlation function can indicate dispersed or clustered patterns. The dashed lines are

M

the theoretical plots for a complete random distribution.

ED

Following (Raykar et al., 2010), we first initialize the sensitivity and specificity of each annotator l for each class c as αcl = 1, and βcl = 1, ∀c, l, respectively. We consider yˆi to be the unknown ground truth of patch i, and compute the

PT

initial estimate of its posterior probability for each class c based on the original

CE

label distribution,

L

µi,c = P r(ˆ yi = c|Dc , αcl , βcl ) =

1X l y˜i,c . L

(2)

l=1

A MAP estimate of the ground truth label itself for patch i can be then com-

AC

puted by yˆiMAP = arg max0≤c≤C−1 µi,c . Note that the MAP estimate of the

ground truth using the initial µi,c is simply the majority vote label.

400

For training, rather than using the estimated ground truth as labels, we set

the continuous posterior probabilities µi,c as regression targets, and therefore

train C regressors. The motivation is to provide a prediction of the label probability, from which any estimate (not necessarily MAP) can be deduced, as well 20

ACCEPTED MANUSCRIPT

as to have a confidence measure associated with the predicted labels that is based on the inter-observer agreement. In each iteration of the training process, we update the estimate of the

µi,c =

ai,c pi,c , ai,c pi,c + bi,c (1 − pi,c )

where ai,c =

L Y l l (αcl )y˜i,c (1 − αcl )1−˜yi,c

bi,c =

l=1

(3)

L Y l l (βcl )1−˜yi,c (1 − βcl )y˜i,c , l=1

(4)

and pi,c in our case is taken as the output score of regressor c in the current iteration.

AN US

405

CR IP T

ground truth probability (the E-step) by

We then compute estimates for the sensitivity and specificity of the annota-

αcl

=

i∈training P

l µi,c y˜i,c

i∈training

µi,c

βcl

=

P

M

tors (the M-step) by P

i∈training (1

P

l − µi,c )(1 − y˜i,c )

i∈training (1

− µi,c )

(5)

Typically, the EM algorithm should be run until convergence. However, to shorten the simulation time we set a fixed number of 25 iterations, which was

410

ED

found as a good tradeoff between convergence and running time. 2.6. Classification and Evaluation Methods

PT

After the feature extraction and initial ground truth estimation, we have experimented with different types of regression methods and models to test our

CE

classification framework. The experiments were conducted on subsets of 171 and 60 patients (247 and 86 cores) as the training and testing sets, respectively,

415

that were partitioned to attain a similar intra-set class distribution (stratified

AC

sampling). Specifically, we have evaluated the performance of linear discriminant analysis (LDA), logistic regression (LR), SVM, RF (with different number of trees). From the RF results, we have also estimated the importance of the features for each class based on its corresponding regressor.

420

We also evaluated the performance of a DL approach on the same training and testing subsets. The convolutional neural network (CNN) used in this work 21

ACCEPTED MANUSCRIPT

follows the general architecture of U-net (Ronneberger et al., 2015). We made a few changes to the basic U-net. First, we feed the network with the original patches of size 512×512 pixels as well as patches downsampled by factors of 2 and 4. Second, as it has been suggested by previous studies (Drozdzal et al., 2016),

CR IP T

425

we used short and long skip connections in both contracting and expanding

sections to improve the gradient flow and ease the training. Finally, to reduce

the model size, we forward the features computed at each level of the contracting path to all the coarser levels, similar to the approach suggested in (Huang et al., 430

2016). During training, we used PCA-based color jitter (Krizhevsky et al., 2012)

AN US

and elastic image deformation (Ronneberger et al., 2015) for data augmentation.

We trained the network by minimizing the cross-entropy of its output multi-class probability map and the true label map using Netsterov’s accelerated gradient descent method.

Based on the comparison analysis results, we have evaluated our framework

435

with LR in a thorough leave-one-patient-out cross-validation process over the

M

231 patients. In each experiment, we selected patches from all patients but one as the training set, trained a supervised classifier with the multi-expert

440

ED

probabilistic approach described above, and applied it to the patches of the left-out patient (1–3 cores). The predicted labels of the left-out patches were recorded for validation, and the classification process was repeated for each

PT

patient.

We evaluated the classification performance in several manners. The analysis

CE

was performed on the left-out patches, by aggregating the classifications results 445

of the left-one-out experiments. First, we plotted the receiver operating characteristic (ROC) curves, computed by varying the decision threshold for each

AC

of the labels against the others. For each curve we calculated the area under the curve (AUC), and estimated the optimal threshold and its corresponding accuracy, sensitivity and specificity.

450

In order to assess the agreement level of the automatic classification, in com-

parison to the inter-observer agreement, we used the κ statistic (Cohen’s kappa coefficient) (Sheskin, 2003; Viera et al., 2005). The κ coefficient ranges from 22

ACCEPTED MANUSCRIPT

Table 3: Comparison of the classification results between methods on a leave-out set of patients.

LR

SVM

91.6 94.5 76.0 79.1 80.2 77.1 .49

90.8 91.9 84.5 79.0 75.6 84.7 .52

88.5 90.8 75.5 73.8 69.3 81.6 .45

RF100 89.7 93.0 70.8 78.8 79.2 78.2 .49

RF250 89.7 92.8 72.2 79.4 79.1 79.7 .50

RF500 89.8 92.8 72.8 79.4 79.3 79.6 .50

DL 91.9 92.7 82.7 77.8 79.0 75.2 .37

CR IP T

Cancer detection (benign vs. all grades) Low- vs. High grade (Grade 3 vs. Grades 4, 5) Overall agreement

Accuracy (%) Sensitivity (%) Specificity (%) Accuracy (%) Sensitivity (%) Specificity (%) Unweighted κ

LDA

AN US

−1 to 1. Non-positive values indicate an accidental agreement; positive values

in the intervals 0.1-0.2, 0.21-0.4, 0.41-0.6, 0.61-0.8 and 0.81-1.0 reflect slight, 455

fair, moderate, substantial and (near) perfect agreements, respectively. To be consistent with (Allsbrook et al., 2001b), we computed the unweighted κ for the patch-wise grades of each pathologist compared to the others pathologists, as

M

well as to the automatic classification results. The overall unweighted κ’s were also computed based on the corresponding pathologist or automatic classifier vs. all human pathologists as predictors.

3. Results

ED

460

PT

3.1. TMA Classification

The results of the different classification methods we tested on the leave-out

CE

set of patients are summarized in Table 3, with the best performance achieved 465

by the LR classifier. For the leave-one-patient-out experiment using LR, the patch-wise ROC curves of the automatic classifier, as computed based on the

AC

left-out patches probabilities, are plotted in Figure 5. The weighted mean ROC curve is computed by taking the occurrence of each label as weights, and reflects the overall performance of the classifier with an AUC of 0.85.

470

When considering the MAP estimation, by assigning each patch the label

with the highest probability, the classification accuracy, sensitivity and specificity for cancer detection (benign vs. all grades) are 90.5%, 91.5% and 85.2%, 23

ACCEPTED MANUSCRIPT

Table 4: Inter-observer (unweighted) κ for patch-wise grading on leave-one-out patients (pathologists are not in alphabetical order).

Path. 2

Path. 3

Path. 4

Path. 5

.36 .41 .48 .45 .48 .40 .45

.46 .53 .53 .64 .40 .50

.69 .72 .60 .58 .61

.67 .58 .57 .61

.65 .53 .62

Path. 6

Automatic

CR IP T

Path. 1

.47 .59

.51

AN US

Path. 1 Path. 2 Path. 3 Path. 4 Path. 5 Path. 6 Automatic Overall

respectively. For low- vs. high-grade classification (grade 3 vs. grades 4 and 5) accuracy, sensitivity and specificity are 79.2%, 79.2% and 79.1%, respectively. 475

For comparison, the manual grading by the pathologists in this study leads to inter-observer mean accuracy, sensitivity and specificity of 97.2%, 98.4% and 86.2%, respectively, for cancer detection, and 78.8%, 79.2% and 79.3%, respec-

M

tively, for low- vs. high-grade.

Further inter-observer analysis of the patch-wise agreement using the unweighted κ coefficient is provided in Table 4. The overall agreement κ of each

ED

480

pathologist with the others (last row in the table, not including the last value) ranged between 0.45 to 0.62 (0.56±0.07), while the automatic classifier’s overall

PT

agreement with the pathologists was 0.51 (last value in the table). Examples of cores, as annotated by the pathologists, and their consensus and classification label maps are shown in Figures 6-8.

CE

485

In terms of running times, on a machine with two Intel Xeon E5-2650 v4

CPUs with 12 cores each @ 2.2Ghz and 256GB of RAM, the processing of

AC

each TMA core (∼400 image patches) for feature extraction using a Matlab (MathWorks, Natick, MA, USA) implementation took ∼2 hours using parallel

490

computing (∼14 hours for all 333 cores). Similarly, the training and classification took ∼4.5 hours for the 231 leave-one-patient-out simulations.

24

ACCEPTED MANUSCRIPT

1

0.8 0.7 0.6 0.5 0.4 0.3

Benign, AUC = 0.95 Gleason 3, AUC = 0.85 Gleason 4, AUC = 0.82 Gleason 5, AUC = 0.77 Weighted, AUC = 0.85

0.2 0.1 0

CR IP T

True positive rate (sensitivity)

0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

with the consensus labels.

Pathologist 2

Pathologist 5

Pathologist 6

Pathologist 3

M

Pathologist 1

AN US

Figure 5: The ROC curves of the LR classifier evaluated on left-one-out patients in comparison

Automatic

PT

ED

Consensus

Pathologist 4

Figure 6: Example of the TMA classification results. The automatic classifier is in good

CE

agreement with the consensus and captures the detailed grade map.

3.2. Feature Importance

AC

Each feature was ranked by its importance based on the RF experiments,

with ranks one and 276 being the highest and lowest importance, respectively.

495

Since our framework consists of a multi-class and multi-scale regression, the features are ranked by multiple importance values, and Table 2 summarizes the highest importance achieved by each group of features, with the corresponding class and scale it was attained at. Below we provide a summary of the 25

ACCEPTED MANUSCRIPT

Pathologist 3

Pathologist 5

Pathologist 6

Consensus

Pathologist 4

CR IP T

Pathologist 2

Automatic

AN US

Pathologist 1

Figure 7: Example of the TMA classification results. In this example, the automatic classifier has upgraded the consensus grade.

Pathologist 2

Pathologist 5

Pathologist 6

Pathologist 3

Consensus

Pathologist 4

Automatic

PT

ED

M

Pathologist 1

Figure 8: Example of the TMA classification results. Here, the automatic classifier has

CE

downgraded the consensus grade, in agreement with some pathologists (when there is a tie, the consensus is taken as the higher grade).

AC

importance results for the novel features proposed in this paper.

500

The highest importance for the SURF descriptors’ mean was ranged 2–169

(mean±STD of 34±32) for differentiating mostly benign at 10x, while the descriptors’s STD was ranged 1–69 (18±16) for mostly Gleason grade 4 at 20x.

The nuclei distribution spectrum mean was ranked first for classifying benign

26

ACCEPTED MANUSCRIPT

tissue at 20x, its covariance matrix diagonal elements were ranked second and 505

fourth for classifying Gleason 5 at 40x, while the covariance matrix eigenvalues were ranked third and 30 for classifying benign at 40x and Gleason 3 at 20x,

CR IP T

respectively. The perimeter spectrum importance was ranged between 2–59, with its fourth coefficient ranked the highest in classification of Gleason grade 3 at 40x.

For the nuclei spatial statistics based features, Poisson’s λ mean and STD

510

were ranked 52 and 20, in classification of Gleason 5 at 40x and Gleason 4 at

20x, respectively. Moran’s I maximum value was ranked 33 in classification of

AN US

Gleason 5 at 10x. The mean and STD of the Ripley’s/Besag’s-based function,

r −L(r) were ranked 32 and 7, respectively, both in classification of Gleason 3 at 515

40x. The number of its zero crossings and their intermediate distances reached ranks 38 and 4, for benign and Gleason 4 at 20x, respectively. The damping ratio, and natural frequency of the pcf ranked 7 and 5, respectively, in classification of benign at 40x, while its spectrum peak magnitudes and frequencies

respectively.

ED

520

M

best ranks were first and second in classification of Gleason 4 and benign at 10x,

3.3. Whole-Mount Classification using TMA-trained Classifier Here, we applied the classifier that was trained exclusively on the TMA

PT

dataset to a WM dataset from another institution in order to evaluate its translation across laboratories. We have employed a similar patch-wise multi-scale 525

feature extraction to the WM slides. Due to the lower magnification of the WM

CE

dataset compared to the TMA dataset, we extracted image patches within the annotated regions and processed them in two scales (20x and 10x). Due to the

AC

significant amount of benign tissue compared to cancer in WM slides, so as to expedite the processing, we extracted features from a random selection of 10%

530

of the benign patches in each slide. We then employed a TMA-based classifier to classify the WM patches. The

classifier was trained on all TMA cores while using only the features that correspond to the two relevant scales in order to be compatible with the WM data. 27

ACCEPTED MANUSCRIPT

Grade 5

Grade 4

CR IP T

Grade 3

Benign

Invalid

(a) True

(b) Automatic

AN US

Figure 9: An example of a WM slide classified by the TMA trained classifier. 1 0.9

0.7 0.6 0.5 0.4

M

True positive rate (sensitivity)

0.8

0.3 0.2

Benign, AUC = 0.85 Gleason 3, AUC = 0.63 Gleason 4, AUC = 0.81 Gleason 5, AUC = 0.85 Weighted, AUC = 0.75

ED

0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

PT

Figure 10: The ROC curves of the TMA trained classifier evaluated on the WM slides.

An example of a classified slide (with all of its patches processed for illustration) is shown in Figure 9. The overall patch-wise ROC curves are plotted in

CE

535

Figure 10. The resulting weighted mean ROC curve has an AUC of 0.75.

AC

4. Discussion The results on the TMA dataset suggest that the classifier’s performance in

detecting cancer (90.5% accuracy) is on par with the most comparable works

540

described in Section 1.2, while its performance in differentiation between lowand high-grade cancer (79.2% accuracy) is lower than the results reported by 28

ACCEPTED MANUSCRIPT

some comparable works (e.g., 81.0% by (Tabesh et al., 2007), and 85% by (Gorelick et al., 2013)). We note, however, that most previous works have trained and evaluated their classifiers on datasets in the order of ∼100 image patches sampled from a single slide and/or patient. Such datasets are likely to be in-

CR IP T

545

sufficient for capturing the heterogeneity in histological patterns of PCa across populations. Indeed, the performances of the reviewed works in (Mosquera-

Lopez et al., 2015) seem to be inversely correlated with the size of the data. In

contrast, our multi-class classifier for benign and Gleason grades 3–5 patterns 550

was developed based on a dataset of over 16, 000 patch samples (not includ-

AN US

ing the half-patch and multi-scale sampling) from different slides and over 200 patients, and was evaluated in a leave-patients-out cross-validation.

We also note that, given the inter-observer grading variability of PCa, a perfect classifier that was trained and evaluated by a single pathologist is expected 555

to perform only at the mean inter-observer agreement rate when evaluated against another pathologist. Thus, even when evaluated in a cross-validation

M

manner, the classification results of previous works are likely to overfit the training data to some extent. In our study, the classifier was trained to predict the

560

ED

multi-label probabilities of each grade class based on six pathologists, and its overall agreement with the pathologists (unweighted κ of 0.51) fits within the range of agreement between the pathologists. The inter-observer agreements

PT

between the pathologists (unweighted κ of 0.56±0.07) were also consistent with the levels reported in the literature (De La Taille et al., 2003), despite the small

CE

amount of tissue in each TMA core (in practice the sample is larger and at least 565

three levels are provided). In general, the more annotations by more pathologists that we add to our

AC

data, the better and more robust our classifier is expected to perform. Information regarding the annotation quality, e.g., based on experience, can be utilized through the probabilistic framework by assigning a prior probability to each

570

expert. Such priors can be determined based on a screening questionnaire or a preliminary evaluation on examples prior to the annotation task. Annotation variability may also be associated with different grading guidelines between cen29

ACCEPTED MANUSCRIPT

ters or settings (e.g., research vs. clinical), rather than experience. Therefore, the framework also allows to tune the classifier to a specific center by changing 575

the priors in favour of annotators from that center.

CR IP T

The motivation for not having a consensus meeting among the pathologists prior to their annotation of the cores was to expose the classifier to different grading “doctrines,” and therefore to increase its robustness to datasets

acquired, e.g., by different centers. However, a consensus meeting might have 580

been useful in some ambiguous cases for which pathologists would normally look

into additional information. For example, regions with high-grade PIN, which

AN US

is usually confirmed with immunohistochemistry staining, were marked by some of the pathologists as benign or left unmarked, while others have assumed the “worst case scenario” and marked them as cancer.

The TMA dataset provided a consistent processing of the samples and a

585

diversity of patients in the dataset. However, this may have hindered the generality of the classifier since there are typically differences between laboratories in,

M

e.g., preparation, staining and scanning of histology slides (Yagi, 2011), which could be a reason for the lower performance of the classifier on the WM dataset that was completely excluded from the training phase. In the future, we in-

ED

590

tend to apply stain normalization (Bautista et al., 2014; Bejnordi et al., 2016; Janowczyk et al., 2017; Khan et al., 2014), data and color augmentation, or neu-

PT

ral networks approaches (Lafarge et al., 2017) to improve the generalization of the classifier to staining differences and other artifacts of histological processing. In addition, although we assumed that skewing the sampling of the 333 TMA

CE

595

cores towards higher grades would also produce low-grade and benign patches (since they are present in many high-grade cores), there were still far less benign

AC

(and grade 5) patches in the training set than grades 3 and 4 patches. Training on more benign patches is therefore expected to increase the classification

600

performance on the WM slides. Due to the large amount of benign tissue in WM (compared to cancer), it was assumed that the randomly sampled 10% benign patches provided an adequate representation. However, for full analysis, processing of the entire benign tissue on the WM slides is required. We also 30

ACCEPTED MANUSCRIPT

note that while four pathologists have annotated the WM data, they did not 605

overlap and only jointly discussed and reached a consensus regarding regions for which the assigned pathologist had a doubt (Iczkowski et al., 2011). Thus, the

CR IP T

performance on the WM data was effectively evaluated against one pathologist and therefore the results may suffer from the overfitting issues above.

The feature-based classification performance was found to be similar across 610

the different methods we tested, with the best results in terms of overall agree-

ment achieved by LR and the worst by SVM. We note that we did not tune parameters for these methods, and the classification performance may be im-

AN US

proved by using, e.g., kernel-based methods and regularization (Ginsburg et al., 2016). The general performance may be further improved by improving the 615

accuracy of the extracted features by, e.g., updating the gland and nuclei segmentation methods, which is an active field of research (Irshad et al., 2014). We note that, as suggested by (Nguyen et al., 2014), even a perfect segmentation of every gland and nuclei may not dramatically improve the grading performance.

620

M

In our multi-class, multi-scale approach, we found that most of the features we used to be important, at least in distinguishing one class at a certain scale.

ED

Specifically, the newly proposed features that are based on spatial point statistics, shape spectrum, and SURF descriptors of the nuclei were found to be ranked highly. While the combination of features seems to therefore contribute

625

PT

to the overall performance, a dynamic feature selection could be performed for each regressor based on the corresponding importance, such that different fea-

CE

tures at different scales are used for each classification task in order to optimize computational efficiency and avoid potential overfitting. The feature-based methods were found to perform better than the DL ap-

AC

proach we tested. However, we expect the DL performance to improve with

630

additional training data, which will also allow the training of deeper architectures. Still, since handcrafted features have the advantage of providing clinicians with insight into the automatic decision process, the lack of feature interpretation in DL, even if performing well, may hinder the clinical deployment of CAD systems (Niazi et al., 2017; Shen et al., 2017). Thus, as future work, we plan to 31

ACCEPTED MANUSCRIPT

635

investigate the correlation of DL-based features with the handcrafted ones and employ a hybrid approach similar to (Cruz-Roa et al., 2013; Rezaeilouyeh et al., 2016; Zhou et al., 2017).

CR IP T

We note that our performance analysis was performed by comparing the labels of corresponding pairs of image patches in order to provide detailed clas640

sification of the tissue. Based on the primary and secondary grades, a single score could be assigned for the entire TMA core or biopsy core (or WM slide),

as typically done for clinical purposes. The score can be estimated by setting thresholds for the number of patches required to constitute primary and sec-

645

AN US

ondary grades. In such case, the classification results are expected to improve

since the core-wise scores would be less “noisy” than the raw patch-wise grades. Also, to measure scoring agreement, a weighted κ coefficient can be used so as to mitigate grading differences, e.g., within ±1 Gleason, since these disagreements between neighbouring scores are often not clinically important.

The developed PathMarker app’s interface on the tablet was received positively by the pathologists involved in our study and allowed for quick and

M

650

intuitive, yet detailed annotations. To make the tool available for use by others,

ED

we intend to add more features and resolve technical issues, such as the current downsampling of the images to 10x magnification. While PCa can typically be performed at 4x, and in certain instances at 10x magnification (Gordetsky and Epstein, 2016), and the pathologists confirmed that the image resolution in the

PT

655

app was sufficient for grading, it might still be advantageous to allow for higher

CE

magnifications to avoid any potential impact on the inter-observer variability. We also intend to incorporate the app as part of the CAD system to “close the loop” by presenting pathologists with the suggested automatic grading annotations in real-time. Such feature will also be useful in training and teaching

AC

660

pathology residents, for whom digital pathology has already been shown to be an effective educational tool (Pantanowitz et al., 2012). Since histologic patterns that yield a same Gleason grade are heterogeneous,

explicit classification of patterns, such as those described in (Iczkowski et al., 665

2011; McKenney et al., 2016), from which the Gleason grades can be deduced 32

ACCEPTED MANUSCRIPT

and with which disease outcomes may be better correlated, is likely to provide a more consistent classifier. We therefore plan to include additional classes of patterns for developing such classifier. This, however, will require increasing

670

CR IP T

the training dataset. In addition, some tissue may contain a mixture of two or more patterns that are intermixed and cannot be annotated as separate regions.

Pathologists would then typically assign an estimate of the percentage of the

patterns to the entire region. Another advantage of the probabilistic approach is that it inherently supports such cases by simply allowing the experts’ labels to take real values.

We are also looking into classification of our features with respect to clinical

AN US

675

outcomes, e.g., biochemical recurrence. The classifier may allow a less subjective prediction of outcomes than Gleason score as it is based on a multitude of extracted features, many of which are not explicitly noted by pathologists. However, the single level and small amount of tissue in the TMA cores may not 680

have been sampled from the lesions that led to a poor outcome. A biopsy or

M

WM datasets, in which the samples are larger and more levels are available, are

ED

more suitable for such study.

5. Conclusion

685

PT

We developed a CAD system for automatic grading of PCa in digitized histopathology images using supervised learning methods. We proposed novel features that include spatial inter-nuclei statistics and intra-nuclei properties,

CE

and were found to be important in different scales, mainly for distinguishing high-grade PCa patterns. Our approach was evaluated in leave-patients-out experiments with consistent results across different classification methods. The classifiers were trained to predict the probabilities of each grade class based on

AC 690

labels provided by multiple pathologists, and their overall agreement with the pathologists was found to be consistent with the agreement levels reported in the literature. We consider the utilization of annotated data by multiple pathologists in a

33

ACCEPTED MANUSCRIPT

695

probabilistic framework as the main contribution of this work. The approach tackles the inter-observer variability in PCa grading and can lead to a consensusbased training that improves both classification and feedback to pathologists,

CR IP T

as more annotations are accumulated. The CAD can be eventually translated to the clinic as a tool for pathologists to allow rapid screening of vast amount of 700

tissue automatically, and to produce quantitative measures based on the entire tissue, e.g., for determining the pathological stage of postoperative specimens. Acknowledgments

AN US

We would like to thank Dr. Martin Gleave and his group at the Vancouver Prostate Centre for the preparation, staining and scanning of the TMA data. This work was funded by a Prostate Cancer Canada Grant #D2016-1352,

705

by the CIHR Grant #MOP-142439, and by Professor Salcudean’s C.A. Laszlo Chair. Guy Nir is a recipient of a Prostate Cancer Canada Post-Doctoral

M

Research Fellowship Award #PDF2016-1338.

References

Allsbrook, W.C., Mangold, K.A., Johnson, M.H., Lane, R.B., Lane, C.G., Amin,

ED

710

M.B., Bostwick, D.G., Humphrey, P.A., Jones, E.C., Reuter, V.E., et al., 2001a. Interobserver reproducibility of gleason grading of prostatic carcinoma:

PT

urologic pathologists. Human Pathology 32, 74–80. Allsbrook, W.C., Mangold, K.A., Johnson, M.H., Lane, R.B., Lane, C.G., Epstein, J.I., 2001b. Interobserver reproducibility of gleason grading of prostatic

CE

715

carcinoma: general pathologists. Human Pathology 32, 81–88.

AC

Arevalo, J., Cruz-Roa, A., et al., 2014. Histopathology image representation for automatic analysis: a state-of-the-art review. Revista Med 22, 79–91.

Baddeley, A., Rubak, E., Turner, R., 2015. Spatial point patterns: methodology

720

and applications with R. CRC Press.

34

ACCEPTED MANUSCRIPT

Bautista, P.A., Hashimoto, N., Yagi, Y., et al., 2014. Color standardization in whole slide imaging using a color calibration slide. Journal of Pathology Informatics 5, 4.

CR IP T

Bay, H., Ess, A., Tuytelaars, T., Van Gool, L., 2008. Speeded-up robust features (SURF). Computer Vision and Image Understanding 110, 346–359.

725

Bejnordi, B.E., Litjens, G., Timofeeva, N., Otte-H¨ oller, I., Homeyer, A., Karssemeijer, N., van der Laak, J.A., 2016. Stain specific standardization of wholeslide histopathological images. IEEE Transactions on Medical Imaging 35,

730

AN US

404–415.

Berney, D.M., Algaba, F., Camparo, P., Comp´erat, E., Griffiths, D., Kristiansen, G., Lopez-Beltran, A., Montironi, R., Varma, M., Egevad, L., 2014. The reasons behind variation in gleason grading of prostatic biopsies: areas of agreement and misconception among 266 european pathologists. Histopathol-

735

M

ogy 64, 405–411.

Bhargava, R., Madabhushi, A., 2016. Emerging themes in image informatics

ED

and molecular analysis for digital pathology. Annual Review of Biomedical Engineering 18, 387–412.

Bishop, C.M., 2006. Pattern recognition and machine learning. Springer-Verlag

740

PT

New York.

Bova, G.S., Parmigiani, G., Epstein, J.I., Wheeler, T., Mucci, N.R., Rubin,

CE

M.A., 2001. Web-based tissue microarray image data analysis: initial validation testing through prostate cancer gleason grading. Human Pathology 32,

AC

417–427.

Breiman, L., 2001. Random forests. Machine Learning 45, 5–32.

745

Chatelain, P., Pauly, O., Peter, L., Ahmadi, S.A., Plate, A., B¨ otzel, K., Navab, N., 2013. Learning from multiple experts with random forests: Application

35

ACCEPTED MANUSCRIPT

to the segmentation of the midbrain in 3d ultrasound, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 230–237. Chen, H., Qi, X., Yu, L., Heng, P.A., 2016. Dcan: Deep contour-aware networks

CR IP T

750

for accurate gland segmentation, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2487–2496.

Cire¸san, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J., 2013. Mitosis detection in breast cancer histology images with deep neural networks, in: In-

ternational Conference on Medical Image Computing and Computer-assisted

AN US

755

Intervention, Springer. pp. 411–418.

Cruz-Roa, A.A., Ovalle, J.E.A., Madabhushi, A., Osorio, F.A.G., 2013. A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection, in: International Conference

M

on Medical Image Computing and Computer-Assisted Intervention, Springer.

760

pp. 403–410.

ED

De La Taille, A., Viellefond, A., Berger, N., Boucher, E., De Fromont, M., Fondimare, A., Molini´e, V., Piron, D., Sibony, M., Staroz, F., et al., 2003. Evaluation of the interobserver reproducibility of gleason grading of prostatic adenocarcinoma using tissue microarrays. Human Pathology 34, 444–449.

PT

765

Dekel, O., Shamir, O., 2009. Vox populi: Collecting high-quality labels from a

CE

crowd, in: Proceedings of the 22nd Annual Conference on Learning Theory. Dorf, R.C., Bishop, R.H., 2011. Modern control systems. Pearson.

AC

Doyle, S., Feldman, M., Tomaszewski, J., Madabhushi, A., 2012a. A boosted

770

bayesian multiresolution classifier for prostate cancer detection from digitized needle biopsies. IEEE Transactions on Biomedical Engineering 59, 1205–1218.

Doyle, S., Feldman, M.D., Shih, N., Tomaszewski, J., Madabhushi, A., 2012b. Cascaded discrimination of normal, abnormal, and confounder classes in

36

ACCEPTED MANUSCRIPT

histopathology: Gleason grading of prostate cancer. BMC Bioinformatics 13, 282.

775

Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C., 2016. The

CR IP T

importance of skip connections in biomedical image segmentation, in: Deep Learning and Data Labeling for Medical Applications. Springer, pp. 179–187.

Epstein, J.I., Allsbrook Jr, W.C., Amin, M.B., Egevad, L.L., Committee, I.G.,

et al., 2005. The 2005 international society of urological pathology (isup) con-

780

sensus conference on gleason grading of prostatic carcinoma. The American

AN US

Journal of Surgical Pathology 29, 1228–1242.

Epstein, J.I., Egevad, L., Amin, M.B., Delahunt, B., Srigley, J.R., Humphrey, P.A., Committee, G., et al., 2016a. The 2014 international society of urological pathology (isup) consensus conference on gleason grading of prostatic

785

carcinoma: definition of grading patterns and proposal for a new grading

M

system. The American Journal of Surgical Pathology 40, 244–252. Epstein, J.I., Zelefsky, M.J., Sjoberg, D.D., Nelson, J.B., Egevad, L., Magi-

ED

Galluzzi, C., Vickers, A.J., Parwani, A.V., Reuter, V.E., Fine, S.W., et al., 2016b. A contemporary prostate cancer grading system: a validated alterna-

790

tive to the gleason score. European Urology 69, 428–435.

PT

Ginsburg, S.B., Lee, G., Ali, S., Madabhushi, A., 2016. Feature importance in nonlinear embeddings (fine): applications in digital pathology. IEEE Trans-

CE

actions on Medical Imaging 35, 76–88. 795

Gleason, D.F., 1966. Classification of prostatic carcinomas. Cancer Chemother-

AC

apy Reports 50, 125–128.

Gordetsky, J., Epstein, J., 2016. Grading of prostatic adenocarcinoma: current state and prognostic implications. Diagnostic pathology 11, 25.

Gorelick, L., Veksler, O., Gaed, M., G´ omez, J.A., Moussa, M., Bauman, G., 800

Fenster, A., Ward, A.D., 2013. Prostate histopathology: Learning tissue com-

37

ACCEPTED MANUSCRIPT

ponent histograms for cancer detection and classification. IEEE Transactions on Medical Imaging 32, 1804–1818. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L., 2016. Densely

805

CR IP T

connected convolutional networks. arXiv preprint arXiv:1608.06993 .

Iczkowski, K.A., Torkko, K.C., Kotnis, G.R., Storey Wilson, R., Huang, W.,

Wheeler, T.M., Abeyta, A.M., La Rosa, F.G., Cook, S., Werahera, P.N., et al., 2011. Digital quantification of five high-grade prostate cancer patterns, including the cribriform pattern, and their association with adverse outcome.

810

AN US

American Journal of Clinical Pathology 136, 98–107.

Ipeirotis, P.G., Provost, F., Wang, J., 2010. Quality management on Amazon mechanical turk, in: Proceedings of the ACM SIGKDD workshop on human computation, ACM. pp. 64–67.

Irshad, H., Veillard, A., Roux, L., Racoceanu, D., 2014. Methods for nuclei

M

detection, segmentation, and classification in digital histopathology: a reviewcurrent status and future potential. IEEE Reviews in Biomedical Engi-

815

ED

neering 7, 97–114.

Janowczyk, A., Basavanhally, A., Madabhushi, A., 2017. Stain normalization using sparse autoencoders (stanosa): Application to digital pathology. Com-

820

PT

puterized Medical Imaging and Graphics 57, 50–61. Janowczyk, A., Madabhushi, A., 2016. Deep learning for digital pathology

CE

image analysis: A comprehensive tutorial with selected use cases. Journal of Pathology Informatics 7.

AC

Jothi, J.A.A., Rajam, V.M.A., 2017. A survey on automated cancer diagnosis

825

from histopathology images. Artificial Intelligence Review 48, 31–81.

K¨ all´en, H., Molin, J., Heyden, A., Lundstr¨ om, C., ˚ Astr¨ om, K., 2016. Towards grading gleason score using generically trained deep convolutional neural networks, in: Biomedical Imaging (ISBI), 2016 IEEE 13th International Symposium on, IEEE. pp. 1163–1167. 38

ACCEPTED MANUSCRIPT

Khan, A.M., Rajpoot, N., Treanor, D., Magee, D., 2014. A nonlinear mapping approach to stain normalization in digital histopathology images using image-

830

specific color deconvolution. IEEE Transactions on Biomedical Engineering

CR IP T

61, 1729–1738. Kononen, J., Bubendorf, L., Kallionimeni, A., B¨ arlund, M., Schraml, P., Leighton, S., Torhorst, J., Mihatsch, M.J., Sauter, G., Kallionimeni, O.P., 1998. Tissue microarrays for high-throughput molecular profiling of tumor

835

specimens. Nature medicine 4, 844–847.

AN US

Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, pp. 1097–1105. 840

Lafarge, M.W., Pluim, J.P., Eppenhof, K.A., Moeskops, P., Veta, M., 2017. Domain-adversarial neural networks to address the appearance variability of

M

histopathology images, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, pp. 83–91.

ED

Lee, G., Sparks, R., Ali, S., Shih, N.N., Feldman, M.D., Spangler, E., Rebbeck, T., Tomaszewski, J.E., Madabhushi, A., 2014. Co-occurring gland angularity

845

in localized subgraphs: predicting biochemical recurrence in intermediate-risk

PT

prostate cancer patients. PloS one 9, e97954. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian,

CE

M., van der Laak, J.A., van Ginneken, B., S´ anchez, C.I., 2017. A survey on deep learning in medical image analysis. Medical Image Analysis 42, 60–88.

850

AC

Litjens, G., S´ anchez, C.I., Timofeeva, N., Hermsen, M., Nagtegaal, I., Kovacs,

855

I., Hulsbergen-Van De Kaa, C., Bult, P., Van Ginneken, B., Van Der Laak, J., 2016. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Scientific Reports 6, 26286.

Lloyd, S., 1982. Least squares quantization in pcm. IEEE Transactions on Information Theory 28, 129–137. 39

ACCEPTED MANUSCRIPT

Lowe, D.G., 1999. Object recognition from local scale-invariant features, in: Computer vision, 1999. The proceedings of the seventh IEEE international conference on, Ieee. pp. 1150–1157. Madabhushi, A., Lee, G., 2016. Image analysis and machine learning in digital

CR IP T

860

pathology: Challenges and opportunities. Medical Image Analysis 33, 170– 175.

McKenney, J.K., Wei, W., Hawley, S., Auman, H., Newcomb, L.F., Boyer, H.D., Fazli, L., Simko, J., Hurtado-Coll, A., Troyer, D.A., et al., 2016. Histologic

AN US

grading of prostatic adenocarcinoma can be further optimized: analysis of the

865

relative prognostic strength of individual architectural patterns in 1275 patients from the canary retrospective cohort. The American Journal of Surgical Pathology 40, 1439–1456.

Moran, P.A., 1950. Notes on continuous stochastic phenomena. Biometrika 37, 17–23.

M

870

Mosquera-Lopez, C., Agaian, S., Velez-Hoyos, A., Thompson, I., 2015.

ED

Computer-aided prostate cancer diagnosis from digitized histopathology: a review on texture-based systems. IEEE Reviews in Biomedical Engineering 8, 98–113.

Nguyen, K., Sarkar, A., Jain, A.K., 2012. Structure and context in prostatic

PT

875

gland segmentation and classification, in: International Conference on Medi-

CE

cal Image Computing and Computer-Assisted Intervention, Springer. pp. 115– 123.

AC

Nguyen, K., Sarkar, A., Jain, A.K., 2014. Prostate cancer grading: Use of graph

880

cut and spatial arrangement of nuclei. IEEE Transactions on Medical Imaging 33, 2254–2270.

Niazi, M.K.K., Yao, K., Zynger, D., Clinton, S., Chen, J., Koyuturk, M., LaFramboise, T., Gurcan, M., 2017. Visually meaningful histopathological

40

ACCEPTED MANUSCRIPT

features for automatic grading of prostate cancer. IEEE Journal of Biomedical and Health Informatics 21, 1027–1038.

885

Pantanowitz, L., 2010. Digital images and the future of digital pathology. Jour-

CR IP T

nal of Pathology Informatics 1.

Pantanowitz, L., Szymas, J., Yagi, Y., Wilbur, D., 2012. Whole slide imaging for educational purposes. Journal of Pathology Informatics 3. 890

Rashid, S., 2014. Automatic pathology of prostate cancer in whole mount

slides incorporating individual gland classification. Ph.D. thesis. University

AN US

of British Columbia.

Rashid, S., Fazli, L., Boag, A., Siemens, R., Abolmaesumi, P., Salcudean, S.E., 2013. Separation of benign and malignant glands in prostatic adenocarcinoma, in: International Conference on Medical Image Computing and Computer-

895

Assisted Intervention, Springer. pp. 461–468.

M

Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy,

1297–1322. 900

ED

L., 2010. Learning from crowds. Journal of Machine Learning Research 11,

Rezaeilouyeh, H., Mollahosseini, A., Mahoor, M.H., 2016. Microscopic medi-

PT

cal image classification framework via deep learning and shearlet transform. Journal of Medical Imaging 3, 044501–044501.

CE

Ripley, B.D., 2005. Spatial statistics. volume 575. John Wiley & Sons. Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical

AC

905

Image Computing and Computer-Assisted Intervention, Springer. pp. 234– 241.

Shen, D., Wu, G., Suk, H.I., 2017. Deep learning in medical image analysis. Annual Review of Biomedical Engineering .

41

ACCEPTED MANUSCRIPT

910

Sheskin, D.J., 2003. Handbook of parametric and nonparametric statistical procedures. crc Press.

Cancer Journal for Clinicians 67, 7–30.

CR IP T

Siegel, R.L., Miller, K.D., Jemal, A., 2017. Cancer statistics, 2017. CA: A

Tabesh, A., Teverovskiy, M., Pang, H.Y., Kumar, V.P., Verbel, D., Kotsianti, A., Saidi, O., 2007. Multifeature prostate cancer diagnosis and gleason grad-

915

ing of histological images. IEEE Transactions on Medical Imaging 26, 1366– 1378.

AN US

Tsoumakas, G., Katakis, I., Vlahavas, I., 2009. Mining multi-label data, in: Data mining and knowledge discovery handbook. Springer, pp. 667–685. 920

Tsoumakas, G., Vlahavas, I., 2007. Random k-labelsets: An ensemble method for multilabel classification. Machine Learning: ECML 2007 , 406–417. Veta, M., Van Diest, P.J., Willems, S.M., Wang, H., Madabhushi, A., Cruz-Roa,

M

A., Gonzalez, F., Larsen, A.B., Vestergaard, J.S., Dahl, A.B., et al., 2015. Assessment of algorithms for mitosis detection in breast cancer histopathology images. Medical Image Analysis 20, 237–248.

ED

925

Viera, A.J., Garrett, J.M., et al., 2005. Understanding interobserver agreement:

PT

the kappa statistic. Fam Med 37, 360–363. Warfield, S., Zou, K., Wells, W., 2004. Simultaneous truth and performance level

CE

estimation (staple): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging 23, 903–921.

930

AC

Xu, Y., Li, Y., Wang, Y., Liu, M., Fan, Y., Lai, M., Chang, E., 2017. Gland instance segmentation using deep multichannel neural networks. IEEE Transactions on Biomedical Engineering .

Yagi, Y., 2011. Color standardization and optimization in whole slide imaging.

935

Diagnostic Pathology 6, S15.

42

ACCEPTED MANUSCRIPT

Zhang, M.L., Zhou, Z.H., 2014. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26, 1819–1837. Zhou, N., Fedorov, A., Fennessy, F., Kikinis, R., Gao, Y., 2017. Large scale

deep neural network. arXiv preprint arXiv:1705.02678 .

AC

CE

PT

ED

M

AN US

940

CR IP T

digital prostate pathology image analysis combining feature extraction and

43