Voxel-Wise Logistic Regression and Leave-One-Source-Out Cross Validation for white matter hyperintensity segmentation

Voxel-Wise Logistic Regression and Leave-One-Source-Out Cross Validation for white matter hyperintensity segmentation

Accepted Manuscript Voxel-Wise Logistic Regression and Leave-One-Source-Out Cross Validation for white matter hyperintensity segmentation Jesse Knigh...

7MB Sizes 0 Downloads 5 Views

Accepted Manuscript Voxel-Wise Logistic Regression and Leave-One-Source-Out Cross Validation for white matter hyperintensity segmentation

Jesse Knight, Graham W. Taylor, April Khademi PII: DOI: Reference:

S0730-725X(18)30225-X doi:10.1016/j.mri.2018.06.009 MRI 8982

To appear in:

Magnetic Resonance Imaging

Received date: Revised date: Accepted date:

12 April 2018 11 June 2018 13 June 2018

Please cite this article as: Jesse Knight, Graham W. Taylor, April Khademi , VoxelWise Logistic Regression and Leave-One-Source-Out Cross Validation for white matter hyperintensity segmentation. Mri (2018), doi:10.1016/j.mri.2018.06.009

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Voxel-Wise Logistic Regression and Leave-One-Source-Out Cross Validation for White Matter Hyperintensity Segmentation Jesse Knighta,∗, Graham W. Taylora,c , April Khademib a University

of Guelph, 50 Stone Rd E, Guelph, Canada University, 350 Victoria St, Toronto, Canada c Vector Institute, 101 College Street, Suite HL30B, Toronto, Canada

RI

PT

b Ryerson

SC

Abstract

Many algorithms have been proposed for automated segmentation of white matter hyperintensities (WMH) in brain MRI. Yet, broad uptake of any particular algorithm has not been observed. In this work, we argue

NU

that this may be due to variable and suboptimal validation data and frameworks, precluding direct comparison of methods on heterogeneous data. As a solution, we present Leave-One-Source-Out Cross Validation

MA

(LOSO-CV), which leverages all available data for performance estimation, and show that this gives more realistic (lower) estimates of segmentation algorithm performance on data from different scanners. We also develop a FLAIR-only WMH segmentation algorithm: Voxel-Wise Logistic Regression (VLR), inspired by the open-source Lesion Prediction Algorithm (LPA). Our variant facilitates more accurate parameter es-

ED

timation, and permits intuitive interpretation of model parameters. We illustrate the performance of the VLR algorithm using the LOSO-CV framework with a dataset comprising freely available data from several

PT

recent competitions (96 images from 7 scanners). The performance of the VLR algorithm (median Similarity Index of 0.69) is compared to its LPA predecessor (0.58), and the results of the VLR algorithm in the 2017

AC CE

WMH Segmentation Competition are also presented. Keywords: Brain, Fluid-attenuated inversion recovery, White matter hyperintensities, Multiple sclerosis, Segmentation, Logistic regression

∗ Corresponding

author Email address: [email protected] (Jesse Knight) Abbreviations: WMH, white matter hyperintensity; FLAIR, fluid-attenuated inversion recovery; MS, multiple sclerosis; CV, cross validation; LOSO, leave-one-source-out; LL, lesion load. Preprint submitted to Magnetic Resonance Imaging

June 14, 2018

ACCEPTED MANUSCRIPT

1. Introduction Several white matter pathologies present similarly on T2-weighted MRI, including Multiple Sclerosis (MS) plaques and white matter lesions of presumed vascular origin. These so-called white matter hyperintensities (WMH)1 manifest due to some combination of local macroscopic tissue structure erosion and increased water content due to inflammation [1, 2]. Presentation and progression of MS is highly variable, with

PT

demyelination injury possibly precipitated by autoimmune activation or viral factors [3, 4, 5]. Inflammatory edema contributes to initial T2 hyperintensity, while subsequent demyelination injury and axonal loss yield

RI

persistent T2 lesions [5, 6]. WMH are also implicated in small vessel disease leading to dementia, where focal lesions are thought to derive from chronic or acute ischemic injury [7]. As in MS, exact etiologies are

SC

unclear and likely heterogeneous, but correlations with cognitive decline and dementia are strong [8, 9, 10]. 1.1. Motivation

NU

MR imaging of WMH plays several roles in management of MS and dementia, including diagnosis [11, 12] and evaluation of new treatment efficacies [9, 13, 14, 15]. In a 2010 meta-analysis, WMH were also associated

MA

with significantly increased risk of stroke and dementia [8].

Typically, analysis of WMH on MRI is performed manually, using specific criteria [11, 12], visual scales [16], or manual delineation [17]. Manual delineation is most informative, providing both volumetric lesion load (LL) and exact spatial distribution of lesions. However, this approach is laborious, and is

ED

subject to large inter- and intra-rater variability. Such variability has been quantified in several works using similarity index (SI) and interclass correlation coefficient (ICC) (see § 3.3.1 for definitions); Table 1 gives a

PT

summary of these results. Such discrepancies have motivated the development of automated tools in order to provide more consistent segmentations. These efforts have been further stimulated by the collection of

AC CE

large imaging databases (e.g. CAIN & ADNI) for which manual delineation of lesions would be too time consuming or variable across sites. Table 1: Reported mean inter-rater agreement measures for manual WMH segmentation. SI is a measure of voxel-wise agreement ∈ [0, 1]; ICC is a measure of total volume agreement ∈ [0, 1].

Ref

Year

Authors

[17] [18] [19] [20]

2016 2006 2009 2013

Egger et al. Harmouche et al. de Boer et al. Steenwijk et al.

Raters

Data

SI

ICC

3 5 2 2

50 images 10 images 6 images 120 slices

0.66 0.64 0.75 0.83

0.97 — — 0.96

1 We use the term WMH to describe T2 hyperintensities in purely the imaging context, distinguished from one source of such findings: leukoaraiosis, or white matter lesions of presumed cerebrovascular origin.

2

ACCEPTED MANUSCRIPT

1.2. Automatic WMH Segmentation Many researchers have sought to fully automate segmentation of WMH. Reviews by Llad´o et al. [21], Garcia-Lorenzo et al. [22], and Caligiuri et al. [23] give an excellent overview of many of the proposed methods. Here, we introduce the specific challenges to automated WMH segmentation, review the most popular proposed methods, and focus particularly on their limitations with respect to validation.

PT

1.2.1. Challenges to Segmentation

There are several challenges to automating segmentation of WMH. Whereas human observers can

RI

consider and account for many of these challenges, they are often difficult to model computationally. These include:

SC

1. tissue graylevel distribution overlap due to partial volume effect and noise [24];

NU

2. differentiating WMH from cerebrospinal fluid flow artifacts, especially periventricular lesions [2]; 3. delineating the boundaries of focal lesions surrounded by so-called dirty appearing white matter [25];

MA

4. identifying small WMH in MRI with significant partial volume effect due to large slice thickness [24]; 5. variable and uncertain disease etiology [5, 26];

7. image variability.

ED

6. bias field artifact [27];

We will take “image” to mean one subject-scan, which may include several MRI sequences, and we will take

PT

“image variability” to comprise:

• differences in image contrasts (i.e. tissue intensity distributions), due to selection of MRI parameters;

AC CE

• differences in MRI scanner, including field strength and proprietary image reconstruction; • differences in image resolution (voxel size); • inter-subject anatomical variability and lesion heterogeneity. Most of the major challenges to segmentation have been addressed with varying degrees of success in previous works. However, image variability remains a persistent obstacle, which has not been well characterized or addressed by previous works, particularly since it modulates the severity of the other six challenges. For example, images from one scanner may have little graylevel distribution overlap between tissues, but poor resolution, while images from another scanner may have high resolution, but strong bias field effect, et cetera. As a result, models designed or trained using images from one scanner often perform poorly on images from another. For this reason, we give special attention to image variability throughout this work, using a large and heterogeneous database to highlight the challenges outlined above. 3

ACCEPTED MANUSCRIPT

1.2.2. Prior Work A selection of the most popular proposed methods for WMH segmentation was derived from the reviews above [21, 22, 23]. These methods are summarized in Table 2, which also highlights validation conditions used in each work. As in any classification problem, algorithms can be broadly grouped into supervised and unsupervised methods, the former requiring manually labelled example images, and the latter deriving from

PT

task-specific knowledge. In either case, most methods employ some combination of intensity features and spatial information.

RI

1.2.2.1. Thresholding Techniques. As the most discriminative modality, intensities from FLAIR images have been used in many works as the primary feature for WMH classification. In unsupervised approaches,

SC

histogram statistics have been used to determine a suitable intensity threshold for distinguishing the lesion class, such as in the works by Jack et al. [29], de Boer et al. [19], and Smart et al. [51]. The approach by Yoo et al. [59] is similar, except that a naive estimate of total lesion load is used to help define the

NU

optimal FLAIR threshold from the histogram. In [52], regions derived from watershed segmentation are classified using on a global FLAIR threshold, while in [46], a conservative threshold is applied to select

MA

obvious lesions, before the remaining voxels are classified using Fuzzy C Means. Khademi et al. [57] derive a model of partial volume averaging using the conditional probability of edge content on graylevel, and use this for unsupervised WMH segmentation in subjects with ischemic and MS diseases [57, 70, 78].

ED

1.2.2.2. Mixture Models. Perhaps the most popular approach to neuroimage segmentation (not just for WMH) is the unified mixture model. This approach underpins the popular segmentation modules in both

PT

FSL [79] and SPM [80]. In such models, the distribution of intensities for each tissue class is approximated by a Gaussian distribution, permitting expectation maximization estimation of model parameters and the segmentation. Van Leemput et al. [28] adapt this approach for WMH segmentation by defining outlier

AC CE

classes using heuristic rules and ensuring that other model parameters are estimated robustly with respect to outliers. In many of these works, a Markov Random Field energy is used to encourage spatial smoothness in the label images.

Attempting to improve generalization, Khayati et al. [37] and Subbanna et al. [81] employ the same Markov Random Field-mixture model framework, but model WMH as a unique Gaussian-distributed tissue class. Later works by Bricq et al. [40], Schmidt et al. [53], Jain et al. [62], and Roura et al. [69] however, return to outlier-based strategies for lesion detection. Other variants explore lognormal distributions instead of Gaussians [45], additional partial volume classes [42], and supervised improvement of the WMH outlier class definition [39]. Graph-Cuts segmentation of mixture models has also gained recent popularity, as in the works by Garc´ıa-Lorenzo et al. [43], Tomas-Fernandez and Warfield [63], and Strumia et al. [72]. Finally, Harmouche et al. [18, 60] have employed separate mixture models for different brain regions, in order to reflect lobe heterogeneity. 4

ACCEPTED MANUSCRIPT

Table 2: Summary of previous approaches to WMH segmentation with respect to image variability and reported performance (SI). Ref.

Year

Authors

MRI Sequences

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

[28] [29] [30] [31] [32] [33] [34] [35] [36] [18] [37] [38] [39] [40] [41] [42] [19] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [24] [53] [54] [55] [56] [20] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77]

2001 2001 2002 2004 2005 2005 2006 2006 2006 2006 2008 2008 2008 2008 2008 2008 2009 2009 2009 2009 2010 2010 2010 2011 2011 2011 2012 2012 2012 2012 2013 2013 2013 2014 2014 2014 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2016 2016 2016 2016 2016 2017 2017 2017

Van Leemput et al. Jack et al. Zijdenbos et al. Anbeek et al. Anbeek et al. Admiraal-Behloul et al. Lao et al. Wu et al. Sajja et al. Harmouche et al. Khayati et al. Wels et al. Herskovits et al. Bricq et al. Dyrby et al. Souplet et al. de Boer et al. Garc´ıa-Lorenzo et al. Akselrod-Ballin et al. Schwarz et al. Gibson et al. Shiee et al. Scully et al. Garc´ıa-Lorenzo et al. Geremia et al. Smart et al. Samaille et al. Khademi et al. Schmidt et al. Abdullah et al. Sweeney et al. Datta and Narayana Steenwijk et al. Khademi et al. Ithapu et al. Yoo et al. Harmouche et al. Guizard et al. Jain et al. Tomas-Fernandez and Warfield Wang et al. Roy et al. Brosch et al. Fartaria et al. Deshpande et al. Roura et al. Knight and Khademi Mechrez et al. Strumia et al. Griffanti et al. Brosch et al. Valverde et al. Dadar et al. Zhan et al.

T1, T2, PD FLAIR T1, T2 T1, T2, PD, FLAIR, IR T1, T2, PD, FLAIR, IR T2, PD, FLAIR T1, T2, PD, FLAIR T1, T2, PD T2, PD, FLAIR T1, T2, PD FLAIR T1, T2, FLAIR T1, T2, PD, FLAIR T2, FLAIR T1, T2, FLAIR T1, T2, FLAIR T1, PD, FLAIR T1, T2, PD T1, T2, PD, FLAIR T1, T2, PD T1, T2, FLAIR T1, FLAIR T1, T2, FLAIR T1, T2, FLAIR T1, T2, FLAIR T1, FLAIR T1, FLAIR FLAIR T1, FLAIR T1, T2, FLAIR T1, T2, PD, FLAIR T1, T2, FLAIR T1, FLAIR FLAIR T1, FLAIR FLAIR T1, T2, PD, FLAIR T1, T2, PD, FLAIR T1, FLAIR T1, T2, FLAIR T1, T2, FLAIR FLAIR T1, T2, FLAIR FLAIR T1, T2, PD, FLAIR T1, FLAIR FLAIR T1, T2, FLAIR T1, FLAIR T1, FLAIR T1, T2, PD, FLAIR T1, FLAIR T1, FLAIR T1, T2, FLAIR

U/S

I

S

SI

U U S S S U S S S U U S S U S U S S S S U U S U S U U U U S S S S U S S U S U U U S S S S U U S U S S U S S

20 39 10 20 10 100 45 12 23 10 20 6 42 25 362 25 20 10 41 165 18 10 17 10 20 30 67 24 53 61 111 90 40 25 38 32 100 108 20 51 70 38 20 39 52 20 15 20 20 130 77 33 80 50

1 1 1 1 1 1 1 1 1 1 1 1 2 2 10 2 2 1 1 2 1 1 1 1 2 1 6 1 1 3 1 3 2 1 1 2 35 32 1 2 2 3 2 1 1 2 3 2 3 2 67 2 3 2

0.51 — 0.6 0.61 0.78 0.75 — — 0.78 0.61 0.75 0.57 0.6 — 0.56 — 0.72 0.63 0.53 — 0.81 0.63 — 0.65 — — 0.72 0.83 0.75 — 0.61 — 0.8 0.78 0.67 0.76 0.56 0.6 0.67 — 0.84 0.56 0.36 0.55 0.5 0.34 0.7 0.31 0.52 0.76 0.64 — 0.62 0.76

AC CE

PT

ED

MA

NU

SC

RI

PT

#

Abbreviations. U/S: unsupervised/supervised; I: number of MR image sets used for validation; S: number of MRI scanners used for validation; SI: reported validation similarity index.

5

ACCEPTED MANUSCRIPT

Other unsupervised paradigms for segmentation of WMH are not very common, though the works by Admiraal-Behloul et al. [33], Gibson et al. [46], and Valverde et al. [75] use Fuzzy C Means, while a novel topology-driven framework for brain segmentation called TOADS was proposed in [82], and subsequently adapted for lesion segmentation by Shiee et al. [47]. 1.2.2.3. Classical Supervised Methods. WMH segmentation has also been approached using classical su-

PT

pervised classification methods like K-Nearest Neighbours [20, 31, 32, 35, 67, 73], Support Vector Machines [34, 48, 54, 58], and ensembles such as Random Forests and AdaBoost [38, 44, 50, 58, 65]. In such

RI

models, the employed features always include voxel intensities, in addition to some combination of spatial coordinates [31, 32, 73], tissue priors [20, 35, 50, 67, 65], and / or patch-based features [38, 44, 48, 54, 58, 73].

SC

In recent years, logistic regression has also become more popular. Several combinations of MRI sequences are used as predictors in isolation, multiplication, and with smoothing in the OASIS model by Sweeney et al. [55]. In the past year, Zhan et al. [77] proposed a simplification of this model, using only the raw T1, T2,

NU

and FLAIR graylevels, while a variant using spatial features and linear regression was described by Dadar et al. [76].

MA

1.2.2.4. Deep Learning. Until recently, only a few WMH segmentation models have used deep learning. Traditional fully-connected Neural Networks are used by Zijdenbos et al. [30] and Dyrby et al. [41] to predict the lesion class voxel-wise, using a combination of intensity, spatial, and tissue prior features, whereas broader

ED

contextual features are considered in the deep convolutional Neural Network proposed by Brosch et al. [66]. Several deep learning approaches were employed, however, in the 2017 WMH segmentation competition [83].

PT

These methods will be further discussed below.

1.2.2.5. External Toolboxes. Finally, several freely available toolboxes for general neuroimage processing

AC CE

provide well validated modules for tasks like image registration, brain extraction / de-skulling, bias field correction, and normal brain segmentation. These have been incorporated into many of the WMH segmentation algorithms, including SPM2 toolkit [36, 41, 44, 51, 53, 58, 59, 65, 75] and the FSL3 toolkit [20, 39, 46, 55, 56, 64, 65, 73, 77], as well as bias correction by the N3/44 [84] algorithm [18, 30, 60, 61, 67, 71, 75, 76, 77]. 1.2.3. Limitations

As tools for processing images from a consistent source, all of the proposed methods are reasonable. However, if we assume a goal of multi-centre use, then most of the algorithms are inadequately validated, particularly with respect to image variability. Both supervised and unsupervised methods designed for images from one source may perform poorly on images from another, as noted by several authors [21, 55, 76]. 2 http://www.fil.ion.ucl.ac.uk/spm/ 3 https://fsl.fmrib.ox.ac.uk/fsl/ 4 https://www.slicer.org/wiki/Documentation/4.6/Modules/N4ITKBiasFieldCorrection

6

ACCEPTED MANUSCRIPT

Therefore, it is wise to validate algorithms on images from different scanners. Yet, of the 54 works reviewed (Table 2), only half use more than one scanner for validation, and only 5 (9%) use more than three. In supervised models, discriminative features like graylevels are subject to image variability, in spite of efforts to standardize them. For example, in the work by Steenwijk et al. [20], the authors present results for a supervised method using same-scanner training and testing on data from two scanners independently

PT

(SI = 0.75, 0.84), but note significant performance drop when training on one scanner and testing on the other (SI = 0.50), despite the use of variance scaling for intensity standardization [20]. Such a discrepancy

scanner, resolution, sequence contrasts, and in this case, disease.

RI

might be attributed to the features of image variability which are difficult to model, including different

Unsupervised models also often have hyperparameters which can be over-tuned to images from a single

SC

source. For example, most of the mixture model approaches classifying lesions as outliers [28, 40, 49, 62, 69] employ an empirically determined definition of outlier using model parameters (e.g. a Mahalanobis distance).

NU

Yet such mixture model parameters, and therefore the optimal outlier definition, are known to be affected by MR slice thickness, noise level, and tissue contrasts, as noted by several of the authors [28, 49, 69]. While not all works have proven multi-centre performance, we would like to highlight four works which

MA

have demonstrated strong validation of their algorithms (Table 3).5 Perhaps not surprisingly, reported performance in these works is lower than reported elsewhere (SI ≤ 0.64).

ED

Table 3: Works demonstrating better validation of a WMH segmentation algorithm

Year

Authors

[41] [60] [61] [74]

2008 2015 2015 2016

Dyrby et al. Harmouche et al. Guizard et al. Brosch et al.

PT

Ref.

I

S

SI

362 100 108 77

10 35 32 67

0.56 0.56 0.60 0.64

AC CE

Abbreviations. I: number of MR image sets used for validation; S: number of MRI scanners used for validation; SI: reported validation similarity index.

Perhaps the reason for weak validation in many works is the lack of publicly available datasets for WMH segmentation. The MSSEG 2008 Challenge data [85], which provide only 20 complete training image sets (in addition to a withheld testing database), were the standard validation data for almost a decade. However, these data consist of images from only 2 scanners, have significant MR artifacts [22], and it has been noted that the manual segmentations provided have large inter-rater variability (median inter-rater SI = 23.7 ± 13.5 % [61]). We therefore welcome recent revivals of the competition framework, such as the Longitudinal Multiple Sclerosis Lesion Segmentation (ISBI 2015) [86], the MS Segmentation Challenge (MICCAI 2016) [87] and 5 The work by Samaille et al. [52] is also a good candidate, having used 6 scanners for validation; however, as the authors note, 43 of the 67 images (64%) come from only one scanner, reducing the robustness of generalization results.

7

ACCEPTED MANUSCRIPT

the WMH Segmentation Challenge (MICCAI 2017) [83], for providing additional open-source images for training, as well as algorithm testing on withheld data. 1.2.4. Contributions In this paper, we make two major contributions towards the goal of automated multi-centre WMH segmentation. First, we propose a new FLAIR-only model for WMH segmentation, Voxel-Wise Logistic Re-

PT

gression (VLR), which employs FLAIR graylevels in a spatially-parametrized logistic regression framework. A FLAIR-only approach has several advantages, including that inter-sequence registration is not required,

RI

which is challenging among sequences acquired at different resolutions, and that both prospective and retrospective studies can have fewer sequence requirements, thus potentially reducing acquisition time. The

SC

FLAIR sequence specifically is much better than conventional T2 images at distinguishing periventricular lesions, since the CSF signal in FLAIR images is suppressed [88]. We also show that, unlike many previously

NU

proposed machine learning methods, the inner-workings of the proposed method are intelligible, and that it outperforms a similar method, the Lesion Prediction Algorithm (LPA). We also propose a cross validation (CV) framework, Leave-One-Source-Out (LOSO-CV), which is espe-

MA

cially relevant to MR image analysis. We show that this framework gives more realistic (lower) estimates of model performance compared with other typical frameworks. We characterize the performance of the VLR model using this framework and, in the spirit of open-source research, we employ only WMH imaging data

ED

from freely available sources. We hope that this will facilitate direct comparison of future methods with the current work.

PT

2. Proposed Segmentation Method

We now describe the proposed lesion prediction model, Voxel-Wise Logistic Regression (VLR). The

AC CE

method is inspired by the open-source WMH segmentation tool LPA [89, 90], which uses a combination of FLAIR graylevel and spatial features in a logistic regression model to predict the lesion class.6 Specifically, the conditional probability of the lesion class c = 1, given the image-derived features y is modelled by a logistic function, parameterized by feature weights β:

P (c = 1 | y; β) =

1 , 1 + e−βT y

βT y =

K X

β k yk .

(1)

k=1

1

k T

Typically, the image features are functions of space x = [x1 , x2 , x3 ], as in y(x) = [ 1, y(x) , . . . , y(x) ] . In the LPA model, the logistic intercept β 0 (x) additionally considers the effect of spatial location, while 6 Our understanding of the method was derived from both section 6.1 of the thesis by Schmidt [90] and the open-source MATLAB code at http://www.applied-statistics.de/lst.html.

8

ACCEPTED MANUSCRIPT

the remaining feature weights are estimated globally. In the proposed VLR model, all elements of β are 0

1

k T

permitted to vary spatially, yielding a set of “parameter images” β(x) = [β(x) , β(x) , . . . , β(x) ] , and the following revision to Eq (1): P (c(x) = 1 | y(x); β(x)) =

1 1+

e−βT (x)y(x)

.

(2)

PT

We denote this probability as the estimated lesion class label cˆ(x) = P (c(x) = 1 | β(x); y(x)) ∈ [0, 1], and ˆ denote complete images using capital letters, as in Y (x), C(x), etc.

RI

There are several advantages to the proposed spatial parameterization of β(x). First, since no feature weights are tied across x – each spatial location is implicitly modelled independently, hence the name Voxel-

SC

Wise Logistic Regression – estimation of β across spatial locations can occur in parallel. As such, practically complete convergence can be achieved during estimation of β, and reliance on models of spatial connectivity

NU

(e.g. MRF) can be avoided. This is in contrast to the LPA parameter estimation, which relies on random spatial sampling procedures and interpolation, and is not guaranteed to reach convergence [90]. Second, in this work, we highlight the transformation of variables, from β(x) to a threshold image T (x)

MA

and sensitivity image S(x), which succinctly summarizes the decision characteristics of the model. As such, the VLR model is not a “black box”, and its fitted parameters can be verified by expert humans. Finally, the current work also proposes significantly different pre- and post-processing procedures from LPA which

ED

aim to improve reliability of segmentation on images from different sources. The recent popularity of logistic regression models might be attributed to their simplicity and strong prior, making them easier to train with less data. By contrast, more complex supervised models like RF

PT

and Neural Networks have weaker priors, making them susceptible to overfitting, particularly for tasks with limited training data, like this. Furthermore, evidence suggesting that lesion segmentation requires a

AC CE

complex model is scarce, and radiologist assessment primarily considers only T2/FLAIR hyperintensity in the white matter. Finally, logistic regression models can give probabilistic outputs, which may be helpful in quantifying marginally pathological tissues like dirty-appearing white matter. The voxel-wise parameterization of the logistic regression model in this work is perhaps surprising. However, it parallels work by Harmouche et al. [60] in recognizing the potential for regional heterogeneity in the appearance of pathological and healthy tissues in brain MRI. Furthermore, many supervised models employ spatially normalized coordinates as features [31, 32, 41, 73, 76], thereby learning a discriminant in a feature space composed at least in part by these coordinates. The VLR discriminant is in fact similar, except that spatial and graylevel features are treated differently when constructing the class discrimination boundary in feature space. In particular, we enforce monotonicity in the probability of lesion given the graylevel features, but allow arbitrary complexity in the spatial features. Smoothness in the spatial domains is encouraged, however, by a variety of regularization techniques, discussed below. 9

ACCEPTED MANUSCRIPT

2.1. Overview The VLR model employs logistic regression to predict the lesion class on a per-voxel basis, after graylevel standardization and image registration. This model is trained in a standardized space, but model parameters are warped to the native image space for test prediction. Training the model yields one set of parameters per voxel in the standardized space, or equivalently one image for each parameter. These parameter images are

PT

then readily interpretable, which allows them to be regularized intuitively. In our implementation, we use only one feature, the FLAIR image greylevel, in order to avoid inter-sequence image registration. However in developing the model, we maintain the generality for any number of features.

RI

For discriminative models using MRI greylevels as features, it is necessary to standardize the greylevels.

SC

Our model also requires that images are standardized in space, at least during training. These are the goals of our preprocessing steps, which include image registration, histogram-matching, and brain masking. Standardized images are then used to train the logistic model, yielding a set of parameter images, which

NU

are subsequently smoothed. These parameter images can then be used to predict the lesion class for new images, following an inverse registration from the standardized space. Initial predictions are thresholded at a value π to yield binary segmentations Cˆ π (x), and small, isolated estimated lesions are removed. An

MA

overview of the algorithm is given in Figure 1.

PT

2.2. Preprocessing

ED

Figure 1: Overview of the proposed segmentation method. Typefaces – upright Roman: images in native space; italic Roman: images in standard (MNI) space; calligraphic: a set of images from several patients; bold: a set of images corresponding to ˆ different features; Variables – C(x): manual segmentation; Y (x): FLAIR image; β(x): parameter image; C(x): estimated lesion segmentation. Best viewed in colour.

For image registration and bias field correction, we use the “Segment” tool available in the SPM12 toolbox

AC CE

for MATLAB [80]. In the VLR model, bias correction is critically important, and a 2016 comparison found that the unified segmentation model employed in SPM12 outperformed the N3 approach (N4 precursor), among others [91]. While perfect registration was not essential due to the benefits of noisy data when training the VLR model, the transformations estimated by SPM12 “Segment” for real data are also known to be good [92]. Since many of the FLAIR images in our database had good white matter / grey matter contrast, we used SPM12 “Segment” directly on these images. This additionally avoids compounding errors of two-step registration, since the available T1 images were not acquired at the same resolution as the FLAIR images. For training, this step produces images which are bias-corrected, warped (affine and nonlinear) [93], and resampled (trilinear) to 1.5 mm isotropic voxel resolution in the Montreal Neurological Institute (MNI) brain space [94, 95]. During testing, we use this same step for bias correction, but do not resample the source image. Instead we use the inverse transform to warp the model parameter images to the subject space for 10

ACCEPTED MANUSCRIPT

inference, since the smooth parameter images are less affected by interpolation, and since estimated lesion class images are desired in the original space. Source images (training and testing) are also masked by a brain mask, derived from the MNI-space ICBM brain tissue probability maps [96]. 5

6

For graylevel standardization, we define a synthetic histogram with the shape (1 − y) − (1 − y) , which smoothly increases contrast for the upper range of image intensities. Masked images are then histogram

PT

matched to this profile [97], and clipped to the range [0, 1]. After exploring a range of transformations, we found that this nonlinear graylevel transformation outperformed several linear transformations involving histogram statistics (e.g. y˜ = (y − µ)/σ), including those using regional statistics, as suggested by Shinohara

RI

et al. [98]. Performance was quantified using a heuristic class graylevel overlap metric, and using segmentation performance of the full model using each technique. As noted in [99], the choice of target histogram

SC

does not significantly affect the graylevel agreement between source images after histogram matching; the

NU

histogram used here was simply defined to increase contrast in the upper range of graylevels. 2.3. Voxel-Wise Logistic Regression

We now turn to the discriminative model. We would like to predict the lesion class probability image

MA

ˆ C(x), given a set of feature images Y(x). As discussed above, we model this probability for each voxel independently, using a logistic function with feature weights β(x) – Eq. (2). It is therefore sufficient to derive the model here for only one spatial location. Furthermore, we omit the (x) notation in the following

ED

sections for clarity, since all subsequent derivations are for a single voxel. 2.3.1. Model Fitting

PT

Fitting the model involves estimating β (for each voxel x). To do so, we require some training data: feature vectors from a population of N observations Y = {y1 , . . . , yn }, and the true labels C = {c1 , . . . , cn }. We would like to find β which maximizes the likelihood of the model given the training data (maximum

AC CE

likelihood estimate), or equivalently, the log-likelihood L, which is equivalent, but is more numerically stable. The log-likelihood of the model is defined as L(β) = log P (C | Y; β) N Y

= log

P (cn | yn ; β)

n=1

=

N X

log P (cn | yn ; β).

(3)

n=1

Therefore, we can define the optimal β as β ∗ = arg max β

N X

log P (cn | yn ; β).

n=1

11

(4)

ACCEPTED MANUSCRIPT

2.3.2. Parameter Updates We estimate the optimal β ∗ with iterative optimization, using an initial estimate β (0) , an update term ∆β (t) , and a learning rate parameter α, β (t+1) ← β (t) + α ∆β (t) .

(5)

PT

There are many possible definitions of ∆β, including simply the gradient of L(β), denoted ∇β L, as in gradient descent [100]. However, it can be shown that L(β) is convex, so higher order update equations

RI

can be used for faster convergence [100]. We use Newton’s updates, which also employ the Hessian matrix,

∂L ∂β 1

.. .

  ∇β L =  

∂L ∂β k



   , 

NU



∂2L ∂β 1 ∂β 1

.. .

··· .. .

MA

  ∇2β L =  

SC

denoted ∇2β L. If ∇β L and ∇2β L are defined as

∂ L ∂β k ∂β 1

···

−1

ED

then the Newton update is given by

2

∆β = −∇2β L

∂2L ∂β 1 ∂β k

.. .

2

∂ L ∂β k ∂β k

(6)

   , 

(7)

∇β L.

(8)

PT

For logistic regression, the log-likelihood of a single observation n is given by

AC CE

i h (1−cn ) c log P (cn | yn ; β) = log (ˆ cn ) n (1 − cˆn ) = cn log cˆn + (1 − cn ) log(1 − cˆn ) = cn β T yn − log(1 + e+β

T

yn

)

(9)

so the gradient can be defined as

∇β L =

N X

yn (cn − cˆn ) ,

(10)

yn yn T (cn − cˆn ) .

(11)

n=1

and the Hessian as ∇2β L =

N X n=1

Substituting (10) and (11) into (8), we obtain the explicit update quantity ∆β for (5).

12

ACCEPTED MANUSCRIPT

2.3.3. FLAIR Feature and Implementation At this point, the model is still generalizable to any choice of features y. However we use only K = 1 feature: the FLAIR image intensity y 1 (the intercept β 0 corresponding to y 0 = 1 is also fitted). Our motivation is to minimize the need for additional scans and intra-subject registration, since it has been shown that WMH contrast and detection is better in FLAIR than conventional T2 or double inversion

PT

recovery (DIR) sequences [101, 102], and anatomical information conferred by T1 or T2 images is not used by our model. Moreover, it is necessary that each feature employed in the VLR model have a monotonic relationship with the lesion class, but typical values of image contrasts in T1 and T2 MRI do not yield such

RI

a relationship.

SC

The use of only one feature has two additional benefits. First, estimating β ∗ for every voxel is computationally expensive, but highly parallelizable, since each estimation is independent. In fact, for each voxel, explicit expressions for the 2 × 1 gradient (10), the inverse of the 2 × 2 Hessian (11), and therefore the 2 × 1

NU

update (8) can be defined easily. Concatenating these quantities along a vectorized index of x, we obtain a single update matrix for the entire model, which can be computed in parallel, ∆β 0 (x1 ) .. .

∆β 1 (x1 ) .. .

∆β 0 (xm )

∆β 1 (xm )

MA



  . 

(12)

ED

  ∆β(x) =  



Second, with the FLAIR graylevel as the only feature y = y 1 , it is possible to reparameterize the sigmoid

AC CE

PT

argument as

β T y = β 0 (1) + β 1 (y)   s = s(y − τ )  τ

= β1 0

(13)

= − ββ 1 .

In this form, it is easy to see that the parameter τ gives a threshold for the graylevel y corresponding to probability of lesion cˆ =

1 1+e0

= 0.5, which can be readily interpreted. Similarly, the s parameter

corresponds to the slope of the logistic function with respect to the FLAIR graylevel – i.e. the sensitivity near τ . Considering all image locations once again, T (x) is therefore a threshold image, and S(x) is a sensitivity image. 2.4. Regularization With limited training data, three challenges emerge for this model. First, it is possible that the intensities of observed lesion and non-lesion classes in a given location are perfectly separable. In this scenario, 13

ACCEPTED MANUSCRIPT

the maximum likelihood-fitted logistic function approaches a step function (β 1 → +∞), which is falsely “confident” in its class discrimination. We therefore require some regularization on β 1 . Second, in many locations, the lesion class has never been observed – i.e. C(x) = {0, . . . , 0}. Again, the maximum likelihoodfitted logistic contradicts our prior knowledge by predicting cˆ ≈ 0+ regardless of graylevel y (β 0 ≈ +∞). Finally, treating voxels independently is not ideal, and may lead to “noisy” parameter images. These should

PT

be smoothed to ensure spatial regularity. The first two challenges are well known. The first is solved by incorporating a prior on the distribution of β. Gaussian priors with variance λ−1 are often used, yielding L2 regularization of β with weight λ.

RI

Including this prior, the estimate becomes maximum a posteriori, and Equation (4) becomes

β

SC

N   X β ∗ = arg max log P (β) + log P (cn | yn ; β) ,

(14)

(∇β L − λβ) .

(15)

n=1

NU

yielding the following revision to (8): −1

MA

∆β = − ∇2β L − λI

We find that λ = 10−2 works well in the current implementation. The second problem is more difficult to solve in this context. We would like to maintain the ability to

ED

predict the lesion class in locations where no lesions are observed in the training data. Considering the model reparameterization in (13), this is equivalent to enforcing τ < ymax ⇒ cˆ(ymax ) > 0.5. Regularization of β 0 and β 1 independently, or with linear Tikhonov matrices, will have no reliable effect on τ = −β 0 /β 1 ,

PT

and so cannot achieve this goal. Similarly, an exploration of nonlinear regularization terms which might be more capable revealed nontrivial convergence issues with these approaches.

AC CE

Therefore, we employ “pseudo-lesion” regularization: we append V synthetic observations V = {γ1 , . . . , γv }, with labels {1, . . . , 1} to the training data {Y, C} in each location. During training, this has the effect of increasing the predicted probability of lesion given greylevels near V, and is equivalent to synthetic dataset balancing. Hyperparameters associated with this approach include the synthetic lesion greylevels V, which can be thought of as a prior on the lesion class greylevel, and the number of pseudo-lesions, V , controlling the strength of the effect. The greylevels V do not have to be equal, and V can be very large; however, we find it is sufficient to use V = 1 and γ = ymax . We also find it is helpful to include pseudo-lesions outside the SPM-predicted white matter (i.e. throughout the brain), to account for small registration errors and misclassification of WML as grey matter. Additional data augmentation using plausible spatial sampling is explored in § 3.1.1. Finally, we smooth the parameter images β(x) in MNI space using a 3D Gaussian kernel with σ = 2 voxels (3 mm), to account for registration errors. Exploration of several linear and non-linear smoothing operations 14

ACCEPTED MANUSCRIPT

revealed that Gaussian smoothing gave the best results, and that performance was not significantly affected by different values of σ. 2.5. Post Processing Following model training, the parameter images can be used to estimate the voxel-wise probability of lesion. After bias correction of a test image, and inverse registration of the parameter images to the test

PT

ˆ image space by SPM12, equation (2) gives the initial prediction C(x). This probabilistic image of the lesion class is then thresholded to yield a binary estimate. The threshold π is selected through maximization7 of

RI

the SI across all training data. Finally, since small, isolated groups of estimated lesion voxels may arise due to image artifacts, and these are not typically considered by manual raters, they are removed from the

SC

segmentation, yielding the final estimate Cˆ π (x). The minimum volume of connected voxels was chosen as 5mm3 , in rough agreement with other works [59, 67]; the actual number of voxels changes with the resolution

NU

of the native image space.

3. Experimentation

MA

3.1. Image Database

We use 96 FLAIR images from 7 different scanners to train and validate our model. The term “scanner” is used to mean a unique combination of scanner model, field strength, MRI sequence parameter set, and

ED

resolution. The number of images and scan parameters are summarized in Table 4. All image sets (FLAIR and manual segmentation) are from freely-available WMH segmentation competitions (two involving MS

PT

lesions, and one involving leukariosis) from which we use only the training data, since they include manual segmentations. In addition to in-house validation results, we include the test results from the MICCAI 2017 WMH Segmentation Challenge [83], which are evaluated on 110 withheld training and testing from 5

AC CE

different scanners, three of which are also represented in the training data.8 3.1.1. Data Augmentation

After image registration to the MNI brainspace, we increase the number of images available for training by mirroring every image across the sagittal plane and shifting every image by one voxel in each direction (positive and negative), yielding a net 2 × (2 + 2 + 2 + 1) = 14-fold increase. These augmentations are surely plausible, considering the likeliness of registration errors on the scale of the MNI voxel size (1.5mm cubed), and confer a number of benefits. First, increasing the number of observations per voxel reduces logistic overfitting. Second, model symmetry and smoothness are enforced; in fact, the shifting augmentations have a similar effect to Markov Random Field regularization of the estimated parameter images. 7 Optimization 8 More

uses the MATLAB function fminsearch, an implementation of the simplex method by [103]. information on these can be found at wmh.isi.uu.nl/data

15

ACCEPTED MANUSCRIPT

Table 4: Summary of image database.

20 20 20 5 5 5 21

Dataset Ref. Scanner WMH 2017 (1) WMH 2017 (2) WMH 2017 (3) MS 2016 (1) MS 2016 (2) MS 2016 (3) MS 2015 ISBI

      

[83] [83] [83] [87] [87] [87] [86]

3T Philips Achieva 3T Siemens TrioTim 3T GE Signa HDxt 3T Philips Ingenia 1.5T Siemens Aera 3T Siemens Verio 3T Philips

TE (ms)

TR (ms)

TI (ms)

Voxel Size (mm)

Manuals (#)

125 82 126 360 336 399 68

11000 9000 8000 5400 5400 5000 11000

2800 2500 2340 1800 1800 1800 2800

0.96 × 0.96 × 3.00 1.00 × 1.00 × 3.00 0.98 × 1.20 × 3.00 0.50 × 1.10 × 0.50 1.04 × 1.25 × 1.04 0.74 × 0.70 × 0.74 0.43 × 0.43 × 3.00

1a 1a 1a 7b 7b 7b 2c

a

PT

Img (#)

3.2. Leave-One-Source-Out Cross Validation (LOSO-CV)

SC

RI

Manuals were generated following the standards outlined in [26], and were subsequently reviewed by a second rater, only WMH labels were included; b Manuals were fused using the LOP-STAPLE method [104]; c Manuals were fused using logical ‘and’.

NU

If an MRI segmentation algorithm is to be used outside its original setting, it is likely that the new images will have significantly different characteristics – scanner manufacturer, field strength, sequence parameters, and resolution. In order to characterize the expected performance of the model on such data, it is important

MA

to test the model on images whose characteristics have not been perceived during optimization. This procedure should also be repeated on images from several sources, to ensure robust results. To this end, and following from the concerns outlined in § 1.2.3, we propose Leave-One-Source-Out

ED

Cross Validation (LOSO-CV). For a database comprising images from S unique MRI scanners, LOSO-CV involves training the model S times, in each case omitting all data from one scanner, and then measuring

PT

the performance on this withheld data. This approach is contrasted with single-scanner training and testing (most common), and recent competition frameworks [85, 86, 87], which employ a single, fixed test set, comprising data from both seen and unseen scanners.9 While the unseen data in these test sets do provide

AC CE

an estimate of generalization performance, the LOSO-CV approach leverages all available data for both model training and performance evaluation, reducing potential bias associated with selection of specific images and scanners for training versus testing. The LOSO-CV approach also represents a specific instance of the “multi-source cross validation” framework proposed by Geras and Sutton [105] for arbitrary tasks. We further recommend that as many scanners and images as possible be used to validate model performance in segmentation tasks, and that the scan parameters be reported as in Table 4 for transparency. Again, “scanner” is used here to describe scanner-parameter combinations, since the same scanner model may be used with different imaging parameters at different centres. In order to illustrate the conservatism of this approach, the performance of the VLR model will be estimated under several other CV frameworks for comparison: 9 Granted, a fixed, restricted-access test set is perhaps necessary in the competition setting, to prevent over-fitting by competitors.

16

ACCEPTED MANUSCRIPT

• LOSO – Leave-One-Source-Out: Withhold all examples from one source from the training set; use these as the test set; repeat S times. • LOO – Leave-One-Out: Use all images except one as the training set; use it as the test case; repeat N times. • KF-CV – K-Fold Cross Validation: Use all images except a random batch of images as the training

PT

set; use these as the test set; repeat K times (without replacement).

• OSAAT – One-Scanner-At-A-Time: Use all images from a single scanner except one as the training

RI

set; use it as the test case; repeat N times.

SC

• No-CV – No Cross Validation: Train and test the model on all available data; no repetition. The No-CV framework is obviously not a valid cross validation method, since it involves training and testing

NU

on the same data. However, it lends insights on the maximum possible model performance. 3.3. Performance Evaluation

MA

To establish a baseline of human performance in conjunction with Table 1, we quantify agreement between manual raters in our two datasets which provide them. The 15 images in the MS 2016 dataset [87] are each segmented by 7 different raters, while the 21 images in the MS 2015 ISBI dataset [86] are segmented by two raters. We then evaluate performance of the proposed method, Voxel-Wise Logistic Regression (VLR),

ED

based on voxel-wise and segmented LL agreement with manual segmentations. Finally, we compare our

3.3.1. Evaluation Metrics

PT

method to one other freely available FLAIR-only WMH segmentation tool (LPA).

Voxel-wise agreement is quantified using the following measures, in terms of numbers of true positive

AC CE

(TP), false positive (FP), and false negative (FN) voxels: • Similarity Index (SI) (aka Dice Similarity Coefficient, F1-Score) SI =

2T P 2T P + F P + F N

• Precision (P r) (aka Overlap Fraction, Positive Predictive Value) Pr =

TP TP + FP

• Recall (Re) (aka Sensitivity, True Positive Rate) TP TP + FN 17

Re =

ACCEPTED MANUSCRIPT

Volume agreement between segmentations is characterized using the 2-way mixed-effects single-rater absolute intraclass correlation coefficient (ICC)10 [106]. Trends in in over/undersegmentation with LL are illustrated using a Bland-Altman plot [107]. 3.3.2. Comparison with Available Methods We commend the authors of several freely available WMH segmentation tools for deploying their algo-

PT

rithms in usable form. Of the available tools [28, 42, 47, 49, 53, 55, 89], only the method by Schmidt [89] is designed for FLAIR-only segmentation. The older method by Schmidt et al. [53] requires a T1 sequence,

RI

and the method by Sweeney et al. [55] requires T1, T2, FLAIR, and PD modalities. The methods by Van Leemput et al. [28], Souplet et al. [42], Shiee et al. [47], and Garc´ıa-Lorenzo et al. [49] were designed for

SC

some combination of T1, T2, FLAIR, and PD, and are flexible to the selected inputs; however they were not designed for FLAIR-only use, and it is unclear how they will perform under such conditions. Therefore,

NU

we compare only with the LPA method by Schmidt [89].

The LPA algorithm produces probabilistic lesion segmentations. For a fair comparison, we optimize the threshold for binarization of these images π using the same LOSO-CV folds, similar to the postprocessing

MA

optimization of our method. No other aspect of the LPA algorithm permitted user tuning, so the remaining components were left unchanged.

ED

4. Results and Discussion

This section presents the results of experimentation, including analysis of inter-rater variability in manual

4.1. Inter-Rater Variability

PT

segmentations, performance of the proposed method, and comparison with the LPA tool.

AC CE

Inter-rater SI was calculated among the 7 reviewers (7-choose-2 = 21 comparisons) for all 15 images in the MS 2016 dataset, and among the 2 reviewers (1 comparison) for all 21 images in the MS ISBI 2015 dataset. Mean ± standard deviation SI for all images and rater pairs, as well as the ICC between segmented lesion volumes, are shown in Table 5. These results are roughly consistent with the other reports in Table 1, and serve as a baseline for judging the performance of the proposed method. Unfortunately, only one manual segmentation per subject was available for the WMH 2017 dataset, precluding comparisons of automated and human performance for segmenting leukoaraiosis. 10 Option

‘A-1’ in the MATLAB function ICC from https://www.mathworks.com/matlabcentral/fileexchange/22099

18

ACCEPTED MANUSCRIPT

Table 5: Calculated mean inter-rater agreement measures for manual WMH segmentation.

Ref

Dataset

[87] [86]

MSSEG 2016 MS 2015 ISBI

Raters

Data

SI

ICC

7 2

15 images 21 images

0.63±0.16 0.73±0.10

0.91 0.98

PT

Figure 2: Example FLAIR image and parameter images from one LOSO-CV fold. Slice number in isotropic 1.5mm MNI space is shown top left in each panel.

RI

4.2. Proposed Method

Optimization of each regularization technique, including the values of λ and V , the smoothing kernel for

SC

β(x), and the various kinds and amounts of data augmentation, was explored using both single-voxel toy examples and overall segmentation performance in a user-guided quasi-minorize-maximization search [108]. Segmentation performance was defined as the LOSO-CV estimated median SI, in which the proposed

NU

method trained and tested 7 times, in each case withholding and then testing on all images from one scanner. The results, already noted above, can be summarized as λ = 10−2 , V = 1, Gaussian smoothing

MA

of β(x) using σ = 2 voxels (3 mm), and data augmentation comprising a reflection about the midline, and shifts by 1-voxel in all 6 directions.

Using these definitions, this section presents the fitted VLR parameter images, as well as voxel-wise and

ED

LL-based performance metrics. 4.2.1. Parameter Images

Training the model yields a set of parameter images β(x). In our implementation, these are reparame-

PT

terized via (13) to yield a threshold image T (x) and a sensitivity image S(x). One set of parameter images from a random CV fold are shown in Figure 2. These do not vary appreciably between folds, but inclusion

AC CE

of pseudo-lesions was important for convergence in voxels which saw no lesions in the training set. The parameter images concisely illustrate a probabilistic decision boundary between lesion and nonlesion classes. FLAIR graylevels which are brighter than T (x) in the corresponding location are predicted ˆ to have C(x) > 0.5. Few other segmentation models can provide such a clear illustration of the discriminant behaviour, and the lack of artifacts (which might manifest in the popular Random Forest models) is worth noting.

The regions of low threshold appear to correspond to the typical distribution of WMH, whereas high thresholds depict areas of common false positives, or of rare lesion appearance. For example, the superficial tissues, insular ribbon, and midline all contain high thresholds, while the periventricular tissues contain much lower thresholds. Similarly, regions of low sensitivity S(x) include areas which commonly include both lesions and false positives. These include the septum pellucidum, which is often hyperintense, but inconsistently included in manual segmentations by different raters, and the peripheries of the ventricles, which represent 19

ACCEPTED MANUSCRIPT

Table 6: Mean lesion load performance measures by scanner under the LOSO-CV regime.

LL

SI

Pr

Re

      

24 17 6 29 5 10 5

0.69 0.81 0.68 0.55 0.41 0.61 0.70

0.87 0.82 0.70 0.89 0.60 0.85 0.71

0.65 0.78 0.77 0.47 0.32 0.47 0.78

ALL 

12

0.69

0.75

0.71

RI

WMH 2017 (1) WMH 2017 (2) WMH 2017 (3) MS 2016 (1) MS 2016 (2) MS 2016 (3) ISBI MS 2015

PT

Scanner

SC

Figure 3: Example images from inference and performance analysis. Best viewed in colour.

the possible overlap of periventricular lesions and cerebrospinal fluid flow through artifacts, due to imperfect

NU

registration. In regions with higher sensitivity, classes have greater separability, such as the tissues posterior and to cranial the occipital horns of the lateral ventricles. We note that segmentation performance is much less affected by the sensitivity image than by the threshold image, and was not significantly affected by

MA

different degrees of smoothing (σ). 4.2.2. Segmentation Performance

ED

Median performance metrics for each scanner under LOSO-CV are summarized in Table 6, and an example segmentation is shown in Figure 3. Overall, the VLR algorithm achieves a median Similarity Index of 0.69, higher than other models reporting performance on several scanners (Table 1), though only 7 are

PT

used in this work compared to 10–67 in the other works. Figure 4 illustrates the distribution of SI, P r, and Re versus LL, with trend line and confidence intervals. As is usually the case, performance is correlated

AC CE

with LL, though this trend is not significant for Recall. Relative to a preliminary investigation using FLAIR thresholding alone, the VLR model improved the median SI from 0.36 to 0.69. This suggests that spatial context is an important feature for lesion identification. For example, Figure 3 highlights the exclusion of septum pellucidum hyperintensity despite a similar graylevel as the surrounding lesions. Performance was highest on the WMH 2017 data, and lowest on MSSEG 2016 data (Table 6). This can be attributed to two factors. First, the LL of the WMH 2017 subjects are significantly higher than the MSSEG 2016 subjects, conferring the performance improvements suggested above. Second, we note that the images from all three sources in these two datasets have similar scan parameters (Table 4) and more consistent disease characteristics. When testing images from each WMH 2017 source in LOSO-CV there are 40 training images from the other two WMH 2017 sources, while there are only 10 training images from the two MSSEG 2016 sources when testing images from each MSSEG 2016 source. This means the characteristics 20

ACCEPTED MANUSCRIPT

Figure 4: Performance metrics versus lesion load. Scatter plot data are colored by scanner and show 3rd order trend line (dark grey) with 90% confidence intervals (light grey).

Figure 5: Bland-Altman plot (log space) showing volume agreement between manually and automatically segmented lesion loads.

PT

of the WMH 2017 database are more easily perceived during training under LOSO-CV, facilitating better performance on these data.

Next, we tested whether the VLR algorithm has comparable performance to humans. If the agreement

RI

(SI) between automatic and manual segmentations is statistically indistinguishable from the agreement between two manual segmentations, then the performance can be considered similar. If the same group of

SC

subjects is used for both comparisons, then the non-parametric paired Wilcoxon signed-rank test can be used. This applies to datasets with more than one manual segmentation provided (MSSEG 2016, MS 2015

NU

ISBI). Using this test, smaller differences are more significant. However, it is also possible to compare all subjects in an unpaired test, like the Mann-Whitney rank-sum test. This permits the inclusion of the WMH 2017 data in the comparison, since these data have no measurement of human agreement. In this case, we

MA

concatenate all manual vs. manual SI (n = 15 + 21 = 36)11 and compare with all automatic vs. manual SI (n = 96).

Differences were significant for the MS 2015 ISBI comparison (p = 0.037), but not for the MSSEG 2016

ED

comparison (p = 0.086), implying that the VLR algorithm is comparable to human performance on the

4.2.3. Volume Analysis

PT

MSSEG 2016 data.

The ICC between automatically estimated and manually assessed total LL over all images was 0.71,

AC CE

lower than the manual inter-rater ICCs reported in Tables 1 and 5. Trends in volume agreement are shown in the Bland-Altman plot (Figure 5), drawn in log-space to better illustrate effects at small LL. Note that this requires the approximation of LL = 0 with LL = 100 = 1 (mL). The VLR algorithm tends to underestimate the total LL, especially for large loads, resulting in a fitted slope of 0.857 mL/mL. This is likely attributable to the histogram matching step, which acts to equalize the proportion of hyperintensity in the image, diminishing large LL. Thus, we believe it could be overcome by using more advanced graylevel standardization techniques. It is worth noting that the volume of voxels across CV folds which observed at least one lesion during training, was 938 [917 − 940] mL; the complement volume was 739 [738 − 761] mL – i.e. both classes are observed in just over half the brain using the augmented training data. However, the volume of voxels in 11 For

the MSSEG 2016 data, with 21 (7 choose 2) comparisons per subject, the median SI is used for each of the 15 subjects.

21

ACCEPTED MANUSCRIPT

Figure 6: Comparison of the estimated model performance using different cross validation methods. Box plots show median (centre line), 25th and 75th percentiles (box), extreme values (whiskers), and outliers (+). Best viewed in colour.

which there were no lesions during training, but which subsequently contained lesions in the test set – the target of pseudo-lesion regularization – was only 21 [19 − 41] mL. This makes design of such regularizations

PT

risky, since the number of potential FP far outweigh the number of possible FN → TP conversions. 4.2.4. Comparison of Cross Validation Frameworks

RI

In order to demonstrate the overestimation of segmentation performance by conventional CV frameworks, the VLR model was cross validated using several different methods. The results for each performance metric

SC

are summarized in Figure 6, and a paired non-parametric statistical test12 was used to test for significant differences among the conditions. Paired tests are more sensitive to smaller differences than unpaired tests, but can be used here because the samples (MRI and manual segmentations) are identical across test

NU

conditions (CV frameworks). This is why some comparisons are significant despite box overlap in Figure 6. The No-CV condition yielded the highest median SI, at 0.72, and was significantly higher in SI and P r

MA

comparisons with LOO, KF, and LOSO frameworks. The LOSO-CV condition consistently gave the lowest SI, at 0.69, and statistical comparisons showed that these differences were significant for all other conditions: LOO, KF, OSAAT, and No-CV.

These results suggest the potential for overestimation of generalization performance using conventional

ED

CV procedures, since characteristics of the test scanner are perceived during model optimization. For example, when training a supervised classifier under LOSO-CV, test image features may exist in a subspace

PT

of the feature domain which is sparsely observed during training (under conventional CV frameworks, this is not the case). Classification of these data would therefore be informed by only a few outlier samples, and

AC CE

so performance would likely be worse. This scenario parallels segmentation of data from truly unseen distributions, which is realistic for algorithm use by the general community. Finally, we note that performance estimation differences between CV frameworks, which were small but significant here, would likely be even higher in models with less regularization, since more over-fitting would likely occur. We note that LOSO-CV is also applicable to unsupervised algorithms with tunable parameters. At the risk of contradicting the “unsupervised” label, optimization of such parameters should be both automated and cross validated when demonstrating the performance of these models, or else heuristic selection during model development based on available data may prove suboptimal for data from other sources [28]. Now, it is possible that even under LOSO-CV, certain model hyperparameters can be over-tuned to the validation data, since algorithm development inevitably includes iterations informed by validation performance. However, the more heterogeneous the validation data, the more representative they become of the 12 The

paired non-parametric test was signrank in MATLAB.

22

ACCEPTED MANUSCRIPT

Figure 7: Performance measures of the proposed method (VLR) vs. the LPA algorithm, stratified by lesion load. See Figure 6 for description of the box plots. Best viewed in colour.

true gamut of expected images, and the more confidence can be assigned to the predicted generalization performance. This is why large, multi-scanner databases are critical for MRI analysis algorithm validation. Finally, we emphasize that this framework is presented and demonstrated for the WMH segmentation task,

PT

but LOSO-CV can be adopted in a variety imaging modalities and tasks.

RI

4.3. Comparison with LPA

The probabilistic lesion images produced by the LPA segmentation tool were binarized using the opti-

SC

mized thresholds for each LOSO-CV fold (0.22 [0.20 − 0.23]), in order to give a fair comparison with the proposed method. Binary lesion masks were then compared to those produced by the VLR algorithm, and

large: > 22 mL, n = 32) as shown in Figure 7.

NU

performance metrics were stratified by LL tertiles (small: < 4 mL, n = 32; medium: 4 − 22 mL, n = 32;

Overall, the proposed VLR algorithm outperformed the LPA algorithm in voxel-wise performance (me-

MA

dian VLR SI = 0.69 vs. LPA SI = 0.58, Wilcoxon signed-rank test p < 0.001). The LPA algorithm was less precise but had higher recall (fewer false negatives), especially for large LL (Figure ??). This resulted in better volumetric agreement by LPA for large LL. However, in all LL groupings, VLR had higher precision

ED

(fewer false positives), as shown in Figure ??. This is likely attributable to a combination of two factors. First, the current model has increased flexibility to model spatial relationships in the distributions of lesion and healthy tissues (Figure 2). since it is not subject to artifacts of the estimation procedure used by LPA,

PT

as described in [90]. Second, the optimization of the VLR model using data from 6 scanners during each LOSO-CV fold likely improves its generalization performance, relative to the LPA algorithm, for which the

AC CE

single spatial parameter image is fixed, and was derived using images from only a single source. Again, the importance of multi-source data is emphasized for both model training and validation. 4.4. 2017 WMH Segmentation Challenge Results The proposed method was submitted to the 2017 WMH Challenge. The challenge training data are described above (Table 4), while the testing data comprised 110 images from 5 different scanners (30 + 30 + 30 + 10 + 10), the last two being unseen during training. Teams were ranked using a combination of 5 performance metrics.13 The VLR method achieved a mean SI of 0.70 on the test data, or a scaled challenge score of 82.3%, ranking VLR 8th on this metric. Other metrics considered individual lesion identification, for which the VLR method did not perform as well, since these metrics were not considered during model development. Overall, 13 For

more information on WMH 2017 Challenge evaluation see: http://wmh.isi.uu.nl/evaluation/.

23

NU

MA

24

PT

RI

SC

Figure 8: Results report for the submitted method provided by the WMH Segmentation Competition.

ED

PT

AC CE

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

the VLR method ranked 15th of the 20 teams, with a score of 0.4159 (lower is better). The performance report provided by the challenge organizers is given in Figure 8. The 2017 WMH Challenge was significantly more competitive than the 2016 MSSEG Challenge, with the top-performing methods achieving SI of 0.8014 and 0.5915 respectively. This is likely attributable to some combination of three factors. First, there are differences in the image and disease characteristics, which may

PT

make WMH segmentation easier in the 2017 data. It is difficult to decorrelate this, however, from a second factor: a four-fold increase in the training set size, from 15 images to 60, which facilitates supervised learning of highly parametrized models. The third factor concerns the methods employed. The 2016 Challenge saw

RI

only 4 of 15 methods use deep learning models, whereas the 2017 Challenge saw 15 of 20 methods use deep learning, and in the latter competition, the top 13 methods all used this type of approach. Practically of

SC

these methods use fully convolutional neural networks, specifically variants on the U-Net architecture [109]. It is worth noting that the VLR method was the best non-deep method in Similarity Index performance,

NU

achieving 10% higher than the next best non-deep method (0.70 vs 0.60), and it was the second best performing non-deep method overall (using all 5 metrics). While deep learning methods clearly show quantitative gains in performance metrics, they have been criticized for lack of interpretability, and for limitations to

MA

reliability due to their capacity to over-fit, as illustrated by Szegedy et al. [110]. In fact, there is a significant difference in SI performance (0.105, p < 0.005) on data from “seen” versus “unseen” scanners across the average deep learning model in the competition. Conversely, this difference for the VLR model was not sig-

ED

nificant (0.007, p > 0.05). As such, the VLR model provides a reliable and easily interpretable segmentation

4.5. Summary

PT

algorithm (cf. Figure 2), which still has room for improvement, as discussed in the following sections.

The proposed method, inspired by the LPA algorithm [89], can be seen as a hybrid between the works

AC CE

by Sweeney et al. [55] and Harmouche et al. [60], in that lesions are predicted by logistic regression, but that the fitted parameters vary spatially. VLR overcomes the major challenges to WMH segmentation outlined in § 1.2.1. In particular, robust management of image variability emerged as a central goal of our work, after recognizing the multiplicative challenge posed by this aspect during model development. For example, the overlap of WMH greylevel distributions with healthy tissue classes and image artifacts is managed robustly in the VLR model through expansion of the feature space to include standardized spatial dimensions. Confounding hyperintensities such as flow through artifacts and bright grey matter are then excluded by spatial location. Several previously proposed approaches and our preliminary work explored the feasibility of graylevel thresholding techniques. 14 Detailed

WMH 2017 results and competitor methods descriptions are available at: http://wmh.isi.uu.nl/results/. summary of the MS SEG 2016 results is available at: https://portal.fli-iam.irisa.fr/documents/20182/33769/Results+MSSEG+Challenge+2016/. 15 A

25

ACCEPTED MANUSCRIPT

However, without spatial context, we found that these methods could not easily adapt to images with different contrasts and noise characteristics, since threshold definitions were too sensitive to this variability. The challenge of segmenting lesions surrounded by ambiguously pathological dirty-appearing white matter is managed by probabilistic estimation of the lesion class, which can be subsequently thresholded at a level reflecting the desired specificity. Small lesions with diminished hyperintensity due to partial volume

PT

averaging can be similarly segmented, regardless of the variability in slice thickness, and do not contribute to inaccuracies in estimated global parametric graylevel distributions required by other models. While bias field is not specifically addressed by the VLR method, any of several stand-alone tools could be applied as

RI

preprocessing, such as the N4 algorithm, or SPM Segment. Similarly, a variety of MR sequence non-specific tools for image registration are available, including FSL FLIRT and SPM Coregister.

SC

As evidenced by the LOSO-CV results, the VLR method makes strides towards our goal of consistent performance on images from different sources. Differences in tissue contrasts due to scanner characteristics

NU

and sequence parameters are standardized through histogram matching, in which only the ranking of WMH graylevels relative to healthy tissues is considered. This is contrasted with algorithms involving parametric mixture models, which are subject to corruption by image artifacts, partial volume averaging, and errors

MA

of distribution approximation, which likely makes robust generalization more difficult. Issues of variable anatomy and image resolution are also overcome by voxel-wise inference, following registration and transformation of the parameter images to the native space. Overall, the VLR algorithm demonstrates strong

ED

performance under challenging validation conditions (median SI = 0.69, P r = 83, Re = 0.63), similar to the works noted in Table 3 [41, 60, 61, 74]. This performance was indistinguishable from human performance

PT

in an unpaired test.

Several other results also lend insights to the task of WMH segmentation in general. The model parameter images illustrate a probabilistic decision boundary in the feature space composed of spatial coordinates and

AC CE

FLAIR graylevel features. Additionally, we demonstrate how the use of synthetic lesions can be helpful for improving generalization performance. This regularization trick can be easily incorporated into the training data of any discriminative model, provided prior knowledge of the expected features in each class are known. 4.6. Future Work

The VLR algorithm is not without fault. Foremost, the NoCV results (median SI = 0.72) suggest that there is a limit to model performance using these features and pre/postprocessing. In fact, the reduction in median SI from this result under LOSO-CV (median SI = 0.69) is only 3%. This implies that current overfitting is minimal, and that improved psuedo-lesion and parameter regularizations are unlikely to yield large gains. Moreover, the volume of lesion which might be missed without regularization is only a tiny fraction of the region in which they could be expected to be found. This happy disparity is afforded by a large (freely available) training dataset, which contains example lesions in most of the common spatial 26

ACCEPTED MANUSCRIPT

locations. Therefore, the most promising targets for improving model performance are the preprocessing steps and feature definition. In particular, we observe a central tendency of the estimated lesion load (high for small LL, low for high LL). This is likely attributable to the histogram matching step, since this nonlinear transform acts to equalize the proportion of hyperintensity in the image. Thus, the impact of LL on this step should be

PT

minimized. One solution could involve two iterations of the segmentation, in which estimated lesion voxels in the first-pass are masked before a histogram matching in the second pass. Also, the SPM12 “Segment” image registration employed here constrained the maximum permissible deformation for stability reasons,

RI

resulting in some misalignment of brain structures between subjects, notably the ventricles. Exploration of other tools or cascaded T1-based registration may yield improved precision of the fitted parameter images,

SC

resulting in segmentation performance gains.

Furthermore, while we maintain generality in deriving the model for any number of features, we have not

NU

yet explored the use of additional features besides FLAIR graylevel. Though single-sequence segmentation is attractive, most clinical protocols include a T1 sequence, and all the datasets used here provide it. Inclusion of features derived from T1, T2, or other sequences could be explored, though the extracted features must

MA

be transformed in order to ensure they are monotonically related to the lesion class for use in the VLR model. Such transformations could employ a so-called “kernel trick” [111]; these investigations will be the subject of future work.

ED

Finally, further investigations will explore the application of the LOSO-CV framework to the available WMH segmentation toolboxes, noted previously [28, 42, 47, 49, 53, 55, 89]. This characterization of scanner-

research workflows.

AC CE

5. Conclusion

PT

wise biases and overall performance can help inform algorithm selection by non-experts in the field for related

This work was motivated by the lack of a widely accepted algorithm for WMH segmentation, which stands out among other neuroimage analysis tasks. The specific challenges to this task and limitations of previous approaches were outlined, particularly considering image variability, and used to inform development of a novel WMH segmentation algorithm: Voxel-Wise Logistic Regression. Demonstrating the performance of the new model in anticipation of open-source release, special consideration was given to estimating model generalization performance on images from unseen sources. This yielded a rediscovery of the multi-source cross validation framework [105] – which we call LOSO-CV: Leave-One-Source-Out Cross Validation – which gives more realistic (lower) estimates of expected generalization performance. The LOSO-CV framework also reduces potential bias associated with selection of fixed training and testing sets by leveraging all available data for both roles. In the spirit of open-source research, and to facilitate direct comparisons by 27

ACCEPTED MANUSCRIPT

future works, all performance analysis in this work is done using freely available data from several WMH segmentation competitions [83, 86, 87]. The VLR algorithm is also available on GitHub.16

Acknowledgements The research was supported in part by the Natural Science and Engineering Research Council of Canada

PT

(NSERC CGS-M) and by the Ontario Ministry of Advanced Education and Skills Development (OGS-M). We are also grateful to Dhanesh Ramachandram, Thor Jonsson, Colin Brennan, Terrance DeVries, Dylan

RI

Drover, Griffin Lacey, Carolyn Augusta, and Brittany Reiche for their insights and discussions throughout

AC CE

PT

ED

MA

NU

SC

the project.

16 github.com/uoguelph-mlrg/vlr

28

ACCEPTED MANUSCRIPT

References [1] R. Bakshi, A. Minagar, Z. Jaisani, J. S. Wolinsky, Imaging of multiple sclerosis: role in neurotherapeutics., NeuroRx : the journal of the American Society for Experimental NeuroTherapeutics 2 (2) (2005) 277–303, ISSN 1545-5343, DOI 10.1602/neurorx.2.2.277, URL http://www.ncbi.nlm.nih.gov/pubmed/15897951http://www.pubmedcentral.nih.gov/ articlerender.fcgi?artid=1064992{&}tool=pmcentrez{&}rendertype=abstract. [2] J. M. Wardlaw, M. C. Vald´ es Hern´ andez, S. Mu˜ noz-Maniega, What are white matter hyperintensities made of? Rel-

PT

evance to vascular cognitive impairment., Journal of the American Heart Association 4 (6) (2015) 001140, ISSN 2047-9980, DOI 10.1161/JAHA.114.001140, URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 4599520{&}tool=pmcentrez{&}rendertype=abstract.

RI

[3] A. Kutzelnigg, C. F. Lucchinetti, C. Stadelmann, W. Br¨ uck, H. Rauschka, M. Bergmann, M. Schmidbauer, J. E. Parisi, H. Lassmann, Cortical demyelination and diffuse white matter injury in multiple sclerosis., Brain : a journal of neurology

SC

128 (Pt 11) (2005) 2705–12, ISSN 1460-2156, DOI 10.1093/brain/awh641, URL http://brain.oxfordjournals.org/ content/128/11/2705.long.

[4] H. Lassmann, Mechanisms of inflammation induced tissue injury in multiple sclerosis, Journal of the Neurological Sciences 274 (1-2) (2008) 45–47, ISSN 0022510X, DOI 10.1016/j.jns.2008.04.003.

NU

[5] D. H. Mahad, B. D. Trapp, H. Lassmann, Pathological mechanisms in progressive multiple sclerosis, The Lancet Neurology 14 (2) (2015) 183–193, ISSN 14744465, DOI 10.1016/S1474-4422(14)70256-X, URL http://www.sciencedirect.com/ science/article/pii/S147444221470256X.

MA

[6] M. A. A. van Walderveen, W. Kamphorst, P. Scheltens, J. H. T. M. van Waesberghe, R. Ravid, J. Valk, C. H. Polman, F. Barkhof, Histopathologic correlate of hypointense lesions on T1-weighted spin-echo MRI in multiple sclerosis, Neurology 50 (5) (1998) 1282–1288, ISSN 0028-3878, 1526-632X, DOI 10.1212/WNL.50.5.1282, URL

http://www.ncbi.nlm.nih.gov/pubmed/9595975http://www.neurology.org/content/50/5/1282{%}5Cnhttp:

ED

//www.ncbi.nlm.nih.gov/pubmed/9595975{%}5Cnhttp://www.neurology.org/content/50/5/1282.full.pdf. [7] L. Pantoni, Cerebral small vessel disease: from pathogenesis and clinical characteristics to therapeutic challenges, The Lancet Neurology 9 (7) (2010) 689–701, ISSN 14744422, DOI 10.1016/S1474-4422(10)70104-6.

PT

[8] S. Debette, H. S. Markus, The clinical importance of white matter hyperintensities on brain magnetic resonance imaging: systematic review and meta-analysis, BMJ 341, ISSN 0959-8138, DOI 10.1136/bmj.c3666, URL http://www.bmj.com/ content/341/bmj.c3666.

AC CE

[9] S. Debette, A. Beiser, C. DeCarli, R. Au, J. J. Himali, M. Kelly-Hayes, J. R. Romero, C. S. Kase, P. A. Wolf, S. Seshadri, Association of MRI markers of vascular brain injury with mild cognitive impairment, incident stroke and mortality: The framingham offspring study, Alzheimer’s and Dementia 1) (4) (2009) 128, ISSN 1552-5260, DOI http://dx.doi.org/10.1016/j.jalz.2009.05.432, URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847685/. [10] H. M. Snyder, R. A. Corriveau, S. Craft, J. E. Faber, S. M. Greenberg, D. Knopman, B. T. Lamb, T. J. Montine, M. Nedergaard, C. B. Schaffer, J. A. Schneider, C. Wellington, D. M. Wilcock, G. J. Zipfel, B. Zlokovic, L. J. Bain, F. Bosetti, Z. S. Galis, W. Koroshetz, M. C. Carrillo, Vascular contributions to cognitive impairment and dementia including Alzheimer’s disease, Alzheimer’s and Dementia 11 (6) (2015) 710–717, ISSN 15525279, DOI 10.1016/j.jalz.2014.10.008, URL http://www.sciencedirect.com.subzero.lib.uoguelph.ca/science/article/pii/S1552526014028623. [11] C. H. Polman, S. C. Reingold, B. Banwell, M. Clanet, J. A. Cohen, M. Filippi, K. Fujihara, E. Havrdova, M. Hutchinson, L. Kappos, F. D. Lublin, X. Montalban, P. O’Connor, M. Sandberg-Wollheim, A. J. Thompson, E. Waubant, B. Weinshenker, J. S. Wolinsky, Diagnostic criteria for multiple sclerosis: 2010 revisions to the McDonald criteria., Annals of neurology 69 (2) (2011) 292–302, ISSN 1531-8249, DOI 10.1002/ana.22366, URL http://www.pubmedcentral.nih.gov/ articlerender.fcgi?artid=3084507{&}tool=pmcentrez{&}rendertype=abstract. [12] G. M. McKhann, D. S. Knopman, H. Chertkow, B. T. Hyman, C. R. Jack, C. H. Kawas, W. E. Klunk, W. J. Koroshetz,

29

ACCEPTED MANUSCRIPT

J. J. Manly, R. Mayeux, R. C. Mohs, J. C. Morris, M. N. Rossor, P. Scheltens, M. C. Carrillo, B. Thies, S. Weintraub, C. H. Phelps, The diagnosis of dementia due to Alzheimer’s disease: Recommendations from the National Institute on AgingAlzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s disease, Alzheimer’s and Dementia 7 (3) (2011) 263–269, ISSN 15525260, DOI 10.1016/j.jalz.2011.03.005, URL http://linkinghub.elsevier.com/retrieve/ pii/S1552526011001014http://www.sciencedirect.com/science/article/pii/S1552526011001014. [13] D. H. Miller, P. S. Albert, F. Barkhof, G. Francis, J. A. Frank, S. Hodgkinson, F. D. Lublin, D. W. Paty, S. C. Reingold, J. Simon, Guidelines for the use of magnetic resonance techniques in monitoring the treatment of multiple sclerosis,

PT

Annals of Neurology 39 (1) (1996) 6–16, ISSN 03645134, DOI 10.1002/ana.410390104, URL http://www.ncbi.nlm.nih. gov/pubmed/8572668.

[14] K. Fahrbach, R. Huelin, A. Martin, E. Kim, H. Dastani, M. Mahlhotra, S. Rao, Relating Relapse and T2 Lesion

RI

Changes to Disability Progression in MS: A Systematic Literature Review and Regression Analysis (P05.090), Neurology 78 (Meeting Abstracts 1) (2012) P05.090–P05.090, ISSN 0028-3878, DOI 10.1212/WNL.78.1_MeetingAbstracts.P05.090,

SC

URL http://www.ncbi.nlm.nih.gov/pubmed/24245966.

[15] M. P. Sormani, P. Bruzzi, MRI lesions as a surrogate for relapses in multiple sclerosis: A meta-analysis of randomised trials, The Lancet Neurology 12 (7) (2013) 669–676, ISSN 14744422, DOI 10.1016/S1474-4422(13)70103-0, URL http:

NU

//www.sciencedirect.com/science/article/pii/S1474442213701030.

[16] L. Pantoni, M. Simoni, G. Pracucci, R. Schmidt, F. Barkhof, D. Inzitari, Visual Rating Scales for Age-Related White Matter Changes (Leukoaraiosis): Can the Heterogeneity Be Reduced?, Stroke 33 (12) (2002) 2827–2833, ISSN 0039-2499, DOI 10.1161/01.STR.0000038424.70926.5E, URL http://stroke.ahajournals.org/content/33/12/2827.full.

MA

[17] C. Egger, R. Opfer, C. Wang, T. Kepp, M. P. Sormani, L. Spies, M. Barnett, S. Schippling, MRI FLAIR lesion segmentation in Multiple Sclerosis: Does automated segmentation hold up with manual annotation?, NeuroImage: Clinical 13 (2016) 264–270, ISSN 22131582, DOI 10.1016/j.nicl.2016.11.020, URL http://www.sciencedirect.com/science/ article/pii/S2213158216302285.

ED

[18] R. Harmouche, L. Collins, D. Arnold, S. Francis, T. Arbel, Bayesian MS lesion classification modeling regional and local spatial information, in: Proceedings - International Conference on Pattern Recognition, vol. 3, IEEE, ISBN 0769525210, ISSN 10514651, 984–987, DOI 10.1109/ICPR.2006.318, URL http://ieeexplore.ieee.org/document/1699691/, 2006.

PT

[19] R. de Boer, H. A. Vrooman, F. van der Lijn, M. W. Vernooij, M. A. Ikram, A. van der Lugt, M. M. B. Breteler, W. J. Niessen, White matter lesion extension to automatic brain tissue segmentation on MRI., NeuroImage 45 (4) (2009) 1151– 61, ISSN 1095-9572, DOI 10.1016/j.neuroimage.2009.01.011, URL http://www.sciencedirect.com/science/article/

AC CE

pii/S1053811909000561.

[20] M. D. Steenwijk, P. J. W. Pouwels, M. Daams, J. W. van Dalen, M. W. A. Caan, E. Richard, F. Barkhof, H. Vrenken, Accurate white matter lesion segmentation by k nearest neighbor classification with tissue type priors (kNN-TTPs)., NeuroImage. Clinical 3 (2013) 462–9, ISSN 2213-1582, DOI 10.1016/j.nicl.2013.10.003, URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3830067{&}tool=pmcentrez{&}rendertype=abstract. ` Rovira, [21] X. Llad´ o, A. Oliver, M. Cabezas, J. Freixenet, J. C. Vilanova, A. Quiles, L. Valls, L. Rami´ o-Torrent` a, A. Segmentation of multiple sclerosis lesions in brain MRI: A review of automated approaches, Information Sciences 186 (1) (2012) 164–185, ISSN 0020-0255, DOI http://dx.doi.org/10.1016/j.ins.2011.10.011. [22] D. Garcia-Lorenzo, S. Francis, S. Narayanan, D. L. Arnold, D. L. Collins, Review of automatic segmentation methods of multiple sclerosis white matter lesions on conventional magnetic resonance imaging., Medical image analysis 17 (1) (2013) 1–18, ISSN 1361-8423 (Electronic), DOI 10.1016/j.media.2012.09.004. [23] M. E. Caligiuri, P. Perrotta, A. Augimeri, F. Rocca, A. Quattrone, A. Cherubini, Automatic Detection of White Matter Hyperintensities in Healthy Aging and Pathology Using Magnetic Resonance Imaging: A Review, Neuroinformatics 13 (3) (2015) 261–276, ISSN 15392791, DOI 10.1007/s12021-015-9260-y, URL http://link.springer.com/10.1007/

30

ACCEPTED MANUSCRIPT

s12021-015-9260-y. [24] A. Khademi, A. Venetsanopoulos, A. R. Moody, Robust white matter lesion segmentation in FLAIR MRI., IEEE transactions on bio-medical engineering 59 (3) (2012) 860–871, ISSN 1558-2531 (Electronic), DOI 10.1109/TBME.2011.2181167. [25] Y. Ge, R. I. Grossman, J. S. Babb, J. He, L. J. Mannon, Dirty-Appearing White Matter in Multiple Sclerosis: Volumetric MR Imaging and Magnetization Transfer Ratio Histogram Analysis, American Journal of Neuroradiology 24 (10) (2003) 1935–1940, ISSN 01956108, URL http://www.ncbi.nlm.nih.gov/pubmed/14625213. [26] J. M. Wardlaw, E. E. Smith, G. J. Biessels, C. Cordonnier, F. Fazekas, R. Frayne, R. I. Lindley, J. T. O’Brien, F. Barkhof,

PT

O. R. Benavente, S. E. Black, C. Brayne, M. Breteler, H. Chabriat, C. DeCarli, F. E. de Leeuw, F. Doubal, M. Duering, N. C. Fox, S. Greenberg, V. Hachinski, I. Kilimann, V. Mok, R. van Oostenbrugge, L. Pantoni, O. Speck, B. C. Stephan, S. Teipel, A. Viswanathan, D. Werring, C. Chen, C. Smith, , Neuroimaging standards for research into small vessel disease

RI

and its contribution to ageing and neurodegeneration, The Lancet Neurology 12 (8) (2013) 822–838, ISSN 14744422, DOI 10.1016/S1474-4422(13)70124-8, URL http://www.ncbi.nlm.nih.gov/pubmed/23867200.

SC

[27] L. G. Nyul, J. K. Udupa, On standardizing the MR image intensity scale., Magnetic resonance in medicine 42 (6) (1999) 1072–1081, ISSN 0740-3194 (Print).

[28] K. Van Leemput, F. Maes, D. Vandermeulen, A. Colchester, P. Suetens, Automated segmentation of multiple sclerosis le-

NU

sions by model outlier detection, IEEE Transactions on Medical Imaging 20 (8) (2001) 677–688, DOI 10.1109/42.938237, URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=938237. [29] C. R. Jack, P. C. O’Brien, D. W. Rettman, M. M. Shiung, Y. Xu, R. Muthupillai, A. Manduca, R. Avula, B. J. Erickson, FLAIR histogram segmentation for measurement of leukoaraiosis volume, Journal of Magnetic Resonance Imaging 14 (6)

MA

(2001) 668–676, ISSN 10531807, DOI 10.1002/jmri.10011, URL http://www.ncbi.nlm.nih.gov/pubmed/11747022http: //www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2755497. [30] A. P. Zijdenbos, R. Forghani, A. C. Evans, Automatic pipeline analysis of 3-D MRI data for clinical trials: application to multiple sclerosis., IEEE transactions on medical imaging 21 (10) (2002) 1280–91, DOI 10.1109/TMI.2002.806283, URL

ED

http://www.ncbi.nlm.nih.gov/pubmed/12585710.

[31] P. Anbeek, K. L. Vincken, M. J. P. van Osch, R. H. C. Bisschops, J. van der Grond, Probabilistic segmentation of white matter lesions in MR imaging., NeuroImage 21 (3) (2004) 1037–44, ISSN 1053-8119, DOI

PT

10.1016/j.neuroimage.2003.10.012, URL http://www.sciencedirect.com/science/article/pii/S1053811903006621. [32] P. Anbeek, K. L. Vincken, G. S. van Bochove, M. J. P. van Osch, J. van der Grond, Probabilistic segmentation of brain tissue in MR imaging., NeuroImage 27 (4) (2005) 795–804, ISSN 1053-8119, DOI 10.1016/j.neuroimage.2005.05.046,

AC CE

URL http://www.sciencedirect.com/science/article/pii/S1053811905003228. [33] F. Admiraal-Behloul, D. van den Heuvel, H. Olofsen, M. van Osch, J. van der Grond, M. van Buchem, J. Reiber, Fully automatic segmentation of white matter hyperintensities in MR images of the elderly, NeuroImage 28 (3) (2005) 607–617, ISSN 10538119, DOI 10.1016/j.neuroimage.2005.06.061. [34] Z. Lao, D. Shen, A. Jawad, B. Karacali, E. Melhem, R. Bryan, C. Davatziko, Automated Segmentation of White Matter Lesions in 3D Brain MR Images, using Multivariate Pattern Classification, in: 3rd IEEE International Symposium on Biomedical Imaging: Macro to Nano, 2006., IEEE, ISBN 0-7803-9576-X, 307–310, DOI 10.1109/ISBI.2006.1624914, URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1624914, 2006. [35] Y. Wu, S. K. Warfield, I. L. Tan, W. M. Wells, D. S. Meier, R. A. van Schijndel, F. Barkhof, C. R. G. Guttmann, Automated segmentation of multiple sclerosis lesion subtypes with multichannel MRI, NeuroImage 32 (3) (2006) 1205– 1215, ISSN 10538119, DOI 10.1016/j.neuroimage.2006.04.211, URL http://www.ncbi.nlm.nih.gov/pubmed/16797188. [36] B. R. Sajja, S. Datta, R. He, M. Mehta, R. K. Gupta, J. S. Wolinsky, P. A. Narayana, Unified approach for multiple sclerosis lesion segmentation on brain MRI., Annals of biomedical engineering 34 (1) (2006) 142–51, ISSN 0090-6964, DOI 10.1007/s10439-005-9009-0, URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1463248{&}tool=

31

ACCEPTED MANUSCRIPT

pmcentrez{&}rendertype=abstract. [37] R. Khayati, M. Vafadust, F. Towhidkhah, M. Nabavi, Fully automatic segmentation of multiple sclerosis lesions in brain MR FLAIR images using adaptive mixtures method and Markov random field model., Computers in biology and medicine 38 (3) (2008) 379–390, ISSN 0010-4825 (Print), DOI 10.1016/j.compbiomed.2007.12.005. [38] M. Wels, M. Huber, J. Hornegger, Fully automated segmentation of multiple sclerosis lesions in multispectral MRI, Pattern Recognition and Image Analysis 18 (2) (2008) 347–350, DOI 10.1134/S1054661808020235, URL http://link. springer.com/10.1134/S1054661808020235.

PT

[39] E. H. Herskovits, R. N. Bryan, F. Yang, Automated Bayesian segmentation of microvascular white-matter lesions in the ACCORD-MIND study., Advances in medical sciences 53 (2) (2008) 182–190, ISSN 1896-1126 (Print), DOI 10.2478/v10039-008-0039-3.

RI

[40] S. Bricq, C. Collet, J.-P. Armspach, MS Lesion Segmentation based on Hidden Markov Chains, in: The MIDAS Journal - MS Lesion Segmentation (MICCAI 2008 Workshop), URL http://hdl.handle.net/10380/1450, 2008.

SC

[41] T. B. Dyrby, E. Rostrup, W. F. C. Baare, E. C. W. van Straaten, F. Barkhof, H. Vrenken, S. Ropele, R. Schmidt, T. Erkinjuntti, L.-O. Wahlund, L. Pantoni, D. Inzitari, O. B. Paulson, L. K. Hansen, G. Waldemar, Segmentation of age-related white matter changes in a clinical multi-center study., NeuroImage 41 (2) (2008) 335–345, ISSN 1053-8119

NU

(Print), DOI 10.1016/j.neuroimage.2008.02.024.

[42] J. Souplet, C. Lebrun, N. Ayache, G. Malandain, An Automatic Segmentation of T2-FLAIR Multiple Sclerosis Lesions, MICCAI Grand Challenge Workshop: Multiple Sclerosis Lesion Segmentation Challenge (2008) 1–11URL http://grand-challenge2008.bigr.nl/.

MA

[43] D. Garc´ıa-Lorenzo, J. Lecoeur, D. L. Arnold, D. L. Collins, C. Barillot, Multiple sclerosis lesion segmentation using an automatic multimodal graph cuts, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5762 LNCS, ISBN 3642042708, ISSN 03029743, 584–591, DOI 10.1007/978-3-642-04271-3_71, URL http://www.ncbi.nlm.nih.gov/pubmed/20426159, 2009.

ED

[44] A. Akselrod-Ballin, M. Galun, J. M. Gomori, M. Filippi, P. Valsasina, R. Basri, A. Bradnt, Automatic Segmentation and Classification of Multiple Sclerosis in Multichannel MRI, IEEE Transactions on Biomedical Engineering

2461{_}asacomsimm.xml.

PT

56 (10) (2009) 2461 – 2469, ISSN 00189294, URL http://resolver.scholarsportal.info/resolve/00189294/v56i0010/

[45] C. Schwarz, E. Fletcher, C. Decarli, O. Carmichael, Fully-automated white matter hyperintensity detection with anatomical prior knowledge and without FLAIR, in: Lecture Notes in Computer Science (including subseries Lecture Notes in

AC CE

Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5636 LNCS, NIH Public Access, ISBN 3642024971, ISSN 03029743, 239–251, DOI 10.1007/978-3-642-02498-6_20, URL http://www.ncbi.nlm.nih.gov/pubmed/19694267, 2009. [46] E. Gibson, F. Gao, S. E. Black, N. J. Lobaugh, Automatic segmentation of white matter hyperintensities in the elderly using FLAIR images at 3T, Journal of Magnetic Resonance Imaging 31 (6) (2010) 1311–1322, ISSN 15222586, DOI 10.1002/jmri.22004, URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2905619{&}tool=pmcentrez{&}rendertype=abstract. [47] N. Shiee, P.-L. Bazin, A. Ozturk, D. S. Reich, P. A. Calabresi, D. L. Pham, A topology-preserving approach to the segmentation of brain images with multiple sclerosis lesions., NeuroImage 49 (2) (2010) 1524–1535, ISSN 1095-9572, DOI 10.1016/j.neuroimage.2009.09.005, URL http://www.ncbi.nlm.nih.gov/pubmed/19766196http: //www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2806481http://dx.doi.org/10.1016/j.neuroimage. 2009.09.005. [48] M. Scully, B. Anderson, T. Lane, C. Gasparovic, V. Magnotta, W. Sibbitt, C. Roldan, R. Kikinis, H. J. Bockholt, An Automated Method for Segmenting White Matter Lesions through Multi-Level Morphometric Feature Classification with Application to Lupus., Frontiers in human neuroscience 4 (April) (2010) 27, ISSN 1662-5161, DOI

32

ACCEPTED MANUSCRIPT

10.3389/fnhum.2010.00027, URL http://www.ncbi.nlm.nih.gov/pubmed/20428508. [49] D. Garc´ıa-Lorenzo, S. Prima, D. L. Arnold, D. L. Collins, C. Barillot, Trimmed-likelihood estimation for focal lesions and tissue segmentation in multisequence MRI for multiple sclerosis., IEEE Transactions on Medical Imaging 30 (8) (2011) 1455–67, ISSN 1558-254X, DOI 10.1109/TMI.2011.2114671, URL http://www.ncbi.nlm.nih.gov/pubmed/ 21324773http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3326634. [50] E. Geremia, O. Clatz, B. H. Menze, E. Konukoglu, A. Criminisi, N. Ayache, Spatial decision forests for MS lesion segmentation in multi-channel magnetic resonance images., NeuroImage 57 (2) (2011) 378–390, ISSN 1095-9572 (Electronic),

PT

DOI 10.1016/j.neuroimage.2011.03.080.

[51] S. D. Smart, M. J. Firbank, J. T. O’Brien, Validation of automated white matter hyperintensity segmentation., Journal of aging research 2011 (2011) 391783, ISSN 2090-2212, DOI 10.4061/2011/391783, URL http://www.pubmedcentral.

RI

nih.gov/articlerender.fcgi?artid=3167190{&}tool=pmcentrez{&}rendertype=abstract.

[52] T. Samaille, L. Fillon, R. Cuingnet, E. Jouvent, H. Chabriat, D. Dormont, O. Colliot, M. Chupin, Contrast-based fully

1932-6203 (Electronic), DOI 10.1371/journal.pone.0048953.

SC

automatic segmentation of white matter hyperintensities: method and validation., PloS one 7 (11) (2012) e48953, ISSN

[53] P. Schmidt, C. Gaser, M. Arsic, D. Buck, A. Forschler, A. Berthele, M. Hoshi, R. Ilg, V. J. Schmid, C. Zim-

NU

mer, B. Hemmer, M. Muhlau, A. F¨ orschler, A. Berthele, An automated tool for detection of FLAIR-hyperintense white-matter lesions in multiple sclerosis, Neuroimage 59 (4) (2012) 3774–3783, ISSN 1095-9572 (Electronic), DOI 10.1016/j.neuroimage.2011.11.032, URL http://dx.doi.org/10.1016/j.neuroimage.2011.11.032. [54] B. a. Abdullah, A. a. Younis, N. M. John, Multi-Sectional Views Textural Based SVM for MS Lesion Seg-

MA

mentation in Multi-Channels MRIs., The open biomedical engineering journal 6 (2012) 56–72, ISSN 18741207, DOI 10.2174/1874230001206010056, URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 3382289{&}tool=pmcentrez{&}rendertype=abstract.

[55] E. M. Sweeney, R. T. Shinohara, N. Shiee, F. J. Mateen, A. A. Chudgar, J. L. Cuzzocreo, P. A. Calabresi, D. L.

ED

Pham, D. S. Reich, C. M. Crainiceanu, OASIS is Automated Statistical Inference for Segmentation, with applications to multiple sclerosis lesion segmentation in MRI, NeuroImage: Clinical 2 (1) (2013) 402–413, ISSN 22131582, DOI 10.1016/j.nicl.2013.03.002.

PT

[56] S. Datta, P. A. Narayana, A comprehensive approach to the segmentation of multichannel three-dimensional MR brain images in multiple sclerosis, NeuroImage: Clinical 2 (2013) 184–196, DOI 10.1016/j.nicl.2012.12.007. [57] A. Khademi, A. Venetsanopoulos, A. R. Moody, Generalized method for partial volume estimation and tissue segmen-

AC CE

tation in cerebral magnetic resonance images, Journal of Medical Imaging 1 (1) (2014) 14002, ISSN 2329-4302, DOI 10.1117/1.JMI.1.1.014002.

[58] V. Ithapu, V. Singh, C. Lindner, B. P. Austin, C. Hinrichs, C. M. Carlsson, B. B. Bendlin, S. C. Johnson, Extracting and summarizing white matter hyperintensities using supervised segmentation methods in Alzheimer’s disease risk and aging studies, Human Brain Mapping 35 (8) (2014) 4219–4235, ISSN 10970193, DOI 10.1002/hbm.22472, URL http: //resolver.scholarsportal.info/resolve/10659471/v35i0008/4219{_}easwmhadraas.xml. [59] B. I. Yoo, J. J. Lee, J. W. Han, S. Y. W. Oh, E. Y. Lee, J. R. MacFall, M. E. Payne, T. H. Kim, J. H. Kim, K. W. Kim, Application of variable threshold intensity to segmentation for white matter hyperintensities in fluid attenuated inversion recovery magnetic resonance images, Neuroradiology 56 (4) (2014) 265–281, ISSN 14321920, DOI 10.1007/s00234-014-1322-6. [60] R. Harmouche, N. K. Subbanna, D. L. Collins, D. L. Arnold, T. Arbel, Probabilistic multiple sclerosis lesion classification based on modeling regional intensity variability and local neighborhood Information, IEEE Transactions on Biomedical Engineering 62 (5) (2015) 1281–1292, ISSN 15582531, DOI 10.1109/TBME.2014.2385635, URL http: //resolver.scholarsportal.info/resolve/00189294/v62i0005/1281{_}pmslcbivalni.xml.

33

ACCEPTED MANUSCRIPT

[61] N. Guizard, P. Coupe, V. S. Fonov, J. V. Manjon, D. L. Arnold, D. L. Collins, Rotation-invariant multicontrast non-local means for MS lesion segmentation, NeuroImage: Clinical 8 (2015) 376–389, ISSN 22131582, DOI 10.1016/j.nicl.2015.05.001. [62] S. Jain, D. M. Sima, A. Ribbens, M. Cambron, A. Maertens, W. Van Hecke, J. De Mey, F. Barkhof, M. D. Steenwijk, M. Daams, F. Maes, S. Van Huffel, H. Vrenken, D. Smeets, Automatic segmentation and volumetry of multiple sclerosis brain lesions from MR images., NeuroImage: Clinical 8 (2015) 367–75, ISSN 2213-1582, DOI 10.1016/j.nicl.2015.05.003, URL http://www.ncbi.nlm.nih.gov/pubmed/26106562http://www.pubmedcentral.nih.

PT

gov/articlerender.fcgi?artid=PMC4474324.

[63] X. Tomas-Fernandez, S. K. Warfield, A model of population and subject (MOPS) intensities with application to multiple sclerosis lesion segmentation, IEEE Transactions on Medical Imaging 34 (6) (2015) 1349–1361, ISSN 1558254X, DOI

RI

10.1109/TMI.2015.2393853, URL http://www.ncbi.nlm.nih.gov/pubmed/25616008.

[64] R. Wang, C. Li, J. Wang, X. Wei, Y. Li, Y. Zhu, S. Zhang, Automatic segmentation and volumetric quantification

SC

of white matter hyperintensities on fluid-attenuated inversion recovery images using the extreme value distribution., Neuroradiology 57 (3) (2015) 307–320, ISSN 1432-1920 (Electronic), DOI 10.1007/s00234-014-1466-4. [65] P. K. Roy, A. Bhuiyan, A. Janke, P. M. Desmond, T. Y. Wong, W. P. Abhayaratna, E. Storey, K. Ramamohanarao,

NU

Automatic white matter lesion segmentation using contrast enhanced FLAIR intensity and Markov Random Field., Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society 45 (2015) 102–111, ISSN 1879-0771 (Electronic), DOI 10.1016/j.compmedimag.2015.08.005. [66] T. Brosch, Y. Yoo, L. Y. W. Tang, D. K. B. Li, A. Traboulsee, R. Tam, Deep Convolutional Encoder Networks for

MA

Multiple Sclerosis Lesion Segmentation, vol. 9556, Springer International Publishing, ISBN 978-3-319-30857-9, DOI 10.1007/978-3-319-30858-6, URL http://link.springer.com/10.1007/978-3-319-24574-4{_}1http://www.scopus. com/inward/record.url?eid=2-s2.0-84961647384{&}partnerID=tZOtx3y1, 2015. [67] M. J. Fartaria, G. Bonnier, A. Roche, T. Kober, R. Meuli, D. Rotzinger, R. Frackowiak, M. Schluep, R. Du Pasquier, J.-P.

ED

Thiran, G. Krueger, M. Bach Cuadra, C. Granziera, Automated detection of white matter and cortical lesions in early stages of multiple sclerosis., Journal of magnetic resonance imaging : JMRI ISSN 1522-2586, DOI 10.1002/jmri.25095, URL http://www.ncbi.nlm.nih.gov/pubmed/26606758.

PT

[68] H. Deshpande, P. Maurel, C. Barillot, Classification of multiple sclerosis lesions using adaptive dictionary learning, Computerized Medical Imaging and Graphics 46 (2015) 2–10, ISSN 18790771, DOI 10.1016/j.compmedimag.2015.05.003. [69] E. Roura, A. Oliver, M. Cabezas, S. Valverde, D. Pareto, J. C. Vilanova, L. Ramio-Torrenta, A. Rovira, L. Xavier,

AC CE

A toolbox for multiple sclerosis lesion segmentation, Neuroradiology 57 (10) (2015) 1031–1043, ISSN 14321920, DOI 10.1007/s00234-015-1552-2, URL http://link.springer.com/10.1007/s00234-015-1552-2. [70] J. Knight, A. Khademi, MS Lesion Segmentation Using FLAIR MRI Only, in: Medical Image Computing and ComputerAssisted Intervention - MICCAI, Athens, Greece, TBD, 2016. [71] R. Mechrez, J. Goldberger, H. Greenspan, Patch-Based Segmentation with Spatial Consistency:

Application to

MS Lesions in Brain MRI, International Journal of Biomedical Imaging 2016 (2016) 1–13, ISSN 16874196, DOI 10.1155/2016/7952541, URL http://www.hindawi.com/journals/ijbi/2016/7952541/. [72] M. Strumia, F. Schmidt, C. Anastasopoulos, C. Granziera, G. Krueger, T. Brox, White Matter MS-Lesion Segmentation Using a Geometric Brain Model., IEEE transactions on medical imaging PP (99) (2016) 1, ISSN 1558-254X, DOI 10.1109/TMI.2016.2522178, URL http://www.ncbi.nlm.nih.gov/pubmed/26829786. [73] L. Griffanti, G. Zamboni, A. Khan, L. Li, G. Bonifacio, V. Sundaresan, U. G. Schulz, W. Kuker, M. Battaglini, P. M. Rothwell, M. Jenkinson, BIANCA (Brain Intensity AbNormality Alassification Algorithm): A new tool for automated segmentation of white matter hyperintensities, NeuroImage 141 (2016) 191–205, ISSN 10538119, DOI 10.1016/j.neuroimage.2016.07.018, URL http://linkinghub.elsevier.com/retrieve/pii/S1053811916303251.

34

ACCEPTED MANUSCRIPT

[74] T. Brosch, L. Y. W. Tang, Y. Yoo, D. K. B. Li, A. Traboulsee, R. Tam, Deep 3D Convolutional Encoder Networks With Shortcuts for Multiscale Feature Integration Applied to Multiple Sclerosis Lesion Segmentation, IEEE Transactions on Medical Imaging 35 (5) (2016) 1229–1239, ISSN 1558254X, DOI 10.1109/TMI.2016.2528821, URL http://ieeexplore. ieee.org/document/7404285/. ` Rovira, X. Llad´ [75] S. Valverde, A. Oliver, E. Roura, S. Gonz´ alez-vill` a, D. Pareto, J. C. Vilanova, L. Rami´ o-torrent` a, A. o, Automated tissue segmentation of MR brain images in the presence of white matter lesions, Medical Image Analysis 35 (2017) 446–457, ISSN 1361-8415, DOI 10.1016/j.media.2016.08.014, URL http://www.ncbi.nlm.nih.gov/pubmed/

PT

27598104http://dx.doi.org/10.1016/j.media.2016.08.014.

[76] M. Dadar, T. Pascoal, S. Manitsirikul, K. Misquitta, C. Tartaglia, J. Brietner, P. Rosa-Neto, O. Carmichael, C. DeCarli, D. L. Collins, Validation of a Regression Technique for Segmentation of White Matter Hyperintensities in Alzheimer’s

RI

Disease, IEEE Transactions on Medical Imaging 99 (PP) (2017) 1–1, ISSN 0278-0062, DOI 10.1109/TMI.2017.2693978, URL http://ieeexplore.ieee.org/document/7898360/.

SC

[77] T. Zhan, R. Yu, Y. Zheng, Y. Zhan, L. Xiao, Z. Wei, Multimodal spatial-based segmentation framework for white matter lesions in multi-sequence magnetic resonance images, Biomedical Signal Processing and Control 31 (2017) 52–62, DOI 10.1016/j.bspc.2016.06.016.

NU

[78] A. Khademi, A. R. Moody, Multiscale partial volume estimation for segmentation of white matter lesions using flair MRI, in: Proceedings - International Symposium on Biomedical Imaging, vol. 2015-July, IEEE, ISBN 9781479923748, ISSN 19458452, 568–571, DOI 10.1109/ISBI.2015.7163937, URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper. htm?arnumber=7163937, 2015.

MA

[79] Y. Zhang, M. Brady, S. Smith, Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm, IEEE Transactions on Medical Imaging 20 (1) (2001) 45–57, ISSN 02780062, DOI 10.1109/42.906424, URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=906424. [80] J. Ashburner, K. J. Friston, Unified segmentation., NeuroImage 26 (3) (2005) 839–851, ISSN 1053-8119 (Print), DOI

ED

10.1016/j.neuroimage.2005.02.018.

[81] N. K. Subbanna, M. Shah, S. J. Francis, D. L. Collins, D. L. Arnold, T. Arbel, MS Lesion Segmentation using Markov Random Fields, in: MICCAI Workshop on Medical Image Analysis on Multiple Sclerosis, 37–48, 2009.

PT

[82] P. L. Bazin, D. L. Pham, Homeomorphic brain image segmentation with topological and statistical atlases, Medical Image Analysis 12 (5) (2008) 616–625,

ISSN 13618415,

DOI 10.1016/j.media.2008.06.008,

URL

http://www.ncbi.nlm.nih.gov/pubmed/18640069http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=

AC CE

PMC2562468http://linkinghub.elsevier.com/retrieve/pii/S1361841508000601. [83] H. J. Kuijf, G. J. Biessels, M. A. Viergever, C. Chen, W. van der Flier, F. Barkhof, J. De Bresser, M. Biesbroek, R. Heinen, WMH Segmentation Challenge, URL http://wmh.isi.uu.nl/, 2017. [84] N. J. Tustison, B. B. Avants, P. A. Cook, Y. Zheng, A. Egan, P. A. Yushkevich, J. C. Gee, N4ITK: improved N3 bias correction., IEEE Transactions on Medical Imaging 29 (6) (2010) 1310–20, ISSN 1558-254X, DOI 10.1109/TMI.2010.2046908, URL http://www.ncbi.nlm.nih.gov/pubmed/20378467http://www.pubmedcentral. nih.gov/articlerender.fcgi?artid=PMC3071855. [85] MS Lesion Segmentation Challenge 2008, URL http://www.ia.unc.edu/MSseg/, 2008. [86] A. Carass, S. Roy, A. Jog, J. L. Cuzzocreo, E. Magrath, A. Gherman, J. Button, J. Nguyen, P. L. Bazin, P. A. Calabresi, C. M. Crainiceanu, L. M. Ellingsen, D. S. Reich, J. L. Prince, D. L. Pham, Longitudinal multiple sclerosis lesion segmentation data resource, Data in Brief 12 (2017) 346–350, ISSN 23523409, DOI 10.1016/j.dib.2017.04.004, URL http://www.ncbi.nlm.nih.gov/pubmed/28491937http://www.pubmedcentral.nih. gov/articlerender.fcgi?artid=PMC5412004http://linkinghub.elsevier.com/retrieve/pii/S2352340917301361. [87] MS Lesion Segmentation Challenge 2016, URL https://portal.fli-iam.irisa.fr/msseg-challenge, 2016.

35

ACCEPTED MANUSCRIPT

[88] F. Barkhof, P. Scheltens, Imaging of white matter lesions., Cerebrovascular diseases (Basel, Switzerland) 13 Suppl 2 (suppl 2) (2002) 21–30, ISSN 1015-9770, DOI 49146, URL http://www.ncbi.nlm.nih.gov/pubmed/11901239. [89] P. Schmidt, LST: A lesion segmentation tool for SPM, URL http://www.applied-statistics.de/lst.html, 2015. [90] P. Schmidt, Bayesian inference for structured additive regression models for large-scale problems with applications to medical imaging, Ph.D. thesis, Ludwig-Maximilians-Universit¨ at M¨ unchen, URL https://edoc.ub.uni-muenchen.de/ 20373/, 2017. [91] M. Ganzetti, N. Wenderoth, D. Mantini, Quantitative Evaluation of Intensity Inhomogeneity Correction Methods for

PT

Structural MR Brain Images, Neuroinformatics 14 (1) (2016) 5–21, ISSN 15392791, DOI 10.1007/s12021-015-9277-2, URL http://link.springer.com/10.1007/s12021-015-9277-2.

[92] A. Klein, J. Andersson, B. A. Ardekani, J. Ashburner, B. Avants, M.-C. Chiang, G. E. Christensen, D. L. Collins, J. Gee,

RI

P. Hellier, J. H. Song, M. Jenkinson, C. Lepage, D. Rueckert, P. Thompson, T. Vercauteren, R. P. Woods, J. J. Mann, R. V. Parsey, Evaluation of 14 nonlinear deformation algorithms applied to human brain MRI registration., NeuroImage

SC

46 (3) (2009) 786–802, ISSN 1095-9572, DOI 10.1016/j.neuroimage.2008.12.037, URL http://www.sciencedirect. com/science/article/pii/S1053811908012974.

[93] J. Ashburner, K. J. Friston, Multimodal Image Coregistration and Partitioning - a Unified Framework, Neuroimage

NU

6 (3) (1997) 209–217, ISSN 10538119, DOI 10.1006/nimg.1997.0290, URL http://www.ncbi.nlm.nih.gov/pubmed/ 9344825http://linkinghub.elsevier.com/retrieve/pii/S1053811997902901. [94] A. Evans, D. Collins, S. Mills, E. Brown, R. Kelly, T. Peters, 3D statistical neuroanatomical models from 305 MRI volumes, 1993 IEEE Conference Record Nuclear Science Symposium and Medical Imaging Conference 3 (January 1993)

MA

(1993) 1813–1817, DOI 10.1109/NSSMIC.1993.373602, URL http://ieeexplore.ieee.org/document/373602/http:// ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=373602. [95] A. C. Evans, A. L. Janke, D. L. Collins, S. Baillet, Brain templates and atlases, NeuroImage 62 (2) (2012) 911–922, ISSN 10538119, DOI 10.1016/j.neuroimage.2012.01.024, URL http://www.sciencedirect.com/science/article/

ED

pii/S1053811912000419.

[96] J. Mazziotta, A. Toga, A. Evans, P. Fox, J. Lancaster, K. Zilles, R. Woods, T. Paus, G. Simpson, B. Pike, C. Holmes, L. Collins, P. Thompson, D. MacDonald, M. Iacoboni, T. Schormann, K. Amunts, N. Palomero-Gallagher, S. Geyer,

PT

L. Parsons, K. Narr, N. Kabani, G. Le Goualher, D. Boomsma, T. Cannon, R. Kawashima, B. Mazoyer, A probabilistic atlas and reference system for the human brain: International Consortium for Brain Mapping (ICBM)., Philosophical transactions of the Royal Society of London. Series B, Biological sciences 356 (1412) (2001) 1293–322, ISSN 0962-8436,

AC CE

DOI 10.1098/rstb.2001.0915, URL http://rstb.royalsocietypublishing.org/content/356/1412/1293.shorthttp: //www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1088516{&}tool=pmcentrez{&}rendertype=abstract. [97] R. C. Gonzalez, R. E. Woods, Digital Image Processing, Prentice Hall, Inc., Upper Saddle River, New Jersey, 3 edn., ISBN 013168728X, URL http://dl.acm.org/citation.cfm?id=1076432, 2006. [98] R. T. Shinohara, E. M. Sweeney, J. Goldsmith, N. Shiee, F. J. Mateen, P. A. Calabresi, S. Jarso, D. L. Pham, D. S. Reich, C. M. Crainiceanu, Statistical normalization techniques for magnetic resonance imaging, NeuroImage: Clinical 6 (2014) 9–19, ISSN 22131582, DOI 10.1016/j.nicl.2014.08.008. [99] J. Knight, G. Taylor, A. Khademi, Equivalence of histogram equalization, histogram matching and the Nyul algorithm for intensity standardization in MRI, in: 3rd Annual Conference on Vision and Imaging Systems, Waterloo, 2017. [100] R. Fletcher, M. J. D. Powell, A Rapidly Convergent Descent Method for Minimization, The Computer Journal 6 (2) (1963) 163–168, ISSN 0010-4620, DOI 10.1093/comjnl/6.2.163, URL https://academic.oup.com/comjnl/article-lookup/ doi/10.1093/comjnl/6.2.163http://comjnl.oxfordjournals.org/content/6/2/163.abstract{%}5Cnhttp://comjnl. oxfordjournals.org/cgi/doi/10.1093/comjnl/6.2.163{%}5Cnhttp://comjnl.oxfordjournals.org/cgi/doi/10.1093/ comj.

36

ACCEPTED MANUSCRIPT

[101] B. Moraal, S. D. Roosendaal, P. J. W. Pouwels, H. Vrenken, R. A. van Schijndel, D. S. Meier, C. R. G. Guttmann, J. J. G. Geurts, F. Barkhof, Multi-contrast, isotropic, single-slab 3D MR imaging in multiple sclerosis., European radiology 18 (10) (2008) 2311–20, ISSN 0938-7994, DOI 10.1007/s00330-008-1009-7, URL http://www.ncbi.nlm.nih. gov/pubmed/18509658. [102] J. Maranzano, D. A. Rudko, D. L. Arnold, S. Narayanan, Manual segmentation of MS cortical lesions using MRI: A comparison of 3 MRI reading protocols, American Journal of Neuroradiology 37 (9) (2016) 1623–1628, ISSN 1936959X, DOI 10.3174/ajnr.A4799, URL http://www.ncbi.nlm.nih.gov/pubmed/27197988.

PT

[103] J. C. Lagarias, J. A. Reeds, M. H. Wright, P. E. Wright, Convergence Properties of the Nelder–Mead Simplex Method in Low Dimensions, SIAM Journal on Optimization 9 (1) (1998) 112–147, ISSN 1052-6234, DOI 10.1137/S1052623496303470, URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.6062http:

RI

//epubs.siam.org/doi/10.1137/S1052623496303470.

[104] A. Akhondi-Asl, L. Hoyte, M. E. Lockhart, S. K. Warfield, A logarithmic opinion pool based STAPLE algorithm for the

SC

fusion of segmentations with associated reliability weights., IEEE Transactions on Medical Imaging 33 (10) (2014) 1997– 2009, ISSN 1558-254X, DOI 10.1109/TMI.2014.2329603, URL http://www.ncbi.nlm.nih.gov/pubmed/24951681http: //www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4264575.

NU

[105] K. Geras, C. Sutton, Multiple-source cross-validation, in: S. Dasgupta, D. McAllester (Eds.), Proceedings of the 30th International Conference on Machine Learning, vol. 28 of Proceedings of Machine Learning Research, PMLR, Atlanta, Georgia, USA, 1292–1300, URL http://proceedings.mlr.press/v28/geras13.html, 2013. [106] T. K. Koo, M. Y. Li, A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research,

MA

Journal of Chiropractic Medicine 15 (2) (2016) 155–163, ISSN 15563707, DOI 10.1016/j.jcm.2016.02.012, URL http:// www.ncbi.nlm.nih.gov/pubmed/27330520http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4913118. [107] D. G. Altman, J. M. Bland, Measurement in Medicine: the Analysis of Method Comparison Studies, Statistician 32 (July 1981) (1983) 307–317, ISSN 00390526, DOI 10.2307/2987937, URL http://www.jstor.org/stable/2987937?origin=

ED

crossref.

[108] D. R. Hunter, K. Lange, A Tutorial on MM Algorithms, American Statistician 58 (1) (2004) 30–37, ISSN 15372731, DOI 10.1198/0003130042836, URL http://www.tandfonline.com/doi/abs/10.1198/0003130042836.

PT

[109] O. Ronneberger, P. Fischer, T. Brox, U-Net:

Convolutional Networks for Biomedical Image Segmentation,

Medical Image Computing and Computer-Assisted Intervention - MICCAI (2015) 234–241ISSN 16113349, DOI 10.1007/978-3-319-24574-4_28, URL http://arxiv.org/abs/1505.04597.

AC CE

[110] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, arXiv preprint ISSN 15499618, DOI 10.1021/ct2009208, URL http://arxiv.org/abs/1312.6199. [111] B. Scholkopf, The kernel trick for distances, Advances in Neural Information Processing Systems 13 13 (2001) 301–307, ISSN 10495258, DOI 10.1.1.20.327.

37

Figure 1

AC

CE

PT

ED

M

AN

US

CR

IP

T

ACCEPTED MANUSCRIPT

Figure 2

AN

US

CR

IP

T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Figure 3

Figure 4

AC

CE

PT

ED

M

AN

US

Figure 5

CR

IP

T

ACCEPTED MANUSCRIPT

Figure 6

AC

CE

PT

ED

M

AN

US

Figure 7

CR

IP

T

ACCEPTED MANUSCRIPT

Figure 8

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8