Accepted Manuscript Title: Deep learning reveals Alzheimer’s disease onset in MCI subjects: results from an international challenge Authors: Nicola Amoroso Domenico Diacono Annarita Fanizzi Marianna La Rocca Alfonso Monaco Angela Lombardi Cataldo Guaragnella Roberto Bellotti Sabina Tangaro, for the Alzheimer’s Disease Neuroimaging Initiative PII: DOI: Reference:
S0165-0270(17)30429-6 https://doi.org/doi:10.1016/j.jneumeth.2017.12.011 NSM 7917
To appear in:
Journal of Neuroscience Methods
Received date: Revised date: Accepted date:
31-7-2017 18-12-2017 20-12-2017
Please cite this article as: Nicola Amoroso, Domenico Diacono, Annarita Fanizzi, Marianna La Rocca, Alfonso Monaco, Angela Lombardi, Cataldo Guaragnella, Roberto Bellotti, Sabina Tangaro,
for the Alzheimer’s Disease Neuroimaging Initiative, Deep learning reveals Alzheimer’s disease onset in MCI subjects: results from an international challenge, (2017), https://doi.org/10.1016/j.jneumeth.2017.12.011 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
A classification strategy based on Random Forest feature selection and Deep Neural Network is proposed.
•
The work demonstrated the accuracy of deep learning strategies for early detection of Alzheimer’s disease.
•
This work ranked third for overall accuracy over all participating teams of the “International challenge for automated prediction of MCI from MRI data” hosted by the Kaggle platform.
Ac ce p
te
d
M
an
us
cr
ip t
•
Page 1 of 22
cr
Number of text pages of the whole manuscript: 22 Number of figures: 5 Number of tables: 4
ip t
Deep learning reveals Alzheimer’s disease onset in MCI subjects: results from an international challenge
an
us
Nicola Amoroso,
[email protected], Universit`a degli studi di Bari “A. Moro”, Dipartimento Interateneo di Fisica “M. Merlin”, Italy and Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Italy Domenico Diacono,
[email protected], Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Italy
M
Annarita Fanizzi,
[email protected], Istituto Tumori “Giovanni Paolo II” - I.R.C.C.S., Bari, Italy
te
d
Marianna La Rocca,
[email protected], Universit`a degli studi di Bari “A. Moro”, Dipartimento Interateneo di Fisica “M. Merlin”, Italy and Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Italy
Ac ce p
Alfonso Monaco,
[email protected], Istituto Nazionale di Fisica Nucleare, sez. di Bari, Italy Angela Lombardi,
[email protected], Dipartimento di Ingegneria Elettrica e dell’Informazione, Politecnico di Bari, Bari, Italy Cataldo Guaragnella,
[email protected], Dipartimento di Ingegneria Elettrica e dell’Informazione, Politecnico di Bari, Bari, Italy Roberto Bellotti,
[email protected], Universit`a degli studi di Bari “A. Moro”, Dipartimento Interateneo di Fisica “M. Merlin”, Italy and Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Italy Sabina Tangaro,
[email protected], Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Italy
1
Page 2 of 22
ip t
Deep learning reveals Alzheimer’s disease onset in MCI subjects: results from an international challenge
cr
Nicola Amorosoa,b , Domenico Diaconob , Annarita Fanizzic,∗, Marianna La Roccaa,b , Alfonso Monacob , Angela Lombardid , Cataldo Guaragnellad , Roberto Bellottia,b,1 , Sabina Tangarob,1 , and for the Alzheimer’s Disease Neuroimaging Initiative2 a
M
an
us
Dipartimento Interateneo di Fisica “M. Merlin”, Universit` a degli studi di Bari “A. Moro”, Bari, Italy b Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Bari, Italy c Istituto Tumori “Giovanni Paolo II” - I.R.C.C.S., Bari, Italy d Dipartimento di Ingegneria Elettrica e dell’Informazione, Politecnico di Bari, Bari, Italy
Abstract
Ac ce p
te
d
Background Early diagnosis of Alzheimer’s disease (AD) and its onset in subjects affected by mild cognitive impairment (MCI) based on structural MRI features is one of the most important open issues in neuroimaging. Accordingly, a scientific challenge has been promoted, on the international Kaggle platform, to assess the performance of different classification methods for prediction of MCI and its conversion to AD. New Method This work presents a classification strategy based on Random Forest feature selection and Deep Neural Network classification using a mixed cohort ∗
Corresponding author at: Via Giovanni Amendola, 173, 70125 Bari BA, +390805442391 Email address:
[email protected] (Annarita Fanizzi) 1 Equal last author contribution 2 Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wpcontent/uploads/how to apply/ADNI Acknowledgement List.pdf. Preprint submitted to Journal of Neuroscience Methods
December 21, 2017
Page 3 of 22
ip t
including the four classes of classification problem, that is HC, AD, MCI, cMCI, to train the model. Moreover, we compare this approach with a novel classification strategy based on fuzzy logic learned on a mixed cohort including only HC and AD.
an
us
cr
Experiments A training set of 240 subjects and a test set including mixed cohort of 500 real and simulated subjects were used. The data included AD patients, MCI subjects converting to AD (cMCI), MCI subjects and healthy controls (HC). This work ranked third for overall accuracy (38.8%) over 19 participating teams.
M
Comparison with Existing Method(s) The “International challenge for automated prediction of MCI from MRI data” hosted by the Kaggle platform has been promoted to validate different methodologies with a common set of data and evaluation procedures.
te
d
Conclusion DNNs reach a classification accuracy significantly higher than other machine learning strategies; on the other hand, fuzzy logic is particularly accurate with cMCI, suggesting a combination of these approaches could lead to interesting future perspectives.
1
2 3 4 5 6 7 8 9 10 11
Ac ce p
Keywords: Deep learning, Fuzzy logic, MRI, Alzheimer’s disease, MCI 1. Introduction
Alzheimer’s disease (AD) is the most common form of dementia, but a reliable in vivo diagnosis is not at sight, yet. In recent years, novel strategies of analysis have been proposed to tackle this issue; in particular, several studies have addressed the challenging task of measuring the atrophy of specific regions, such as the hippocampus [1, 2], and eventually distinguishing healthy controls (HC) from patients suffering from AD and its prodromal stage: mild cognitive impairment (MCI) [3, 4]. A number of works have already established the effectiveness of neuroimaging techniques in outlining structural differences between AD patients and controls; nonetheless, these studies are difficult to compare and their 4
Page 4 of 22
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
ip t
cr
18
us
17
an
16
M
15
d
14
te
13
findings hard to generalize as the adopted algorithms are different as the data. These are the main motivations for international challenges [5, 6], with transparent procedures and data sharing policies, to investigate algorithms and classification strategies. In particular, MCI condition remains elusive, especially because only a fraction of subjects with MCI turns into AD patients (cMCI), whereas another fraction do not convert into AD (MCI). Some works suggested to address the classification of MCI and cMCI classes learning their models on mixed cohorts including all available classes: HC, AD, cMCI and MCI [7, 8]. Another option is to learn a classification model directly using a mixed cohort of MCI and cMCI subjects. The main drawback of these approaches is that MCI is highly heterogeneous as it can include subjects affected by different or multiple pathological conditions; besides, cMCI and MCI classes can hardly be considered independent. Other studies have therefore suggested a different approach, learning the classification model on mixed cohort of HC and AD subjects and then predicting the conversion or not of MCI subjects [9, 10]. With this work we aimed at evaluating both strategies on the data shared by the challenge and, consequently, transparently investigating differences and similarities. In particular, we compared a pure machine learning approach, exploiting a Random Forest (RF) for feature selection and a Deep Neural Network (DNN) for classification, using a mixed cohort including all the four classes HC, AD, MCI and cMCI, with a fuzzy approach learned on a mixed cohort including only HC and AD subjects. The latter approach used the hippocampal volume to segregate the different classes and determine a proper set of features for each classification task. Recent works demonstrated that models trained with HC and AD subjects can also be effective when attempting to distinguish MCI and cMCI. In fact, the differences among MCI and cMCI subjects are significantly subtler than those among AD and HC, thus a classification model learned on these latter two classes can achieve more accurate results [10, 11, 12]. Several classification studies have investigated the early diagnosis of AD, the characterization of MCI and, eventually, its conversion into AD; structural MRI features have been mostly adopted. Nevertheless, an international competition assessing different algorithms on the same training and test sets have not been performed yet. Accordingly, a novel international challenge
Ac ce p
12
5
Page 5 of 22
56
2. Materials and Methods
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
78 79 80
cr
us
54
2.1. International challenge for automated prediction of MCI from MRI data The adoption of machine learning techniques for the analysis of neuroimaging data has deeply influenced recent neuroscience studies, specifically concerning AD and its prodromal phase (MCI). In fact, machine learning can support the estimation of disease risk along with the estimation of therapy success or the effects of genotype-phenotype associations. Recent studies exists about the early diagnosis of AD or the conversion of MCI subjects into AD. However, it is worth noting that an international competition among existing algorithms assessing the use of structural MRI markers has never been performed on the same training and test sets. Accordingly, a set of T1-weighted MRI scans for patients of four categories, stable AD, MCI, MCI who converted to AD and healthy controls were analyzed using FreeSurfer v.5.3. The obtained features were released on Kaggle platform to allow research team from all around the world to compare their algorithms within an established framework and using the same data. The participants were allowed to download specific training and test sets, in order to make results from different algorithms comparable, and submit their solutions in a text format. Kaggle tools allowed the visualization of intermediate scores by public/private leaderboard in terms of classification accuracy.
an
53
M
52
d
51
te
50
Ac ce p
49
ip t
55
was performed and hosted on the Kaggle platform3 in order to assess existing and novel classification algorithms with a common framework of analysis and a shared base of knowledge. The results of this work were publicly released on the Kaggle platform which hosted the competition of 19 international participating teams. The authors submitted their predictions within the Bari Medical Physics Group (BMPG) team flag. The main goal of this work is to describe the methodology designed by the BMPG to rank among the three best performing teams.
48
2.2. Challenge Data Data used in the preparation of this article were obtained from the ADNI database (adni.loni.usc.edu). The challenge organizers prepared a training 3
https://inclass.kaggle.com/c/mci-prediction
6
Page 6 of 22
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
ip t
cr
87
us
86
an
85
M
84
d
83
te
82
set Dtrain consisting of 240 subjects, equally composed of 60 AD patients, 60 HC, 60 cMCI and 60 MCI subjects. An anonymous id-code for identification, diagnosis, gender, age and baseline Mini Mental State Examination total score were provided for each subject; in addition, subcortical volumes, cortical thickness, cortical volume, cortical surface area, cortical thickness standard deviation, cortical curvature and hippocampal subfields’ volumes resulted by FreeSurfer analysis along with several other structural measures accounting for an overall feature representation of N = 431 variables. Thus, Dtrain was represented as a matrix with 240 rows and 431 columns. An analogous test set Dtest , including 500 observations was provided by the organizing committee, except for the diagnosis: so that Dtest was represented as a matrix with 500 rows and 430 columns. The Dtest set included 340 simulated samples and 160 real subjects. For what concerns the sampling, a three-fold procedure was adopted by the challenge organizers: (i) the whole cohort was downloaded from ADNI and randomly divided in training and test; (ii) the test sample was used to calculate the probability density function (pdf) of the features, accordingly the simulated data were obtained inverting the pdf previously calculated; (iii) finally, they randomly labeled each simulated with a clinical status. The size of both training and test sets were established from the challenge organizers, a random collection of 100 subjects for each clinical class was selected to keep these classes balanced and considering that MCI subjects are difficult to recruit. Further data could have been downloaded from ADNI, however we preferred use only data provided by the challenge organizers in order to keep our results comparable with those obtained by other teams. Challenge data was firstly cleaned; we identified and removed zero variance features. Besides, we investigated the presence of highly correlated features. In fact, we found 26 features correlated more than 90%. To further reduce the feature representation, we searched for linear combinations among the features and removed them from the data. After removing these variables, the resulting Dtrain matrix representation consisted of 242 features. Finally, training data was centered and scaled, in order to standardize the features with null average and unitary variance. We performed the same operations for the Dtest .
Ac ce p
81
7
Page 7 of 22
120 121 122 123 124 125 126 127 128 129
Ac ce p
te
d
M
130
ip t
119
cr
118
us
117
2.3. Feature selection We used a Random Forest classifier [13] to evaluate the feature importance in terms of mean decrease of accuracy. Random Forests are an ensemble of tree classifiers which can provide a direct multi-class model. A standard configuration was adopted with 500 trees and 20 features (as prescribed in [13]) randomly selected at each split. We performed 100 5-fold cross-validation rounds and for each round we selected the 20 most important features. We observed in fact that no significant improvement occurred in training classification accuracy using more features. As expected, hippocampal volumes were found among the best performing features. Other AD related features such as the the cerebrospinal fluid volume, the lateral ventricle volume and the enthorinal cortex thickness were selected. Interestingly, right hippocampus rank resulted higher than the left one; in addition the best performing feature was by far the baseline MMSE, see Figure 1 for an overview.
an
116
Figure 1: The feature importance measured in term of mean decrease in classification accuracy with a Random Forest classifier. When appropriate, the left (L) or the right (R) hemisphere is reported.
131 132 133 134
Structural measures of both right and left hemispheres were selected; their number was almost equal showing no manifest asymmetry. In fact, the selected left hemisphere measures were 9 against 10 from the right. The selected features were then used to train a Deep Neural Network. 8
Page 8 of 22
138 139 140 141 142 143
Ac ce p
te
d
M
an
144
ip t
137
cr
136
2.4. Deep Neural Networks A deep neural network (DNN) is a neural network with more than two hidden layers of neurons [14, 15]. In particular, we used a feedforward DNN, which learned to map the feature representation of each patient in the four aforementioned classes. Each unit within a layer computes a weighted sum of the inputs from the previous layer, then its result becomes the argument of a nonlinear transformation textcolor (relu or tanh), then it feeds the subsequent layers until the final layer is reached and a final score is computed. In particular, a softmax nonlinear function is used to obtain the final score [16], see Figure 2 for a pictorial representation.
us
135
Figure 2: The optimal network configuration was obtained in cross-validation. The number of neurons within each layer decreases according to the flow of information from features to output labels. 145
DNNs have shown promising results in several applications ranging from 9
Page 9 of 22
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169
170 171 172 173 174 175 176 177 178 179 180 181 182
ip t
cr
152
us
151
an
150
M
149
d
148
te
147
image classification to natural language applications [14, 15]. More recently, they showed excellent results for the development of computer-aided detection applications [17, 18, 19]. Accordingly, we decided to explore their effectiveness. The optimal size of a training set depends in general on several factors, especially on how separate are the classes to be investigated. In particular, when the performance on the training set remains stable whatever configuration of learning is considered, an option to be considered is that gathering more data would not be helpful. In order to improve the learning algorithm, we preferred to increase the size of the model by adding more hidden units to each layer, as well as tuning the learning rate hyperparameter. Specifically, after several cross-validation analysis the best DNN configuration was determined with 11 layers and a decreasing number of neurons: any further increase in the number of layers or neurons did not yield any improvement. The input layer consisted of 2056 units while the last one just included 4 (one for each class); a mixed optimal configuration of activation functions (relu and tanh) was determined with a cross-validation analysis [20]. The DNN loss function was the categorical cross entropy; it was minimized with the Adam optimizer [21]. For better robustness, the network weights were randomly initialized from a uniform distribution, thus resulting in different models. We performed 30 different initializations. The classification scores from each model were then averaged to obtain a final prediction, each test subject was assigned the class with the resulting higher score.
Ac ce p
146
2.5. Fuzzy classes of hippocampal volume Hippocampal atrophy is a well-known biomarker for AD progression state [22, 23]. In a previous work, we proposed the use of homogeneous classes of subjects to perform different feature selection on MRI data [4]. The highly variable nature of the hippocampal morphology suggested the use of a fuzzy logic to generalize the definition of set by redefining the concept of membership. Specifically, let X be a space of points, with a generic element of X denoted by x. A fuzzy set A ∈ X is characterized by a membership function fA (x) that associates with each point in X a real number in the interval [0, 1], with the values of fA (x) representing the grade of membership of x in A. Thus, the nearer the value of fA (x) to unity, the higher the grade of membership of x in A [24]. The core of the fuzzy set A is the subset of values of x that belong to the class A with maximum degree. 10
Page 10 of 22
190 191 192 193 194 195 196 197 198 199 200 201 202 203
204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219
ip t
cr
189
us
188
an
187
M
186
d
185
te
184
We individuated a statistically consistent subdivision of the training cohort in three fuzzy classes according to the subject’s left hippocampal volume. Firstly, we performed classification of training subjects and looking at Receiving Operating Characteristic (ROC) curve we found the optimal threshold value t separating HC and AD classes. Thus, the fuzzy set L including subjects whit the left hippocampal volume V satisfying the relationship V ≤ t was defined; analogously, we defined the fuzzy set H when V ≥ t. These two sets were mutually exclusive. However, there are subjects belonging to H which are AD and HC subjects in L. To take this effect into account, we defined a third fuzzy set M. We computed the median values l1 and l2 of left hippocampal volumes of AD ad HC subjects, respectively; the median value of left hippocampal volumes of HC subjects was l2 . These values satisfied the inequality l1 ≤ t ≤ l2 so that a third fuzzy set was suitably introduced including subjects whose left hippocampal volume satisfied l1 ≤ V ≤ l2 . Within each fuzzy set, the probability for a subject to belong to a specific class was reasonably different; for example, subjects belonging to the L set were more likely to be AD, having a small hippocampal volume, than HC. Thus, a specific feature selection was performed for each set in order to obtain a more adherent representation of its population. In particular, a Kruskal-Wallis test was performed and only those structural features exceeding the 5% significance were used to learn the model. 2.6. Classification model A schematic overview of the framework we adopt here is presented in Figure 3. We trained a standard RF classifier for each set L, M and H, then we measured the subject’s membership to each set and finally obtained the subject’s classification score S by averaging the scores obtained from each model according to the membership. Finally, four diagnostic groups were obtained with a Bayesian approach. We computed three thresholds t1 , t2 , and t3 , maximizing the training classification accuracy for the classification of HC, AD, cMCI, and MCI classes. Specifically, subjects with the final score S ≥ t3 were classified AD, if S ≤ t1 HC, if t1 ≤ S ≤ t2 MCI and if t2 ≤ S ≤ t3 cMCI. In order to evaluate the performance of the model, we used a 10-fold crossvalidation. Specifically, we measured the overall accuracy (acc) and adopted the multi-class definition for recall and precision. The overall accuracy was the number of correctly classified subjects (true positive, TP) by the sample size N :
Ac ce p
183
11
Page 11 of 22
ip t cr us
an
Figure 3: A schematic overview of the proposed fuzzy approach. Once the fuzzy sets L, M and H have been defined, feature selection is performed and a RF model is trained for each set. The final classification score is obtained by averaging the models’ predicitons.
TP (1) N Recall Rc of a class was the number of subjects belonging to the class correctly classified (T Pc ) by the overall class size Nc :
220
d
221
M
acc =
T Pc (2) Nc Analogously, precision Pc of a class was the number of correctly classified subjects (T Pc ) divided by the number of predictions within that class Np,c :
223
224 225
226
227 228 229 230 231 232
Ac ce p
222
te
Rc =
Pc =
T Pc Np,c
(3)
For all reported results, average and standard deviations were computed when possible. 3. Results
3.1. Public and Private Leaderboards The performance of all challenge competitors was publicly released in terms of classification accuracy; each one of the 19 participating teams could make a submission once a day. The Public Leaderboard kept record of the team name, the classification score, the number of entries and the best submission time, along with the ranking position. The BMPG best submission 12
Page 12 of 22
235 236
Score accuaracy 0.424 0.416 0.412 0.412 0.408 0.396 0.388 0.384 0.384 0.384
M
an
us
Placement Team name 1 SiPBA-UGR 2 Salvatore C. - Castiglioni I. 3 Loris Nanni 4 Stavros Dimitriadis - Dimitris LIparas 5 Sorensen 6 GRAAL 7 BMPG 8 ChaseCowart 9 Jean-Baptiste SCHIRATTI 10 BrainE
ip t
234
was obtained with the DNN model; the organizing committee adopted the previously defined accuracy metric, see Eq. 1, to evaluate the submissions. In particular, the BMPG submission obtained a 38.8% accuracy, resulting the seventh performance of the roaster, see Table 1.
cr
233
239 240 241 242 243 244 245 246 247 248 249
250 251 252 253 254
The leaderboard performance was computed on approximately 50% of test data. Thus, it provided just an estimation of test results. In this case, the BMPG best submission reached the seventh position. After the deadline for challenge submissions, the organizing committee released the Private Leaderboard performance computed on the entire Dtest . The Private Leaderboard performance resulted overoptimistic, with a 4% loss in accuracy; in fact, the BMPG final result was 34.8%. It is worth noting that training performances exceeded 50% accuracy (for all presented models). Nevertheless, the BMPG prediction reached the third position, see Table 2. After the submission deadline, the organizing committee released the test labels so that it was possible to investigate in detail the classification performance especially investigating the model accuracy on real subjects. We perform such evaluation in the following section.
te
238
Ac ce p
237
d
Table 1: BMPG accuracy was 38.8% and resulted among the top ten algorithms presented in the Public Leaderboard including half of testing subjects.
3.2. Model comparison The classification performance on training was evaluated on 100 rounds of 10-fold cross validation; the results are summarized in Table 3. The DNN model resulted the one with higher overall accuracy acc = 53.7 ± 1.9%, while for the fuzzy model acc = 50.9 ± 1.1%. In particular, the DNN model had 13
Page 13 of 22
cr
ip t
Score accuaracy 0.360 0.356 0.348 0.344 0.336 0.336 0.332 0.328 0.328 0.324
us
Placement Team name 1 Stavros Dimitriadis - Dimitris LIparas 2 GRAAL 3 BMPG 4 BrainE 5 gogogo 6 DevinAnnaWilley 7 fengxy 8 BoyX 9 ChaseCowart 10 Jean-Baptiste SCHIRATTI
257 258 259
te
260
higher recall values for HC, AD and cMCI, while the fuzzy model showed a slightly higher precision for all classes, except AD. Also, we compared DNN results with two state-of-the-art machine learning methods, Random Forests and Support Vector Machines; in both cases accuracies resulted significantly lower thus we decided to not submit their prediction to the leaderboard.
M
256
d
255
an
Table 2: MPG submission ranked third in the Private Leaderboard including all of testing subjects. The BMPG submission obtained a 34.8% accuracy.
Ac ce p
DNN Fuzzy recall precision recall precision AD 80.9± 3.2 % 79.3 ± 1.7% 66.2 ± 2.4% 66.8 ± 1.1% HC 53.8 ± 3.4% 53.4 ± 2.4% 47.7 ± 3.5% 54.8 ± 2.0% MCI 35.9 ± 4.3% 36.4 ± 3.5% 46.1 ± 1.5% 37.9 ± 1.4% cMCI 44.8 ± 3.0% 45.0 ± 3.0% 43.8 ± 1.2% 47.5 ± 1.8%
Table 3: Recall and precision for the DNN and Fuzzy models on training subjects (average and standard deviation computed in cross-validation). Best performances are reported in bold.
261 262 263 264 265 266
The classification performance on real test subjects confirmed the DNN as the best model. In fact, the DNN model acquired a 56.3% accuracy while the fuzzy reached 52.2% accuracy, see Figure 4 for a comprehensive review of the classification results. Besides, recall and precision of both models are reported in Table 4. The DNN model resulted the one with higher recall for all classes, except 14
Page 14 of 22
ip t cr us an M
Ac ce p
te
d
Figure 4: The confusion matrices of the DNN best submission (left) and the fuzzy (right) models on real test subjects.
AD HC MCI cMCI
DNN recall precision 87.5% 74.5% 52.5% 52.5% 27.5% 34.3% 52.5% 51.2%
Fuzzy recall precision 82.5% 78.6% 50.0% 55.9% 20.0% 22.9% 57.5% 47.9%
Table 4: Recall and precision for the DNN, Fuzzy and RF models on real test subjects. Best performances are reported in bold.
15
Page 15 of 22
269 270 271 272 273
d
M
an
us
274
ip t
268
cMCI; in this latter case, the fuzzy model had higher recall. For what concerned precision, the DNN model was the more precise with respect of MCI and cMCI while the fuzzy model was more precise with AD and HC. Moreover, a direct comparison of predictions obtained by the two models was performed to evaluate their agreement, see Figure 5, on the 160 real test subjects. The two models agreed on 103 predictions, 64.3% of cases; a major difference was found for the MCI class where a 9.3% agreement was found, to be compared with the best agreement of AD class which was 22.5%.
cr
267
275
276 277 278 279 280 281 282 283 284 285 286 287
Ac ce p
te
Figure 5: The comparison matrix of DNN and fuzzy predictions on real test subjects. The overall agreement of predictions reached the 64.3%.
4. Discussion
This study confirms the effectiveness of DNN models for computer-aided detection systems. In fact, the classification results obtained by DNN allowed the BMPG group to obtain one of the most accurate prediction in the participants’ roaster. Nevertheless, the multi-class classification accuracy is still far from reaching satisfactory results for clinical applicability. However, these findings compare well with state-of-the-art results; for example in [5] the best performing method reached a 63% accuracy for a three-class classification problem. However, other works consider this as an intrinsically four-class classification problem [25], of course in this case reported accuracies, 46.3%, can reach lower values. Interestingly, classification accuracies reported in challenges is lower than those reported in literature. Discrimination between MCI-c and MCI is challenging because of an unavoidable bias: 16
Page 16 of 22
295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325
ip t
cr
294
us
293
an
292
M
291
d
290
te
289
under the label of MCI, various pathologies share common clinical features but can have different causes [26]. Moreover, some people with MCI will progress to dementia, but others we remain stable or fully recover [27]. From this perspective, this study also confirmed that the development of accurate algorithms for distinguishing MCI subjects who convert to AD from those who will not is yet a challenging problem. The MMSE feature was by far the most informative one; it is a well-known AD biomarker and often used as a short screening tool for providing an overall measure of cognitive impairment in clinical, research and community settings [28]. The diagnosis of AD is primarily based on multiple variables and factors, nevertheless structural MRI fits in the model of dynamic biomarkers of the Alzheimer’s pathological progression, as it captures atrophy effects yielded by amyloid cascade. Indeed, it should be emphasized that despite multi-modal data are often acquired for AD diagnosis, several studies were performed using only one imaging modality [29] because, among the common neuroimaging tools, structural MRI plays a pivotal role for several reasons: it is the most widespread tool in the clinical practice, the examination is relatively fast and cheap, it allows longitudinal in vivo studies, the technique is noninvasive and the image interpretation is fairly straightforward [30]. Recent studies have shown that different biomarkers may provide complementary information which is very useful for AD diagnosis and, in particular, for MCI classification [31, 32, 33, 34]. Multimodal neuroimaging is widely used in neuroscience researches, as it overcomes the limitations of individual modalities and it may be very useful for future works to investigate their application to solve the complex 4-class classification problem. This result suggests the need for further feature engineering studies to improve diagnostic accuracy. However, novel insight has been gained thanks to this competition. In particular, this study demonstrated that pure machine learning algorithms are generally those with higher recall but they can have poor precision. The use of alternative approaches, such as the fuzzy logic for the present case, can in principle improve classification. In fact, the fuzzy approach adopted here was the most precise model when dealing with AD and HC. We also observed that the two approaches had a relatively small overlap, with an agreement of 64.3%. Another interesting aspect concerns the synthetic data. The organizing committee provided a test set including real subjects and simulated data. Interestingly, the test performances obtained with all the three models on the entire test set were significantly lower than those obtaining during training.
Ac ce p
288
17
Page 17 of 22
331
5. Conclusion
328 329
cr
327
ip t
330
Specifically, for both models we found a ∼ 50% accuracy while in test we found a significant decrease (∼ 34%). On the contrary, when dealing with real subjects, test performances resulted in good agreement with training cross-validation accuracy: an important clue about the poor conformity of simulated data to real subjects.
326
346
References
338 339 340 341 342 343 344
347 348 349 350
351 352 353 354 355
356 357 358 359
an
337
M
336
d
335
te
334
Ac ce p
333
us
345
The presented work demonstrated the accuracy of deep learning strategies for early detection of Alzheimer’s disease and, in particular, of its onset in MCI subjects. In fact, the strategy described here was presented at the “International challenge for automated prediction of MCI from MRI data” which hosted 19 research teams from all over the world and ranked third in terms of overall multi-class classification accuracy. We evaluated the robustness of the proposed approaches also in terms precision and recall. These evaluations were performed on a training and an independent test set provided by the organizing committee. Our work confirmed the reliability of MMSE score for AD classification and investigated the effectiveness of a novel approach based on fuzzy, thus we hope that our work will stimulate further studies and novel strategies to combine these approaches. In particular, we intend to investigate if DNN accuracy can be improved adding non-MRI features or combining different imaging modalities.
332
[1] S. Tangaro, N. Amoroso, M. Boccardi, S. Bruno, A. Chincarini, G. Ferraro, G. B. Frisoni, R. Maglietta, A. Redolfi, L. Rei, et al., Automated voxel-by-voxel tissue classification for hippocampal segmentation: methods and validation, Physica Medica 30 (2014) 878–887. [2] N. Amoroso, R. Errico, S. Bruno, A. Chincarini, E. Garuccio, F. Sensi, S. Tangaro, A. Tateo, R. Bellotti, A. D. N. Initiative, et al., Hippocampal unified multi-atlas network (HUMAN): protocol and scale validation of a novel segmentation tool, Physics in medicine and biology 60 (2015) 8851. [3] A. Chincarini, F. Sensi, L. Rei, G. Gemme, S. Squarcia, R. Longo, F. Brun, S. Tangaro, R. Bellotti, N. Amoroso, et al., Integrating longitudinal information in hippocampal volume measurements for the early detection of Alzheimer’s disease, Neuroimage 125 (2016) 834–847. 18
Page 18 of 22
367
368 369 370 371
372 373 374
375 376 377 378
379 380 381
382 383 384
385 386 387 388
389 390 391
ip t
cr
366
us
365
[6] G. I. Allen, N. Amoroso, C. Anghel, V. Balagurusamy, C. J. Bare, D. Beaton, R. Bellotti, D. A. Bennett, K. L. Boehme, P. C. Boutros, et al., Crowdsourced estimation of cognitive decline and resilience in Alzheimer’s disease, Alzheimer’s & Dementia 12 (2016) 645–653.
an
364
[5] E. E. Bron, M. Smits, W. M. Van Der Flier, H. Vrenken, F. Barkhof, P. Scheltens, J. M. Papma, R. M. Steketee, C. M. Orellana, R. Meijboom, et al., Standardized evaluation of algorithms for computer-aided diagnosis of dementia based on structural MRI: The CADDementia challenge, NeuroImage 111 (2015) 562–579.
[7] C. Davatzikos, P. Bhatt, L. M. Shaw, K. N. Batmanghelich, J. Q. Trojanowski, Prediction of MCI to AD conversion, via MRI, CSF biomarkers, and pattern classification, Neurobiology of aging 32 (2011) 2322–e19.
M
363
[8] Y. Cho, J.-K. Seong, Y. Jeong, S. Y. Shin, A. D. N. Initiative, et al., Individual subject classification for Alzheimer’s disease based on incremental learning using a spatial frequency representation of cortical thickness data, Neuroimage 59 (2012) 2217–2230.
d
362
te
361
[4] S. Tangaro, A. Fanizzi, N. Amoroso, R. Bellotti, A. D. N. Initiative, et al., A fuzzy-based system reveals Alzheimers Disease onset in subjects with Mild Cognitive Impairment, Physica Medica 38 (2017) 36–44.
[9] D. Zhang, D. Shen, A. D. N. Initiative, et al., Predicting future clinical changes of MCI patients using longitudinal and multimodal biomarkers, PloS one 7 (2012) e33182.
Ac ce p
360
[10] E. Moradi, A. Pepe, C. Gaser, H. Huttunen, J. Tohka, A. D. N. Initiative, et al., Machine learning framework for early MRI-based Alzheimer’s conversion prediction in MCI subjects, Neuroimage 104 (2015) 398–412. [11] J. Young, M. Modat, M. Cardoso, A. Mendelson, D. Cash, S. Ourselin, et al., Accurate multimodal probabilistic prediction of conversion to Alzheimer’s disease in patients with mild cognitive impairment, NeuroImage: Clinical 2 (2013) 735–45. [12] R. Filipovych, C. Davatzikos, A. D. N. Initiative, et al., Semi-supervised pattern classification of medical images: application to mild cognitive impairment (MCI), NeuroImage 55 (2011) 1109–19. 19
Page 19 of 22
399
400 401 402 403 404
405 406 407 408
409 410 411
412 413 414
415 416
417 418 419
420 421 422 423
ip t
cr
398
[16] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. Http://www.deeplearningbook.org.
us
397
[17] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, R. M. Summers, Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning, IEEE transactions on medical imaging 35 (2016) 1285–1298.
an
396
[15] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al., Greedy layerwise training of deep networks, Advances in neural information processing systems 19 (2007) 153.
M
395
[14] G. E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural computation 18 (2006) 1527–1554.
[18] T. Kooi, G. Litjens, B. van Ginneken, A. Gubern-M´erida, C. I. S´anchez, R. Mann, A. den Heeten, N. Karssemeijer, Large scale deep learning for computer aided detection of mammographic lesions, Medical image analysis 35 (2017) 303–312.
d
394
[19] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, S. Thrun, Dermatologist-level classification of skin cancer with deep neural networks, Nature 542 (2017) 115–118.
te
393
[13] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.
Ac ce p
392
[20] R. Arora, A. Basu, P. Mianjy, A. Mukherjee, Deep Neural Networks with Rectified Linear Units, arXiv:1611.01491 (2016).
Understanding arXiv preprint
[21] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [22] D. Devanand, R. Bansal, J. Liu, X. Hao, G. Pradhaban, B. S. Peterson, MRI hippocampal and entorhinal cortex mapping in predicting conversion to Alzheimer’s disease, Neuroimage 60 (2012) 1622–29. [23] K. Leung, J. Barnes, G. Ridgway, J. Bartlett, M. Clarkson, K. Macdonald, et al., Automated cross-sectional and longitudinal hippocampal volume measurement in mild cognitive impairment and Alzheimer’s disease, Neuroimage 51 (2010) 1345–59. 20
Page 20 of 22
431 432
433 434 435 436 437 438
439 440 441 442 443
444 445
446 447 448 449
450 451 452
453 454 455 456
ip t
cr
430
[26] B. Dubois, M. Albert, Amnestic MCI or prodromal Alzheimer’s disease?, The Lancet Neurology 3 (2004) 246–48.
us
429
[27] M. S. Albert, S. T. DeKosky, D. Dickson, B. Dubois, H. H. Feldman, N. C. Fox, A. Gamst, D. M. Holtzman, W. J. Jagust, R. C. Petersen, et al., The diagnosis of mild cognitive impairment due to Alzheimers disease: Recommendations from the National Institute on Aging-Alzheimers Association workgroups on diagnostic guidelines for Alzheimer’s disease, Alzheimer’s & dementia 7 (2011) 270–279.
an
428
[25] S. Liu, S. Liu, W. Cai, H. Che, S. Pujol, R. Kikinis, D. Feng, M. J. Fulham, et al., Multimodal neuroimaging feature learning for multiclass diagnosis of Alzheimer’s disease, IEEE Transactions on Biomedical Engineering 62 (2015) 1132–1140.
M
427
[28] I. Arevalo-Rodriguez, N. Smailagic, M. R. i Figuls, A. Ciapponi, E. Sanchez-Perez, A. Giannakou, O. L. Pedraza, X. B. Cosp, S. Cullum, Mini-Mental State Examination (MMSE) for the detection of Alzheimers disease and other dementias in people with mild cognitive impairment (MCI), BJPsych Advances 21 (2015) 362–362.
d
426
te
425
[24] L. Zadeh, From computing with numbers to computing with wordsfrom manipulation of measurements to manipulation of perceptions, in: Logic, Thought and Action, Springer, 1999, pp. 507–44.
[29] Q. Zhao, X. Chen, Y. Zhou, Quantitative multimodal multiparametric imaging in Alzheimers disease, Brain informatics 3 (2016) 29–37.
Ac ce p
424
[30] A. Chincarini, P. Bosco, G. Gemme, S. Morbelli, D. Arnaldi, F. Sensi, I. Solano, N. Amoroso, S. Tangaro, R. Longo, et al., Alzheimers disease markers from structural MRI and FDG-PET brain images, The European Physical Journal Plus 127 (2012) 135. [31] L. Nanni, C. Salvatore, A. Cerasa, I. Castiglioni, A. D. N. Initiative, et al., Combining multiple approaches for the early diagnosis of Alzheimer’s Disease, Pattern Recognition Letters 84 (2016) 259–266. [32] K. Walhovd, A. Fjell, A. Dale, L. McEvoy, J. Brewer, D. Karow, D. Salmon, C. Fennema-Notestine, A. D. N. Initiative, et al., Multimodal imaging predicts memory performance in normal aging and cognitive decline, Neurobiology of aging 31 (2010) 1107–1121. 21
Page 21 of 22
ip t
cr
463
us
462
an
461
[34] C. Hinrichs, V. Singh, G. Xu, S. C. Johnson, A. D. N. Initiative, et al., Predictive markers for AD in a multi-modality framework: an analysis of MCI progression in the ADNI population, Neuroimage 55 (2011) 574–589.
M
460
d
459
te
458
[33] D. Zhang, Y. Wang, L. Zhou, H. Yuan, D. Shen, A. D. N. Initiative, et al., Multimodal classification of Alzheimer’s disease and mild cognitive impairment, Neuroimage 55 (2011) 856–867.
Ac ce p
457
22
Page 22 of 22