Does feature selection improve classification accuracy? Impact of sample size and feature selection on classification using anatomical magnetic resonance images

Does feature selection improve classification accuracy? Impact of sample size and feature selection on classification using anatomical magnetic resonance images

NeuroImage 60 (2012) 59–70 Contents lists available at SciVerse ScienceDirect NeuroImage journal homepage: www.elsevier.com/locate/ynimg Technical ...

2MB Sizes 1 Downloads 159 Views

NeuroImage 60 (2012) 59–70

Contents lists available at SciVerse ScienceDirect

NeuroImage journal homepage: www.elsevier.com/locate/ynimg

Technical Note

Does feature selection improve classification accuracy? Impact of sample size and feature selection on classification using anatomical magnetic resonance images Carlton Chu a, 1, Ai-Ling Hsu b, 1, Kun-Hsien Chou c, Peter Bandettini a, ChingPo Lin b, c,⁎ and for the Alzheimer's Disease Neuroimaging Initiative 2 a b c

Section on Functional Imaging Methods, Laboratory of Brain and Cognition, NIMH, NIH, Bethesda, USA Institute of Brain Science, National Yang Ming University, Taipei, Taiwan, ROC Brain Connectivity Laboratory, Institute of Neuroscience, National Yang Ming University, Taipei, Taiwan, ROC

a r t i c l e

i n f o

Article history: Received 1 August 2011 Revised 15 November 2011 Accepted 21 November 2011 Available online 1 December 2011 Keywords: Feature selection Support vector machine (SVM) Diseases classification Recursive feature elimination (RFE)

a b s t r a c t There are growing numbers of studies using machine learning approaches to characterize patterns of anatomical difference discernible from neuroimaging data. The high-dimensionality of image data often raises a concern that feature selection is needed to obtain optimal accuracy. Among previous studies, mostly using fixed sample sizes, some show greater predictive accuracies with feature selection, whereas others do not. In this study, we compared four common feature selection methods. 1) Pre-selected region of interests (ROIs) that are based on prior knowledge. 2) Univariate t-test filtering. 3) Recursive feature elimination (RFE), and 4) t-test filtering constrained by ROIs. The predictive accuracies achieved from different sample sizes, with and without feature selection, were compared statistically. To demonstrate the effect, we used grey matter segmented from the T1-weighted anatomical scans collected by the Alzheimer's disease Neuroimaging Initiative (ADNI) as the input features to a linear support vector machine classifier. The objective was to characterize the patterns of difference between Alzheimer's disease (AD) patients and cognitively normal subjects, and also to characterize the difference between mild cognitive impairment (MCI) patients and normal subjects. In addition, we also compared the classification accuracies between MCI patients who converted to AD and MCI patients who did not convert within the period of 12 months. Predictive accuracies from two data-driven feature selection methods (t-test filtering and RFE) were no better than those achieved using whole brain data. We showed that we could achieve the most accurate characterizations by using prior knowledge of where to expect neurodegeneration (hippocampus and parahippocampal gyrus). Therefore, feature selection does improve the classification accuracies, but it depends on the method adopted. In general, larger sample sizes yielded higher accuracies with less advantage obtained by using knowledge from the existing literature. © 2011 Elsevier Inc. All rights reserved.

1. Introduction Common questions that investigators have about analyzing image data often concern how best to represent it. For morphometric analysis, it is common for researchers to ask whether they should use voxel-based morphometry, cortical thickness mapping, or any one of a vast number of other possible representations. No one approach ⁎ Corresponding author at: Institute of Neuroscience, National Yang-Ming University, 155, Nong st., Sec. 2, Peitou, Taipei, Taiwan, ROC. Fax: +886 2 28262285. E-mail address: [email protected] (C. Lin). 1 These authors contributed equally to this work. 2 Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://www.loni.ucla.edu/ADNI). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. ADNI investigators include (complete listing available at: http://adni.loni.ucla. edu/wp-content/uploads/how_to_apply/ADNI_Authorship_List.pdf). 1053-8119/$ – see front matter © 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.neuroimage.2011.11.066

is optimal in all situations. One that works well for differentiating among some populations may not work so well when used for others. The empirical way to determine which is “better” involves assessing how well the various representations are able to actually discriminate among the populations. Studies that apply predictive and classification models are based on the predictive validity (Forster, 2002). If the classification accuracy is higher than chance, it implies that the input features (i.e. could be the whole brain or a small brain region) contain some information to discriminate among the populations (Ashburner and Kloppel, 2011). There is increasing interest in applying multivariate pattern analysis (MVPA) to study the patterns of neurodegenerative diseases and mental disorders using anatomical MRI, including Alzheimer's diseases (AD), Huntington's Diseases (HD), major depression disorders (MDD), schizophrenia, autism spectrum disorder (ASD) (Chu, 2009; Costafreda et al., 2009; Eger et al., 2008; Fan et al., 2005; Kloppel et al., 2009; Kloppel et al., 2008), and many other disorders or diseases, which had been applied to

60

C. Chu et al. / NeuroImage 60 (2012) 59–70

Voxel Based Morphometry (VBM), especially studies on classification of AD have grown significantly in recent years. The high dimensionality of the input feature space in comparison with the relatively small number of subjects (curse of dimensionality) is a widespread concern, so some form of feature selection is often applied (Costafreda et al., 2009; Fan et al., 2007; Guyon and Elisseeff, 2003; Pelaez-Coca et al., 2010). However, many studies have used support vector machine (SVM) and other kernel methods. These approaches search for a solution in the kernel space, so the optimized parameters are bounded by the number of training samples, rather than the dimension of the input features. In other words, kernel methods implicitly reduce the dimensionality of the problem to that of the number of samples (subjects in this case). Besides, neuroimaging data exhibit high correlations among features, implying a much lower “effective dimensionality” than the actual number of voxels, and samples (images) are likely to occur on a lower-dimensional manifold. Nevertheless, feature selection does have benefits other than alleviating the effect of the curse of dimensionality. One practical benefit is that it speeds up the testing process. Some also claim that it can make interpretation easier. It is often simpler to explain results from those subsets of the data that are sufficient to maintain equivalent classification accuracies (Marquand et al., 2011). The main benefit claimed for feature selection, which is the main focus in this manuscript, is that it increases classification accuracy. It is believed that removing non-informative signal can reduce noise, and can increase the contrast between labelled groups. There are some researchers who think it is necessary to apply feature selection. Some reviewers also make the papers difficult to publish if some feature selection has not been used. However, there is little evidence to support the claim that feature selection is always necessary. Approaches of feature selection have shown many promising results in the field of bioinformatics (Saeys et al., 2007). A lot of feature selection methods and techniques originated or were specifically developed for bioinformatics. However, the properties of input features in bioinformatics are very different from features in neuroimaging data, especially as there are often more independent features in bioinformatics data. Among previous studies of diseases classification using imaging data, mostly using a fixed sample size, some show higher classification accuracies with feature selection. However, a recent study (Cuingnet et al., 2010) using 509 subjects showed no accuracy improvement with feature selection. We hypothesize that any improvement due to feature selection depends on sample size and the prior knowledge that the feature selection method utilizes. A priori, we believe that when the training sample is sufficiently large, feature selection would have a tiny, or even negative, impact on performance. Based on Occam's razor (the principle of parsimony) we should prefer the simpler method when the performance is equivalent to more sophisticated ones. That is to say, perhaps it is preferable not to apply feature selection when no improvements of classification performance are shown. To test our hypothesis, we compared the accuracy of discriminating AD from normal controls (NC) and discriminating mild cognitive impairment (MCI) from NC using four common feature selection methods and different training set sizes. The imaging data came from the Alzheimer's disease Neuroimaging Initiative (ADNI), which have been used by many studies of AD classification (Filipovych and Davatzikos, 2011; Gerardin et al., 2009; PelaezCoca et al., 2011; Stonnington et al., 2010; Yang et al., 2011; Zhang et al., 2011). To obtain the maximum amount of training data of the same modality, we used a single T1-weighted image from each subject. The pre-processing pipeline was the same as is often used in voxel-based morphometry (VBM) studies, and involved using Jacobian-scaled, spatially normalized gray matter (GM) as the input features. Cross-validation and resampling techniques were applied to estimate the accuracies. The performance, with and without feature selection, was compared statistically.

2. Methods This section describes the methods of feature selection and machine learning applied to the ADNI data, and the pre-processing of the image data. 2.1. ADNI dataset Data used in the preparation of this article were obtained from the Alzheimer's disease Neuroimaging Initiative (ADNI) database (http:// www.loni.ucla.edu/ADNI), which was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography (PET), other biological markers and clinical and neuropsychological assessment can be combined to measure the progression of amnestic MCI and early probable AD. The determination of sensitive and specific biomarkers of early AD and disease progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials. Although ADNI includes various image modalities, we only used the more abundant T1-weighted images obtained using MPRAGE or equivalent protocols with varying resolutions (typically 1.25 × 1.25 mm inplane spatial resolution and 1.2 mm thick sagittal slices). Only images obtained using 1.5 T scanners were used in this study, and we used the first time point if there are multiple images of the same subject acquired at different times. The sagittal images were pre-processed according to a number of steps detailed under the ADNI website, which corrected for field inhomogeneities and image distortion, and were resliced to axial orientation. We visually inspected the gray matter (GM) map of all the subjects (native space) after tissue segmentation. Images that could not be segmented correctly were removed (list of removed subjects are in the Supplementary material). The dataset, which we used in the comparisons of feature selection, contained 260 MCI (86 females, 174 males), 188 NC (96 females, 92 males), and 131 AD (63 females, 68 males) (Table 1). The list of subjects used in this manuscript can be found in the Supplementary material. In addition, a subset of MCI patients who had at least three longitudinal scans (baseline image, and two subsequent images six months and twelve months after) was selected. The converters were defined as subjects who had MMSE scores between 20 and 26 (inclusive), and CDR of 0.5 or 1.0 in the second or the third scan. The non-converters were defined as subjects who had MMSE scores between 24 and 30 (inclusive), and CDR of 0.5 in all three scans. These criteria are consistent with the criteria of AD and MCI in ADNI. 2.2. Image pre-processing All individual T1-weighted images were pre-processed with SPM8 (http://www.fil.ion.ucl.ac.uk/spm) running under a Linux MATLAB 7.10 platform (The MathWorks, Natick, MA, USA). Before tissue Table 1 Demographic of the dataset.

Size Female/male Age (Mean ± SD) *MMSE (Mean ± SD) *CDR Memory box score

AD

MCI

NC

131 63/68 78.53 ± 4.91 23.24 ± 2.08 0.5, 1.0 1.0

261 86/174 77.92 ± 4.97 26.98 ± 1.81 0.5 0.5

188 96/92 76.51 ± 4.58 29.16 ± 0.96 0 0

*MMSE: Mini-Mental State Examination (Scores 0–30) AD(20–26); MCI,NC(24–30). *CDR: Clinical Dementia Rating (0.5 — very mind, 1 — mild, 2 — moderate, 3 — severe).

C. Chu et al. / NeuroImage 60 (2012) 59–70

segmentation, the images were re-oriented such that the origin of their coordinate systems was manually set to be close to the anterior commissure. These images were then segmented into grey matter (GM) and white matter (WM) maps in the “realigned” space using the “new segment” toolbox with default settings. “New segmentation” is an extension of the conventional “unified segmentation” (Ashburner and Friston, 2005). It provides more accurate tissue segmentation and deals with voxels outside the brain by including extra tissue probability maps (skull, scalp, etc.). The realigned GM and WM maps were also converted to 1.5 mm isotropic. Next, the inter-subject alignment was refined using the Dartel toolbox (Ashburner, 2007). Briefly, this involves creating averages of the GM and WM images of all subjects, which is used as an initial template. Images were then warped into closer alignment with this template, and a new template constructed by averaging the warped images (Ashburner and Friston, 2009). This procedure was repeated for a fixed number of iterations, using all the default settings. The GM images were warped into alignment with the average-shaped template using the estimated deformations. Total GM volume was preserved by multiplying each voxel of the warped GM with the Jacobian determinants of the deformation field (a.k.a. “modulation”). No smoothing was applied as we found no significant improvement of classification accuracy using whole brain GM with smoothing. This finding also confirmed the previous study (Kloppel et al., 2008). To avoid possible edge effects between tissue boundaries, we excluded voxels in the GM population template that had values less than 0.2. These masked, warped and modulated images, which have 299,477 voxels in each image, served as the input features to the following feature selection and classification procedures. 2.3. Support vector machine (SVM) The support vector machine (SVM) is one of the more popular supervised learning algorithms that have been applied to neoroimaging data in recent years (Craddock et al., 2009; Fan et al., 2005; Kloppel et al., 2008; Mourao-Miranda et al., 2005). SVM demonstrates good classification performance, and is computationally efficient for training with high dimensional data. SVM is also known as the maximum margin classifier (Cristianini and Shawe-Taylor, 2000). The framework of SVM was motivated by statistical learning theory (Vapnik, 1995), in which the decision boundary is selected to maximize the separation between two classes rather than minimizing the empirical loss. It is a sparse method, which means the decision boundary may be determined by a subset of training samples, called the support vectors (SV). Intuitively, SVM ignores trivial training samples, which are far away from the opposite class, and the SV are the samples which are closest to the decision boundaries (for hard-margin SVM). The equations of SVM can be found in many literatures including Wikipedia, therefore we do not repeat in this manuscript. In this study, we used both the hard-margin SVM and soft-margin SVM implementation in LIBSVM (Chang and Lin, 2001). The difference between softmargin and hard-margin SVM is that soft-margin SVM penalizes the training error, and the optimization in the primal form tries to maximize the margin and minimize the training errors. The free parameter C controls the weighting of the errors, it can also be realized as the inverse of regularization (smaller C, more regularization, vice versa). In the dual form, the objective functions in both soft-margin and hardmargin are the same, except C is the upper bound of the Lagrange multipliers for soft-margin SVM. In other words, if the C in softmargin SVM is larger or equal to the largest Lagrange multiplier in the hard-margin SVM, both formulations would yield the same solution. This is an important issue, as some studies that claimed using a soft-margin SVM, however, the C values may be large enough to result a hard-margin SVM training. The effect of C value also depends on the number of features and the scale of the features. When more features are used or when the scales of the features increase, the C

61

Table 2 C values and the corresponding feature size. Number of voxels

C1

C2

C3

C4

C5

C6

Whole brain 82,260 71,229 25,355 12,754 11,031 9542 7522 6463 4568

2− 12 2− 9 2− 9 2− 7 2− 5 2− 5 2− 5 2− 5 2− 4 2− 3

2− 14 2− 11 2− 11 2− 9 2− 7 2− 7 2− 7 2− 7 2− 6 2− 5

2− 16 2− 13 2− 13 2− 11 2− 9 2− 9 2− 9 2− 9 2− 8 2− 7

2− 18 2− 15 2− 15 2− 13 2− 11 2− 11 2− 11 2− 11 2− 10 2− 9

2− 20 2− 17 2− 17 2− 15 2− 13 2− 13 2− 13 2− 13 2− 12 2− 11

2− 22 2− 19 2− 19 2− 17 2− 15 2− 15 2− 15 2− 15 2− 14 2− 13

value would need to decrease to achieve the same regularization. In Table 2, we show the C values used in this study and the corresponding number of voxels. The C values are all powers of 2, and the power of the maximum C value (C0) for each corresponding number of voxels was defined by rounding down the log of base two of the maximum Lagrange multiplier from training AD vs. NC using the ROI with the corresponding voxels with the whole dataset. The subsequent C value is computed by dividing the previous C value. (i.e. C1 = C0/4, C2 = C1/4, etc.) SVM is a kernel algorithm that means the input of the algorithm is a kernel rather than the original input features. For an input matrix, X = [x1, x2, …, xn] T, each row of X is one sample with d features, and the linear kernel matrix is defined byK = XX T. Alternatively, we can also formulate the kernel as a linear combination of sub-kernels, K = K1 + K2 + … Kd. Each sub-kernel is computed from each feature (voxel) individually. In this perspective, feature selection can be viewed as finding the optimal combination of kernels or a binary version of multi-kernel learning (MKL). 2.4. Feature selection methods The goal of feature selection is to select a subset of features (voxels) from the original input features. We applied four different feature selection methods. One is based on using prior knowledge about regions of brain atrophy found in previous studies, and two methods are purely data-driven approaches. We also created a hybrid method, which combined the known parcellation of anatomical regions and the data-driven approach. To compare the feature selection methods, we selected equal numbers of features in all methods with different level of selected features, except the hybrid method. The methods are described in the following: 1. Atlas-based region of interest (ROI). The ROIs are defined by the LONI Probabilistic Brain Atlas (LPBA40) (http://www.loni.ucla. edu/Atlases/LPBA40) (Shattuck et al., 2008), which were generated from anatomic regions of MRI delineated manually from 40 subjects. The delineated structures were all transformed into the

Table 3 ROI selected. ROI Full name

Abbreviation Number of voxels

1 2 3 4 5 6 7 8 9

CG HC PHG FG PCu MTG TL HC + PHG TL + HC + PHG All

10

Cingulate gyrus Hippocampus Parahippocampal gyrus Fusiform gyrus Precuneus Middle temporal gyrus Temporal lobe Hippocampus + parahippocampal gyrus Temporal lobe + hippocampus + parahippocampal gyrus Whole brain

12,754 4568 6463 9542 7522 25,355 71,229 11,031 82,260 299,477

62

C. Chu et al. / NeuroImage 60 (2012) 59–70

Fig. 1. Overlay of the selected ROIs from the affine-registered maximum likelihood map in LPBA40 onto the population grey matter average.

atlas space. We used the SPM5 space, which is the template space in SPM5. There are a total of 56 structures in the atlas. Most regions are left/right pairs e.g. left cuneus, right cuneus, except for cerebellum and brainstem. To generate the parcellation, we converted the LPBA40 atlas into an image of discrete regions, where each voxel was assigned a label corresponding to the most probable of the 56 structures. We combined both left and right structures into one, hence reducing the total number of ROIs to 27. Because the data were spatially normalized into population average space (details see Image pre-processing), the maximum likelihood map in LPBA40 were additionally affine-registered to the population average, and resampled using nearest neighbor interpolation. Based on findings from previous studies (Bastos Leite et al., 2004; Chetelat et

al., 2002; Frisoni et al., 2002; Pennanen et al., 2005), we selected the nine ROIs listed in Table 3 (also see Fig. 1). Using knowledge from the literatures, we would expect hippocampus and parahippocampal gyrus to achieve the best performance. Note that hippocampus defined in LPBA40 includes amygdale. 2. Univariate t-test filtering (t-test). This is one of the most commonly used approaches. It is equivalent to performing a simple massunivariate analysis (statistical parametric map) of the data, and thresholding at some value to obtain a mask. A two-sample t-test at each voxel of the training set was performed. Because training should not be informed by the labels of the test set in any way, the test set was excluded from this mass-univariate analysis. We then determined various thresholds for the absolute t-value,

Classify MCI & NC

Classify AD & NC 85%

70%

Accuracy

Accuracy

80%

65%

75% 60% 70%

Original Data

Original Data

55%

65%

Von Bertalanffy function

Von Bertalanffy function 60%

50%

55% 45% 50% 40%

45%

Sample Size 40% 14

38

62

86

110 134 158 182 206 230 254 262

Sample Size 35% 20

52

84

116 148 180 212 244 276 308 340 356

Fig. 2. Classification accuracies vs. sample size for classifying AD & NC and classifying MCI & NC using the whole brains. We also fitted a Von Bertalanffy growth function, f(x) = a × (1 − exp(− b × x)).

C. Chu et al. / NeuroImage 60 (2012) 59–70

based on the number of voxels we intended to select at each level (shown in Table 3). In other words, a smaller set (higher threshold) was always a subset of a larger set (lower threshold). It can also be seen as a combination of kernels computed from voxels at different range of absolute t-values K = Kth1 + Kth2 + … Kthk. As the threshold lowering down, more kernels were added. 3. Averaged absolute t-value in ROIs (t-test + ROI). The method combines the standard univariate t-test filtering with pre-defied ROIs from the atlas. We first performed mass-univariate t-tests on the training samples, from which we computed the averaged absolute t-value in the 27 ROIs defined by LPBA40 (combining left and right structures). The ROIs were ranked according to their averaged absolute t-value. We selected the top 1, 2, 3, 4, and 5 ROIs, and there were 5 levels of inclusion. i.e. Top one means using the 1st ranked ROI, top 2 means the combination of the 1st and 2nd ranked ROI, and top 3 means the combination from the 1st to the 3rd ranked ROI, etc. This was the only method that did not have a fixed number of voxel at each level of selection, because different ROIs have different numbers of voxels. For this, we only varied the “number of combined ROIs”. 4. Recursive feature elimination (RFE). Recursive feature elimination is an iterative feature selection algorithm developed specifically for SVM (Guyon et al., 2002). It is also a popular feature selection method for fMRI data (Craddock et al., 2009; De Martino et al., 2008; Marquand et al., 2010). The original algorithm removes one feature, which would potentially yield the largest margin increase at each iteration. For a linear kernel, it means the feature

a 90%

63

with the lowest absolute feature weight. An approximation is usually made for high-dimensional data, whereby a small set of features is removed at each iteration. In our implementation, we removed 3000 voxels (around 1%) at each iteration, sometimes adjusting this number to obtain the same number of features as used by the other methods (i.e. the number of voxels in Table 3). 2.5. Cross-validation and resampling To estimate the classification accuracies with different training size, we first selected a subset of subjects from the whole dataset. To reduce the gender confound, we balanced the number of males and females in both control and patient groups, such that both groups had equal proportions of males and females. To reduce the age confound, we further matched the age in each gender group (e.g. matched the age of male control and male patients, etc.). We not only matched the mean of the age, we also tried to match the distribution in both populations. To achieve this, we first randomly selected a subset from the group with fewer samples (e.g. fewer samples in female NC than female MCI). We then estimated the distribution of age by smoothing the histogram (bin size 1 year) with Gaussian smoothing with 3 year FWHM. The new samples were then drawn from the group with more samples using a variation of rejection sampling (Bishop, 2006). After the subset was selected, we applied leave-one-out crossvalidation (loocv) to the subset for the hard-margin SVM. In each cross-validation trial, one sample was left out, and we applied feature

b 90%

Accuracy

85%

85%

80%

80%

75%

75%

70%

70%

65% All

t_120000

t_100000

t_82260

60% t_71229

t_25355

t_12754

t_11031

t_9542

t_7522

t_6463

t_4568

55% 50%

Accuracy

65% 60%

All

R_120000

R_100000

R_82260

R_71229

R_25355

R_12754

R_11031

R_9542

R_7522

R_6463

R_4568

55% 50%

AD & NC: t-test

45%

AD & NC: RFE

45%

Sample size

Sample size

40% 14 30 46 62 78 94 110 126 142 158 174 190 206 222 238 254 262

40% 14 30 46 62 78 94 110 126 142 158 174 190 206 222 238 254 262

c

d

70%

70%

Accuracy

65%

65%

60%

60%

55% 50%

All

t_120000

t_100000

t_82260

t_71229

t_25355

t_12754

t_11031

t_9542

t_7522

t_6463

55% 50%

t_4568

45%

All

R_120000

R_100000

R_82260

R_71229

R_25355

R_12754

R_11031

R_9542

R_7522

R_6463

R_4568

45%

MCI & NC: t-test 40% 20

Accuracy

44

68

Sample size

40% 92 116 140 164 188 212 236 260 284 308 332 356 20

MCI & NC: RFE 44

68

Sample size

92 116 140 164 188 212 236 260 284 308 332 356

Fig. 3. Classification accuracies obtained from data-driven feature selection. The dark line is the accuracy obtained using the whole brain (no feature selection). Neither t-test nor RFE achieved significantly better classification than the whole brain. The table of p-values can be found in the Supplementary material. For t-tests, the accuracies were calculated from averaging 200 resamples, and for RFE, the accuracies were calculated from averaging 50 resamples. To be compatible, we also averaged different numbers of resamples for the whole brain. That is why the lines of whole brain classification (black lines) are different between (a) and (b), and are different between (c) and (d).

64

C. Chu et al. / NeuroImage 60 (2012) 59–70

90%

AD & NC : [ ROI

AD & NC: ROI

Accuracy

SS / ROI 262 254

85% 80% 75% 70% 65% 60% 55% 50% 45%

All

CG

HC

PHG

FG

Pcu

MTG

TL

HC+PHG

TL+HC+PHG

Sample size

40% 14 30 46 62 78 94 110 126 142 158 174 190 206 222 238 254 262

whole brain(All) ] MTG

TL

HC TL+HC +PHG +PHG 1 -1 1 0

CG

HC

PHG

FG

PCu

-1 -1

0 0

1 1

-1 -1

-1 -1

-1 -1

-1 -1

246

-1

0

1

-1

-1

-1

-1

1

238

-1

0

1

-1

-1

-1

-1

1

0

230 222 214 206 198 190 182 174 166 158 150 142 134 126 118 110 102 94 86 78 70 62 54 46 38 30 22 14

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1

-1

P<0.001 Fig. 4. Graph on the left shows classification accuracies obtained using different ROIs for classifying AD and NC with different sample sizes (resample 200 times). The dark line is the accuracy obtained by using the whole brain (no feature selection). Table on the right shows results from paired t-tests of classification accuracies between each ROI and whole brain (no feature selection). 1 indicates the ROI is significantly better than whole brain, 0 indicates not significantly better, and − 1 indicates whole brain is significantly better than that ROI. The rows are different sample sizes and the columns are different ROIs. The tables of the original p-values are in the Supplementary material.

selection to the remaining samples. After feature selection, we trained a linear hard margin SVM, using it to predict the label of the testing sample. The procedure was repeated until each sample was left out and tested once, and the loocv accuracy was calculated. To reduce

70%

the variance of the estimated accuracy, we resampled the subsets multiple times for each subset size. We resampled 200 times for ROI, t-test, and t-test + ROI. For RFE, we only resampled 50 times, due to the longer computation times required. We also varied the MCI & NC : [ ROI

MCI & NC: ROI

Accuracy

SS / ROI 356 348

65%

60%

55%

50%

45%

All

CG

HC

PHG

FG

Pcu

MTG

TL

HC+PHG

TL+HC+PHG

Sample Size

40% 20

44

68

92 116 140 164 188 212 236 260 284 308 332 356

whole brain(All) ] PCu

MTG

TL

HG+ TL+HC PHG +PHG 1 1 1 1

CG

HC

PHG

FG

-1 -1

-1 -1

1 1

-1 -1

-1 -1

-1 -1

-1 -1

340

-1

-1

1

-1

-1

-1

-1

1

1

332

-1

-1

1

-1

-1

-1

-1

1

1

324 316 308 300 292 284 276 268 260 252 244 236 228 220 212 204 196 188 180 172 164 156 148 140 132 124 116 108 100 92 84 76 68 60 52 44 36 28 20

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0

-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1

P<0.001 Fig. 5. Graph on the left shows classification accuracies obtained using different ROIs for classifying MCI and NC with different sample sizes (resample 200 times). The dark line is the accuracy obtained using the whole brain (no feature selection). Table on the right shows results from paired t-tests of classification accuracies between each ROI and whole brain (no feature selection). 1 indicates the ROI is significantly better than whole brain, 0 indicates not significantly better, and − 1 indicates whole brain is significantly better than that ROI. The rows are different sample sizes and the columns are different ROIs. The tables of the original p-values are in the Supplementary material.

C. Chu et al. / NeuroImage 60 (2012) 59–70

90%

Accuracy

85%

65

AD&NC: [t-test+ROI wholebrain(All)]

AD & NC: t-test+ROI

NumberoftoprankedROIcombined

SS

75% 70%

All

ROI~1

ROI~2

ROI~3

ROI~4

ROI~5

12345 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1

262 246 230 214 198 182 166 150 134 118 102 94 86 78 70 62 54 46 38 30 22

80%

65% 60% 55% 50% 45%

Sample size 40% 14 30 46 62 78 94 110 126 142 158 174 190 206 222 238 254 262

-1 -1 -1 -1 -1 -1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1

-1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0 0 1 0 1 1 1 1 1 1

-1 -1 -1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 0 0 1 1 1 1

-1 -1 -1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 0 0 1 1 1 1

P<0.001 Fig. 6. Classification accuracies and the results of paired t-tests of the hybrid method (t-test + ROI) for classifying AD & NC. The dark line is the accuracy obtained using whole brain (no feature selection). ROI ~ 1 means the top ROI, and ROI ~ 2 means the combination of the top two ROIs, etc. The rows in the table are for different sample sizes, and the columns are different numbers of top ranked ROIs combined.

size of the subset. For classification between AD and NC, the minimum size of the subset was 14 (seven in each group, three female, and four male), and the maximum size of the subset was 262. The increment between each size-varied subsets was eight subjects (two male AD, two female AD, two male NC, and two female NC). However, different feature selection methods had different computational costs. For methods that required longer computation, we omitted some population sizes. For discrimination between MCI and NC, the subsets ranged from 20 to 356 subjects, with increments of eight. Again, the increments varied according to the computationally intensive nature of the feature selection approaches. We have also compared the difference between loocv and 10 fold cross-validation. When we averaged more than 50 resamples, the results were not significantly different between10 fold crossvalidation and loocv, especially when the sample size was large. Details are in the Supplementary material. Because 10 fold crossvalidation and loocv have minimum difference with large 70%

Accuracy

resamples, we applied 10 fold cross-validation to the soft-margin SVM to accelerate the computation. 2.6. Statistical analysis One of the objectives of this study is to examine whether feature selection could improve the classification accuracies. Because we used the same resampled sets for each sample size across different feature selection methods, we could compare the accuracies of loocv, with and without the use of feature selection, using paired t-test. 3. Results Regardless of feature selection method, larger sample sizes yielded better performance. For classifying AD and NC, whole brain without feature selection achieved an average accuracy of 84.3% at MCI & NC : whole brain(All)]

MCI & NC: t-test+ROI

[t-test+ROI SS

65%

60%

55%

All

ROI~1

ROI~2

ROI~3

ROI~4

ROI~5

50%

45%

Sample size 40% 20

44

68

92 116 140 164 188 212 236 260 284 308 332 356

356 332 308 292 276 260 244 228 212 196 180 164 148 132 116 100 92 84 76 68 60 52 44 36 28

Number of top ranked ROI combined

1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0

2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0

3 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

5 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

P<0.001 Fig. 7. Classification accuracies and the results of paired t-tests for the hybrid method (t-test + ROI) for classifying MCI & NC. The dark line is the accuracy obtained using whole brain (no feature selection). ROI ~ 1 means the top ROI, and ROI ~ 2 means the combination of the top ranking two ROIs, etc. The rows in the table are for different sample sizes, and the columns are different numbers of top ranked ROIs combined.

66

C. Chu et al. / NeuroImage 60 (2012) 59–70

the maximum training size of 262, with no sign of reaching a plateau. We used Von Bertalanffy growth function, f(x)=a×(1−exp(−b×x)), to fit the data points of accuracies vs. sample size and found out that the actual accuracies were higher with larger sample size than the fitted curve (see Fig. 2). For classifying MCI and NC, whole brain without feature selection achieved an average accuracy of 67.3% at the maximum training size of 356. We also fitted the accuracies curve with Von Bertalanffy function, and found similar results as classifying AD and NC that the actual accuracies were higher than the fitted curve at large training size (see Fig. 2). None of the data-driven feature selection methods (t-test filtering and RFE) achieved significantly superior performance than whole brain (no feature selection) for classifying MCI and NC or classifying AD and NC (see Fig. 3). On the other hand, the atlas based ROI method showed significantly (p b 0.001) better accuracies than whole brain (no feature selection) using parahippocampus (PHG) ROI or parahippocampus + hippocampus (PHG + HC) ROIs for classifying MCI and NC and classifying AD and NC. On average, the use of PHG + HC was better than PHG alone (see Fig. 4). However, for classifying AD and NC, HC was also significantly better than whole brain when the training size was less than or equal to 166. The results were also similar for classifying MCI and NC. When the training size was less than or equal to 76, HC was significantly better (see Fig. 5). Interestingly, temporal lobe + HC + PHG resulted the best accuracies (p b 0.001) when the training size was larger than 300, and the improvement over no feature selection increased even more as the training size increased. Although other regions yielded accuracies higher than chance, their

a

c

b

Accuracy

ROI ( Sample Size 38 )

Accuracy

accuracies were either the same or worse than when using whole brain. For classifying AD and NC, combining t-test filtering and atlas based ROI resulted in significantly better accuracies for all five levels of inclusion when the sample sizes were small (see Fig. 6). When the sample size grew larger, smaller numbers of ROIs achieved better accuracies. When the sample size was over 94, using only the top ranked ROI achieved better accuracies, and including more ROIs reduced the accuracies. The most frequently top ranked ROI was hippocampus. The results were the opposite for classifying MCI and NC. The classification accuracies were only significantly better when the sample size was over 260, and 5 levels of inclusion was employed (see Fig. 7). The tables of the original p-values are in the Supplementary material. Versions of these figures with error bars may be found in the Supplementary material. We also used soft-margin SVM with different levels of regularization (i.e. different C) with feature selection. In general, lowering the C (i.e. increasing regularization) did not improve the accuracies for most ROIs, except those that already yielded better results than whole brain e.g. HC, PHG, and HC + PHG (see Fig. 8). The improvement of accuracies is most prominent in MCI vs. NC with larger sample size. HC became significantly better than whole brain using the optimal C value. For t-test and RFE, the best accuracies with the optimal C were still not significantly better than whole brain using hardmargin SVM (see Fig. 9). The ROIs that yielded better classification accuracy than whole brain for classifying MCI converters vs. MCI non-converters are similar to the result of AD vs. NC, except that FG was significantly better

90%

ROI ( Sample Size 134 )

90%

80%

80%

70%

70%

60%

60%

All CG

80%

HC PHG

70%

FG Pcu

60%

AD & NC

MTG TL

50%

HC+PHG TL+HC+PHG

40%

30%

30% Varying C

20%

C*

C0

C1

C2

C3

C4

C5

d

20%

ROI ( Sample Size 36 )

Accuracy

CG

HC

PHG

FG

Pcu

40% 30%

MTG

TL

HC+PHG

TL+HC+PHG

C*

C0

C1

C2

C3

Varying C

C4

C5

All

CG

HC

PHG

FG

Pcu

MTG

TL

HC+PHG

20%

85% All

CG

HC

PHG

FG

Pcu

MTG

TL

HC+PHG

75%

65%

65%

55%

55%

55%

45%

45%

45% Varying C

Varying C

C*

C0

HC

FG

Pcu

MTG

TL

HC+PHG

C*

C0

Varying C

C1

C2

C3

C4

C5

ROI ( Sample Size 332 ) All PHG MTG TL+HC+PHG

CG FG TL

HC Pcu HC+PHG

TL+HC+PHG

65%

35%

CG

PHG

TL+HC+PHG

Accuracy

ROI ( Sample Size 164 )

Accuracy

75%

All

f

85%

TL+HC+PHG

MCI & NC

50% All

e

85%

75%

50% 40%

ROI ( Sample Size 262)

Accuracy

90%

C1

C2

C3

C4

C5

35%

C*

C0

Varying C

C1

C2

C3

C4

C5

35%

C*

C0

C1

C2

C3

C4

C5

Fig. 8. Classification accuracies obtained using different ROIs for classifying AD & NC and MCI &NC with different sample sizes (resample 200 times) by soft-margin SVM. The dark line is the accuracy obtained by using the whole brain (no feature selection). C* indicates large C value that is identical to a hard-margin SVM. The exact C value for the corresponding feature size can be found in Table 2.

C. Chu et al. / NeuroImage 60 (2012) 59–70

a t-test+RFE ( Sample Size 38 )

Accuracy

c

b

t-test+RFE ( Sample Size 134 )

Accuracy

90%

80%

80%

70%

70%

70%

60%

60%

60%

50%

50%

50%

40%

40%

40%

30%

30%

80%

All

t_82260

R_71229

t_25355

R_12754

t_11031

R_9542

t_7522

R_6463

t_4568

Varying C

20%

C*

C0

C1

C2

C3

C4

C5

d

20%

All

t_82260

R_71229

t_25355

R_12754

t_11031

R_9542

t_7522

R_6463

t_4568

C*

C0

C1

C2

Varying C

C3

C4

C5

e

Accuracy

t-test+RFE ( Sample Size 36 )

All

t_82260

R_71229

t_25355

R_12754

t_11031

R_9542

t_7522

R_6463

t_4568

20%

All

t_82260

R_71229

R_12754

t_11031

R_9542

R_6463

t_4568

C*

C0

C1

C2

t_25355 t_7522

Varying C

C3

C4

C5

Accuracy

t-test+RFE ( Sample Size 332 )

85%

75%

75%

30%

f t-test+RFE ( Sample Size 164)

Accuracy

85%

85%

t-test+RFE ( Sample Size 262 )

Accuracy

90%

90%

AD & NC

67

All

t_82260

R_71229

t_25355

R_12754

t_11031

R_9542

t_7522

R_6463

All

t_82260

R_71229

t_25355

R_12754

t_11031

R_9542

t_7522

R_6463

t_4568

75%

t_4568

65%

65%

65%

55%

55%

55%

45%

45%

45%

MCI & NC

Varying C

Varying C

35%

C*

C0

C1

C2

C3

C4

C5

35%

C*

C0

Varying C

C1

C2

C3

C4

C5

35%

C*

C0

C1

C2

C3

C4

C5

Fig. 9. Classification accuracies obtained using data-driven feature selection (RFE and t-test) for classifying AD & NC and MCI &NC with different sample sizes (resample 200 times for t-test, resample 50 times for RFE) by soft-margin SVM. The dark line is the accuracy obtained by using the whole brain (no feature selection). C* indicates large C value that is identical to a hard-margin SVM. The exact C value for the corresponding feature size can be found in Table 2. Different lines show different feature selection, the prefix R indicates RFE, and the prefix t indicates t-test. The number shows the number of features selected.

than whole brain when the sample sizes are small (see Fig. 10). Overall, the classification accuracies were very low. In addition, we compared the features selected from two datadriven techniques (t-test and RFE). The features selected using the two methods were very different. For illustrative purpose, we generated the frequency maps of features selected in the re-resamples from classifying AD and NC with 11,031 features for both methods (see Fig. 11). The correlation between the frequency maps is corr(t-test, RFE) = 0.31. T-tests selected the features consistently in each resampling trials, and most features were in the HC, PHG, and TL. RFE resulted in a more distributed selection, with the selected features scattered across the whole brain. However, features in HC were also consistently selected. 4. Discussion Generally speaking, the classification accuracies were higher with larger sample size than the estimates from fitting a Von Bertalanffy function. This implies that the improvement of classification accuracies due to increasing samples plateaus slower than the exponential growth function. If the accuracies keep following the projection of the linear fit, prediction accuracies of 90% for separating AD and NC may be achieved using around 500 training samples. However, the main focus in this study was comparing the performance, with and without the use of feature selection. From the results, we found that appropriate methods of feature selection improved the classification accuracies regardless of the sample size. However, data-drive

methods without prior knowledge either did not improve the accuracies or made them worse. We were surprised to see such results, and speculate that it was because there was no additional information in the training samples from which those feature selection methods can extract and SVM cannot, especially when features were highly correlated. In our dataset, there may be hardly any non-informative voxels, but only less-informative voxels. The information encoded in the multivariate patterns also increased the difficulty of ranking features. Nevertheless, data driven approaches could still reduce the less-informative features, and maintain the most-informative voxels without any prior knowledge. This may be useful for people who rely on feature-weight map to make interpretation because feature selection reduces the chance of interpreting unreliable voxels in the weight-map. In general, as the training size increased, the difference between the accuracies, with and without feature selection, was reduced. This implies that any method of feature selection would not improve the accuracies when the training size is sufficient. In the case of classifying AD and NC, because of higher contrasts between the two classes, the required samples are small. On the other hand, the contrast between MCI and NC is weaker and the patterns are less homogenous (i.e. larger effective dimensionality), thus more samples are required before whole brain can achieve the same accuracies as specific ROIs. In this study, even 356 samples were not sufficient. It seemed that making use of prior knowledge resulted in the best accuracies. However, this method depends on the reliability of the information used. For example, although functional studies

68

C. Chu et al. / NeuroImage 60 (2012) 59–70

Converter & Non-Converter: ROI 65%

Accuracy

60%

55%

50%

All PHG MTG TL+HC+PHG

45% 40

60

80

CG FG TL

Sample size 100

120

C=Inf:: [ ROI SS / ROI 200 180 160 140 120 100 80 60 40

CG

HC

1 -1 -1 -1 -1 -1 -1 0 0 0

2 0 1 1 1 1 1 1 1 1

HC Pcu HC+PHG 140

160

180

200

whole brain ]

PHG

FG

Pcu

MTG

TL

3 1 1 1 1 1 1 1 1 1

4 -1 -1 0 0 0 0 1 1 1

5 -1 -1 -1 -1 -1 -1 -1 0 0

6 -1 -1 -1 -1 -1 -1 -1 0 0

7 -1 -1 -1 -1 -1 -1 0 0 0

HC+PHG TL+HC+PHG

8 1 1 1 1 1 1 1 1 1

9 0 0 0 0 0 0 1 1 0

P<0.001 Fig. 10. Graph on the top shows classification accuracies obtained using different ROIs for classifying MCI converter and non-converter with different sample sizes (resample 200 times). The dark line is the accuracy obtained using the whole brain (no feature selection). Table on the bottom shows results from paired t-tests of classification accuracies between each ROI and whole brain (no feature selection). 1 indicates the ROI is significantly better than whole brain, 0 indicates not significantly better, and − 1 indicates whole brain is significantly better than that ROI. The rows are different sample sizes and the columns are different ROIs.

(Ikonomovic et al., 2011) show a strong difference between patterns of brain activity in AD and NC in precuneus, the use of the precuneus ROI (GM density) resulted in very poor classification accuracies. Also, the most well known marker of brain atrophy in AD is HC, but our results showed that combining HC and PHG resulted in higher accuracies than either of them alone. This implies that covariance between information encoded in the ROIs may help classification. This effect makes selecting the optimal combination of ROIs difficult, and we can only claim that the selected ROIs are sub-optimal. Perhaps the best method of feature selection is cross-validation with pre-defined ROIs. From our results, we found that the ranks of the classification accuracies of each ROI were quite consistent across different sample sizes, especially for classifying AD and NC. Although we only used a small validation set, we should still have a high chance of identifying the best ROI through cross-validation. Alternatively, we can also use the hybrid method (t-test + ROI) that uses the ROI as spatially constrain and t-test as the ranking of features. The hybrid method did show significant, but small, improvements of classification accuracy in NC vs. AD and NC vs. MCI (Figs. 7 and 8). However, the selected ROIs in each resample were not always consistent, although hippocampus ROI was often ranked the top. The choice of hard-margin or soft-margin SVM did not affect our conclusion, as methods of feature selection and ROIs that did not yield better solutions still not perform better after optimizing the soft-margin SVM. In practice, people often use three-way-split cross-validation to estimate the test accuracies by determining the C with the cross-validation in the inner loops. This procedure would greatly increase the computation. We took a pragmatic approach by presenting the accuracies with all the C values we tested. The best accuracies obtained from one of the C should be the optimistic measure i.e. higher than the testing accuracy of three-way-split CV. Because the optimistic accuracies, which should be more conservative in our comparison, did not change our conclusions, we did not use the time consuming three-way-split CV.

Fig. 11. The frequency maps of features selected in the re-resamples for classifying AD and NC with 11,031 features for both t-test and RFE.

C. Chu et al. / NeuroImage 60 (2012) 59–70

In this study, we applied feature selection as a separate procedure to the classifier. There are also classifiers with a built-in capability of feature selection during the training. These include lasso regression (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), automatic relevance determination (Chu et al., 2010; MacKay, 1995; Tipping, 2001), and other sparse models (Carroll et al., 2009; Grosenick et al., 2008; Ryali et al., 2010; Yamashita et al., 2008). However, those methods would require a lot of memory to select features from 300,000 voxels, which is often impractical. In practice, efficient methods of feature selection are often applied to eliminate the majority of less-informative features, before applying the classifiers that can select features during training. In conclusion, our results confirmed the previous study (Cuingnet et al., 2010) that features selection did not improve the performance if there was no prior knowledge. That is to say, if the right methods and the right prior knowledge are used, feature selection does improve classification performance. Although, we demonstrated the effect using anatomical MRI, our experience in Pittsburgh brain activity interpretation competition (PBAIC 2007) (Chu et al., 2011) also led to the same conclusion. Teams using data driven feature selection to select fMRI features did not performed well in the competition than teams applied prior knowledge in feature selection. Acknowledgments This research was supported in part by the Intramural Research Program of the NIH, NIMH, National Science Council of Taiwan (NSC 100-2628-E-010-002-MY3, NSC 100-2627-B-010-004-) and National Health Research Institute (NHRI-EX100-9813EC). JA is funded by the Wellcome Trust. This study utilized the high performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD (http://biowulf.nih.gov). Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering (NIBIB), and through generous contributions from the following: Abbott, AstraZeneca AB, Bayer Schering Pharma AG, Bristol-Myers Squibb, Eisai Global Clinical Development, Elan Corporation, Genentech, GE Healthcare, GlaxoSmithKline, Innogenetics, Johnson and Johnson, Eli Lilly and Co.,Medpace, Inc.,Merck and Co., Inc., Novartis AG, Pfizer Inc., F. Hoffman-La Roche, Schering-Plough, Synarc, Inc., as well as non-profit partners the Alzheimer's Association and Alzheimer's Drug Discovery Foundation, with participation from the U.S. Food and Drug Administration. Private sector contributions to ADNI are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of California, Los Angeles. Appendix A. Supplementary data Supplementary data to this article can be found online at doi:10. 1016/j.neuroimage.2011.11.066. References Ashburner, J., 2007. A fast diffeomorphic image registration algorithm. NeuroImage 38, 95–113. Ashburner, J., Friston, K.J., 2005. Unified segmentation. NeuroImage 26, 839–851. Ashburner, J., Friston, K.J., 2009. Computing average shaped tissue probability templates. NeuroImage 45, 333–341.

69

Ashburner, J., Kloppel, S., 2011. Multivariate models of inter-subject anatomical variability. NeuroImage 56, 422–439. Bastos Leite, A.J., Scheltens, P., Barkhof, F., 2004. Pathological aging of the brain: an overview. Top. Magn. Reson. Imaging 15, 369–389. Bishop, C.B., 2006. Pattern recognition and machine learning. Springer. Carroll, M.K., Cecchi, G.A., Rish, I., Garg, R., Rao, A.R., 2009. Prediction and interpretation of distributed neural activity with sparse models. NeuroImage 44, 112–122. Chang, C.C., Lin, C.J., 2001. LIBSVM: a library for support vector machines. Chetelat, G., Desgranges, B., De La Sayette, V., Viader, F., Eustache, F., Baron, J.C., 2002. Mapping gray matter loss with voxel-based morphometry in mild cognitive impairment. Neuroreport 13, 1939–1943. Chu, C., Bandettini, P., Ashburner, J., Marquand, A., Kloeppel, S., 2010. Classification of neurodegenerative diseases using Gaussian process classification with automatic feature determination. ICPR 2010 First Workshop on Brain Decoding. IEEE, pp. 17–20. Chu, C., Ni, Y., Tan, G., Saunders, C.J., Ashburner, J., 2011. Kernel regression for fMRI pattern prediction. NeuroImage 56, 662–673. Chu, C.Y.C., 2009. PhD Thesis: Pattern recognition and machine learning for magnetic resonance images with kernel methods. Wellcome Trust centre for Neuroimaging. University College London, London. Costafreda, S.G., Chu, C., Ashburner, J., Fu, C.H., 2009. Prognostic and diagnostic potential of the structural neuroanatomy of depression. PLoS One 4, e6353. Craddock, R.C., Holtzheimer III, P.E., Hu, X.P., Mayberg, H.S., 2009. Disease state prediction from resting state functional connectivity. Magn. Reson. Med. 62, 1619–1628. Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press. Cuingnet, R., Gerardin, E., Tessieras, J., Auzias, G., Lehericy, S., Habert, M.O., Chupin, M., Benali, H., Colliot, O., Initiative., A.s.D.N., 2002. Automatic classification of patients with Alzheimer's disease from structural MRI: a comparison of ten methods using the ADNI database. NeuroImage 56 (2), 776–781. De Martino, F., Valente, G., Staeren, N., Ashburner, J., Goebel, R., Formisano, E., 2008. Combining multivariate voxel selection and support vector machines for mapping and classification of fMRI spatial patterns. NeuroImage 43, 44–58. Eger, E., Ashburner, J., Haynes, J.D., Dolan, R.J., Rees, G., 2008. fMRI activity patterns in human LOC carry information about object exemplars within category. J. Cogn. Neurosci. 20, 356–370. Fan, Y., Rao, H., Hurt, H., Giannetta, J., Korczykowski, M., Shera, D., Avants, B.B., Gee, J.C., Wang, J., Shen, D., 2007. Multivariate examination of brain abnormality using both structural and functional MRI. NeuroImage 36, 1189–1199. Fan, Y., Shen, D., Davatzikos, C., 2005. Classification of structural images via highdimensional image warping, robust feature extraction, and SVM. Med Image Comput Comput Assist Interv Int Conf Med Image Comput Comput Assist Interv, 8, pp. 1–8. Filipovych, R., Davatzikos, C., 2011. Semi-supervised pattern classification of medical images: application to mild cognitive impairment (MCI). NeuroImage 55, 1109–1119. Forster, M.R., 2002. Predictive accuracy as an achievable goal of science. Philos. Sci. 69, S124–S134. Frisoni, G.B., Testa, C., Zorzan, A., Sabattoli, F., Beltramello, A., Soininen, H., Laakso, M.P., 2002. Detection of grey matter loss in mild Alzheimer's disease with voxel based morphometry. J. Neurol. Neurosurg. Psychiatry 73, 657–664. Gerardin, E., Chetelat, G., Chupin, M., Cuingnet, R., Desgranges, B., Kim, H.S., Niethammer, M., Dubois, B., Lehericy, S., Garnero, L., Eustache, F., Colliot, O., 2009. Multidimensional classification of hippocampal shape features discriminates Alzheimer's disease and mild cognitive impairment from normal aging. NeuroImage 47, 1476–1486. Grosenick, L., Greer, S., Knutson, B., 2008. Interpretable classifiers for FMRI improve prediction of purchases. Neural Systems and Rehabilitation Engineering, IEEE Transactions on, 16, pp. 539–548. Guyon, I., Elisseeff, A.e., 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182. Guyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422. Ikonomovic, M.D., Klunk, W.E., Abrahamson, E.E., Wuu, J., Mathis, C.A., Scheff, S.W., Mufson, E.J., Dekosky, S.T., 2011. Precuneus amyloid burden is associated with reduced cholinergic activity in Alzheimer disease. Neurology 77, 39–47. Kloppel, S., Chu, C., Tan, G.C., Draganski, B., Johnson, H., Paulsen, J.S., Kienzle, W., Tabrizi, S.J., Ashburner, J., Frackowiak, R.S., 2009. Automatic detection of preclinical neurodegeneration: presymptomatic Huntington disease. Neurology 72, 426–431. Kloppel, S., Stonnington, C.M., Chu, C., Draganski, B., Scahill, R.I., Rohrer, J.D., Fox, N.C., Jack Jr., C.R., Ashburner, J., Frackowiak, R.S., 2008. Automatic classification of MR scans in Alzheimer's disease. Brain 131, 681–689. MacKay, D.J.C., 1995. Probable networks and plausible predictions a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6, pp. 469–505. Marquand, A.F., De Simoni, S., O'Daly, O.G., Mehta, M.A., Mourao-Miranda, J., 2010. Quantifying the information content of brain voxels using target information. ICPR First Workshop on Brain Decoding. IEEE, Istanbul, pp. 13–16. Marquand, A.F., De Simoni, S., O'Daly, O.G., Williams, S.C., Mourao-Miranda, J., Mehta, M.A., 2011. Pattern classification of working memory networks reveals differential effects of methylphenidate, atomoxetine, and placebo in healthy volunteers. Neuropsychopharmacology 36, 1237–1247. Mourao-Miranda, J., Bokde, A.L., Born, C., Hampel, H., Stetter, M., 2005. Classifying brain states and determining the discriminating activation patterns: support vector machine on functional MRI data. NeuroImage 28, 980–995. Pelaez-Coca, M., Bossa, M., Olmos, S., 2010. Discrimination of AD and normal subjects from MRI: anatomical versus statistical regions. Neurosci. Lett. 487, 113–117.

70

C. Chu et al. / NeuroImage 60 (2012) 59–70

Pelaez-Coca, M., Bossa, M., Olmos, S., 2011. Discrimination of AD and normal subjects from MRI: anatomical versus statistical regions. Neurosci. Lett. 487, 113–117. Pennanen, C., Testa, C., Laakso, M.P., Hallikainen, M., Helkala, E.L., Hanninen, T., Kivipelto, M., Kononen, M., Nissinen, A., Tervo, S., Vanhanen, M., Vanninen, R., Frisoni, G.B., Soininen, H., 2005. A voxel based morphometry study on mild cognitive impairment. J. Neurol. Neurosurg. Psychiatry 76, 11–14. Ryali, S., Supekar, K., Abrams, D.A., Menon, V., 2010. Sparse logistic regression for whole-brain classification of fMRI data. NeuroImage 51, 752–764. Saeys, Y., Inza, I., Larranaga, P., 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507. Shattuck, D.W., Mirza, M., Adisetiyo, V., Hojatkashani, C., Salamon, G., Narr, K.L., Poldrack, R.A., Bilder, R.M., Toga, A.W., 2008. Construction of a 3D probabilistic atlas of human cortical structures. NeuroImage 39, 1064–1080. Stonnington, C.M., Chu, C., Kloppel, S., Jack Jr., C.R., Ashburner, J., Frackowiak, R.S., 2010. Predicting clinical scores from magnetic resonance scans in Alzheimer's disease. NeuroImage 51, 1405–1413.

Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological) 58, 267–288. Tipping, M.E., 2001. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244. Vapnik, V., 1995. The Nature of Statistical Learning Theory. NY Springer. Yamashita, O., Sato, M., Yoshioka, T., Tong, F., Kamitani, Y., 2008. Sparse estimation automatically selects voxels relevant for the decoding of fMRI activity patterns. NeuroImage 42, 1414–1429. Yang, W., Lui, R.L., Gao, J.H., Chan, T.F., Yau, S.T., Sperling, R.A., Huang, X., 2011. Independent component analysis-based classification of Alzheimer's disease MRI Data. J Alzheimers Dis. Zhang, D., Wang, Y., Zhou, L., Yuan, H., Shen, D., 2011. Multimodal classification of Alzheimer's disease and mild cognitive impairment. NeuroImage 55, 856–867. Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Statistical Methodology) 67, 301–320.