Journal of Neuroscience Methods 212 (2013) 259–268
Contents lists available at SciVerse ScienceDirect
Journal of Neuroscience Methods journal homepage: www.elsevier.com/locate/jneumeth
Computational Neuroscience
Decoding brain states using backward edge elimination and graph kernels in fMRI connectivity networks Fatemeh Mokhtari a , Gholam-Ali Hossein-Zadeh a,b,∗ a Control and Intelligent Processing Center of Excellence, School of Electrical and Computer Engineering, University College of Engineering, University of Tehran, Tehran 14395-515, Iran b School of Cognitive Science, Institute for fundumental Research (IPM), Tehran, Iran
h i g h l i g h t s
We design a classifier to decode brain states from fMRI connectivity graphs. The classifier is able to do classification on graphs using shortest-path kernel. The classifier is able to discover discriminant connectivities in classification. Shortest-path kernel outperforms other kernels in terms of prediction accuracy.
a r t i c l e
i n f o
Article history: Received 26 February 2012 Received in revised form 24 October 2012 Accepted 25 October 2012 Keywords: Decoding brain states fMRI Connectivity graphs Graph kernels Backward edge elimination
a b s t r a c t In the current study, we present a new approach for decoding brain states based on the connectivity graphs extracted from functional magnetic resonance imaging (fMRI) data. fMRI connectivity graphs are constructed in different brain states and fed into an iterative support vector classifier that is enriched by shortest-path kernel. The classifier prunes the graphs of insignificant edges via a backward edge elimination procedure. The iteration in which maximum classification performance occurs is considered as optimum iteration. The edges and nodes that survive in the optimum iteration form discriminant networks between states. We apply “one-versus-one” approach to extend the proposed method into a multi-class classifier. This classifier is used to distinguish between five cognitive brain states from a blocked design fMRI data: (1) fixation, (2) detection of a single stimulus, (3) perceptual matching, (4) attentional cueing, and (5) delayed match-to-sample. The proposed method results in multi-class classification accuracy of 86.32%. Posterior cingulate cortex is identified as a hub in the networks that separate fixation from tasks. Superior parietal lob has the same role to distinguish between different tasks. Connectivity between right retrosplential and superior parietal lobe contributes to discrimination in the fixation-task and task-task cases. © 2012 Elsevier B.V. All rights reserved.
1. Introduction Functional magnetic resonance imaging (fMRI) is being widely used to explore brain function. It has been used for decoding brain states in the recent studies (Ethofer et al., 2009; Kühn et al., 2010; LaConte et al., 2007; Richiardi et al., 2011; Sitaram et al., 2011). Brain decoding approaches try to gain an insight into the encoding of cognitive states from an inverse approach. If information about cognitive states or external stimuli can be uncovered from fMRI
∗ Corresponding author at: School of Electrical and Computer Engineering (ECE), University College of Engineering, University of Tehran, North Kargar Avenue, P.O. Box 14395-515, Tehran, Iran. Tel.: +98 21 8208 4178; fax: +98 21 88778690. E-mail address:
[email protected] (G.-A. Hossein-Zadeh). 0165-0270/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jneumeth.2012.10.012
time series measurements in an interpretable manner, a model of information encoding in brain may be reconstructed. In decoding techniques, classification approaches are employed to predict brain states (or stimuli) from fMRI data (Kühn et al., 2010; LaConte et al., 2007; Richiardi et al., 2011, 2010; Sitaram et al., 2011). fMRI studies of brain decoding include a high-dimensional learning problem, a situation in fMRI where the number of dimensions (number of regions/connections) is much higher than the number of training samples (number of scanning sessions). Hence, adequate feature selection approach and classifier are required to handle this situation. Although various classifiers have been applied to the brain decoding studies including naive Bayes, multilayer percepteron (MLP) (Richiardi et al., 2010), and decision trees (Richiardi et al., 2011), these studies primarily relied on support vector machines (SVMs) (Ethofer et al., 2009; Kühn et al., 2010; LaConte et al., 2007;
260
F. Mokhtari, G.-A. Hossein-Zadeh / Journal of Neuroscience Methods 212 (2013) 259–268
Sitaram et al., 2011) which use a soft margin hyperpalne to separate classes. SVM is a kernel-based technique and consequently can learn adequately from a small training set and has a good generalization performance (Boser et al., 1992; Cortes and Vapnik, 1995; Vapnik, 1963). As a common part of classification problem, feature selection approaches aim to select relevant features and eliminate irrelevant ones for improving generalization ability of a learning algorithm. These approaches are mainly categorized into three classes: (1) filter techniques such as correlation-based approaches, (2) wrapper techniques such as sequential backward elimination, and (3) embedded techniques such as the approaches that use SVM for eliminating irrelevant features. They select features in interaction with the classifier training and are less computationally intensive than wrapper methods (Saeys et al., 2007). In addition to the application of pattern classification, recently, there has been a growing interest in using SVMs for feature selection. Relief is a filter method used in conjunction with SVM ˇ for feature selection (Robnik-Sikonja and Kononenko, 2003). The method in (Weston et al., 2001) is based upon finding a subset of features which minimizes generalization error using a gradient descent approach. In Guyon et al. (2002) a wrapper algorithm is designed called SVM recursive feature elimination (SVM-RFE) and used for gene selection. As a generalization of the SVM-RFE algorithm (Rakotomamonjy, 2003) investigates feature selection problem using various SVM-based criteria. A feature selection approach based on a Bayesian interpretation of SVMs is presented by Gold et al. (2005) in which they select relevant input features via automatic relevance determination. In Chapelle and Keerthi (2008), they develop a new embedded method based on scaling factors that addresses simultaneous feature selection problem for a multiclass SVM. Among these techniques, we use SVM-RFE algorithm that resulted in error rate of zero (the lowest ever) on benchmark collections of genes and could select biologically relevant genes to cancer in Guyon et al. (2002). The algorithm has been extended into kernel methods and it is a straightforward technique that offers significant advantages to identifying discriminant features of brain states. In most fMRI studies of brain decoding, fluctuations in intensity of fMR images have been introduced as informative features for classification. The temporal correlation between neuronal activities of anatomically remote regions of brain, namely functional connectivity, has become an important measure of information integration in neuronal system. Based on communication-through-coherence (CTC) hypothesis, neuronal groups oscillating in a synchrony are likely to interact more efficiently than uncorrelated neuronal populations (Fries, 2005). Accordingly, based on computations required for a specific cognitive state or for processing an external stimulus, different patterns of coherent activity emerge in a dynamic neuronal system. Therefore in this study, it is thought that the pattern of functional interactions in brain may be more informative than the level of activation for decoding of cognitive states. In order to apply multivariate pattern analysis (MVPA) methods to graphs, one of the most important issues is to obtain an overall similarity score between sample graphs. This issue has been referred to as “graph comparison” problem in literature. Traditional graph comparison methods suffer from inherent problems such as computational complexity (Bunke and Allermann, 1983; Chung, 1997). Some of them use summarized representations of graphs that leads to missing topological information (Todeschini and Consonni, 2008; Wiener, 1947). Graph kernel methods are one of the most recent approaches which have been introduced into the graph comparison field in order to overcome the above-mentioned problems of traditional algorithms. Through these methods, graphs are represented by a higher dimensional feature space of graphs substructures. Then
they check for isomorphic substructures and obtain a similarity score between graphs. Graph kernel methods take new roads toward the problem of decreasing the computational cost while considering the information of the topology and the node and edge labels of graphs. Their importance, especially in our case, originates from the fact that by defining graph kernels a large number of MVPA methods such as support vector classification become applicable to graphs (Borgwardt, 2007). The capability of fMRI connectivity graphs has been recently examined for distinguishing between various cognitive states (Richiardi et al., 2011, 2010; Shirer et al., 2012). They reported promising results on predicting brain states from connectivity graphs. The approach of Richiardi et al. (2010) uses two-dimensional singular value decomposition (2DSVD) to represent fMRI connectivity graphs in a low-dimensional space, and then embeds dimension-reduced graphs into the vectors which are then used as feature vectors for distinguishing resting state from a cognitive task. Since features (principal components) extracted by 2DSVD are naturally different from fMRI connectivities, this method cannot identify the connections which produce discrimination between different states. Also, embedding from graph space into a vector space may ignore topological features of the connectivity graphs. In Richiardi et al. (2011), they used wavelet transform to decompose fMRI time series into different frequency subbands. Functional connectivity graphs are constructed at different subbands and then embedded into the vectors that are fed into decision tree classifiers to distinguish resting state from a cognitive task. Authors reported that classification performance of low-frequency subbands surpasses that of high-frequency subbands. Such findings were seen as hints at the presence of consistent networks in the low-frequency subbands. In Shirer et al. (2012), a group-level connectivity matrix (called state matrix) is obtained for each brain state. In the state matrix all connections, which are significant among subjects, are kept and others are set to zero. Connectivity matrices are matched to each of the state matrices through a fit score and decoding is performed based on the best fit score. This approach thus performs a univariate approach to identify significant connections and performs it in isolation from classification that these do not guarantee optimum separation can be found for data. In the present study, we propose an approach that has two notable characteristics compared to the previous studies which use fMRI connectivity graphs for brain decoding (Richiardi et al., 2011, 2010; Shirer et al., 2012). The proposed classifier is an iterative support vector classifier that is able to perform classification on fMRI connectivity graphs using graph kernels. As previously mentioned, graph kernels respect the information represented by the topology as well as the information of node and edge labels in order to calculate a similarity score between two graphs. As the second characteristic, the classifier takes an advantage of SVM to reveal fMRI connectivities which create discrimination between different states. Therefore, our approach provides us with the ability of interpreting the results that is an important issue in functional imaging studies. Our classification approach which we refer to as backward edge elimination (BEE) removes the insignificant edges based on weights of the SVM hyperplane. In other words, classifier training includes a specific objective function to select a subset of features that minimizes prediction error. Our approach also goes beyond 2-state decoding purposes. We use “one-versus-one” approach to construct a multi-class classifier from the proposed classifier. The results and proposed algorithm can be helpful in order to study brain states and abnormalities which disrupt the normal function of brain during certain cognitive states.
F. Mokhtari, G.-A. Hossein-Zadeh / Journal of Neuroscience Methods 212 (2013) 259–268
In the next section, the theory of SVM and graph kernels is explained. In Section 3, characteristics of the fMRI data and the proposed method are described in detail. In Section 4, the results of our method are presented. A discussion and conclusion are given in Sections 5 and 6, respectively.
261
finite or infinite dimensions) where two classes can be separated linearly. Kernel function is then defined as k(x, xi ) = (x), (xi )
(6)
Using kernel trick converts the classification decision function into the following form
2. Theory f (x) = sign(w, (x) + b) = sign
2.1. Support vector machines Assume that we deal with a set of data points X = {x1 , x2 , . . ., xi }, xi ∈ R, in order to distinguish the class y = 1 from the class y = −1. Considering Fig. 1(a), classification desion function can be written as f (xi ) = sign(w, xi + b),
∀i ∈ {1, 2, . . . , l}
(1)
where w ∈ Rm is the vector of weights assigned to the features, b ∈ R is a positive margin and w,xi denotes the dot product between vectors w and xi . The main idea of hard-margin SVM is to fit a maximum margin hyperplane which has the maximum distance to the nearest points of any class (Vapnik, 1963). Based on Fig. 1(a), to have such a hyperplane between two classes the following inequality has to be satisfied yi (w, xi + b) ≥ 1,
∀i ∈ {1, 2, . . . , l}
(2)
Therefore, the problem of fitting maximum margin hyperplane can be represented as minimize w,b
||w||2 2
constrained to : g(xi )
= yi (w, x + b − 1) ≥ 0,
∀i ∈ {1, 2, . . . , l}
(3)
Optimization with Lagrangian method leads to the following relation for w w=
l
˛i yi xi
(4)
i=1
where ˛ = (˛1 , ˛2 , . . ., ˛l ) is the set of Lagrange multipliers of the training patterns with ˛i ≥ 0. Measurement errors and noise in the experimental data usually lead to low generalizability in the designed classifier. To overcome this limitation and enhance the generalization efficiency of the primary SVM, soft-margin SVMs have been introduced (Cortes and Vapnik, 1995). They try to fit the maximum margin hyperplane by allowing classification error to some data points which lie close to the margin (Fig. 1(b)). For example, C-SVM applies nonnegative slack variables and a penalty factor C into the optimization problem, changing the optimization problem into
||w||2 i +C 2 l
minimize w,b,
∀i ∈ {1, 2, . . . , l}
yi ˛i (x), (xi ) + b
i=1
yi ˛i k(x, xi ) + b
(7)
i=1
In the classification decision function, every dot product is thus replaced by a nonlinear kernel function. This provides SVM with the ability for constructing the maximum-margin hyperplane in a transformed feature space. Hence, although SVM fits a hyperplane in the high-dimensional feature space, it may be nonlinear in the original input space. 2.2. SVM recursive feature elimination In Guyon et al. (2002), a feature selection algorithm is devised that eliminates irrelevant features using a recursive SVM classifier (SVM-RFE). After training SVM and obtaining weight vector w, SVM-RFE removes one feature with the smallest value of weight (wi )2 . This procedure iteratively proceeds while all features are ranked. Hence, the ranking criterion Rc (i) for ith feature of data points is defined as Rc (i) = (wi )2
(8)
In Guyon et al. (2002), a generalization of SVM-RFE algorithm to kernel methods has been introduced as well. In the generalization approach, it is supposed that the candidate feature for removing is the feature whose removal causes the smallest change in the magnitude of weight vector (||w||2 ). Therefore, in this case, the ranking criterion Rc (i) for ith feature of data points is redefined as Rc (i) = |||w||2 − ||w(i) ||2 | 1 = 2
˛ ˛ y y K(x , x ) − ˛(i) ˛(i) y y K (i) (x , x ) k j k j k j k j k j k j k,j
(9)
k,j
where K(i) is the Gram matrix of the training data when feature i (i) is removed, ˛k is the solution of Eq. (3), and w(i) is the vector of weights in such a situation. In Guyon et al. (2002), authors stated that in order to reduce computational runtime of the algorithm, the (i) ˛k is assumed to be equal to ˛k . 2.3. Graph kernels
constrained to : g(xi )
i=1
= yi (w, x + b) − 1 ≥ i ,
= sign
l
l
(5)
Therefore, the C-SVM allows for margin errors by penalizing them proportional to their violation of the condition yi (w,xi + b) ≥ 1. In classification problems which contain non-linearly separable classes even soft-margin SVMs cannot separate classes accurately (Fig. 1(c)). Kernel trick has been proposed to overcome these sorts of dilemmas (Boser et al., 1992). The basic idea of the kernel trick is to find a transformation : Rm → Ᏼ from input space Rm into a usually higher-dimensional feature space Ᏼ (a Hilbert space of
In many real world problems, objects can be naturally described as networks of interacting components. Graph representation is a general approach for modeling such objects. However, in pattern recognition studies, there have been always difficulties with comparing graphs. Graph kernel is one of the most recent methods introduced into the graph comparison area in order to overcome such dilemmas. Different graph kernels can be mainly categorized into three classes: (1) kernels based on common random walks and paths between two graphs (Borgwardt and Kriegel, 2005; Gärtner et al., 2003; Kashima et al., 2003), (2) kernels based on common limited-size subgraphs between two graphs (Horváth et al., 2004;
262
F. Mokhtari, G.-A. Hossein-Zadeh / Journal of Neuroscience Methods 212 (2013) 259–268
Fig. 1. Schematic representation of different kinds of hyperplanes that SVMs fit between two classes, (a) the case where hard-margin SVM can handle the problem, a maximum margin hyperplane exists that correctly separates two classes, (b) the case where soft-margin SVM can solve classification problem better than hard-margin SVM in aspect of generalization performance for classifying an unseen data point, and (c) the case where kernel trick can solve the classification problem by transforming data points into a linearly separable feature space.
Shervashidze et al., 2009), and (3) kernels based on common subtree patterns between two graphs (Gaüzère et al., 2012; Ramon and Gärtner, 2003; Shervashidze and Borgwardt, 2009). In our study, we deal with labeled graphs in which nodes and edges are labeled with numerical values in N and R respectively R. Shortest-path kernel is one of the kernels that can be applied to such graphs. It has been shown that shortest-path kernel outperforms state-of-the-art graph kernels on several benchmark datasets in terms of classification accuracy and runtime (Borgwardt and Kriegel, 2005). For the sake of clear representation, let us first explain some essential concepts and then describe shortest-path kernel.
2.3.1. Shortest-path A path in a graph is a sequence of nodes such that consecutive nodes in the sequence are connected by an edge in the graph and any node is not repeated more than once in the sequence. Path length is the number of edges along the path that must be traversed to go from one node to another. In a weighted graph, the definition of path length changes to the sum of the weights of the traversed edges. Shortest path is the path with minimum length.
2.3.2. Shortest-path kernel A graph is represented as G(V, E) in which V and E are the sets of nodes and edges respectively. Shortest-path kernel transforms each graph G(V, EG ) into a graph called shortest-path graph S(V, ES ). The shortest-paths graph S includes the same set of nodes as the original graph G, and there is an edge between the nodes in S which are connected by a path in G. Every edge in S between nodes vi and vj is labeled by the length of shortest path between these two nodes in G. Shortest-path kernel uses Floyd algorithm to solve the problem of computing shortest path length between any pair of nodes in a graph. Therefore shortest path kernel between two graphs is converted into a simple comparison between corresponding shortest-paths graphs (Borgwardt, 2007). For mathematical expression of shortest-path kernel, assume that we deal with two graphs G(V, EG ) and G (V , E G ) which are transformed into the shortest-paths graphs S(V, ES ) and S (V , ES ), respectively. Then shortest-path kernel can be defined over S(V, ES ) and S (V , ES ) as Kshortest-path (S, S ) =
e ∈ ES
e ∈ E
S
1 Kwalk (e, e )
(10)
1 is a positive definite kernel walks of length 1 or a kernel where Kwalk on edges. Therefore shortest-path kernel is converted to a walk kernel on the Floyd-transformed graphs that considers walks of length 1 only. 1 Kwalk is defined based on the approach of Kashima et al. (2003) which chooses a positive definite kernel on nodes and a positive definite kernel on edges. It then defines a kernel on pairs of walks 1 of length 1, Kwalk , as the product of kernels on nodes and edges encountered along the walk. As an example, in the cases when node labels are numerical values in N, kernel on nodes can be defined as Dirac kernel and when edge labels are numerical values in R kernel on edges can be defined as a Gaussian kernel. Further explanation can be found in Borgwardt (2007) which includes a comprehensive review about graph kernels. In the present study, we build an iterative support vector classifier enriched by shortest-path kernel in order to classify fMRI connectivity graphs of different brain states.
3. Materials and methods 3.1. Subjects and data acquisition Nineteen healthy subjects (mean age: 25 ± 3, range: 20–30, gender: 9 male, 10 female) without history of medications and health problems participated in data acquisition. Three individuals in the group were left-handed and others were right-handed. The structural magnetic resonance images were also examined carefully to exclude subjects with severe white matter changes or any abnormalities. All subjects were educated (18.0 ± 2.1 years). They had given written informed consent to participate in the study, which was performed in accordance with the guidelines of the Research Ethics Board at Baycrest and the University of Toronto. Scanning was performed on a Siemens Trio 3 Tesla magnet in Rotman Research Institute, University of Toronto, Toronto, Canada. T2*-weighted images were obtained using echo planar imaging acquisition (time echo (TE) = 30 ms, time repetition (TR) = 2000 ms, flip angle = 70◦ , field of view (FOV) = 200 mm, matrix = 40 × 48, 33 contiguous slices). For each subject a T1-weighted anatomical volume using SPGR sequence was acquired to coregister with the functional images (TE = 2.6 ms, TR = 2000 ms, FOV = 256 mm, slice thickness = 1 mm). For each participant, experiment was repeated in four scanning sessions. The length of time series of each session was 300 points.
F. Mokhtari, G.-A. Hossein-Zadeh / Journal of Neuroscience Methods 212 (2013) 259–268
3.2. Experimental design The experimental design was a blocked design containing five conditions: (1) fixation, (2) detection of a single stimulus, (3) perceptual matching, (4) attentional cueing, and (5) a delayed match-to-sample test of working memory. The visual stimuli used in the task blocks are supposed to avoid semantic processes in brain. These semantically devoid stimuli were white noise patches filtered by band-pass filter with different center frequencies (Protzner and McIntosh, 2007). The examples of the stimuli used in the experiment are shown in Fig. 2. During a fixation block, subject stared at a dot projected on the middle of the screen and any response was not required. Each run session consisted of eight 20-s of fixation blocks interleaved with the task blocks. The order of the 4 cognitive tasks was counterbalanced among subjects and sessions. During a detection trial, a visual stimulus was randomly projected on one of the left, central or right positions at the bottom of the screen. The presentation of stimulus lasted for 1000 ms and subject was required to press one of the left, central or right buttons to determine the location of the stimulus. Such trails were performed 12 times in each detection block that was repeated two times in each run session. During a perceptual matching trail, a sample stimulus was projected on the center of upper part of the screen, and simultaneously three stimuli appeared at the lower portion of the screen and subject was required to press one of the buttons to indicate which of the three stimuli corresponds to the sample (Fig. 2(a)). Each perceptual matching trail lasted for 4000 ms. There were six of such trails in each perceptual matching block that was repeated two times in each run session. In an attentional cueing trail, a sample stimulus was projected on the center of the upper portion of the screen for 1500 ms. Then an arrow pointing either to the right or left appeared in the lower portion of the screen for 1500 ms while the sample stimulus was still displayed on the screen. The arrow then vanished, and 500 ms later, two stimuli were projected on the right and left positions of the screen for 3000 ms. Subjects had to pay attention to the location that had been cued by the arrow and indicate the similarity or dissimilarity between the target stimulus and the sample. Such trails were performed 4 times in each attentional cueing block that was repeated two times in each run session. It was supposable four different kinds for attentional cueing trails. In the first kind, the stimulus cued by the arrow was similar to the sample, whereas the uncued stimulus was not. In the second kind, the stimulus cued by the arrow was not similar to the sample, whereas the uncued stimulus was, and so on for two other kinds. The total number of attentional cueing trails was divided equally among these four kinds. During a delayed match-to-sample trail, a sample stimulus was projected on the centrally upper portion of the screen for 1500 ms, after a 2500 ms delay accompanied by a blank screen, three stimuli were displayed on the lower portion of the screen for 3000 ms. Subject task was to press one of the buttons to indicate which of three stimuli corresponded to the sample (Fig. 2(b)). There were four of such trails in each delayed match-to-sample block that was repeated two times in each run session. The general procedure of the proposed method is illustrated schematically in Fig. 3 and explained in detail in the following sections. 3.3. fMRI data preprocessing At first, an unbiased nonlinear group average anatomical image is created in order to register each subject’s scans into a common space (Chen et al., 2006; Kovaˇcevic´ et al., 2005; Levine et al., 2008). Slice timing correction is performed on functional data using AFNI software package (http://afni.nimh.nih.gov/afni). Three transformations are then obtained using AIR
263
(http://bishopw.loni.ucla.edu/AIR5/) to register each fMR image into the common space. The first one is transformation from each functional volume into the mean volume within a run. Functional data are motion corrected through this registration step. The second is transformation from the mean volume of each run into the corresponding subject’s structural volume, and the third one is transformation from each subject’s structural volume into the common atlas. These three transformations are then concatenated to obtain a nonlinear transformation from each functional image into the common template. After the above-mentioned preprocessing steps, fMRI data are spatially smoothed using a Gaussian filter with full width half maximum of 7 mm. Afterwards, six head motion parameters and white matter time series are used as regressors in a general linear model (GLM) and are regressed out from the time series of each voxel. This is done to remove confounds from the time series. Head motion parameters are obtained in the motion correction step, and for obtaining white matter regressor the white matter map from the ICBM probabilistic tissue atlas (http://www.loni.ucla.edu/ICBM/Downloads/Downloads ICBMpro babilistic.shtml) is used. At first, this map is registered into the common atlas. Then the registered map converted into a binary white matter mask with the threshold value of 0.8 (the chance of containing white matter is ≥80%). Afterwards, this binary mask is eroded using a 3 × 3 square morphological structuring element. Finally, the voxels of white matter are identified for each subject and spatial averaging is done on these voxles time series. This spatially averaged time series is considered as white matter regressor in the GLM. It should be mentioned that preprocessing steps from creating common template to spatial smoothing and creating white matter regressor have been done in (Grady et al., 2010). 3.4. Selecting ROIs and extracting ROIs time series To identify ROIs, we take an advantage of the results obtained in Grady et al. (2010) that uses PLS to analyze activity changes across conditions. PLS is a multivariate approach to identify spatial patterns of activity which reveal the optimal relation between the functional images and each condition of the experiment. In other words, PLS can identify regions with significant changes of activity across conditions of experiment; for more details see McIntosh et al. (1996). Most of the regions identified using PLS in Grady et al. (2010) are included in the default mode network (DMN) and task positive network (TPN), only there is a slight difference between them and the regions which are typically thought to be part of the DMN and TPN. In Table 1, ROIs, their corresponding labels in the connectivity graphs, and coordination of ROIs central voxels in the common template are listed. After identifying ROIs, their time series are extracted from fMRI data. The representative time series of each ROI is spatially averaged time series from the corresponding central voxel and its 6 face-connected neighbors. To create longer time series, we concatenate time series of same blocks (Fair et al., 2007; Freeman et al., 2011; Richiardi et al., 2011, 2010). Considering the facts that: (1) blocks are not short (of length 40 s), and (2) we use cross correlation measure for estimating functional connections, concatenation contribute to correlations that are not due to chance. The exact moment at which each block began and ended are identified as the beginning and the end of time series of that block, respectively. As previously mentioned, for each subject data were acquired in 4 sessions. For each subject and each condition, we concatenate time series across two randomly selected scanning sessions. The same concatenation procedure is then done on the time series of two remaining sessions. Finally, for each subject and each
264
F. Mokhtari, G.-A. Hossein-Zadeh / Journal of Neuroscience Methods 212 (2013) 259–268
Fig. 2. Examples of the visual stimuli used in the experiment, (a) visual stimulus used in the perceptual matching task, and (b) visual stimulus used in the delayed matchto-sample task.
ROI, we have ten time series with the length of 80 time points, each pair of which corresponds to one of the states. 3.5. Creating fMRI connectivity graphs Several methods have been proposed to fMRI connectivity analysis including independent component analysis (ICA) (Hyvärinen and Oja, 2000), principal component analysis (PCA) (Baumgartner et al., 2000; Worsley et al., 2005) and cross correlation analysis (CCA) (Cao and Worsley, 1999). In current study, CCA is used to construct functional connectivity graphs. This method has been frequently used to construct brain large networks (Richiardi et al., 2011; Shirer et al., 2012; Van den Heuvel et al., 2008). It is easy to implement and produces functional networks with similar regions (graphs with same nodes) in all states/subjects, whereas ICA and PCA do not have such properties. These properties simplify the comparison of connectivity graphs of different brain states. To construct weighted connectivity graphs, each ROI is considered as a node. For each subject and each set of the ROIs time series, a correlation matrix is formed. The correlation matrix R = [rij ] is a n × n symmetric matrix where n is number of the graph nodes (ROIs) and rij is Pearson cross correlation coefficient between time series of the nodes vi and vi and describes the weight of functional connectivity between these nodes. Then the graph adjacency matrix is calculated. It is defined as A = R − I. This contributes to
remove self-connections form the graphs. In these graphs, nodes are labeled with the numerical values from N and edges are labeled with their corresponding values form R. At the end of this step, a set containing 38 connectivity graphs with 276 connections is built for each brain state. Since we deal with relatively dense connectivity graphs, an appropriate method is required to identify informative edges which contribute to higher discrimination between states. 3.6. Backward edge elimination To handle the problem of dense connectivity graphs, we design a backward edge elimination (BEE) algorithm using SVM-RFE approach that is proposed in (Guyon et al., 2002) and described in detail in Section 2.2. In each iteration of the BEE algorithm, SVM is trained using the sample graphs of training set. The classifier is then validated using the sample graphs of validation set and correct classification rate (CCR) is calculated. The ranking criterion Rc (i) is calculated for each edge using Lagrange multipliers ˛ = (˛1 , ˛2 , . . ., ˛l ) that are calculated in the training step (Eq. (8)). The edge with the smallest ranking criterion is removed from the training and validation sample graphs. The nodes which are completely isolated from the graphs are eliminated, as well. The pruned graphs are fed into the next iteration. The iteration in which maximum
Fig. 3. Block diagram of the proposed method. At first, fMRI data are preprocessed and ROIs time series are extracted. The connectivity graphs are then constructed. Using one-versus-one approach for classifying C classes, multi-class BEE constructs C(C − 1)/2 separate 2-class BEEs on all possible pairs of classes. Each test graph is fed into all these 2-class BEEs, every BEE assigns a class label to the test graph and the class with most votes determines classification final decision.
F. Mokhtari, G.-A. Hossein-Zadeh / Journal of Neuroscience Methods 212 (2013) 259–268 Table 1 Regions identified using PLS in (Grady et al., 2010) along with their numerical labels in the connectivity graphs, and coordination of the regions central voxels [X, Y, Z]. Regions indexed by a and b are part of the DMN and TPN, respectively. Regions of interest
Label
[X, Y, Z]
Caudat Cerebelluma Left middle inferior temporal gyrusa Left middle superior frontal gyrusa Left ventromedial prefrontala Medial prefrontal cortexa Medial frontal gyrusa Posterior cingulate cortexa Right middle temporal gyrusa Right parahippocampal gyrusa Right retrosplentiala Right IPL (angular gyrus)a Left IPL (angular gyrus)a Precuneusa Insulab Left inferior occipital gyrusb Left cerebrum. Parietal Lobe. Postcentral gyrusb Left precentral gyrusb Left cerebrum. Parietal lobe. Superior parietal lobeb Right fusiform gyrusb Left middle frontal gyrusb Right middle frontal gyrusb Thalamus Putaman
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[−14, 22, 6] [−6, −32, −54] [−57, 3, −20] [−30, 47, 26] [−2, 62, −6] [2, 61, 23] [−2, 6, 42] [−2, −41, 11] [58, 2, −22] [22, −38, −30] [15, −46, −10] [42, −50, 14] [−38, −59, 18] [−9, −35, 35] [−34, 18, 2] [−46, −50, −14] [−37, −26, 33] [−41, −2, 33] [−19, −46, 39] [46, −47, −17] [−30, 49, 24] [22, −38, −30] [−13, −6, −7] [22, 18, 0]
classification performance occurs is considered as “optimum iteration” (if maximum classification performance is obtained in several iterations, the iteration with minimum number of survived edges is considered as “optimum iteration”). The edges and nodes that survive in the “optimum iteration” form discriminant network between two states. General procedure of BEE algorithm for a training set including l sample graphs with m edges and a validation set including l’ sample graphs with m edges is depicted in the following pseudo-code. Algorithm: 2-class BEE Input: Training set: X = {x1 , x2 , . . ., xl } Validation set: X = {x1 , x2 , . . ., x l } Y = {y1 , y2 , . . ., yl }·yi ∈ {−1, 1} Initialization Set of survived edges: s ← {1, 2, . . .m} Sequence of ranked edges: r ← [] k←1 / [] while s = ␣ ← Training SVM using X(s,:) CCR(k) ← Validating SVM using X (s,:) Rc (i) ← Compute ranking criterion for all i ∈ s f ← argmin(RC (i)) i∈s
r ← [f, r] s ← s−{f} k ← k+1 end while Output (optimum results): the SVM classifier with maximum CCR and minimum size of S (discriminant network)
3.7. Extending BEE into a multi-class classification algorithm SVM is inherently a 2-class classifier. We use “one-versus-one” approach to expand the proposed algorithm into a multi-class classifier. Using this approach and assuming that we have C classes, 2-class BEE algorithm should be carried out and optimum results should be obtained for all possible pairs of classes. The multi-class classifier is then constructed from the resultant C(C − 1)/2 SVMs. Validation points are mapped into each of the discriminant networks, fed into the corresponding SVM, and then tested. They are finally classified according to which class has majority votes (Fig. 3). This approach often outperforms other
265
Table 2 CCR obtained using BEE algorithm enriched by various kernels. Parameters are defined at the beginning of Section 4. Kernel
Optimal parameter
CCR% (mean ± St. Dev.)
Linear Polynomial RBF Random walk Fast subtree Shortest-path
C = 64 C = 64, d = 4 C = 64, = 3.6 C = 64 C = 64 C = 64
75.26 77.68 77.37 81.26 81.47 86.32
± ± ± ± ± ±
1.51 0.97 1.34 1.47 0.74 1.57
approaches for developing SVM into a multi-class classifier such as “one-versus-the-rest” (Bishop and SpringerLink, 2006). 4. Results From 190 sample graphs, 95 graphs (19 samples from each state) are selected randomly for training, and remaining graphs (95 graphs including 19 samples from each state) are used for validation. In order to exclude random effects of the samples selected for training the classifier, such partitioning of data and classification procedure are repeated for 10 times. We report average multi-class classification accuracy as well as standard deviation in Table 2. To examine the performance of shortest-path kernel, we implement BEE algorithm enriched by conventional kernels such as linear, polynomial and radial-basis function (RBF) and two other state-of-the-art graph kernels including fast geometric random walk kernel (Borgwardt, 2007) and fast subtree kernel (Shervashidze and Borgwardt, 2009). Polynomial kernel for two graphs x and x , is of the form of k(x, x ) = (1 + x·x )d . The parameter of polynomial kernel is the order of polynomial d. RBF kernel is of the 2
2
form of k(x, x ) = e−(x−x ) /2 , with the parameter as scaling factor. Although we use fixed values of parameters C (penalty factor), d and for whole BEE algorithm (all iterations), in an outer loop we use grid search to find the optimum values of these parameters that produce optimum CCR. The sets for grid search of C, d, and are {0.5, 1, 2, . . ., 128}, {2, 3, 4, . . ., 10}, and {0.1, 0.6, 1.1, . . ., 10.1}, respectively. It should be mentioned that using the conventional kernels, classification cannot be performed on graphs. Since we deal with undirected graphs in this study, the upper triangular part of adjacency matrix is sufficient to describe such graphs completely. Thus the upper triangular part of each adjacency matrix is reshaped into a vector and the resultant vectors are fed into the BEE algorithm. This embedding method has been applied to fMRI connectivity graphs in Richiardi et al. (2011). In order to compare classification performance of the proposed method with those of the previous approaches (Richiardi et al., 2010; Shirer et al., 2012), they are also implemented and applied to the connectivity graphs of the fMRI data. They lead to classification performance of 77.58% and 81.58% via leave-one-out cross-validation procedure respectively. To provide a more exact insight into how much the classifier discriminates between any pair of states, we present confusion matrices obtained using different kernels in Table 3. As mentioned above, we repeat the whole multi-class algorithm 10 times. The confusion matrices reported in the table are the average of results among the 10 repetitions. In these matrices rows and columns stand for the real states of graphs R and class labels predicted by the classifier ω, respectively. For different 2-class BEEs enriched by shortest-path kernel, Fig. 4 shows the classification performance obtained during edge elimination procedure. In other words, the figure demonstrates that how changes in the number of edges affect on the CCR. Investigation of the networks resulted from 10 repetitions of the classification algorithm shows that there are not remarkable differences between them. Fig. 5 illustrates discriminant networks
266
F. Mokhtari, G.-A. Hossein-Zadeh / Journal of Neuroscience Methods 212 (2013) 259–268
Table 3 Confusion matrices obtained using BEE algorithm enriched by various kernels, (a) linear, (b) polynomial, (c) RBF, (d) random walk, (e) fast subtree, and (f) shortest-path. FIX, DET, PMT, ATT, and DMS stand for the fixation, detection, perceptual matching, attentional cueing and delayed match-to-sample conditions, respectively. (a)
(b)
(c)
ω R
FIX
DET
PMT
ATT
DMS
ω R
FIX
DET
PMT
ATT
DMS
ω R
FIX
DET
PMT
ATT
DMS
FIX DET PMT ATT DMS
91.05 0.00 0.00 0.00 0.00
0.00 58.42 14.21 0.00 14.21
8.95 15.26 77.37 13.68 0.00
0.00 13.68 8.42 75.79 12.11
0.00 12.63 0.00 10.53 73.68
FIX DET PMT ATT DMS
89.47 0.00 0.00 5.26 0.00
6.32 65.79 9.47 0.00 14.74
4.21 12.11 78.95 10.53 6.32
0.00 11.58 11.58 75.26 0.00
0.00 10.53 0.00 8.95 78.95
FIX DET PMT ATT DMS
89.47 0.00 0.00 0.00 0.00
6.84 63.16 12.63 0.00 12.11
3.68 14.21 78.95 14.21 3.68
0.00 12.11 6.84 78.95 7.89
0.00 10.53 1.58 6.84 76.32
ω R
FIX
DET
PMT
ATT
DMS
ω R
FIX
DET
PMT
ATT
DMS
ω R
FIX
DET
PMT
ATT
DMS
FIX DET PMT ATT DMS
91.58 0.00 0.00 5.26 0.00
4.74 63.16 10.00 0.00 8.42
3.68 15.26 85.79 4.74 10.00
0.00 11.05 4.21 84.21 0.00
0.00 10.53 0.00 5.79 81.58
FIX DET PMT ATT DMS
90.00 0.00 0.00 5.26 0.00
8.95 63.16 5.26 0.00 10.53
0.00 12.11 85.79 5.26 5.26
0.00 14.21 8.95 84.21 0.00
1.05 10.53 0.00 5.26 84.21
FIX DET PMT ATT DMS
95.79 0.00 0.00 7.37 0.00
4.21 68.42 3.16 0.00 6.32
0.00 11.58 89.47 0.00 4.21
0.00 11.58 4.21 88.42 0.00
0.00 8.42 3.16 4.21 89.47
(e)
(d)
(f)
for any pair of states. These results show some consistent patterns of connectivity in the fixation–task and task–task cases that are discussed in more detail in Section 5. Fig. 5 also shows the rank of edges in the discriminant networks. The edges in darker and thicker lines correspond to higher-ranked edges in the feature elimination procedure. 5. Discussion Brain decoding studies aim to explore brain function by applying multivariate pattern analysis for discerning brain state from fMRI data. In the current study, we design an iterative support vector classifier that can perform classification directly on graphs using shortest-path kernel. Our classifier is also able to discover the networks which cause separation between any pair of brain states using a backward edge elimination procedure. Table 2 shows that shortest-path kernel surpasses other kernels in term of classification performance. In general, we can see that graph kernels outperform the conventional kernels. One important advantage of graph kernels compared to other kernels is that they enable BEE algorithm to perform classification on graphs. Measuring the similarity between graphs using graph kernels cause topological information to be considered in addition to the information represented by node and edge labels. Recent studies have
shown inherent topological features for brain networks (Achard et al., 2006; Hayasaka and Laurienti, 2010; Salvador et al., 2005; Van den Heuvel et al., 2008). Therefore, this property of graph kernels can be worthwhile to obtain an improved similarity score between fMRI connectivity graphs. In Table 3, we can observe that separation rates between fixation and task states are comparatively higher than separation rates between any pair of tasks, suggesting that some of evaluated connections significantly change between resting state and cognitive task states as it is indicated in (Richiardi et al., 2011). On the other hand, there are high confusion rates between detection and other task states. It may indicate that our measurements do not include appropriate features for accurate separation between detection and other task states. According to Fig. 4, at the last iteration of elimination procedure in which classification is performed based on the highest ranked connection, CCR is less than other iterations. This suggests that cognitive brain states may be classified by a network more efficiently than a specific connection. Based on Fig. 5, presence of the connectivity between medial prefrontal cortex and posterior cingulate cortex as the highest-ranked edge is noteworthy in the networks that cause discrimination between fixation and tasks (including detection, perceptual matching, and delay match-to-sample). We can also
Fig. 4. CCR obtained during edge elimination procedure using shortest-path kernel, (a) results of classifying fixation from one of the tasks and (b) results of classifying a task from another task.
F. Mokhtari, G.-A. Hossein-Zadeh / Journal of Neuroscience Methods 212 (2013) 259–268
267
Fig. 5. Discriminant networks of any pair of states obtained using BEE algorithm enriched by shortest-path kernel, (a) FIX-DET, (b) FIX-PMT, (c) FIX-ATT, (d) FIX-DMS, (e) DET-PMT, (f) DET-ATT, (g) DET-DMS, (h) PMT-ATT, (i) PMT-DMS, and (j) ATT-DMS. The node labels are defined in Table 1.
observe the connectivity between right retrosplential and superior parietal lobe as the 2-ranked edge in the discriminant networks of FIX-DET, FIX-PMT, and FIX-ATT. Another important pattern that can be perceived from fixationtask networks is the role of posterior cingulate cortex as a hub. In a graph, hub is a node that has a large number of connections and has an important role to build short and efficient paths between other nodes. Therefore, posterior cingulate cortex is a region that contributes to interactions in the discriminant networks. Posterior cingulate cortex is one of the most prominent regions of DMN and functional imaging studies have recently discovered the significant activation of posterior cingulate in retrieval of past events, planning for feature, spatial navigation, and theory of mind (Buckner and Carroll, 2007; Grady et al., 2010; Spreng et al., 2009). We can also observe a few consistent patterns in the task-task networks. The connectivity between right retrosplential and superior parietal lobe is the highest-ranked edge in these networks except the networks of DET-ATT and ATT-DMS. It is also observed that superior parietal lobe is a highly connected region in these cases. It can thus be seen as a hub with an important role to connect other regions. In addition, superior parietal lobe is the only region that exists in all the 10 discriminant networks. It has been indicated that superior parietal lobe is involved in spatial orientation and receiving visual and sensory inputs (Joseph, 1996). The proposed method does not perform a simultaneous feature selection for the multi-class classification. Multi-class SVM works
based on the features selected for any pair of the states. Using this feature selection approach, optimum accuracy may not be obtained for multi-class classification but it provides us with the discriminant network of any pair of states. As previously mentioned, three of the subjects were left-handed. The potential differences in the brain lateralization between left and right-handed subjects may influence the results. Since the number of left-handed subjects is small, we cannot get a reliable conclusion in this regard. At last, one aspect of the proposed method that may have influential role on classification performance should be explained. As previously described in Section 3.4, we select those sets of regions obtained by PLS. PLS discovers the patterns of activity which attempt to present the optimal association between images and either of the conditions of experiment. In the case of analysis of blocked fMRI data, having such activity patterns can be helpful to identify informative regions. In current study, the ROIs are mainly included in DMN and TPN that are two sets of brain regions show anticorrelated patterns of activation during rest and cognitive tasks. Recently, it has been shown that there is a correspondence between the difficulty of a cognitive task and the level of anticorrelation of the DMN and TPN (Kelly et al., 2008). Therefore, it is thought that time series of the regions included in these two networks may consist of informative features for discriminating between different cognitive tasks.
268
F. Mokhtari, G.-A. Hossein-Zadeh / Journal of Neuroscience Methods 212 (2013) 259–268
6. Conclusion We designed an iterative support vector classifier enriched by shortest-path graph kernel in order to decode five cognitive states from fMRI connectivity graphs. Our classifier had two notable characteristics. It was able to perform classification on graphs relying on a graph-based similarity score calculated by shortest-path kernel. It could also uncover discriminant networks between different brain states via a backward edge elimination procedure. We achieved some important patterns in the resultant networks. We identified the importance of posterior cingulate cortex as a hub in the discriminant networks of fixation-task cases. Superior parietal lobe has a same role in the task-task cases. The connectivity between right retrosplential and superior parietal lobe was an informative connection in both fixation-task and task-task cases. These findings can be investigated more by neuroscience studies in order to understand brain cognitive states and disorders that change activity and/or connectivity patterns among regions. Acknowledgements We wish to thank Prof. Cheryl Grady and Prof. Stephen Strother for providing us with the fMRI data, Prof. Hamid Soltanian-Zadeh and Dr. Jonas Richiardi for fruitful advices and Shahab Kadkhodaeian Bakhtiari for helpful discussions. The original experiment of this work was supported by the Canadian Institutes of Health Research (Grant MOP14036). References Achard S, Salvador R, Whitcher B, Suckling J, Bullmore E. A resilient, low-frequency, small-world human brain functional network with highly connected association cortical hubs. J Neurosci 2006;26:63–72. Baumgartner R, Ryner L, Richter W, Summers R, Jarmasz M, Somorjai R. Comparison of two exploratory data analysis methods for fMRI: fuzzy clustering vs. principal component analysis. Magn Reson Imaging 2000;18:89–94. Bishop CM. SpringerLink. Pattern recognition and machine learning. New York: Springer; 2006. Borgwardt KM. Graph kernels. Ludwig Maximilian University of Munich; 2007. Borgwardt KM, Kriegel HP. Shortest-path kernels on graphs. In: Proc IEEE Int Conf on Data Mining; 2005. p. 74–81. Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proc ACM Workshop on Computational Learning Theory; 1992. p. 144–52. Buckner RL, Carroll DC. Self-projection and the brain. Trends Cogn Sci 2007;11:49–57. Bunke H, Allermann G. Inexact graph matching for structural pattern recognition. Pattern Recogn Lett 1983;1:245–53. Cao J, Worsley K. The geometry of correlation fields with an application to functional connectivity of the brain. Ann Appl Probab 1999;9:1021–57. Chapelle O, Keerthi S. Multi-class feature selection with support vector machines. Proc Am Stat Assoc 2008. Chen XJ, Kovacevic N, Lobaugh NJ, Sled JG, Henkelman RM, Henderson JT. Neuroanatomical differences between mouse strains as shown by high-resolution 3D MRI. Neuroimage 2006;29:99–105. Chung FRK. Spectral graph theory. Am Math Soc 1997. Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995;20:273–97. Ethofer T, Van De Ville D, Scherer K, Vuilleumier P. Decoding of emotional information in voice-sensitive cortices. Curr Biol 2009;19:1028–33. Fair DA, Schlaggar BL, Cohen AL, Miezin FM, Dosenbach NUF, Wenger KK, et al. A method for using blocked and event-related fMRI data to study “resting state” functional connectivity. Neuroimage 2007;35:396–405. Freeman J, Donner TH, Heeger DJ. Inter-area correlations in the ventral visual pathway reflect feature integration. J Vis 2011:11. Fries P. A mechanism for cognitive dynamics: neuronal communication through neuronal coherence. Trends Cogn Sci 2005;9:474–80. Gärtner T, Flach P, Wrobel S. On graph kernels: Hardness results and efficient alternatives. Lect Notes Artif Int 2003:129–43. Gaüzère B, Brun L, Villemin D. Two new graphs kernels in chemoinformatics. Pattern Recogn Lett 2012;33:2038–47. Gold C, Holub A, Sollich P. Bayesian approach to feature selection and parameter tuning for support vector machine classifiers. Neural Netw 2005;18: 693–701.
Grady CL, Protzner AB, Kovacevic N, Strother SC, Afshin-Pour B, Wojtowicz M, et al. A multivariate analysis of age-related differences in default mode and task-positive networks across multiple cognitive domains. Cereb Cortex 2010;20:1432–47. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn 2002;46:389–422. Hayasaka S, Laurienti PJ. Comparison of characteristics between region-and voxel-based network analyses in resting-state fMRI data. Neuroimage 2010;50:499–508. Horváth T, Gärtner T, Wrobel S. Cyclic pattern kernels for predictive graph mining. In: Proc ACM Int Conf on Knowledge Discovery and Data Mining; 2004. p. 158–67. Hyvärinen A, Oja E. Independent component analysis: algorithms and applications. Neural Netw 2000;13:411–30. Joseph R. Neuropsychiatry, neuropsychology, and clinical neuroscience: Emotion, evolution, cognition, language, memory, brain damage, and abnormal behavior. Baltimore: Williams & Wilkins Co; 1996. Kashima H, Tsuda K, Inokuchi A. Marginalized kernels between labeled graphs. In: Proc Int Conf on Machine Learning; 2003. p. 321–8. Kelly A, Uddin LQ, Biswal BB, Castellanos FX, Milham MP. Competition between functional brain networks mediates behavioral variability. Neuroimage 2008;39:527–37. Kovaˇcevic´ N, Henderson J, Chan E, Lifshitz N, Bishop J, Evans A, et al. A threedimensional MRI atlas of the mouse brain with estimates of the average and variability. Cereb Cortex 2005;15:639–45. Kühn S, Bodammer NC, Brass M. Dissociating mental states related to doing nothing by means of fMRI pattern classification. Neuroimage 2010;53: 1294–300. LaConte SM, Peltier SJ, Hu XP. Real-time fMRI using brain-state classification. Hum Brain Mapp 2007;28:1033–44. Levine B, Kovacevic N, Nica E, Cheung G, Gao F, Schwartz M, et al. The Toronto traumatic brain injury study Injury severity and quantified MRI. Neurology 2008;70:771–8. McIntosh A, Bookstein F, Haxby JV, Grady C. Spatial pattern analysis of functional brain images using partial least squares. Neuroimage 1996;3:143–57. Protzner AB, McIntosh AR. The interplay of stimulus modality and response latency in neural network organization for simple working memory tasks. J Neurosci 2007;27:3187–97. Rakotomamonjy A. Variable selection using SVM based criteria. J Mach Learn Res 2003;3:1357–70. Ramon J, Gärtner T. Expressivity versus efficiency of graph kernels. Technical Report Int Workshop on Mining Graphs. Trees and Sequences 2003; 65–74. Richiardi J, Eryilmaz H, Schwartz S, Vuilleumier P, Van De Ville D. Decoding brain states from fMRI connectivity graphs. Neuroimage 2011;56:616–26. Richiardi J, Van De Ville D, Eryilmaz H. Low-dimensional embedding of functional connectivity graphs for brain state decoding. In: Proc IEEE Workshop on Brain Decoding; 2010. p. 21–4. ˇ M, Kononenko I. Theoretical and empirical analysis of ReliefF and Robnik-Sikonja RReliefF. Mach Learn 2003;53:23–69. ˜ Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007;23:2507–17. Salvador R, Suckling J, Coleman MR, Pickard JD, Menon D, Bullmore E. Neurophysiological architecture of functional magnetic resonance images of human brain. Cereb Cortex 2005;15:1332–42. Shervashidze N, Borgwardt KM. Fast subtree kernels on graphs. Adv Neural Inf Proc Syst 2009;22:1660–8. Shervashidze N, Vishwanathan S, Petri T, Mehlhorn K, Borgwardt K. Efficient graphlet kernels for large graph comparison. In: Proc Int Conf on Artificial Intelligence and Statistics; 2009. Shirer W, Ryali S, Rykhlevskaia E, Menon V, Greicius M. Decoding subjectdriven cognitive states with whole-brain connectivity patterns. Cereb Cortex 2012;22:158–65. Sitaram R, Lee S, Ruiz S, Rana M, Veit R, Birbaumer N. Real-time support vector classification and feedback of multiple emotional brain states. Neuroimage 2011;56:753–65. Spreng RN, Mar RA, Kim ASN. The common neural basis of autobiographical memory, prospection, navigation, theory of mind, and the default mode: a quantitative meta-analysis. J Cogn Neurosci 2009;21:489–510. Todeschini R, Consonni V. Handbook of molecular descriptors. Weinheim, Germany: Wiley-VCH; 2008. Van den Heuvel M, Stam C, Boersma M, Hulshoff Pol H. Small-world and scale-free organization of voxel-based resting-state functional connectivity in the human brain. Neuroimage 2008;43:528–39. Vapnik V. Pattern recognition using generalized portrait method. Automat Rem Contr 1963;24:774–80. Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V. Feature selection for SVMs. Adv Neural Inf Proc Syst 2001:668–74. Wiener H. Structural determination of paraffin boiling points. J Am Chem Soc 1947;69:17–20. Worsley KJ, Chen JI, Lerch J, Evans AC. Comparing functional connectivity via thresholding correlations and singular value decomposition. Philos Trans R Soc Lond B Biol Sci 2005;360:913–20.