NeuroImage 57 (2011) 362–377
Contents lists available at ScienceDirect
NeuroImage j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / y n i m g
Spatio-temporal models of mental processes from fMRI Firdaus Janoos a, b, c,⁎, Raghu Machiraju a, Shantanu Singh a, Istvan Ákos Morocz b, c a b c
Dept. of Computer Science and Engineering, Ohio State University, Columbus, USA Brigham and Women's Hospital, Boston, USA Harvard Medical School, Boston, USA
a r t i c l e
i n f o
Article history: Received 1 October 2010 Revised 15 March 2011 Accepted 17 March 2011 Available online 24 March 2011 Keywords: fMRI Spatio-temporal Multivariate Markov Brain-state Semi-supervised Mean field
a b s t r a c t Understanding the highly complex, spatially distributed and temporally organized phenomena entailed by mental processes using functional MRI is an important research problem in cognitive and clinical neuroscience. Conventional analysis methods focus on the spatial dimension of the data discarding the information about brain function contained in the temporal dimension. This paper presents a fully spatio-temporal multivariate analysis method using a state-space model (SSM) for brain function that yields not only spatial maps of activity but also its temporal structure along with spatially varying estimates of the hemodynamic response. Efficient algorithms for estimating the parameters along with quantitative validations are given. A novel low-dimensional feature-space for representing the data, based on a formal definition of functional similarity, is derived. Quantitative validation of the model and the estimation algorithms is provided with a simulation study. Using a real fMRI study for mental arithmetic, the ability of this neurophysiologically inspired model to represent the spatio-temporal information corresponding to mental processes is demonstrated. Moreover, by comparing the models across multiple subjects, natural patterns in mental processes organized according to different mental abilities are revealed. © 2011 Elsevier Inc. All rights reserved.
Introduction Motivation Classically, the analysis of fMRI has focussed either on the creation of static maps localizing the metabolic fingerprints of neural processes or on studying their temporal evolution in a few pre-selected regions in the human brain. However, cognition recruits the entire brain and the underlying mental processes are fundamentally spatio-temporal in nature. By neglecting either the temporal dimension or the spatial entirety of brain function, such methods must necessarily compromise on extracting and representing all the information contained in the data (Haynes and Rees, 2006). Moreover, the high anatomical variation between subjects and the lack of rigid structure–function mapping makes group-level inference of spatial activity-maps difficult (Thirion et al., 2006). This is especially relevant for neurological and psychiatric disorders like dementia, schizophrenia, autism, multiple sclerosis, etc. (Buckner et al., 2008; Cecchi et al., 2009) or common learning disabilities like dyslexia and dyscalculia (Shalev, 2004), where it is widely acknowledged that the similarities and differences between the brain function of subjects may not only reside in the spatial layout of the mental activity but equally so in its temporal organization (Hari et al., 2000). ⁎ Corresponding author at: Harvard Medical School, Boston, USA. E-mail addresses:
[email protected] (F. Janoos),
[email protected] (R. Machiraju),
[email protected] (S. Singh),
[email protected] (I.Á. Morocz). 1053-8119/$ – see front matter © 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.neuroimage.2011.03.047
In contrast to localization based methods in fMRI, Lehmann et al. (1998) demonstrated the presence of intrinsic microstates in EEG recordings corresponding to characteristic distributions of electric activity in the brain, each ranging from 70 ms to 150 ms. Hypothesized to reflect the activation of different neuro-cognitive networks, these may be the “atoms of thought” that constitute the seemingly continual “stream of consciousness”. Inspired by these findings, we demonstrated the presence of characteristic distributions of neural activity in fMRI data (Janoos et al., 2010a) by defining a metric for the functional distance between two activity patterns and then clustering functionally similar fMRI volumes. Identified intrinsically without reference to experimental variables, these spatially distributed and temporally varying signatures were shown to correspond to relatively longer lasting and more highlevel internal mental states of the subject, such as visual perception, motor planning, conflict-resolution, arithmetical processing, etc. However, one of the drawbacks of this purely unsupervised methods used there was the problem of selecting spatio-temporal patterns related to the mental task. In addition to task-related mental processes, fMRI data contain traces of background (i.e. default-state) mental processes along with confounds such as respiration, heart-beat, head-motion and scanner drift (Logothetis, 2008). Identifying task-related patterns amongst this multitude involves determining the correct solution in a non-convex optimization landscape with multiple local minima. In this paper, we present a fully multivariate spatio-temporal model that represents the brain transitioning through an abstract state-space as it performs mental task. This representation delineates not only the
F. Janoos et al. / NeuroImage 57 (2011) 362–377
363
spatial distribution of activity but also its temporal ordering. Here, information about the experimental task is used in a semi-supervised fashion to stabilize estimation and select a model of interest to the investigator, without precluding discovery of new and un-modeled patterns. Importantly, the use of a neurophysiologically inspired model allows comparison of the spatio-temporal patterns of mental processes of subjects, in their entirety. Road map The state-space model (SSM), represented by a first order Markov chain (Bishop, 2007), assumes the presence of a set of abstract mental states that are revisited during the performance of a mental task. Given the goal-oriented and directed nature of human thought these brainstates not only exhibit a temporal ordering but also respond to external stimuli. Each state is associated with a characteristic spatial distribution of neural/metabolic activity and an occurrence of the state corresponds with an activation pattern based on this signature. The observed fMRI data arise from an unknown and spatially varying hemodynamic response to these activation patterns. The model is elaborated further in The StateSpace Model (SSM) section. The Markov chain of brain-states serves two purposes: a) To enforce a temporal ordering on the states, and b) To decouple the stimulus from the fMRI signal, thereby obviating specification of the exact mathematical relationship between the two. This second aspect makes it a semi-supervised method as it uses the stimuli to guide estimation but does not preclude discovery of new patterns in the data and investigation of effects not explicitly encoded in the experimental variables. The SSM can predict the value of experimental stimuli at new frames and is able to estimate a spatially varying hemodynamic response function (HRF) from the data. In the Feature-space section, a novel linear low-dimensional featurespace is derived from a definition of the functional distance (Janoos et al., 2010a) between the activation patterns present in the fMRI data at two time-points. This functional distance is measured by the amount of change between their activity distributions over the functional networks of the brain. The feature-space, derived from this measure of similarity between brain-states, allows an exploration of hidden patterns in the data. Then in the Estimation section, an efficient algorithm based on a mean field approximation of expectation maximization (EM) (Bishop, 2007) is presented to estimate the model parameters, the activation maps, the hemodynamic filter and the optimal state sequence, along with predictions of unobserved stimuli. A method to determine the correct model size and other hyper-parameters in an automated fashion that best describe the task being performed by the subject is described in the SSM hyper-parameter section. The outline of the different processing steps of the method presented in this paper is summarized in Fig. 1. A quantitative validation of the estimation algorithms is given in the Simulation section. The results of the method applied to multi-subject fMRI study for arithmetical processing of healthy, dyslexic and dyscalculic subjects are presented in the fMRI data-set: mental arithmetic task section. This section begins with a comparative evaluation of the proposed method against MVPR classifiers, using different featurespace embeddings of the data, followed by a detailed investigation of the parameters and a neurophysiological interpretation of the results. The advantages of using the state-space model representation to compare the mental processes of different subjects in their entirety is also demonstrated. Finally, we conclude with some remarks and observations on the method in the Conclusion section. Contribution In contrast to other methods for studying the mental state of the subject from fMRI data (cf. Related work section), this paper present a
Fig. 1. Outline of method. Basis vectors of the feature-space Φ are computed from the functional connectivity of the brain estimated from the fMRI data Y (cf. Feature-space section). Dimensionality reduction is performed by retaining stable basis vectors using bootstrapping (cf. Feature-selection section). The data (y), after projecting into the low dimensional feature-space are used to estimate model parameters θ through a generalized EM algorithm (cf. Estimation section). Model hyper-parameters K and λw are selected to minimize the error of predicting the stimulus s (cf. SSM hyper-parameter selection section). Given a set of model parameters, the optimal state-sequence x⁎ is estimated using EM (cf. Estimating the optimal state-sequence section).
novel data-driven multivariate method for the dynamical analysis of mental processes in a time-resolved fashion using a phenomenological model for brain function that: (a) reveals not only the spatial distribution but also the temporal ordering of activity during a task; (b) can be applied to arbitrarily complex paradigms, not just experiments with fixed alternatives; (c) does not require specification of the mathematical relationship between experimental variables and fMRI signal; (d) does not assume a known and spatially fixed hemodynamic response; (e) uses experimental information to guide estimation towards of patterns of interest, unlike unsupervised methods; (f) permits testing for the effects of experimental variables against which the model was not trained, unlike supervised methods; and (g) allows neurophysiological interpretation of the parameters and comparison of the spatio-temporal patterns between subjects in their entirety. Related work Multivariate methods Multivariate methods for fMRI analysis have made major contributions in studying the representation of information in the brain i.e. how distributed neuronal responses encode the sensorial or cognitive state of the subject. The main advantage of the multivariate approach is its ability to integrate information in groups of voxels that individually are weakly activated, but jointly may be highly structured
364
F. Janoos et al. / NeuroImage 57 (2011) 362–377
with respect to the task. Communication among neurons as well as larger functional units is the main basis of neural computation, and by not disregarding their interactions, multivariate methods are able to reveal more about the “neural code” (O'Toole et al., 2007). Supervised or confirmatory approaches such as multivariate linear models (MVLM) (Friston et al., 2008) and multivariate pattern recognition (MVPR) (Haynes and Rees, 2006) learn the mathematical relationship from the distributed patterns of activation contained in the data to the experimental variables. Multivariate unsupervised or exploratory approaches such as principal components analysis (PCA) (Multiple, 2007), independent components analysis (ICA) (Calhoun and Adali, 2006) and cluster analysis (Baumgartner et al., 2000) have also been widely applied to the analysis of fMRI, especially restingstate and non task-related data. Semi-supervised multivariate analysis come in two flavors: (a) where unlabeled data (e.g. resting-state data) are used to regularize the regression of data with labels (e.g. task related data) (Blaschko et al., 2009) or (b) where information about the task is used to guide a clustering process (Friman et al., 2003) or a matrix-based decomposition of the data (e.g. partial least squares (PLS) (McIntosh and Lobaugh, 2004), functional PCA (Ghebreab and Smeulders, 2010) or constrained ICA (Lin et al., 2010)). Despite the success of these multivariate methods, there are nevertheless many open challenges. Because supervised methods learn a fixed mapping from fMRI data to regressors/labels describing stimuli or subject behavior, their ability to explain the cognitive state of the subject is limited to behavioral correlates and they cannot discover intrinsic patterns that might be present in the data. MVLM methods such as canonical correlation analysis (CCA) (Multiple, 2007) and multivariate Bayesian decoding (MVB) (Friston et al., 2008) model a linear relationship between regressors, formed by convolving the stimuli with a hemodynamic response function (HRF), and the observed data. This requires that the mathematical relationship between experimental variables and fMRI signal be known a priori, which may be hard to define, especially in experiments for higher level cognition. Equally problematic is the assumption of spatially and temporally constant hemodynamics in these models. Multiple studies (Logothetis, 2008) have shown a large variation in the HRF across subjects, across brain sites within the same subject, and even at the same brain site of the same subject across time. Furthermore, these methods are extremely sensitive to the design of the experiment in terms of eventtiming, orthogonality, estimability and the layout of the design matrix. On the other hand, MVPR methods such as linear discriminant analysis (LDA), Gaussian naïve Bayes (GNB), neural-networks, support vector machines (SVMs) and other types of classifiers (Haxby et al., 2001; Haynes and Rees, 2006; O'Toole et al., 2007; Mitchell et al., 2008) do not specify a neurophysiological model and instead treat the data as an abstract representation of mental activity. Therefore, neuro-scientific interpretation of results and comparison across subjects becomes difficult. A significant limitation of MVPR classifiers is their applicability only to studies where subjects are presented with a fixed number of alternatives (e.g. faces vs. objects (Haxby et al., 2001)). Also, most methods make the assumption that all fMRI scans with the same label (i.e. behavioral state) are equivalent, thereby neglecting the temporal structure in mental processes. By using all the scans in one block as a feature vector through temporal embedding (Mourão-Miranda et al., 2007), the temporal structure of only block design experiments with fixed length blocks has been incorporated in a MVPR framework. Generalization to complex cognitive paradigms with interval-valued parameters, event related designs and further on to real world situations poses a fundamental challenge for MVPR (Haynes and Rees, 2006). Similarly, many unsupervised and semi-supervised approaches also suffer from the problem of interpretability due to lack of a neurophysiological model. More importantly, purely unsupervised methods are blind to the structure and parameters of the experiment and fail to provide quantifiable links to experimental variables (O'Toole et al., 2007).
The reader is referred to the excellent article by Friston et al. (2008) on MVLMs and the reviews by Haynes and Rees (2006) and O'Toole et al. (2007) on MVPR for a better understanding of these various alternatives and their trade-offs. State-space methods Hidden Markov models (HMM) have been previously used in fMRI for determining the activation state of individual voxels in a univariate fashion from their time-series data (Højen-Sørensen et al., 2000). Activation detection associated with known stimuli has also been done with hidden Markov multiple event sequence models (HMMESM) (Faisan et al., 2007), that pre-process the raw time-series into a series of spikes to infer neural events at each voxel. A hidden process model (HPM) (Hutchinson et al., 2009) was used for univariate testing of each voxel's time-series data for the occurrence of small set of pre-specified “neural processes” relative to some external event. Bayesian spatio-temporal models that estimate the activation state of voxels using Markov random fields as a spatial prior have also been proposed (Quirós et al., 2010). Dynamic Bayesian networks (Zhang et al., 2006) and dynamic causal models (Stephan and Friston, 2010) have been applied to study the time-varying functional integration of a few preselected functional modules, from the interdependency structure of their average timeseries. The state-space model (SSM) The concept of the functional brain transition through a mental statespace is depicted in Fig. 2. The probability Pr[xt = k] that the brain-state xt at time t = 1 … T (in TR units) is k = 1 … K depends on not only on the previous state of the brain but also on the current experimental stimulus described by the vector st. The multinomial transition probability from xt − 1 = i to xt = j is (Bishop, 2007):
πxt−1 ;xt ðst Þ≜pð xt = jjxt−1
n o ⊤ exp st ωj + wi;j n o = i; st ; wÞ = ∑Kk = 1 exp s⊤t ωk + wi;k ð1Þ
The ⊤ symbol indicates matrix transpose throughout the paper. The probability of being in state j at any instant is parameterized by the vector ωj. The probability of transitioning from state i at time t − 1 to state j at time t is parameterized by wi, j, which has a normal prior N 0; λ−1 w I with precision hyper-parameter λw. All these transitions are driven by the stimulus vector st. Introducing an additional element in the stimulus vector st set to 1 allows modifying the transition probability to include a term independent of the current stimulus. Though the experiment maybe have combination of interval and categorical valued stimuli, they are converted into standardized normal variables st through a probit transformation of their cumulative distribution functions. The hyper-parameter λw controls the trade-off between the influence of the current stimulus st and the previous state xt − 1 on the probability of the current state xt. A low value biases the estimates of wi, j towards its mean value ωj reducing the influence of the previous state xt − 1 = i on p(Xt = j|Xt − 1 = i) and increasing the influence of the st on the transition. The SSM allows estimating the value of unobserved or missing stimuli at a subset of the time-points U ≜ {t1…tU} ⊂ {1…T}, represented by the hidden variables ut, t ∈ U. This feature enables prediction of stimuli from data at these time-points t ∈ U. Each abstract brain-state is realized by a characteristic distribution of neural (metabolic) activity in the cerebral cortex, denoted by Zt in voxel-space and by zt = Φ[Zt] in the D-dimensional (D ≪ N, the number of voxels) feature-space Φ (cf. Feature-space section). If xt = k, k = 1…K, then zt is modeled N ðμ k ; Σk Þ in feature-space
F. Janoos et al. / NeuroImage 57 (2011) 362–377
365
Feature-space
Fig. 2. The state-space model (SSM). The experimental parameters are represented by st, while the corresponding brain-state is xt, and the instantaneous activation pattern is zt. The activation pattern is observed in the fMRI data yt… yt + L − 1 after convolution with the HRF h.
coordinates. Let ϑ ≜ {ϑ1…ϑk}, where ϑk ≜ (μk, Σk), the emission parameters for state k. Each element d = 1…D of the D-dimensional feature-space is associated with an independent but unknown HRF, modeled as an finite impulse response (FIR) filter h[d] ≜ (h0[d] … hL[d]) of length L + 1. Each h[d] has a normal prior N ðμ h ; Σh Þ constructed by varying the delay, dispersion and onset parameters of the canonical HRF of SPM8 (Multiple, 2007) and computing their mean and variance. The length L + 1 is typically set to 32 s. The set of HRF parameters is then the D × L matrix h ≜ (h[1] ⊤… h[D] ⊤) ⊤. The fMRI data Yt in voxel-space and yt = Φ[Yt] in feature-space arises through an element-wise convolution yt = ∑l Hlzt − l + t of the metabolic activity zt with the HRF h. Here, Hl ≜ diag{(hl[1]…hl[D]) ⊤} and t ∼N ð0; Σ Þ is temporally i.i.d. noise. The HRF L + 1 TRs long induces a correlation in the scans yt…t + L based on the activation zt corresponding to the brain-state xt at time t. Note that the convolution is actually in voxel-space, but since the linear feature-space transform is commutative with convolution, it can be performed directly in feature-space. Therefore, denoting the set of parameters θ ≜ {u, w, ϑ, h, Σ}, the full probability model is (cf. Fig. 2): pθ ðy; z; xÞ≜pðy; hjz; Σ Þpðzjx; ϑÞpðx; w js; uÞ;
ð2Þ
where " pðx; w js; uÞ = pðwÞ
# ∏ πxt−1 ;xt ðst Þ ∏ πxt−1 ;xt ðut Þ ;
t∈T∖U
t∈U
T pðzj x; ϑÞ = ∏ p zt jxt ; ϑxt ; t=1
and T pðy; hjz; Σ Þ = pðhÞ ∏ pðyt jzt−L…t ; h; Σ Þ : t =1
The SSM hyperparameters K and λw are selected using an automatic data-driven procedure described in the SSM hyper-parameter selection section.
The SSM effectively provides a temporally coherent formalism for clustering the fMRI scans based on their patterns of spatially distributed metabolic activity. Like any clustering algorithm, its efficacy depends on the metric quantifying the difference/similarity between the activity patterns at two time-points (Jain, 2010). In the Functional distance section, we describe a novel functional distance based on the concept of the cost of minimum “transport” of activity between two time-points measured over the functional networks of the brain. Functional networks are routinely defined by the temporal correlations between the fMRI time-series of the voxels (Li et al., 2009). Technical ?? presents an algorithm of computing the functional connectivity (i.e. correlations) F[i, j] ∈ [− 1, 1] between all pairs of voxels i, j that is consistent, sparse and computationally efficient, although any equivalent method can be used. In previous work (Janoos et al., 2010a), we had demonstrated the efficacy of this method to determine functionally meaningful clusters in the space of activation patterns. Because this metric, related to the earth mover's distance (EMD) (Shirdhonkar and Jacobs, 2008) between distributions, is posed as an optimization problem, it does not have a well-understood metric structure. For example, it does not provide a closed-form expression for the mean or variance of a cluster. As a result, determining the statistical properties of clusters obtained under this metric is not straightforward. A strategy for circumventing this problem is to embed the fMRI scans in an Euclidean space obtained from the pair-wise distance matrix through an embedding such as the graph Laplacian (Chung, 1997), and then performing analysis in this space (Janoos et al., 2010a). This, however, requires an estimation of the functional distance between all pairs of volumes in an fMRI session, which with a worst-case running time of O N3 logN per pair (Shirdhonkar and Jacobs, 2008), is prohibitively expensive. Therefore, in the Orthogonal basis construction section we adopt an alternative strategy, wherein each fMRI scan is embedded in a linear space obtained by an orthogonalization of the matrix of voxelwise functional connectivities. Similar to the approximation of EMD by a wavelet transform (Shirdhonkar and Jacobs, 2008), the Euclidean metric in this feature-space provides a good approximation of the functional distance. Since the number ofvoxels and therefore the number of linear 5 basis vectors N∼O 10 is orders of magnitude larger than the number of scans T∼O 102 , dimensionality reduction is required to prevent over-fitting. Using a bootstrap analysis of stability (Bellec et al., 2010), basis vectors that are not “stable” across resamples of the data are discarded, causing a significant reduction in dimensionality. This procedure is laid out in the Feature-selection section. Other feature-spaces in fMRI include purely unsupervised ones like PCA (Multiple, 2007), ICA (Calhoun et al., 2009) and functional parcellations of the cortex (Thirion et al., 2006), or supervised ones like PLS and regions-of-interest (ROIs) identified either by a univariate GLM analysis or through manual demarcation (Haynes and Rees, 2006). Dimensionality reduction is achieved either in an unsupervised fashion, for example by selecting principle components (PCs) that explain a certain percentage of variance or in a supervised fashion by selecting features correlated with or predictive of the experimental variables (Mitchell et al., 2008). For the aims of the methodology presented here, supervised feature-spaces are unsuitable as they are inherently biased towards the experimental variables against which they were selected and may not capture other patterns in the data. At the other extreme, unsupervised feature-spaces do not necessarily describe patterns of interest. For example, in our data-sets, we observed that the largest variance principal components corresponded to motion and physiological noise such as respiration and pulsatile activity. In contrast to these, the feature-space developed here is motivated by an intuitive definition of the functional distance between activation patterns and feature selection is performed using a criterion of stability. A
366
F. Janoos et al. / NeuroImage 57 (2011) 362–377
quantitative evaluation of these various alternatives is reported in the fMRI data-set: mental arithmetic task section.
Theorem 1. Let δz½l; m be coefficients of δZ = Zt1 −Zt2 in the basis Φ. Then, there exist constants M0;0 ≥^ M 0;0 N 0, such that
Functional distance ^ M0;0 The functional distance FD(Zt1, Zt2) between the activation patterns Zt1 and Zt2 (in voxel-space) at two time-points is quantified by the transportation distance, i.e. the minimal “transport” f : N × N→R of activity over the functional circuits to convert Zt1 into Zt2 (Janoos et al., 2010a). Specifically,
f
∑
m=0
ð3Þ
i=1 j=1
subject to the constraints: f [i, j] ≥ 0, ∑ j f [i, j] ≤ Zt1[i], ∑i f [i, j] ≤ Zt2[ j], and ∑i, j f [i, j] = min{∑i Z t1[i], ∑i Zt2[i]}. The cost of the transport of f [i, j] from voxel i to j will depend on a measure of the “disconnectivity” dF : N × N→Rþ between the voxels complementary to F, defined in the Cost metric section. This definition captures the intuitive notion that two activity patterns are functionally more similar if the differences between them are mainly in voxels that are functionally related to each other indicating the activation of a shared functional network.
m=0
l=0
l=0
ð5Þ and the tightness of this bound is:
N N FD Zt1 ; Zt2 = min ∑ ∑ f ½i; jdF ½i; j;
log2 N−1 2m −1 ∑ jδz½l; mj≤ FD Zt1 ; Zt2 ≤ M0;0 ∑ ∑ jδz½l; mj
log2 N−1 2m −1
sup jj δz jj 2 = 1
M0;0 −^ M 0;0 ^ pffiffiffi M0;0 ∑ ∑ jδz½l; mj− M0;0 ∑ ∑ jδz½l; mj ≈ m m 2 l l ð6Þ
Therefore, the functional distance FD Zt1 ; Zt2 is approximated (up to a multiplicative constant) by the ‘2 distance metric in this space, δ zt1 ; zt2 =
2 ∑ zt1 ½l; m−zt2 ½l; m l;m
!1 2
:
ð7Þ
Cost metric
M0;0 is examined at the The quality of the approximation M0;0 −^ end of the next section.
Treating F as the adjacency matrix of a graph with N vertices, a cost metric dF[i, j] = |ϕ*[i] − ϕ*[ j]| is induced via a distortion minimizing embedding ϕT : N→R of the graph (Chung, 1997):
Feature-selection
ϕ = arg inf
ϕ⊥DF 1
∑i ∑j ðϕ½i−ϕ½ jÞ2 F½i; j
ð4Þ
∑i ϕ½i2 DF ½i; i
where DF is the diagonal degree matrix 1 of the adjacency matrix F. Here, the embedding ϕ* will take similar values at voxels that have high functional connectivity and the functional distance between them is dF[i, j] = |ϕ*[i] − ϕ*[ j]|. The constraint ϕ ⊥ DF1 is to prevent ϕ* from taking a value at each vertex proportional to its degree, which is the trivial minimizer of Eq. (4). It can be shown (Chung, 1997) that ϕ* is the solution to the generalized eigenvalue problem (DF − F)ϕ = λDFϕ subject to the constraint ϕ ⊤DF1 = 0. If η1 is the eigenvector −1
−1
η1 = λ1 Lη1 of the normalized graph Laplacian L = DF 2 ðDF −FÞDF 2 −1 corresponding to second smallest eigenvalue λ1 N 0, then ϕ = DF 2 η1 . Orthogonal basis construction Through a recursive partitioning of the voxel-grid based on its emn o bedding ϕ*, we construct an orthogonal basis Φ = ϕðl;mÞ : N→R where the index m =0 … log2N − 1 gives the level of decomposition, while l= 0…2m − 1 indexes the basis vectors at level m. The first basis 1 vector ϕð0;0Þ = D−2 η1 , where η1 is the eigenvector of Lð0;0Þ = L corresponding to the second smallest eigenvalue. Next, the graph is partitioned into two sub-graphs based on the sign of ϕ(0, 0), and their graph Laplacians Lð1;1Þ and Lð2;1Þ are computed. The next two basis vectors ϕ(1, 1) and ϕ (2, 1) are the second smallest eigenvectors of the Lð1;1Þ and Lð2;1Þ , respectively. The process is repeated until only one voxel is left in the partition. The details of this algorithm are provided in Technical ??. Let the projection of Zt on this orthogonal basis be denoted as zt ½l; m; m = 0… log 2 N−1; l = 0…2m −1 , where zt ½l; m≜ 2−m hZt ; ϕðl;mÞ i. The following theorem asserts that the representation in this basis Φ of the difference δZ = zt1 −zt2 between the activity patterns at two time-points t1 and t2 provides a tight bound on the functional distance.
1
DF ½i; i = ∑j≠i F½i; j and DF[i, j] = 0, ∀i ≠ j.
Basis vectors for the feature-space are selected using a bootstrap analysis of stability (Bellec et al., 2010). The bootstrap generates a nonparametric estimate of the sampling distribution of a statistic (i.e. bootstrap distribution) from a single sample of the data by creating multiple surrogate samples of same size as the original sample, by resampling with replacement from the original sample. Bootstrap estimates of the functional connectivity matrix F are obtained by resampling entire blocks of fMRI scans from a session Y = fY1 …YT g to create a surrogate session. The presence of serial correlations in the time-series data necessitates a block bootstrap method, wherein the T scans are divided into M-blocks and a resample is created by randomly selecting M-blocks from this set, with replacement. Although the block length T/M needs to be adapted to the range of temporal dependencies present in the data, the correlation structure in the data is faithfully reproduced over a fairly wide range of lengths (Bellec et al., 2010). We found T/M ≈ 5TRs to be adequate for our data-sets. The stability of a particular basis vector ϕ(l, m) is defined by its (l,m) (l, m) correlation coefficient ρ(l,m)(r1, r2)≜|〈ϕ(r1) , ϕ(r2) 〉| across resamples r1, r2 of the data. Given the bootstrap distribution of correlations Prboot [ρ(l,m)(r1, r2)], a vector ϕ(l, m) is said to be τΦ-stable if Prboot [ρ(l, m)(r1, r2)≥τΦ] ≥ 0.75, i.e. the correlation between at least 75% of the resamples of ϕ(l, m) is greater than the threshold 0≤τΦ ≤1. If ϕ(l,m) is not τΦ-stable, then it is discarded, which also removes all the vectors obtained from the subdivision of ϕ(l,m). Therefore, increasing the value of τΦ causes an exponential increase in the number of vectors that are removed. The effect of τΦ on the dimensionality D is shown in Fig. 3. Initially there is a steep reduction in dimensionality from O 105 to O 102 .
However, after a certain value of τΦ, the reduction slows down significantly. This knee-point usually occurred at D ≈ 500 corresponding to τΦ = 0.4 – 0.5 in our data-sets. Therefore, τΦ was adaptively set for each fMRI session such that D = 500. The figure also shows the largest index m of the vectors ϕ (l,m) retained for a given τΦ, indicating that the stability of basis vectors reduces as the level of decomposition m increases. This observation, along with the 2− m decay of the coefficients in Eq. (5), implies that the effect the of reduced dimensionality of Φ on the approximation error is small, as most of the discarded vectors have a large index m.
F. Janoos et al. / NeuroImage 57 (2011) 362–377
x 105
20 Dimension of Φ Maximum m
1
10
0 0
0.2
0.4
0.6
0.8
1
Maximum m
Dimension of Φ
2
0
Fig. 3. Dimensionality reduction. The effect of threshold τΦ on the average dimensionality of Φ and on the maximum index m (level of decomposition) of the basis vectors ϕ(l,m) that was retained. Results are for the 42 subject data-set of the fMRI data-set: mental arithmetic task section.
2 A comparison of the relative logarithmic error in the approximation of FD zt1 ; zt2 using the reduced Φ versus the full basis set is shown in Fig. 4. It can be observed that the linear approximation δ zt1 ; zt2 provided by the full basis Φ is typically within 2.5 × the transportation distance FD zt1 ; zt2 , while the distance in reduced dimensionality basis is within 3 × FD zt1 ; zt2 , providing empirical validation of Eq. (6). We see that reducing the dimensionality by an order of O 103 increases the approximation error by less than 20%, on average.
Estimation In this section, a generalized expectation–maximization (GEM) algorithm (Bishop, 2007) to estimate the parameters θ = arg maxθ pθ ðyÞ is presented. Introducing a variational density qðz; xÞ over the latent variables z; x, the log-probability of Eq. (2) is decomposed into a freeenergy and a KL-divergence as ln pθ ðyÞ = Qðq; θÞ + KLðqjj pθ Þ, where, pθ ðy; z; xÞ dz and qðz; xÞ p ðz; x jyÞ dz: = − ∑ ∫z qðz; xÞ ln θ qðz; xÞ x
Qðq; θÞ = ∑ ∫z qðz; xÞ ln x
KLðqjj pθ Þ
Starting with an initial estimate θ (0), the GEM algorithm finds a local maxima of ln pθ ðyÞ by iterating the following two steps: E−step M−step
ðnÞ
q ← arg min K Lðq jj pθ ðnÞÞ q ðn + 1 Þ ðnÞ ðnÞ ðnÞ : θ ← θ such that Q q ; θ N Q q ; θ
ð8Þ
The iterations are terminated when the updates to θ fall below a pre-specified tolerance (adaptively set at 1% of the absolute value of the parameter in the n th iteration), yielding a locally optimal solution θ*. The expressions and detailed derivations for the E-Step estimates of all the parameters in the M-Step are provided in Technical ?? and ?? respectively. E-step Although the minimizer of KL(q||pθ(n)) is qðz; xÞ = pθðnÞ ðz; x jyÞ, the HRF introduces a dependency structure between xt − L…xt + L and
2 The relative logarithmic error of the linear approximation is given as: j log10 Δ zt1 ; zt2 − log10 FD zt1 ; zt2 j/ log10 FD zt1 ; zt2 .
367
zt−L …zt + L when conditioned on the measurement yt . Therefore, evaluation of Q pθðnÞ ðz; x jyÞ; θ in the M-step would require marginalization over sequences of2L + 1 variables, resulting in a computational complexity of O T × K 2L for parameter estimation (as compared to T × K 2 for first-order HMMs). To avoid this expensive computation, we restrict q to the family of factorizable distributions qðz; xÞ = ∏Tt = 1 qt ðzt ; xt Þ = ∏Tt = 1 qt ðzt jxt Þqt ðxt Þ. This is known as the mean field approximation in statistical physics and it can be shown (Bishop, 2007) that if q⁎ ðz; xÞ = ∏Tt = 1 q⁎t ðzt ; xt Þ = arg min q KLðqjj pθ ðz; x jyÞÞ = n
o arg min q KLðq jj pθ ðy; z; xÞÞ then q⁎t ðzt ; xt Þ∝exp E∏t′ ≠t q ⁎ ln pθ ðy; z; xÞ . t′ As shown in Technical ??, each factor of the mean field approximation is a product q⁎t ðzt ; xt Þ = q⁎t ðzt j xt Þq⁎t ðxt Þ of a multinomial logistic probability qt*(xt) and a normal density q⁎t ðzt jxt Þ. Therefore, under this approximation, the n th iteration of E-step involves computing the ðnÞ factorizable density qðnÞ ðz; xÞ = ∏Tt = 1 qt ðzt ; xt Þ by first initializing qt with the posterior distribution of pθðn−1Þ ðzt ; xt Þ, and then updating q*t until convergence. As these iterations are a coordinate descent of the KLdivergence term KL(q||pθ(n)), the solution obtained is only a local optimum and depends on the initializations and the order of the updates to q*. t M-step Since the maximization of the state transition parameter w is coupled with that of the missing stimuli u, and Q qðnÞ ; θ is not jointly concave in w and u, we decouple the problem into two concave problems, by first maximizing Q with respect to w setting u←uðnÞ, and then maximizing with respect to u setting w←wðn + 1Þ. Although, w can be estimated using the iteratively re-weighted least squares (IRLS) method, it involves an expensive inversion of the Hessian ∇2w Q at every iteration. This inversion is avoided using a bound optimization method (Krishnapuram et al., 2005) that iteratively ′ ′ maximizes a surrogate function w n + 1;n = arg maxw Q′ w jw n ;n . The index n′ marks the iterations of the bound maximization of Q′ with respect to w, during one iteration of the M-step indexed by n. ð0;nÞ This inner maximization loop is initialized with ←wðnÞ and w n′ + 1;n n′ ;n jj 2 falls below a terminates when the update jj w −w n′ + 1;n ðn + 1 Þ certain tolerance, and w is the new value for the M←w step update. This tolerance can be fairly loose (typically 10% of the n′ ;n absolute value jj w jj 2 ), as the GEM algorithm only requires an increase in the value of Q with respect to its parameters and not necessarily the maximization of Q. The surrogate function used here is the quadratic function with constant Hessian B such that ∇2w Q−B is negative-definite. Although bound optimization takes more iterations to converge than IRLS, on the whole it is much faster since it precludes inverting the Hessian (of the order of the size of w) at each step (Krishnapuram et al., 2005). Detailed proofs for the gradient and Hessian of Q with respect to w are elaborated in Technical ?? followed by the bound optimization procedure in Technical ??. After estimating wðn + 1Þ , Q qðnÞ ; θ is then maximized with respect ðn + 1Þ to ut for all t ∈ U by setting w←w . Again, the Hessian ∇2ut Q is negative-definite and therefore Q is concave in ut with a unique global maximum. Since ∇2ut Q is of the dimension of the stimulus vector and is usually easily invertible, this maximization is done using IRLS because of its faster convergence. The expressions for the gradient and Hessian of Q with respect to u are given in Technical ??. The estimates of emission parameters ϑk = (μk, Σk) in the nth iteration of the M-step can be computed in closed-form, as per the formulae ⊤ detailed in Technical ??. Let h½d≜ðh0 ½d…hL ½dÞ denote the L + 1-tap th HRF FIR filter corresponding to the d element of the D-dimension feature-space. As shown in Technical ??, the gradient ∂Q = ∂h½d for the FIR filter at the d-th element depends on the values of h½d′ at all the other d′ ≠ d of the D-dimensional space. Setting ∂Q = ∂h½d = 0, for
368
F. Janoos et al. / NeuroImage 57 (2011) 362–377
0.8
Rel. log error (reduced Φ)
all d = 1 … D, results in a linear system of D × L equations in D × L ðn + 1Þ unknowns. The unique solution h is computed using conjugate ðnÞ gradient descent (Golub and Van Loan, 1996) initialized at h , and its ðn + 1Þ ðnÞ iterations are terminated when the update jj h −h jj 2 falls below ðnÞ a pre-specified tolerance (set adaptively at 10% of jj h jj 2 ). The closedform estimation of the noise variance Σ is given in Technical ??. Spatial activation maps The activation pattern for a specific value of the experimental variables st is obtained by first computing the invariant distribution pðxt jw; st Þ as the first eigenvector of the state-transition matrix πðst Þ (cf. Eq. (1)), and then computing the mean activation K
pattern K as μ st = ∑k = 1 pθ ðxt = kjw; st Þμ k and its variance as Σst = ∑k = 1 pθ ðxt = k jw; st ÞΣk . The z-score map for the activation pattern −1 = 2 corresponding to sˆ is given by Σst μ s t in feature-space, which can then be transformed back into a voxel-wise spatial map of activity.
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Rel. log error (full Φ) Estimating the optimal state-sequence Direct estimation of the most probable sequence of states x* = arg maxx pθ ðx jyÞ = argmaxx pθ ðy; xÞ, given an estimate of SSM parameters θ requires joint maximization over all T state variables x1…xT, since the hidden layer z layer introduces a dependency between all the y and x variables preventing factorization of the graphical model. As the size of the search space increases exponentially with T with a complexity of O T K for the whole chain, exhaustive search soon becomes infeasible and an approximation such as iterated conditional modes (ICM) is required (Bishop, 2007). To avoid this computational burden, we developed an EM algorithm for optimal state-sequence estimation under a mean field approximation that iteratively transforms the problem into a series of first order HMMs.Each M-step can then be computed using the Viterbi algorithm with O T × K 2 complexity (Bishop, 2007). The EM iterations terminate when the increments jln pθ y; xðn + 1Þ − ln pθ y; xðnÞ j fall below a prespecified tolerance, typically set to 0.0099 corresponding to b 1% increase in the probability. Technical ?? contains the details of this state-sequence estimation procedure. SSM hyper-parameter selection The hyper-parameters of the SSM are the number of hidden states K, the precision λw of the prior distribution of the transition weights w, and the parameters μ h ; Σh of the prior model of the HRF h. The values of μ h and Σh , determined from the canonical HRF of SPM8, are used to enforce domain knowledge by restricting the HRF to the space of physiologically plausible shapes. This provides an optimal trade-off between allowing a spatially varying and unknown HRF against overfitting the FIR filter to the data. The hyper-parameter λw determines the variance in the weights wi;j , and implements a trade-off between the effect of the stimulus versus the previous state on the current state probability and mediates a complex set of interactions between the temporal structure of the fMRI data and of the stimulus sequence. A very high value of λw causes the statetransitions to be driven mostly by the current stimulus, while a low value increases the contribution of the previous state to the transition probability. It therefore cannot be practically provided as a user-tunable parameter. On the other hand, model-size (i.e. K) selection is typically done using Bayes factors (Kass and Raftery, 1995), information theoretic criteria (McGrory and Titterington, 2009) or reversible jump MCMC based methods (Scott, 2002). Implicitly these methods require an a priori notion about the complexity of a given model. Here instead, we adopt an automated method for selecting both K and λw based on a maximally predictive criterion leveraging the ability of the SSM to predict missing stimuli u. From the stimulus time-series, blocks of T′ consecutive time-points (in TR units) totalling to 25% of
Fig. 4. Approximation quality. A scatter plot the relative logarithmic (base-10) error in the approximation of FD(zt1, zt2) using the reduced (y-axis) versus the full basis-set (x-axis). Each point represents the pair-wise distance between the fMRI scans at time-points t1 and t2 of the same subject, for the data-set described in the fMRI data-set: mental arithmetic task section.
the total number of scans, are removed at random to serve as missing stimuli U ≜ {t1…t1 + T′ − 1,…,tM… tM + T′ − 1} and the optimal SSM parameters θ* are estimated for a given K and λw . The prediction error is then measured as ERRmissing ≜∑t∈U jj ut −st jj 2 : between the predicted ut and their true values st . The hyper-parameters are then selected to minimize this error-rate. The optimal value of K is obtained by first stepping through different values of K with large step-sizes and then iteratively refining the step-size. The advantage of this procedure is that it allows selecting a model most relevant to the experiment being conducted. For each setting of K, the optimal λw is determined by searching over the range log10 λw = −3… + 3, and selecting the valued that minimizes ERRmissing. This allows setting the parameter to effect an optimal compromise between stimulus driven and previous state driven transitions. We observed that the prediction error is relatively insensitive to λw (cf. Results section), and therefore a common value can be selected across a multi-subject data-set for one study. The reader will observe that prediction error is used merely as a statistic (Friston et al., 2008) to select hyper-parameters. The parameters themselves, unlike MVPR classifiers, are not estimated to minimize prediction error but rather to fit a model of brain function to the data. It is this distinction that allows interpretation of the estimated parameters (in terms of the underlying neurophysiological model) in contrast to MVPR classifiers. The effect of these hyperparameters and the length T′ of a missing-stimulus block on the model estimation is evaluated in the next section. Results This section starts off with a quantitative validation of the model and estimation algorithms using a synthetic data-set. Then the results of a method applied to a multi-subject fMRI study for mental arithmetic processing are presented, including the ability of the method to discover new patterns in the data, a comparative evaluation with respect to other analysis methods and feature-spaces and followed by the group-level analysis of the data-set made using this spatio-temporal generative model. The algorithms were implemented in MATLAB ® with Star-P ® on an 2.6 Hz Opteron cluster with 16 processors and 32 GB RAM.
F. Janoos et al. / NeuroImage 57 (2011) 362–377
Notation Table 1 summarizes the notation introduced in the earlier text and used in the discussion that follows in this section. Simulation Methods and materials The algorithms were validated on a synthetic data-set created as follows. For all simulations, the length of the session was T = 600 TRs, the dimension of the feature-space was D = 500 and the dimension of the stimulus vector st was set to 5, to reflect a typical fMRI data-set. Model size K was varied from 5 to 50, while the precision hyperparameter λw was varied from 10 − 3 to 10 3. The state transition parameters ωj were initialized from a uniform distribution over [0,1], while wi;j were sampled from N 0; λ−1 w I5 , where In is the n × n identity matrix. The HRF FIR coefficients h½d for each element of the d feature-space were obtained by sampling from N ðμ h ; Σh Þ. The emission parameters (μ k, Σk) for each state k were obtained by sampling Σk from a Wishart distribution W ðT;ID Þ and μk from N ð0; Σk Þ. The noise variance Σ was sampled from W T; β−1 ID . The parameter β effectively controls the SNR of the data, by controlling the ratio of the noise variance to that of the activation patterns zt . For each time-point t = 1…T, the stimuli st were generated from a normal distribution N ð0; I5 Þ and then smoothed along the time dimension with a Gaussian filter of different full-width-at-half-maximum (i.e. FWHM), in order to impose a temporal structure on the simulated data. The values of xt ; zt and yt were then sampled from their generative distributions as per The state–space model (SSM) section. We compared the results of the generalized EM (GEM) algorithm under the mean field approximation (MF-GEM) presented here (cf. Estimation section) against those obtained by an MCMC based estimation algorithm (Janoos et al., 2010b). The number of MCMC samples were varied to match the MF-GEM algorithm in terms of equal running time (MCMC:RT) and equal estimation error (MCMC:EE). As the MCMC method can produce exact estimates, given sufficient number of samples, it was also run until convergence (MCMC:CNV) in order to establish baseline accuracy. The experiments were repeated with β = 10, 100 and 1000 corresponding to SNR of 10, 20 and 30 dB respectively. Discussion The relative error ERRestimate in the parameter estimates θ *, the relative error ERRK of model-size estimates K * and the prediction Table 1 Summary of notation. Symbol
Definition
T N Φ D K st ut xt ∈ [1…K] w zt ∈RD h μh, Σ h yt ∈RD Σ λw θ ERRmissing
Total number of time-points in an fMRI session Total number of (cortical) voxels in an fMRI volume Basis vectors of the feature-space (cf. Feature-space section) The dimension of the FS Φ Total number of hidden brain-states The stimulus vector at time t Unobserved (hidden) stimulus vector at time t The brain-state at 1 ≤ t ≤ T State-transition probability parameters The (pre-HR) brain activation pattern at 1 ≤ t ≤ T in feature-space The linear convolutional HRF of length L The mean and variance of the prior distribution of h The fMRI scan in feature-space at 1 ≤ t ≤ T Variance of the normally distributed noise Hyper-parameter for state-transition probabilities The set of model parameters The error in estimating the missing stimuli ut during the model estimation step (cf. SSM hyper-parameter selection section) The length of a block of missing stimuli (in TR units) when computing ERRmissing
T′
369
error ERRmissing (cf. SSM hyper-parameter selection section) for the various experiments are charted in Fig. 5. One of the main observations is that the MCMC algorithm requires almost thrice the total running-time (including searching for the optimal hyper-parameter values) for the same estimation error ERRestimate as the mean field EM method (MF-GEM), while the prediction error of MF-GEM is within 20% of the best ERRmissing as measured by MCMC: CNV. While reducing SNR does not affect running-time significantly, its effect on the errors is large. Reducing the SNR from 30 to 20 dB caused prediction error to increase from b 10% to ≈ 30%. Furthermore, for SNR ≤ 20dB the estimate for model-size using the maximally predictive criteria is within 10% of the true K. Although all the parameters θ are important in determining the accuracy of the model, of special interest are the μk, k = 1…K, as they correspond to the spatial distribution of activity representative of each state. error, defined as ERRspatial = The⊤average estimation 1 = K∑k μ ⁎k −μ k Σk−1 μ ⁎k −μ k , of these parameters for the MF-GEM and MCMC:CNV cases are listed in Table 2. It can be observed that for the 20 dB case, the estimated spatial patterns are within ≈ 0.25 standard deviations (given by Σk) of the true μk. Examining effect of model-size K on prediction-error in Fig. 6 (black solid line), the classical bias-variance trade-off can be observed. For K much less than the true value, the estimated model is too simple and may fail to account for all the patterns in the data, while for K much greater than the true value, the model may start to fit noise (over-fitting) thereby compromising its ability to predict the driving stimulus. The plot of ERRmissing versus precision hyper-parameter λw in Fig. 6 (red dashed line) shows how it determines the balance between the temporal structure of the stimulus-sequence and the statesequence. A model with transitions highly determined by current stimulus (i.e. high λw ) fails to learn the temporal structure of the state-sequence and hence fails to estimate the correct state for a timepoint with missing stimulus. Similarly, one with transitions driven mainly by the previous state (i.e. low λw ) fails to leverage any taskrelated structure in the data, thereby reducing its prediction power for missing stimuli. However, ERRmissing achieves near optimal levels over a large range of values of λw (e.g. 10−1:5 ≤λw ≤10 + 1 for true λw = 100 ) indicating a high degree of robustness with respect to it. The effect of the block size T′ (cf. SSM hyper-parameter selection section) on prediction error, for different FWHMs of the Gaussian filter applied to s, is shown in Fig. 7. The prediction error for small values of T′ is extremely low, increases as the length of the missing stimulus block increases and then stabilizes. This increase followed by stabilization happens at block lengths proportional to the FWHMs, typically 3 ×FWHM. For block lengths T′ ⪅ 3 ×FWHM, the high accuracy is due to the strong temporal regularity in the stimulus sequence over small durations. However, as block length increases, this temporal regularity becomes weaker and prediction is driven primarily by the spatiotemporal patterns in the data after T′ ⪆ 3 ×FWHM resulting in stable, albeit higher, error-rates. fMRI data-set: mental arithmetic task The method was applied to an fMRI study (Morocz et al., 2003; Shalev, 2004) for number processing capabilities in healthy, dyscalculic and dyslexic individuals. Using this data-set, this section provides quantitative comparisons of the SSM and the feature-space Φ, an investigation of the various parameters of the model and a neurophysiological discussion of the results. Also the ability of the SSM to compare and contrast the spatio-temporal patterns across groups of subjects is demonstrated. Background Dyscalculia (DC) is a specific learning disability affecting the acquisition of mathematical skills in children with otherwise normal
370
F. Janoos et al. / NeuroImage 57 (2011) 362–377
(b) ERRestimate
80 70 60 50 40 30 20 10 0
ERRestimate
mins
(a) Running time
(c) ERRpredict
(d) ERRK
1 0.8
ERRK
ERRpredict
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.6 0.4 0.2 0
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Legend MF-GEM 30dB 20dB 10dB
GEM with mean field approximation
MCMC:RT
MCMC with running time equal to MF-GEM
MCMC:PE
MCMC with ERRestimate equal to MF-GEM
MCMC:CNV MCMC with convergence
Fig. 5. Simulation results. The GEM method under mean field approximation (MF-GEM) is compared against an MCMC based estimation algorithm matched in terms of equal running time (MCMC:RT), equal estimation error (MCMC:EE) and MCMC run until convergence of the estimates (MCMC:CONV). The experiments were repeated for SNR = 10, 20 and 30 dB. Plotted are total running time (Fig. (a)), relative estimation error (Fig. (b)), prediction error (Fig. (c)) and relative error in estimating the correct K (Fig. (d)) for the different experiments. Error bars indicate ± 1 standard deviations.
general intelligence and has been attributed to either abnormalities in the numerical (non-verbal) quantity processing system localized in the left and right intraparietal regions, or to impairments of the verbally encoded mathematical-fact retrieval system localized in the frontal and temporal (language) regions (Dehaene et al., 2003). Dyslexia (DL) is a reading disorder defined as a selective inability to build a visual representation of a word used in subsequent language processing, in the absence of general visual impairment or speech disorders. Certain but not all types of DL may be caused by infarcts in the left middle cerebral hemisphere affecting the temporo-parietal and frontal regions or by atrophy in the anterolateral temporal lobe or by damage to the left occipito-temporal regions (Price and Mechelli, 2005). Neuroimaging studies of both developmental disabilities have failed to pin-point their neuronal substrates. In the case of DC, difficulties in accurate diagnosis, heterogeneity of arithmetic difficulties and the frequent association with DL and attention disorders have impeded spatial localization. Indeed, a great variety of nonspecific problems including slow speed of processing, poor working memory Table 2 Effect of SNR on error. Tabulated is ERRspatial (± 1 std.dev.) in μk estimated by the MF-GEM algorithm and MCMC run to convergence (MCMC:CNV). SNR (dB)
MF-GEM
MCMC:CNV
30 20 10
0.151±0.06 0.223±0.09 0.361±0.13
0.126±0.05 0.212±0.09 0.357±0.14
span, attentional disorders and deficits in the long-term retrieval of arithmetic facts, may influence arithmetic performance (Molko et al., 2003). While, it has been proposed that DL is more generally characterized by a disconnection syndrome of the reading network, the neural correlates of these putative reading pathways are not well understood (Price and Mechelli, 2005). Methods and materials In each trial of the self-paced, irregular paradigm, two single-digit numbers (e.g. 4 × 5) were displayed visually for 2.5 s. After an interval of 0.3 s an incorrect solution (e.g. 27,23,12) was displayed for 0.8 s. Subjects had up to 4 s to decide, with a button press, if the answer was (a) close (within ± 25% of the correct answer), (b) too small or (c) too big. The next trial started after a rest of 1 s, and each trial lasted 4–8.6 s. For each t = 1 … T, the experimental conditions were described by: • [Ph] Indicates if t is (1) multiplication or (2) subtraction/judgement or (3) decision-making phase of the experiment • [LogPs] Quantifies the product size of the multiplication problem (1 ≤ LogDiff ≤ 10) • [LogDiff] Quantifies the expected difficulty in judging the right answer 3 (1 ≤ LogDiff ≤ 5). 3 If Rc = a × b is the correct product for the multiplication problem a × b and Rd is the displayed incorrect result, then the product size is scored as LogPs = log(Rc). The score LogDiff is log(|(1.25Rc) − (Rc + |Rc − Rd|)|/|1.25Rc|), which measures the closeness of the incorrect result to the ± 25% mark and represents the difficulty in judging the correct answer.
ERRmissing
log10 λw −3 0.8 0.6 0.4 0.2 0
−2
10
−1
20
0
30
Κ
1
40
2
50
3
0.8 0.6 0.4 0.2 0 60
Fig. 6. Prediction error versus hyper-parameters. The error ERRmissing (±1 std. dev.) with respect to the model-size K (bottom axis) and hyper-parameter log10λw (top-axis). Results are for the simulations of the 20 dB case with true value for K = 30 and for λ = 1. The trends for the other two SNR values are similar. Legend. Black solid line: ERRmissing vs. K; Red dashed line: ERRmissing vs. log10λw.
Twenty control subjects, thirteen high-performing (full-scale IQN95) individuals with pure dyscalculia (DC) and nine with dyslexia (DL) participated. 4 All subjects were free of neurological and psychiatric illnesses and attention-deficit disorder. All controls denied a history of calculation and reading difficulties. In order to balance the group sizes, group-level analysis was done by selecting 8 subjects at random from each group and computing the statistics over multiple resamples. Each subject participated in two fMRI sessions each of 12.19 min, with 240 trials and 552 scans in total. The data were acquired with a GE 3 T MRI scanner with quadrature head coil, using a BOLD sensitized 3D PRESTO pulse sequence with volume scan time of 2.64s and resolution of 3.75 × 3.75 × 3.75 mm 3. High resolution anatomical scans were also acquired, bias-field corrected, normalized to an MNI atlas space and segmented into gray and white matter regions. The fMRI scans were motion corrected using linear registration and coregistered with the structural scans using SPM8 (Multiple, 2007). Next, the time-series data were high-pass filtered (0.5Hz) to remove artifacts due to breathing, blood pressure changes and scanner drift. The mean volume of the time-series was subtracted, white matter masked out and all further processing was performed on gray matter voxels. Comparative analysis One SSM Mj = fθ; K; λw g was trained per subject j = 1 … 42 with three different encodings of the stimulus vector st : • [SSM:NONE] Optimal SSM estimated with st = ð1Þ • [SSM:PH] Optimal SSM estimated with st = ðPh; 1Þ • [SSM:FULL] Optimal SSM estimated with st = ðPh; LogPs; LogDiff; 1Þ As the models SSM:PH and SSM:NONE do not encode LogPs, LogDiff (and Ph) in the stimulus vector, they cannot estimate them as missing stimulus. Therefore, to assess the ability of the SSMs to predict these variables from the optimal state-sequence x* we trained three simple multinomial classifiers: one to predict the probability Pr[Ph|xt*] of the phase Ph = 1, 2, 3, one to predict Pr[LogPs|x*] t of LogPs quantized to two levels at a value of 5, and one to predict Pr[LogDiff|xt] of LogDiff quantized to two levels at a value of 2.5. The error-rates across the three classifiers were accumulated into a single ERRSSM:NONE, ERRSSM:PH and ERRSSM:FULL, for each SSM trained. Also, as ERRmissing is not defined for SSM:NONE, its hyper-parameters are selected so as to minimize ERRSSM:NONE. For comparative evaluation of the SSM with MVPR methods, we trained three linear SVM classifiers per subject: one to predict Ph = 1, 2, 3, one for LogPs = 0, 1 (quantized) and one for LogDiff = 0, 1 (quantized) and accumulated their error-rates into ERRSVM. The SVMs were trained to predict the stimuli from the fMRI data deconvolved 4 Controls: 10 females, 10 males, age 21–34 yrs, mean age 25.6 yrs ± 3.0 yrs; DC: 6 females, 7 males, age 22–23 yrs; DL 2 females, 7 males, age 18–32.
ERRpredict
F. Janoos et al. / NeuroImage 57 (2011) 362–377
0.3 0.2 0.1 0
371
10
20
30
40
50
T’ TRs Fig. 7. Prediction error versus missing stimulus block length. ERRmissing (±1 std. dev.) with respect to the block length of missing stimulus T′ for different FWHM settings of the Gaussian filter applied to s are shown. Results correspond to the 20 dB case. The trends for the other two SNR values are similar. Legend. Black solid line: FWHM= 3 ×TR; Red dashed line: FWHM= 8 ×TR; Blue dotted line: FWHM= 15 ×TR.
with the canonical HRF. Among the other classifiers evaluated (viz. GNB, LDA, quadratic and exponential SVM) none significantly outperformed the linear SVM. For each of the SSMs and SVMs trained, the following feature-spaces were evaluated: • [FS:Φ] The basis vectors of Φ, with D = 500 (cf. Feature-space section). • [FS:PCA-NONE] A 110 basis vectors obtained from a PCA of the fMRI data, retained using a bootstrap analysis of stability (Bellec et al., 2010) at a 75% confidence level, to match the feature selection criterion for Φ. This confidence level yields 112.92 ± 7.80 principal components (PC) for each subject. • [FS:PCA-PH] The set of PCs maximally correlated with the HRFconvolved regressor for Ph. For each subject 65 PCs were retained at a confidence level of 75% (64.27 ± 8.56 PCs). • [FS:PCA-FULL] The set of PCs maximally correlated with the design matrix containing HRF-convolved regressors for Ph, LogPs and LogDiff identified using multiple regression. For each subject 80 PCs were retained at a confidence level of 75% (79.14 ± 9.57 PCs). The prediction error of the different model and feature-space combinations for the control group are complied in Fig. 8. The other two groups showed similar trends and are omitted for conciseness. It can be observed that the error for SSM:FULL is consistently lower (N 3 SEM for FS:Φ) than that of the SVM. This is noteworthy especially since the parameters of the SVM were specifically optimized for prediction error. By treating each fMRI scan independently, MVPR classifiers are unable to leverage the temporal structure in the data and rely only on spatial patterns for prediction. Moreover, the SVM uses a point-estimate of the neural activation through deconvolution,
Fig. 8. Comparative analysis. The mean prediction error (± 1 standard error of mean (SEM)) for the control group using a linear SVM classifier, and the SSM with three different encoding of the stimulus vector st viz. SSM:FULL, SSM:PH and SSM:NONE. The error-rates are measured for four different feature-spaces: FS-Φ, FS:PCA-NONE, FS:PCA-PH and FS: PCA-FULL. The last bar in the SVM column shows the SVM prediction error for only Ph against which the PCs of FS:PCA-PH were selected. The “chance-level” prediction error is ≈ 0.87 calculated through a permutation test.
372
F. Janoos et al. / NeuroImage 57 (2011) 362–377
whereas the SSM accounts for spatially varying and unknown hemodynamics in a probabilistic fashion which contributes to its ability to predict the mental state of the subject. Using only information about the phase Ph to train SSM:PH increased the error as compared to SSM:FULL, but only slightly (≈ 1 SEM). But removing all experimental information (SSM:NONE) caused a dramatic increase in error (ERRSSM:NONE ≈ 0.48). This implies that the semi-supervised SSM can detect the effect of experimental variables from patterns in the data (namely LogPs and LogDiff in the case of SSM:PH) against which it was not explicitly trained. Including some cues (namely Ph) about the experiment guides this discovery, stabilizes estimation and precludes the model from learning spatiotemporal patterns that are not relevant to the task and which may be due to artifacts (unlike SSM:NONE). It is nevertheless interesting that, despite not using any stimulus information, SSM:NONE had a prediction error much better than chance (ERRchance ≈ 0.87) which implies that it has discovered the mental states of the subject in a purely unsupervised fashion, validating some of the neurophysiological assumptions behind the SSM. The unsupervised PCA feature-space (FS:PCA-NONE) exhibited performance worse than FS:Φ in all cases. The poor accuracy of PCA has also been documented in other fMRI studies (O'Toole et al., 2007) and can be attributed to the lack of a specific relationship between the task selectivity of a PC and its variance. In contrast, as FS:Φ is obtained from the correlation matrix, it describes the structure of the interrelationships between the voxel time-series and not their magnitudes. Using PCs selected against the stimuli (FS:PCA-FULL) deteriorates the performance of the SSMs even further. This is because the exact coupling between stimuli and fMRI signal (and therefore PC timecourses) is unknown and may be non-linear, and selecting PCs linearly correlated with HRF-convolved stimuli may not preserve a large proportion of the spatio-temporal patterns in the data. In contrast, the SVM, based on optimizing prediction error, has best overall performance with this feature-selection strategy. The limitations, however, of supervised feature-selection are apparent in the case of PCA:PH. Although the SVM predicts Ph with high accuracy Ph (ERRSVM ≈ 0.08 for FS:PCA-PH vs. ≈ 0.11% for FS:PCA-FULL), its ability to predict any other stimulus is severely degraded with an overall
error of ≈ 0.50. The SSMs have similarly poor performance, due to the loss of information about spatio-temporal patterns in this basis.
SSM parameter estimates This section further investigates the parameters as estimated by SSM:PH trained with FS Φ and st = ðPh; 1Þ. The prediction error ERRSSH:PH exhibits a relatively shallow basin with respect to modelsize K for all three groups in Fig. 9(a), with minima occurring in the range K = 18 … 28. This points to the robustness of the SSM estimation with respect to K for each subject. The robustness of ERRSSM:PH with respect to λw , shown in Fig. 9(b) was comparable to that of the simulation study, with the curve of ERRSSM:PH almost flat for 10−2:2 ≤λw ≤100:7 . From the plot of ERRSSM:PH versus the dimensionality D of the feature-space Φ in Fig. 9(c), it can be noticed that initially ERRSSM:PH drops as D increases with Φ explaining more of the information in the data and bottoms out at 400 ≤ D ≤ 600, across the three groups. It then begins to slowly rise as a larger number of unstable basis vectors are included, capturing an increasing percentage of the noise in the data. Fig. 9(d) graphs ERRSSM:PH versus the length T′ of the missing stimulus block used in the assessment of ERRmissing (cf. SSM hyperparameter selection section). Here, we observe a very low error for small T′ as prediction is driven primarily by the strong temporal regularity of the stimulus presentation sequence over short durations. However, as in the case of the simulation, it increases with T′ and stabilizes at a block length of 2 trials (T′ ≈ 5 TRs) after which point there is no structure in the stimulus sequence and prediction is driven mainly by the patterns in the data. Table 3 compares the prediction error of SSMs with: [HRF:NONE] no HRF FIR filter; [HRF:CONST] a spatially constant HRF of length L + 1 = 32 s; [HRF:UNCON] spatially varying HRF of length L + 1 = 32 s without any constraints; and [HRF:PRIOR] the SSM with spatially varying and unknown HRF of length L + 1 = 32 s but constrained by the prior density N ð μ h ; Σh Þ (cf. The state-space model (SSM) section). Here, the advantage of the spatially-varying but physiologically constrained HRF (HRF:PRIOR) in dealing with the variable hemodynamics of the brain and accurately predicting the mental state can be seen.
(b) Error with respect to λw
(a) Error with respect to K
0.6
ERRSSM:PH
ERRSSM:PH
0.8 0.6 0.4 0.2
10
20
30
40
0.4
0.2
50
−2
(c) Error with respect to D
(d) Error with respect to T’ 0.6
ERRSSM:PH
ERRSSM:PH
0.8 0.6 0.4 0.2
2
0
log10 λw
K
1000
2000
3000
4000
5000
0.4 0.2 2
4
6
8
10
T’ (TRs) Fig. 9. Effect of hyper-parameters on ERRSSM:PH. Fig.(a): ERRSSM:PH with respect to model-size K. Fig.(b): ERRSSM:PH with respect to precision hyper-parameter λw. Fig.(c): ERRSSM:PH with respect to dimension D of Φ. Fig.(d): ERRSSM:PH with respect to missing stimulus block length T′. Error bars indicate ± 1 SEM. Legend. Blue solid line: control group, Red dashed line: DC group, Green dot-dashed line: DL group.
F. Janoos et al. / NeuroImage 57 (2011) 362–377 Table 3 Prediction error versus different HRF models. ERRSSM:PH (± 1 SEM) for the SSM model with no HRF (HRF:NONE), spatially constant HRF (HRF:CONST), spatially varying and unconstrained HRF (HRF:UNCON) and the spatially varying HRF with a prior obtained from canonical HRF of SPM8 (HRF:PRIOR).
HRF:NONE HRF:CONST HRF:UNCON HRF:PRIOR
Control
DC
DL
0.62±0.11 0.36±0.08 0.54±0.13 0.31±0.05
0.68±0.13 0.46±0.11 0.59±0.15 0.40±0.09
0.64±0.10 0.40±0.09 0.55±0.16 0.33±0.05
Removing the HRF altogether from the model (HRF:NONE), thereby not accounting for the lag in the fMRI signal due to hemodynamics, leads to the largest deterioration in performance. Although the inclusion of a spatially constant HRF (HRF:CONST) causes some reduction in accuracy, allowing too much variability by putting no constraints on the shape of the HRF (HRF:UNCON) results in even worse performance due to overfitting of noise. Fig. 10 shows the estimates of the spatially varying but constrained HRF FIR filter (HRF:PRIOR) for each group , averaged in regions-of-interest (ROI) selected in the left primary motor cortex (BA3, BA4) and the bilateral intraparietal sulcus (IPS) (BA40). A qualitative difference in the estimated HRFs is apparent in terms of their rise-time, peak value and dispersion. The prolonged and repeated recruitment of the IPS in this task may explain the dispersed shape of its HRF as compared to the motor cortex. No significant differences in HRF estimates were observed between the three groups. The group-wise spatial maps corresponding to the three phases of each trial are shown in Fig. 11.
Discussion The average optimal model size K * and prediction error ERRSSM:PH for the three groups are shown in Table 4. Here, we notice the variation in model-sizes for the DC group is larger than the controls while that for DL group is almost of the same order. This points to a greater heterogeneity in the DC data necessitating models with different sizes.
373
Also, the consistently higher error-rate of the DC population indicates the relative inaccuracy of the models for their mental processes, as compared to the other two groups. These observations concur with the theory (Dehaene et al., 2003) that not only are dyscalculics different from each other in their arithmetical strategies, their lack of an intuitive notion of numerical size maybe compensated for by shifting mental strategies resulting in the poor fit of a single model for a subject. The effect of Ph, LogPs and LogDiff on the error ERRSSM:PH is shown in Fig. 12. Note that LogPs and LogDiff were not used to train the model, and therefore the influence of these parameters on the mental patterns of the subjects was effectively discovered by the method. To measure the similarity between the SSMs Mi and Mj of two subjects i and j, the mutual information (MI) between the optimal state-sequences estimated by the two SSMs was used. Specifically, the MI was derived from the joint histogram of X (i) and X (j), the optimal state-sequences for the same fMRI data Y as labeled by models Mi and Mj respectively. In general the mapping between the state labels of two different SSMs is unknown and by comparing the state-sequences for the same data, this correspondence can be determined. A higher MI indicates a higher level of correspondence, while an MI of zero indicates no agreement. This procedure applied to all ( 422) pairs of subjects yielded a similarity matrix of pair-wise MI that was then visualized with multidimensional scaling (MDS) (Edelman et al., 1999) in Fig. 13. The specification of the SSM in terms of abstract mental-states allows comparing the spatiotemporal patterns between subjects in their entirety in this abstract representation (Kriegeskorte et al., 2008). Fig. 13(a) shows a clustering of subjects in the MDS space with respect to their group (control, DL or DC) along the vertical axis, while along the horizontal axis we see a slight, but not significant, organization dictated by gender. Since this labeling is applied after plotting all the subjects in the MDS space, an intrinsic organization in the spatiotemporal patterns of the subjects in each group has been identified. Interestingly, there are a few DC subjects that cluster along with the DL group, at the top of Fig. 13(a). This is not surprising, given that oftentimes dyscalculia is comorbid with dyslexia (Molko et al., 2003) and these DC subjects may exhibit dyslexic deficits during this task.
Fig. 10. Estimated HRF FIR filter h. Fig.(a): The locations of the ROIs (in the left hemisphere). Fig.(b–d): The estimated FIR filter coefficients (± 1 std.dev.) for each group averaged in the ROI in the left motor cortex and the left and right IPS. Legend. Blue solid line: control group, Red dashed line: DC group, Green dot-dashed line: DL group.
374
F. Janoos et al. / NeuroImage 57 (2011) 362–377 Table 4 Overall results. The mean optimal model size K ⁎ and prediction error ERRSSM:PH (± 1 SEM) for the control, dyscalculic (DC) and dyslexic (DL) subjects. The chance-level error-rate for the data-set is ≈ 0.83, computed by permuting the stimuli with respect to the scans.
K⁎ ERR
Control
DC
DL
22.57±2.19 0.31±0.05
26.23±3.95 0.40±0.09
23.14±2.25 0.33±0.05
the effect for the DC and DL groups is less pronounced (N1 SEM). Also, there is a clear separation between the DL and control groups in the MDS space and product-size increases the separation between the DC and control groups. For the control subjects high values are seen in the bilateral occipital extra-striate cortices, the left postcentral area, the left angular gyrus (lAG), the medial frontal gyri (MFG), and the left intra-parietal sulcus (IPS). The DC subjects show lower activation in the bilateral IPS, while the DL subjects show increased activation in their left fronto-parietal and left medial frontal gyral (lMFG) regions as compared to controls. These results may be due to the greater difficulty and conflict experienced by the DL subjects and multiplicity of mental strategies adopted during the reading phase of the task. The higher error of the DC subjects may be due to irregular patterns in accessing the verbally encoded rote multiplication tables located in the lAG. The reduction in error-rates of all subjects with increase in product-size may be due to increased organization of their mental processes as their multiplication memory is stressed, while the increased separation between the groups could indicate greater divergence of the mental patterns of the DC individuals from the controls. Judgement phase. ERRSSM:PH for the DC group increases drastically, while that for the DL and control groups match up. The DC subjects may experience difficulty in judging the difference between the size of the correct and incorrect results and may resort to a greater variety of mental strategies. Not surprisingly, as the reading phase of the experiment has ended, the patterns of the DL individuals begin to resemble that of the controls and the separation between these groups reduces in MDS space, while the separation of the DC group increases. The control and DL subjects exhibit high values in the left and right IPS, both pallida, caudate heads (CdH), left anterior insula (aIn), lMFG, the supplementary motor area (SMA) and the left fronto-parietal operculum, while the map for the DC group activates in both aIn, both MFG, left IPS, the anterior rostral cingulate zone (aRCZ), and the right supramarginal gyrus (SMG). Although LogDiff reduces the error-rate of the control and DL subjects, it has the opposite effect on the DC group as increased conflict may recruit new functional circuits. The effect of LogPs is consistent with strong activation of the working verbal (mute rehearsal) and visual memories. Fig. 11. Spatial maps for mental arithmetic. The group-wise t-score maps on an inflated brain-surface are shown with columns for the left lateral-posterior, left medialposterior, right medial-posterior and right lateral-posterior views. Values t b 3 have masked out for clarity and the color-map shows values ranging from t = 3 to t = 14. Each row shows the activation maps corresponding to three phases within a single trial of the task.
The separation between the MDS clusters for each group can be quantified using Cramér test (Baringhaus and Franz, 2004) which provides a non-parametric measure of the p-value of the distance between the means of two samples through a permutation method. The p-values of the group-wise differences are compiled in Table 5. From the results in Figs. 11–13 and Table 5 the following observations can be made. Multiplication phase. The error rate for the DL group is much higher than that for the controls (cf. Fig. 12). An increase in product-size causes a large (N1.5 SEM) reduction in ERRSSM:PH for controls, while
Fig. 12. ERRSSM:PH with respect to Ph, LogPs and LogDiff. For each group, the overall error-rate (first bar) followed by the effect for LogPs (second bar) and LogDiff (third bar) are displayed with respect to trial phase Ph. The effects are calculated as the difference in ERRSSM:PH at the high minus that at the low level of the quantized LogPs and LogDiff. Error bars indicate ± 1 SEM.
F. Janoos et al. / NeuroImage 57 (2011) 362–377
375
(a) Overall Control Male Control Female Dyslexic Male Dyslexic Female Dyscalculic Male Dyscalculic Female
(b) Phase 1
(c) Phase 1: Product Size Effect
(d) Phase 2
(e) Phase 2: Problem Difficulty Effect
Fig. 13. MDS plots of MI between all pairs of subjects. Fig.(a) shows the MDS plots of the subjects based on their overall MI, while Figs.(b) and (d) show the relative arrangement of the subjects based on their MI during the first two phases of the trial. The effect of product-size in Ph = 1 and problem-difficulty in Ph = 2 on the MI is plotted in Figs.(c) and (e).
Third phase. This phase involves decision-making and conflictresolution and is highly variable between repetitions and subjects, causing increased inaccuracy during this phase. Also, due to the selfpaced nature of the task, it very often contained the button-press and inter-trial rest interval. The spatial-maps for the three groups show increased foci in the pre-frontal and motor areas. The left IPS region in the DC group is also strongly activated during this phase, which may point to irregular storage and retrieval of the number size using spatial attributes typically processed in this region (Morocz et al., 2003). Conclusion This paper has described a time-resolved analysis of the spatiotemporal patterns contained in fMRI data that correspond to the metabolic traces of neural processes using a state-space formalism. The model incorporated information about the experiment to guide the estimation towards patterns of relevance to the task. The data were represented in a low-dimensional feature-space derived from a physiologically motivated definition of functional distance. Efficient estimation algorithms using a variational formulation of generalized EM under the mean field approximation were developed and quantified with a simulation study. The HRF of the brain is known to be highly variable (Logothetis, 2008) and by using a spatially varying but unknown FIR filter, the state-space model (SSM) was able to compensate for this variability. Model hyper-parameters were selected in an automated fashion using a maximally predictive criterion.
The hidden layers in the SSM decouple the stimulus from the data, and therefore neither does the stimulus need to be convolved with an HRF nor does the exact mathematical relationship between the stimulus and the fMRI signal need to be specified. This allows flexibility in choosing which experimental variables to include and their encoding, without having to worry about statistical issues like the orthogonality of the experiment, the estimability of the design matrix and omitted variable bias. But classical issues like confounding variables will still affect inference and must be addressed through appropriate experimental designs. As demonstrated by the mental arithmetic study, this method can be used with arbitrarily complex paradigms, where the investigator can decide which stimuli to provide as input thereby choosing a trade-off between data driven and model (i.e. stimulus) driven estimation of parameters. The effects of other un-modeled experimental variables on the model can then be tested, post hoc. This is in contrast to supervised methods that cannot, by design, capture the effects of experimental variable against which they have not been modeled. However, with simple block design paradigms where the effect of hemodynamics and the temporal structure within a block are insignificant, we observed that MVPR classifiers tended to outperform the SSM in predicting the mental state. Also, its application to default-state and non-task related fMRI studies would require an alternative model-size selection procedure that does use prediction error as a criterion. The SSM parameters are estimated through a fitting criterion and consequently have a well-defined interpretation implied by the underlying neurophysiological model. Here prediction error is used as a
376
F. Janoos et al. / NeuroImage 57 (2011) 362–377
Table 5 The separation between the three groups in the MDS plots assessed using Cramér non-parametric test. Tabulated are the p-values for the overall distance between the means of the groups in the MDS plot (cf. Fig. 13(a)) along with the Ph-wise changes of the p-values. Each column also includes the effects of LogPs and LogDiff on the p-value in brackets. The bold typeface was intended to visually highlight p-values greater than 80% – showing a significant separation between the groups.
Ctrl vs. DC Ctrl vs. DL DC vs. DL
Overall
Ph 1
Ph 2
Ph 3
0.78 (+0.04,+0.01) 0.74 (− 0.02,+0.02) 0.77 (+0.03,−0.01)
0.80 (+0.10,−0.02) 0.86 (− 0.01,−0.00) 0.79 (+0.07,+0.03)
0.84 (+ 0.02,+0.06) 0.73 (+ 0.01,+0.01) 0.85 (+ 0.01,+0.08)
0.71 (+ 0.02,−0.03) 0.65 (− 0.00,+0.01) 0.72 (− 0.01,+0.02)
statistic to select between models and to infer an effect of the experimental variables on the data, which implicitly involves selecting between alternative hypotheses (Friston et al., 2008). For example, the ability to predict mental states at “much better than chance” levels adduces evidence against the null-hypothesis that the SSM does not explain the data. A similar argument applies for the predictability of experimental variables that were not included during the training of the SSM. The SSM, however, due the lack of a parametric form of the null distribution of the prediction error and the prohibitively high cost of a non-parametric permutation test, cannot measure the confidence level (i.e. a p-value) in a hypothesis test. Comparing brain-function in abstract representation spaces rather than the spatial-maps directly has been shown to be a very powerful principle in psychology and neuroscience (Kriegeskorte et al., 2008). For example, Edelman et al. (1999) discovered natural groupings within a representational space derived using MDS on the activation patterns under different task conditions and subjects. Here, the abstract state-space representation was used to compare the spatiotemporal signatures of mental processes in their entirety. Systematic differences in the cascades of recruitment of the functional modules between subject populations were shown indicating the necessity of retaining the temporal dimension. The MDS plots derived from the MI between subject pairs enabled a succinct assessment of the relationships between different groups with respect to experimental parameters. This ability to reveal and study the group-wise structure in the spatio-temporal patterns could guide in the design of more specific experiments to test interesting effects. Therefore, given its advantages and disadvantages with respect to other analysis methods, we believe that it is a complementary tool in an investigator's arsenal providing a new and different insight into mental processes. Acknowledgments We would like to thank Prof. Dana Brook, Prof. William Wells (III) and the anonymous reviewers for their suggestions and constructive comments, and Dr. Ruth Shalev for identifying and scanning subjects with dyscalculia and dyslexia. This work was partially funded by NSF grant CISE:DC:SMALL0916196 and was supported in part by an allocation of computing time from the Ohio Supercomputer Center. Appendix A. Supplementary data Supplementary data to this article can be found online at doi:10.1016/j.neuroimage.2011.03.047. References Baringhaus, L., Franz, C., 2004. On a new multivariate two-sample test. J. Multivar. Anal. 88, 190–206. Baumgartner, R., Ryner, L., Richter, W., Summers, R., Jarmasz, M., Somorjai, R., 2000. Comparison of two exploratory data analysis methods for fMRI: fuzzy clustering vs. principal component analysis. Magn. Reson. Imaging 18 (1), 89–94. Bellec, P., Rosa-Neto, P., Lyttelton, O.C., Benali, H., Evans, A.C., 2010. Multi-level bootstrap analysis of stable clusters in resting-state fMRI. Neuroimage 51 (3), 1126–1139. Bishop, C.M., 2007. Pattern Recognition and Machine Learning, 1st ed. Springer. 2006. corr. 2nd printing edition.
Blaschko, M., Shelton, J., Bartels, A., 2009. Augmenting feature-driven fMRI analyses: semi-supervised learning and resting state activity. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (Eds.), Adv Neural Info Proc Sys (NIPS), 22, pp. 126–134. Buckner, R.L., Andrews-Hanna, J.R., Schacter, D.L., 2008. The brain's default network: anatomy, function, and relevance to disease. Ann. N.Y. Acad. Sci. 1124, 1–38. Calhoun, V.D., Adali, T., 2006. Unmixing fMRI with independent component analysis. IEEE Eng. Med. Biol. Mag. 25 (2), 79–90. Calhoun, V.D., Liu, J., Adali, T., 2009. A review of group ica for fMRI data and ica for joint inference of imaging, genetic, and erp data. Neuroimage 45 (1 Suppl.), S163–S172. Cecchi, G., Rish, I., Thyreau, B., Thirion, B., Plaze, M., Paillere-Martinot, M.-L., Martelli, C., Martinot, J.-L., Poline, J.-B., 2009. Discriminative network models of schizophrenia. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (Eds.), Adv Neural Info Proc Sys (NIPS), 22, pp. 252–260. Chung, F., 1997. Lectures on Spectral Graph Theory. CBMS Reg Conf Series Math. Am Math Soc. Dehaene, S., Piazza, M., Pinel, P., Cohen, L., 2003. Three parietal circuits for number processing. J. Cogn. Neuropsychol. 20, 487–506. Edelman, S., Grill-Spector, K., Kushnir, T., Malach, R., 1999. Towards direct visualization of the internal shape representation space by fMRI. Psychobiology 26, 309–321. Faisan, S., Thoraval, L., Armspach, J.-P., Heitz, F., 2007. Hidden Markov multiple event sequence models: a paradigm for the spatio-temporal analysis of fMRI data. Med Image Anal 11 (1), 1–20. Friman, O., Borga, M., Lundberg, P., Knutsson, H., 2003. Adaptive analysis of fMRI data. Neuroimage 19 (3), 837–845. Friston, K., Chu, C., Mourão-Miranda, J., Hulme, O., Rees, G., Penny, W., Ashburner, J., 2008. Bayesian decoding of brain images. Neuroimage 39 (1), 181–205. Ghebreab, S., Smeulders, A., 2010. Identifying distributed and overlapping clusters of hemodynamic synchrony in fMRI data sets. Pattern Anal. Appl. 1–18. doi:10.1007/ s10044-010-0186-6. Golub, G.H., Van Loan, C.F., 1996. Matrix Computations, 3rd edition. The Johns Hopkins Univ Press. Hari, R., Levänen, S., Raij, T., 2000. Timing of human cortical functions during cognition: role of MEG. Trends Cogn. Sci. 4 (12), 455–462. Haxby, J.V., Gobbini, M.I., Furey, M.L., Ishai, A., Schouten, J.L., Pietrini, P., 2001. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293 (5539), 2425–2430. Haynes, J.-D., Rees, G., 2006. Decoding mental states from brain activity in humans. Nat. Rev. Neurosci. 7 (7), 523–534. Højen-Sørensen, P., Hansen, L.K., Rasmussen, C.E., 2000. Bayesian modelling of fMRI time series. In: S. S., et al. (Ed.), Adv Neural Info Proc Sys (NIPS), pp. 754–760. Hutchinson, R.A., Niculescu, R.S., Keller, T.A., Rustandi, I., Mitchell, T.M., 2009. Modeling fMRI data generated by overlapping cognitive processes with unknown onsets using Hidden Process Models. Neuroimage 46 (1), 87–104. Jain, A.K., 2010. Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31 (8), 651–666.Award winning papers from the 19th International Conference on Pattern Recognition (ICPR), 19th International Conference in Pattern Recognition (ICPR) Janoos, F., Machiraju, R., Sammet, S., Knopp, M.V., Mórocz, I., 2010a. Unsupervised learning of brain states from fMRI data. th Int Conf Med Image Comp & Comp Assist Intervent (MICCAI), volume 6362 of LNCS, pp. 201–208. Janoos, F., Machiraju, R., Singh, S., Mórocz, I.A., 2010b. Spatio-temporal representations and decoding cognitive processes from fMRI. Technical Report OSU-CISRC-9/10TR19. Ohio State Univ. Kass, R.E., Raftery, A.E., 1995. Bayes factors. J. Am. Stat. Assoc. 90 (430), 773–795. Kriegeskorte, N., Mur, M., Bandettini, P., 2008. Representational similarity analysis — connecting the branches of systems neuroscience. Front. Syst. Neurosci. 2, 4. Krishnapuram, B., Carin, L., Figueiredo, M.A.T., Hartemink, A.J., 2005. Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans. Pattern Anal. Mach. Intel. 27 (6), 957–968. Lehmann, D., Strik, W.K., Henggeler, B., Koenig, T., Koukkou, M., 1998. Brain electric microstates and momentary conscious mind states as building blocks of spontaneous thinking: I. Visual imagery and abstract thoughts. Int. J. Psychophysiol. 29 (1), 1–11. Li, K., Guo, L., Nie, J., Li, G., Liu, T., 2009. Review of methods for functional brain connectivity detection using fMRI. J. Comp. Med. Imag. Graph. 33 (2), 131–139. Lin, Q.H., Liu, J., Zheng, Y.R., Liang, H., Calhoun, V.D., 2010. Semiblind spatial ICA of fMRI using spatial constraints. Hum. Brain Mapp. 31 (7), 1076–1088. Logothetis, N.K., 2008. What we can do and what we cannot do with fMRI. Nature 453 (7197), 869–878. McGrory, C.A., Titterington, D.M., 2009. Variational bayesian analyses for hidden markov models. Aust. NZ J. Stat. 51 (2), 227–244. McIntosh, A.R., Lobaugh, N.J., 2004. Partial least squares analysis of neuroimaging data: applications and advances. Neuroimage 23 (Suppl. 1), S250–S263.
F. Janoos et al. / NeuroImage 57 (2011) 362–377 Mitchell, T.M., Shinkareva, S.V., Carlson, A., Chang, K.-M., Malave, V.L., Mason, R.A., Just, M.A., 2008. Predicting human brain activity associated with the meanings of nouns. Science 320 (5880), 1191–1195. Molko, N., Cachia, A., Riviere, D., Mangin, J.F., Bruandet, M., LeBihan, D., Cohen, L., Dehaene, S., 2003. Functional and structural alterations of the intraparietal sulcus in a developmental dyscalculia of genetic origin. Neuron 40 (4), 847–858. Morocz, I., Gross-Tsur, A., von Aster, M., Manor, O., Breznitz, Z., Karni, A., Shalev, R., 2003. Functional magnetic resonance imaging in dyscalculia: preliminary observations. Ann. Neurol. 54 (S7), S145. Mourão-Miranda, J., Friston, K.J., Brammer, M., 2007. Dynamic discrimination analysis: a spatial-temporal SVM. Neuroimage 36 (1), 88–99. Multiple, 2007. Statistical Parametric Mapping: The Analysis of Functional Brain Images. Acad Press. O'Toole, A.J., Jiang, F., Abdi, H., Pnard, N., Dunlop, J.P., Parent, M.A., 2007. Theoretical, statistical, and practical perspectives on pattern-based classification approaches to the analysis of functional neuroimaging data. J. Cogn. Neurosci. 19 (11), 1735–1752.
377
Price, C.J., Mechelli, A., 2005. Reading and reading disturbance. Curr. Opin. Neurobiol. 15 (2), 231–238. Quirós, A., Diez, R.M., Wilson, S.P., 2010. Bayesian spatiotemporal model of fMRI data using transfer functions. Neuroimage 52 (3), 995–1004. Scott, S.L., 2002. Bayesian methods for hidden Markov models: recursive computing in the 21st century. J. Am. Stat. Assoc. 97 (457), 337–351. Shalev, R.S., 2004. Developmental dyscalculia. Child Neurol. 19 (10), 765–771. Shirdhonkar, S., Jacobs, D., 2008. Approximate earth mover's distance in linear time. Comp Vis Pat Recog., IEEE Conf, pp. 1–8. Stephan, K., Friston, K., 2010. Analyzing effective connectivity with functional magnetic resonance imaging. WIREs Cogn. Sci. 1, 446–459. Thirion, B., Flandin, G., Pinel, P., Roche, A., Ciuciu, P., Poline, J.-B., 2006. Dealing with the shortcomings of spatial normalization: multi-subject parcellation of fMRI datasets. Hum. Brain Mapp. 27 (8), 678–693. Zhang, L., Samaras, D., Alia-Klein, N., Volkow, N., Goldstein, R., 2006. Modeling neuronal interactivity using Dynamic Bayesian Networks. In: Weiss, Y., Schölkopf, B., Platt, J. (Eds.), Adv Neural Info Proc Sys (NIPS), 18. MIT Press, Cambridge, MA, pp. 1593–1600.