Combining sparseness and smoothness improves classification accuracy and interpretability

Combining sparseness and smoothness improves classification accuracy and interpretability

NeuroImage 60 (2012) 1550–1561 Contents lists available at SciVerse ScienceDirect NeuroImage journal homepage: www.elsevier.com/locate/ynimg Combin...

2MB Sizes 3 Downloads 46 Views

NeuroImage 60 (2012) 1550–1561

Contents lists available at SciVerse ScienceDirect

NeuroImage journal homepage: www.elsevier.com/locate/ynimg

Combining sparseness and smoothness improves classification accuracy and interpretability Matthew de Brecht a,⁎, Noriko Yamagishi a, b, c a b c

National Institute of Information and Communications Technology, 2-2-2 Hikaridai, Keihanna Science City, Kyoto 619-0288, Japan Cognitive Mechanisms Laboratories, Advanced Telecommunications Research Institute International, 2-2-2 Hikaridai, Keihanna Science City, Kyoto 619-0288, Japan PRESTO, Japan Science and Technology Agency (JST), 4-1-8 Honcho Kawaguchi, Saitama 332-0012, Japan

a r t i c l e

i n f o

Article history: Received 4 October 2011 Revised 13 December 2011 Accepted 24 December 2011 Available online 10 January 2012 Keywords: Logistic regression Classification Decoding Sparse Smooth MEG

a b s t r a c t Sparse logistic regression (SLR) has been shown to be a useful method for decoding high-dimensional fMRI and MEG data by automatically selecting relevant feature dimensions. However, when applied to signals with high spatio-temporal correlations, SLR often over-prunes the feature space, which can result in overfitting and weight vectors that are difficult to interpret. To overcome this problem, we investigate a modification of ℓ1-normed sparse logistic regression, called smooth sparse logistic regression (SSLR), which has a spatio-temporal “smoothing” prior that encourages weights that are close in time and space to have similar values. This causes the classifier to select spatio-temporally continuous groups of features, whereas SLR classifiers often select a scattered collection of independent features. We applied the method to both simulation data and real MEG data. We found that SSLR consistently increases classification accuracy, and produces weight vectors that are more meaningful from a neuroscientific perspective. © 2012 Elsevier Inc. All rights reserved.

Introduction A major challenge with classifying recordings of brain activity is that the number of features is usually very large. This can cause difficulties in learning high performance classifiers, and also makes it more difficult to interpret the model generated by the classifiers. If the number of features is much larger than the size of the training data set, the probability of spurious correlations between features and class labels increases. Typical classifiers, like basic logistic regression, tend to fit to these spurious correlations, a problem known as overfitting, and as a result will not generalize well to new examples. In many applications, it is also desirable for the classifier to produce an easily interpretable model. For example, we often want to know which brain regions and patterns of activity are most important for making accurate predictions, so that we can gain new insight into how the brain functions. In addition, artifacts due to eye or body movements correlated with a task are very common in recordings of brain activity, so it is important to be able to verify that classifier weights are truly related to brain activity and not just exploiting these artifacts. This sense of interpretability is very difficult even for linear classifiers if the model depends on all of the features of a very large feature space.

⁎ Corresponding author. E-mail addresses: [email protected] (M. de Brecht), [email protected] (N. Yamagishi). 1053-8119/$ – see front matter © 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.neuroimage.2011.12.085

Sparse regression is one method that helps alleviate these problems. The ℓ1-normed regression of Tibshirani (1996) tends to produce sparse models that make predictions based on only a small number of relevant features. Features are less likely to be deemed relevant based on weak correlations in the data, which helps prevent the overfitting problem, and the model only depends on a small number of automatically selected features, which makes interpretation of the model easier. Sparse logistic regression (SLR) is a method for binary classification which extends the approach of sparse regression to logistic regression. SLR can be realized by using the ℓ1-norm approach of Tibshirani (1996) or by using automatic relevance determination (ARD) priors MacKay (1992) as proposed by Yamashita et al. (2008). SLR shares the basic properties of sparse regression, such as producing sparse models that avoid overfitting and are easier to interpret. Enforcing sparseness can greatly improve on traditional regression methods, and has been used in various forms for decoding brain activity (Miyawaki et al., 2008; Carroll et al., 2009; van Gerven et al., 2009; Ryali et al., 2010; Toda et al., 2011). However, sparse regression has some problems of its own. For one, it is designed to put all of its eggs in as few baskets as possible. If the experimental paradigm induces a robust signal, then sparse regression can often perform classification with very high accuracy based on a very small number of features. Unfortunately, a small number of selected features can quickly become irrelevant due to small changes in the experimental environment, say if the subject's head rotates slightly in the MEG scanner. In addition, sparse regression often rejects genuine correlations of features that failed to become relevant due to an insufficient number

M. de Brecht, N. Yamagishi / NeuroImage 60 (2012) 1550–1561

of trials. This presents a dilemma when trying to interpret a selected set of features that appear to be independently scattered across a feature space that possesses a naturally continuous structure, for example in the spatial, temporal or frequency domains. Under such circumstances it becomes unclear as to what degree the relevant features are truly discrete and independent, and to what degree the classifier is either overpruning or selecting features due to artifacts. The elastic-net, proposed by Zou and Hastie (2005), combines ℓ1 and ℓ2-normed regression to help overcome some of the difficulties that sparse regression has with correlated data. The linear regression form of the elastic-net was applied to decoding fMRI data by Carroll et al. (2009), and the logistic regression form of the elastic-net was applied to fMRI decoding by Ryali et al. (2010). Both groups showed that the elastic-net improves decoding accuracy compared to the original version of sparse regression. However, the elastic-net still treats features independently, and does not take advantage of any prior-knowledge the experimenter may have about the correlational structure of the data. Thus, the elastic-net still rejects genuine correlations between features that have been partially hidden by noise, which can result in overpruning and weight vectors containing independently scattered features. In order to improve both the accuracy and interpretability of classifiers, in this paper we investigate a generalization of sparse logistic regression which combines a Laplacian prior on the weight vector to encourage sparseness with a multi-variate Gaussian prior to encode the correlational structure of the weights. This allows the experimenter to encode prior-knowledge into the classifier, which can help increase performance and produce more intuitive models. The elastic net form of logistic regression is a special case of this method, in which weights are assumed to all have equal variance and zero correlation. Although any multi-variate Gaussian prior on the weights can be applied, we will focus on the applications of smooth sparse logistic regression (SSLR) which uses a prior to encourage weights that are close in time and space to have similar values. This has the effect that weights “close together” in feature space are selected together, which can produce more understandable models and also increase decoding accuracy. This preference for smoothness is particularly relevant for EEG and MEG recordings, where spatio-temporally isolated weights have a higher chance of being due to noise or artifacts. The effect of a smoothing prior on the weights of a classifier has been previously investigated by Dyrholm and Parra (2006) and applied to EEG decoding, although this study did not apply sparseness. Recently, the combination of smoothness and sparseness has been investigated by Hebiri and van de Geer (2011) and applied to fMRI decoding by van Gerven and Heskes (2010), Ng and Abugharbieh (2011), and Grosenick et al. (2011). These studies have focused on very local smoothness constraints, by adding a penalty for immediately neighboring feature weights with different values. In this paper, we use globally defined smoothing constraints, which allow more distant feature weights to interact, and investigate how classification performance varies with differing degrees of smoothness. Long reaching interactions are particularly important for EEG and MEG decoding in order to obtain a more stable smoothing effect across the spatial and temporal dimensions. This approach is made practical, even for large data sets, using a dual optimization method that we derive. To better understand how sparseness and smoothness constraints interact, we compare SSLR with optimized smoothing and sparsifying parameters with special cases of SSLR with either no sparseness or no smoothing. Non-sparse SSLR, meaning the ℓ1-norm penalty is set to zero, corresponds to the smooth logistic regression method proposed by Dyrholm and Parra (2006). Non-smooth SSLR, meaning the prior assumes zero correlations between weights, corresponds to the logistic regression form of the elastic-net Zou and Hastie (2005). We will also show, using simulated data, that encoding “smoothness” as prior knowledge in the classifier is often more robust than forcing

1551

smoothness on the data during a preprocessing step. Finally, we will demonstrate the advantages of SSLR over non-smooth SSLR using real MEG data. Material and methods Generalized sparse logistic regression In this subsection we first investigate a generalization of sparse logistic regression which combines a Laplacian prior and a Gaussian prior on the weights. The Laplacian prior encourages sparseness of the weight vector, and the Gaussian prior can be used to encourage additional structure, such as smoothness, on the weight vector. In the following, light-faced letters denote scalars, bold-faced lower-case letters denote vectors, and bold-faced upper-case letters denote matrices. The transpose of a matrix M is denoted M T. Assume N observations x1, x2, …, xN are made, where each observation xi is assumed to be a D-element vector. Associated with each observation xi is a label yi ∈ {− 1, 1} designating to which class the observation belongs. We will let X = (x1, x2, …, xN) T be the N × D matrix formed by taking the i-th row to be the i-th observation xi, and we let y = (y1, y2, …, yN)T be the corresponding vector of labels. Logistic regression assumes that the probability that xi has label yi = 1 is given by P(yi = 1|xi, w) = 1/(1 + exp(− w Txi), where w is a D-element weight vector. Assuming the observations are statistically independent, we have    N T P ðyjX; wÞ ¼ ∏ 1= 1 þ exp −yi w xi : i¼1

It follows from Bayes Theorem that P ðwjy; X Þ ¼ P ðyjw; XÞP ðwjXÞ=P ðyjXÞ ∝P ðyjw; XÞP ðwÞ: We can then find the most probable w given y and X by solving: arg max P ðyjw; XÞP ðwÞ: w

Several choices of the prior P(w) on the weight vector have been investigated in the literature. One popular choice is the Laplacian prior: P Lap ðwÞ∝expð−λjjwjj1 Þ; where λ is a scalar parameter and ||w||1 is the ℓ1-norm of w defined as jjwjj1 ¼

D X

jwi j:

i¼1

The Laplacian prior has the advantage that it favors sparse weight vectors w, meaning that only a small number of the elements of w are non-zero (Tibshirani, 1996). Another option is to assume a Gaussian prior for w:   1 T −1 P Gauss ðwÞ∝exp − w Q w ; 2 where Q is a positive-definite covariance matrix. This prior has the advantage that we can model correlations between weights. The special case of regression where Q is a diagonal matrix is a form of ridge regression (Hoerl and Kennard, 1970).

1552

M. de Brecht, N. Yamagishi / NeuroImage 60 (2012) 1550–1561

In this paper we use a prior which combines the Laplacian and Gaussian priors:   1 T −1 P ðwÞ∝ exp − w Q w expð−λjjwjj1 Þ: 2 Similar methods combining a quadratic penalty with an ℓ1-norm penalty have recently been investigated by van Gerven and Heskes (2010), Hebiri and van de Geer (2011), Ng and Abugharbieh (2011), and Grosenick et al. (2011). We will call sparse logistic regression with the above prior generalized sparse logistic regression (GSLR). The log-likelihood of w given X and y used by GSLR is:

‘ðw; X; yÞ :¼ −

N X i¼1

   1 T T −1 log 1 þ exp −yi w xi − w Q w−λjjwjj1 : 2

where the positive-definite matrix Q and scalar λ are parameters. GSLR combines the benefits of sparseness with the ability to encode prior knowledge of the structure of the weight vector using the covariance matrix Q. Smooth sparse logistic regression Although there are many choices for the covariance matrix Q in GSLR, in this paper we will focus on covariance matrices which encourage smoothness in the weights. This is accomplished by defining Q so that the strength of the correlation between weights is directly proportional to some measure of the distance between the weights in feature space. We will refer to this subfamily of GSLR which encourages smooth weight vectors as smooth sparse logistic regression (SSLR). For example, in the case of MEG or EEG data, weights corresponding to measurements that are close in time and space will be assigned higher correlation, thereby placing a preference on weight vectors that are both temporally and spatially smooth. As we will see in this paper, this combination of smoothness and sparseness helps to prevent overfitting of the training data, which increases decoding accuracy. SSLR can also help improve the interpretability of the results, because features tend to be chosen in locally continuous groups. One variation of SSLR is to require λ=0, which we will call non-sparse SSLR. Non-sparse SSLR produces smooth weight vectors, but since there is no term to encourage sparseness, in general all of the weights will be nonzero. Non-sparse SSLR and its applications to EEG classification have been investigated by Dyrholm and Parra (2006). Another variation of SSLR is to require Q = I, the identity matrix. We will call this version of SSLR non-smooth SSLR. Non-smooth SSLR produces sparse weight vectors, but does not encourage any form of smoothness across the weights. Non-smooth SSLR is closely related to the elastic net proposed by Zou and Hastie (2005), which is a combination of ℓ1 and ℓ2-normed regressions. The elastic net was shown to be advantageous over sparse regression without the ℓ2-norm when there are high correlations between weights. Non-smooth SSLR could be extended to allow scalar multiples of the identity matrix, in which case we would recover a logistic regression form of Zou and Hastie's elastic net. Applications of this form of the elastic net to fMRI decoding was investigated by Ryali et al. (2010), and shown to be advantageous over sparse logistic regression without the ℓ2-norm penalty. For the purposes of this paper, however, our main interest is in the effects of giving prior information of the correlational structure of the weights to the decoder, so to simplify analysis we mainly focus on non-smooth SSLR with Q = I. The effects of scaling Q are briefly investigated in the supplementary text (“SupplementaryText.pdf”) for this paper.

Dual optimization problem for GSLR If the feature dimension D is relatively small then the weight vector w maximizing the GSLR log-likelihood function ℓ(w;X,y) can be closely approximated using Newton methods if some care is taken with the fact that ℓ(w;X,y) is not differentiable at w = 0. However, for applications to neuroimaging data it is common for D to be very large, and inverting the Hessian of ℓ(w;X,y) on each Newton step will be very expensive computationally. Furthermore, in many cases it is more natural to define Q than its inverse Q − 1, so if Q is large it would be preferable to work with Q alone. In addition, in order to produce truly sparse results it would be necessary to define thresholds so that weights below threshold are set exactly to zero. However, such thresholds depend heavily on Q and are non-trivial to compute. Fortunately, the above problems are largely alleviated by optimizing a dual form of ℓ(w;X,y), rather than a direct optimization. Tomioka and Sugiyama (2009) and Koh et al. (2007) have also investigated dual optimization problems for sparse regression. For notational convenience we define x̄i = yixi and X̄ = (x̄1, x̄2,…, x̄N) T. We can maximize ℓ(w;X,y) by instead solving the following constrained dual optimization problem (see the Appendix A for details): Minimize : g ðμ; ηÞ ¼

N X ðμ i logðμ i Þ þ ð1−μ i Þ logð1−μ i ÞÞ i¼1

T   1  T  T μ−η μ−η Q X þ X   2 Subject to : η ≤λ for 1≤i≤D:

ð1Þ

i

In the above equation, μ is an N dimensional vector and η is a D dimensional vector. If μ⁎, η⁎ is a solution to problem (1), then w⁎ = Q(X̄ Tμ⁎ − η⁎) maximizes ℓ(w;X,y). Two nice features of the dual problem is that it is unnecessary to compute the inverse of Q, and that each weight wi is guaranteed to be equal to zero whenever |ηi| b λ (see Appendix A). This makes it possible to compute the optimal weight vector w even when Q is very large, and also provides a simple criterion for determining which features are selected by the classifier. Results Simulation experiment 1 The purpose of the first simulation experiment was to investigate the effects of the smoothing and sparsifying parameters on smooth sparse logistic regression (SSLR). In particular, we tested SSLR with multiple parameter combinations on classifying noisy signals according to whether or not they contained a particular target signal. The target signal consisted of a 5 Hz sine wave multiplied by a 200 ms Hann window centered in a one second time interval and sampled at 200 Hz (see Fig. 1). Each instance of the target signal was corrupted with additive Gaussian noise with zero mean and standard deviation one. The classifier was trained to differentiate between corrupted target signals (assigned label y = 1) and instances of pure Gaussian noise with zero mean and standard deviation one (assigned label y = –1). The noisy signals were given directly to the classifier as 200-element vectors xi without any preprocessing. The covariance matrices used for SSLR were defined as 0  2 1 − t −t i j B C ðQ σ Þi;j ¼ exp@ A; 2σ 2

ð2Þ

where ti ∈ [0, 1] is the time of the i-th sample and σ is a parameter. For the case σ = 0 we defined Q σ to be the identity matrix.

M. de Brecht, N. Yamagishi / NeuroImage 60 (2012) 1550–1561

1553

Optimal (σ = 0.05, λ = 5)

3 0.5 2 0 1

−0.5

Non−sparse (σ = 0.055, λ = 0)

0 1 −1

0

−2

−1

−3

Non−smooth (σ = 0, λ = 5) 0

0.2

0.4

0.6

0.8

1

Time (sec)

0

Fig. 1. Example of the data used in the simulation experiments. The red line is the target signal. The blue line is a typical example of the target signal corrupted with Gaussian noise. The goal of the classifier was to distinguish between corrupted target signals and pure Gaussian noise.

We varied λ from 0 to 20 in increments of 0.5, and varied σ from 0 to 0.2 s in increments of 0.005 s. The special case λ = 0 corresponds to non-sparse SSLR, and the special case σ = 0corresponds to nonsmooth SSLR. For each combination of parameters, we trained SSLR with 100 corrupted signals and 100 instances of pure Gaussian noise. The learned weights were then tested on a new set of data of the same size as the training set. This process was repeated 100 times to determine the average classification performance for each combination of parameters. The average classification accuracy for each parameter combination is shown in Fig. 2. The optimal parameters for SSLR were σ = 0.05 and λ = 5 (mean accuracy 88.7%, std. dev. 2.20%; marked by a black star in Fig. 2). An example of a weight vector produced with these parameters is shown in the top panel of Fig. 3. The similarity with the original target signal in Fig. 1 is readily apparent.

90 (%)

20

80

Sparseness − λ

15

70 10 60

5 50

0

40 0

0.05

0.1

0.5

0.15

−0.5 0

Fig. 2. Classification performance of SSLR in Simulation experiment 1 with different parameter combinations. The black star at (σ = 0.05, λ = 5) marks the parameter combination that gave the overall best average performance (optimal; mean 88.7%, std. dev. 2.20%). The black diamond at (σ= 0.055, λ = 0) marks the best average performance when λ = 0 (non-sparse; mean 86.2%, std. dev. 3.03%). The black circle at (σ = 0,λ = 5) marks the best average performance when σ = 0 (non-smooth; mean 81.9%, std. dev. 2.35%).

0.4

0.6

0.8

1

Time (sec) Fig. 3. Examples of the weight vectors learned by SSLR in Simulation experiment 1. The top panel shows a weight vector learned by SSLR using the optimal parameter pair (marked by a star in Fig. 2). The middle panel shows a weight vector learned with the optimal smoothing parameter without sparseness (marked by a diamond in Fig. 2). The bottom panel shows a weight vector learned with the optimal sparseness parameter without smoothing (marked by a circle in Fig. 2).

Non-sparse SSLR (λ = 0) obtained peak performance with σ = 0.055 (mean 86.2%, std. dev. 3.03%; marked by a black diamond in Fig. 2). An example of a weight vector produced with these parameters is shown in the middle panel of Fig. 3. Although the learned weight vector resembles the original target signal, large fluctuations throughout the weight vector are clearly visible in the figure. Non-smooth SSLR (σ = 0) performed best with λ = 5 (mean 81.9%, std. dev. 2.35%; marked by a black circle in Fig. 2). An example of a weight vector produced with these parameters is shown in the bottom panel of Fig. 3. In this case, the smooth shape of the original target signal is less apparent, and relatively large non-zero weights appear throughout the weight vector as a result of overfitting. Performance began to decrease rapidly when both λ and σ became too large (upper right region of Fig. 2). In these cases, the combination of very strong sparseness and smoothness often pushed all of the weights to zero. To investigate the ability of SSLR to select features, we divided the weight vectors into 5 time periods (t = 0–0.2, 0.2–0.4, 0.4–0.6, 0.6–0.8, and 0.8–1 s), and computed the average probability of features in each time period to be selected (Table 1). These values were computed by dividing the number of non-zero weights in each time window by the total number of weights in the time window (i.e. 40), and then averaging these ratios over the 100 weight vectors learned during the cross-validation runs. The uncorrupted target signal had non-zero values only within the 0.4–0.6 s time window (see Fig. 1). Non-sparse SSLR does not perform feature selection, so all weights were non-zero. SSLR with the optimal parameter values selected features from the target window Table 1 Average probability of feature being selected.

0.2

Smoothness − σ (sec)

0.2

Optimal Non-sparse Non-smooth

t = 0–0.2 s

0.2–0.4 s

0.4–0.6 s

0.6–0.8 s

0.8–1 s

0.13 1.00 0.16

0.62 1.00 0.16

0.96 1.00 0.55

0.61 1.00 0.14

0.11 1.00 0.16

Feature selection statistics for SSLR in Simulation Experiment 1. Each entry is the probability that each feature within the specified time window (columns) is given non-zero weight, averaged over 100 cross-validation runs. The parameters for each row are Optimal: (σ = 0.05, λ = 5), non-sparse: (σ = 0.055, λ = 0), non-smooth: (σ = 0, λ = 5) (see Fig. 2).

1554

M. de Brecht, N. Yamagishi / NeuroImage 60 (2012) 1550–1561

(0.4–0.6 s) with the highest probability, with the probability of feature selection decreasing with more distant time windows. Nonsmooth SSLR did better than SSLR with the optimal parameters at avoiding features in the intermediate windows (0.2–0.4 and 0.6–0.8 s), but tended to over-prune during the target window (0.4–0.6 s), and also had a tendency to select features in the distant time windows (0–0.2 and 0.8–1 s) with higher probability. Although smoothing caused SSLR to assign non-zero weights to the intermediate time windows with higher probability than nonsmooth SSLR, Table 2 shows that the weights in these windows were much smaller on average than for non-smooth SSLR. The values in Table 2 were computed by taking the average of the absolute value of the 40 weights in each time window, and then averaging over the 100 cross-validation runs. Thus smoothing caused SSLR to assign very small non-zero weights with higher probability, and non-smooth SSLR assigned relatively large weights to these windows albeit with a smaller probability. The results of this simulation experiment showed that SSLR can achieve high classification performance, even in the presence of large noise and no preprocessing of the data. In addition, SSLR accurately combined smoothing with feature selection, with most of the selected features (non-zero weights) being concentrated near the non-zero values of the original uncorrupted target signal. Even in cases where smoothing caused SSLR to select incorrect features, it tended to assign these features weights much smaller than those assigned by non-smooth SSLR. This is because smoothing causes neighboring irrelevant features to pull down each other's weights, whereas non-smooth SSLR does not assign any penalty for placing a large weight on a single feature isolated in the middle of a largely irrelevant area. As a result, the weight vectors produced by SSLR closely resembled the uncorrupted target signal, suggesting that SSLR may be useful in extracting accurate representations of the underlying phenomenon in noisy data. Simulation experiment 2 The purpose of the second simulation experiment was to determine whether the effects of the smoothing prior in SSLR could also be obtained by more standard sparse regression algorithms by first smoothing the data during a preprocessing step. In particular, we compared the performance of SSLR with a smoothing prior on unprocessed data, with the performance of non-smooth SSLR on data that was smoothed by a preprocessing step. To formalize this concern, let us first look at non-sparse SSLR, which is a special case of SSLR obtained by setting λ = 0 (see Material and methods). In this case, the log-likelihood is given by:

‘ðw; X; y; Q Þ :¼ −

N   1 X T T −1 ρ yi w xi − w Q w; 2 i¼1

where ρ(z) : = log(1+ exp(− z)). Since Q is positive-definite, Q can be represented as Q ¼ MMT with M as an invertible matrix. For example, 1 we can take M ¼ VD2 VT , where the columns of V are the eigenvectors of Q and D is the corresponding diagonal matrix of eigenvalues. In this Table 2 Average absolute value of weights.

Optimal Non-sparse Non-smooth

t = 0–0.2 s

0.2–0.4 s

0.4–0.6 s

0.6–0.8 s

0.8–1 s

0.0001 0.0636 0.0119

0.0011 0.0682 0.0119

0.1211 0.3111 0.1149

0.0010 0.0638 0.0102

0.0001 0.0690 0.0109

Weight value statistics for SSLR in Simulation Experiment 1. Each entry is the average absolute value of the weight assigned to each feature within the specified time window (columns), averaged over 100 cross-validation runs. The parameters for each row are as in Table 1.

case M is symmetric and even positive definite. In general, for any invertible M satisfying Q ¼ MMT , the log-likelihood can be rewritten as: N     1   X T −T T T T −1 w ‘ðw; X; y; Q Þ ¼ − ρ yi w M M xi − w MM 2 i¼1 N   1 X T T T ¼ − ρ yi u M xi − u u 2 i¼1

¼ ‘ðu; XM; y; IÞ; −1

where u ¼ M w and I is the identity matrix. Thus, non-sparse SSLR can be viewed as applying ridge regression to an invertible linear transformation of the original data set. Conversely, if R is any invertible matrix, then RRT is positive definite and we can interpret ridge regression on the transformed data set XR as non-sparse SSLR. We therefore have a natural correspondence between non-sparse SSLR and ridge regression on some invertible linear transformation of the data. When we apply the same transformation above to SSLR, we obtain the same expression but with the extra term −λjjMujj1 . Thus, SSLR resembles applying non-smooth SSLR (i.e., SSLR with Q = I, see Material and methods) to a linearly transformed data set XM, but with the difference that with SSLR the sparsification is with respect to the original feature space. This difference is important for two reasons. The first reason is that in order to interpret a model it is often preferable to view the results within the original feature space. Applying non-smooth SSLR to XM results in a sparse weight vector u in the transformed feature space. We can transform this solution back to the original feature space by taking w= Mu, but in most cases w will no longer be sparse. Thus SSLR combines the benefits of data preprocessing with the capability to automatically perform feature selection in the original feature space. The second reason is an issue of robustness. If the transformation M is sub-optimal, for example if it makes the data too smooth, then it may become impossible to obtain a sparse solution for the transformed data set XM. Even if the degree of smoothing is carefully chosen, it is easy to imagine scenarios where different regions of a single data set require different degrees of smoothing. In such a case, uniformly applying a smoothing filter will result in the loss of valuable information. With SSLR, a bad choice of Q will certainly decrease performance. However, we might expect that sparseness will to some degree counter-balance the effect of over-smoothing, making the algorithm more robust in general. In the second simulation experiment, we confirm that SSLR is more robust to the effects of an overly smooth Q when compared to non-smooth SSLR on data preprocessed with the corresponding (overly smooth) transformation M. This demonstrates potential advantages of moving some standard preprocessing stages into the learning stage as is done by SSLR. 1 For a given σ we define Mσ ¼ Vσ D2σ VTσ , where Vσ and Dσ are respectively the matrices of eigenvectors and eigenvalues of the matrix Qσ from Eq. (2) in Subsection Simulation experiment 1. In the first part of Simulation experiment 2, we investigated the effects of a smoothing preprocessing step on the performance of non-smooth SSLR applied to the same simulation data as in Simulation experiment 1. The training and testing procedures are the same as in Simulation experiment 1, with the only difference that non-smooth SSLR (i.e., SSLR with Q held fixed as the identity matrix) was used in each case, and the data matrix X was transformed (preprocessed) as XMσ before training and testing. Fig. 4 shows the average classification performance of non-smooth SSLR on the preprocessed data, averaged over 100 training/testing runs for each parameter combination. The black star at (σ = 0.025, λ = 7) marks the parameter combination that gave the overall best average performance (mean 88.4%, std. dev. 2.52%). The top panel of Fig. 5 shows an example of a corrupted target signal

M. de Brecht, N. Yamagishi / NeuroImage 60 (2012) 1550–1561

90 (%)

20

80

Sparseness − λ

15

70 10 60

5 50

0 0

0.05

0.1

0.15

0.2

40

Smoothness − σ (sec) Fig. 4. Classification performance of non-smooth SSLR (i.e., SSLR with Q=I) on the simulated data (see Fig. 1) preprocessed with different degrees of smoothing (i.e., transformed by Mσ). The black star at (σ=0.025,λ=7) marks the parameter combination that gave the overall best average performance (mean 88.4%, std. dev. 2.52%).

(blue line; this is the same signal as in Fig. 1) and the same signal transformed by Mσ (red line) for σ = 0.025. The bottom panel shows an example of a weight vector learned by non-smooth SSLR with the optimal parameter combination (σ = 0.025, λ = 7). The blue line is the weight vector in the transformed feature space (u), and the red line is the weight vector converted back to the original feature space (w ¼ Mσ u). In this example, 10 values of u are non-zero, whereas all 200 values of w are non-zero, which was typical for this simulation experiment. In comparison with Fig. 2, we see that larger values of λ have less of an effect on performance when using non-smooth SSLR on the preprocessed data compared to using SSLR on the raw data. This is because, for appropriate values of σ, only a very small number of

1555

features in the transformed feature space are sufficient for classification, whereas classification with respect to the original feature space inevitably requires a larger number of features. On the other hand, performance of non-smooth SSLR on preprocessed data drops faster with increasing σ than does the corresponding SSLR on raw data. This suggests that classification on preprocessed data may be more sensitive to over-smoothing than SSLR. To compare the robustness of the two approaches of classification to the degree of smoothness, in the second part of Simulation experiment 2 we held parameters for the preprocessing and classifiers fixed, and varied the frequency of the noisy target signal. As in Simulation experiment 1, the target signal was a sine wave multiplied by a 200 ms Hann window centered in a one second time interval and corrupted with Gaussian noise. The frequency of the target signal was varied from 5 Hz to 25 Hz in steps of 1 Hz. For each frequency, we trained and tested SSLR on raw data and non-smooth SSLR on preprocessed data. Both training and testing data sets contained 100 corrupted target signals and 100 instances of pure Gaussian noise. The average performance of both classification methods as a function of target signal frequency is shown in Fig. 6. For fixed parameters the performance of both classification methods decrease as the target signal frequency increases. However, SSLR suffers less from over-smoothing, and obtains significantly (p b 10 − 6) better performance on high frequency target signals as measured by a paired t-test with Bonferroni correction for multiple comparisons across frequencies (see Fig. 6). This holds true when the parameters are optimal for SSLR for 5 Hz target signals (σ = 0.05, λ = 5; top panel of Fig. 6) and when the parameters are optimal for non-smooth SSLR on preprocessed 5 Hz target signals (σ = 0.025, λ = 7; bottom panel of Fig. 6). The results of Simulation experiment 2 confirm that the effect of the smoothing prior in SSLR cannot simply be substituted by a smoothing preprocessing step on the data. This is because (1) a reverse transformation of the preprocessing step will in general produce results that are not sparse, thus the procedure fails to perform automatic feature selection, and (2) SSLR is significantly more robust to the effects of over-smoothing than non-smooth SSLR on preprocessed data.

Smoothed corrupted signal Accuracy (%)

5

0

−5

0

0.2

0.4

0.6

0.8

−0.5 0

50 5

Learned weight vector Accuracy (%)

0

SSLR Non−smooth

75

1

0.5

σ = 0.05, λ = 5

100

0.4

0.6

0.8

1

15

20

25

20

25

σ = 0.025, λ = 7

100 75 50 5

0.2

10

10

15

Target Signal Frequency (Hz)

Time (sec) Fig. 5. Example of preprocessed data and a weight vector learned by non-smooth SSLR. The top panel shows a corrupted target signal (blue line) and the same signal transformed by Mσ for σ = 0.025 (red line). The bottom panel shows a weight vector learned by non-smooth SSLR (λ = 7) trained on data transformed by Mσ (σ = 0.025). The blue line is the weight vector in the transformed feature space (u), and the red line is the weight vector in the original feature space (w = Mσu). Although w closely resembles the original target signal, it is not sparse, and in fact none of its values are exactly zero.

Fig. 6. Classification accuracy as a function of target signal frequency. The average classification performance of SSLR on raw data is marked by blue lines, and the average performance of non-smooth SSLR on preprocessed data is marked by green lines. The top panel shows results for parameters optimal for SSLR on 5 Hz target signals (marked by a black star in Fig. 2) and the bottom panel shows results for parameters optimal for non-smooth SSLR on preprocessed 5 Hz target signals (marked by a black star in Fig. 4). SSLR performance was significantly better for all frequencies below 17 Hz (vertical blue dashed line) in the top panel, and for all frequencies above 13 Hz (vertical blue dashed line) in the bottom panel (Bonferroni corrected paired t-test, p b 10− 6).

1556

M. de Brecht, N. Yamagishi / NeuroImage 60 (2012) 1550–1561

Simulation experiment 3 In the third simulation experiment, we applied SSLR to a twodimensional generalization of the simulation data from the previous experiments. The purpose of this experiment was to verify that SSLR performs well in the presence of a large number of features organized into multiple dimensions. As in the previous classification experiments, the classification task was to determine whether or not noisy signals contained a particular target signal. The two-dimensional target signal, s2, used in this experiment was defined as     ′ ′ s2 t; t ¼ s1 ðt Þ  s1 t ; where s1 is the target signal from the previous simulations, and t, t′ were restricted to the window between 0.25 and 0.75 s. Each instance of the target signal was corrupted with additive Gaussian noise with zero mean and standard deviation one. Each signal was therefore a 100 × 100 element matrix, which can be interpreted as a one dimensional vector of 10,000 elements when used by the classifier. The original target signal and an example of the corrupted target signal are shown in Fig. 7. The covariance matrix used by SSLR took the form

Q 〈t1 ;t ′ 1 〉;〈t 2 ;t ′ 2 〉

−ðt 1 −t 2 Þ2 ¼ exp 2σ 2

!

0  2 1 ′ ′ − t −t 1 2 B C exp@ A; 2σ 2

ð3Þ

where 〈ti, t′i〉 is the index of the vectorized signal corresponding to s2(ti, t′i). Due to the symmetric nature of the signal, only a single smoothing parameter σ was used for both dimensions. Non-smooth SSLR corresponds to σ = 0, where Q was defined as the identity matrix. Although the full covariance matrix Q is a rather large 10, 000 × 10,000 matrix, the dual optimization algorithm (see Material and methods and Appendix A) allowed us to avoid computing the inverse of Q. Furthermore, the two components of Q corresponding to t and t′ are independent and identical, so rather than storing all of Q in

memory we could compute the elements of Q as needed using a precomputed 100× 100 covariance matrix representing a single dimension. Although the two-dimensional signal is very similar to the previous experiments, a smaller value of the sparseness parameter λ was needed to deal with the larger number of features. To find an optimal value of λ, we trained and tested the performance of non-smooth SSLR while λ varied from 0 to 20 in steps of 0.5. For each value of λ, non-smooth SSLR was given 100 examples of the corrupted signal and 100 examples of pure Gaussian noise, and tested on a data set of the same size. This was repeated 10 times to determine the average accuracy for each value of λ. This parameter search showed that nonsmooth SSLR obtains a peak performance of 75.8 ± 4.4% (mean ± std. dev.) with λ = 1.5. On the other hand, since the frequency of the two-dimensional signal was the same as in the one-dimensional case, the optimal value of the smoothing parameter in the one-dimensional case (σ = 0.05) was sufficient in the two-dimensional case. Smooth SSLR with parameter combination (λ = 1.5, σ = 0.05) achieved an average accuracy of 99.8 ± 0.35%. Examples of the two-dimensional weight vectors learned by SSLR with and without a smoothing prior are shown in Fig. 7. The smooth SSLR weight vector learned is virtually identical to the uncorrupted target signal. The weight vector learned by non-smooth SSLR is much sparser, but the weights in the center follow the same general pattern as the target signal. Compared with the one-dimensional target signal, the twodimensional signal has a much larger number of features that are highly correlated. SSLR with a smoothing prior effectively exploited this property to obtain almost perfect accuracy and closely reproduce the underlying target signal. On the other hand, non-smooth SSLR treats the features independently, so the large number of features and strong noise resulted in less relevant features and more irrelevant features being selected. The results of Simulation experiment 3 show that SSLR performs well even with multi-dimensional signals with a large number of features, and further demonstrate the advantage of providing the classifier with prior knowledge of the structure of the weight vector.

MEG experiment

Fig. 7. The original target signal (upper left), an example of the corrupted target signal (upper right), and weight vectors learned by non-smooth SSLR (lower left) and SSLR with a smoothing prior (lower right) in Simulation experiment 3. Negative values are shown in blue, and positive values are shown in red.

To evaluate the performance of smooth sparse logistic regression (SSLR) on real data, we compared the decoding results of SSLR with non-smooth SSLR (i.e., SSLR with Q = identity matrix) on recordings from an MEG experiment. The decoding task in the MEG experiment was to determine whether subjects had pressed a button with their right index finger or their left index finger. Recordings were taken from the response period of a visual attention experiment. During a single trial, subjects fixated a gray bulls-eye fixation marker and watched a monochromatic grating stimulus displayed briefly for 17 ms. The fixation marker turned black 800 ms after stimulus offset to cue the subject to press a button with their right index finger if the grating was rotated slightly clockwise, and press a button with their left index finger if the grating was rotated counter-clockwise. Both grating conditions were presented with equal probability, which resulted in approximately equal number of left and right button presses. Subjects performed 4 sessions, with each session containing 40 trials. Subjects were given a few minutes to rest between sessions. Ten subjects participated in the experiment. Eye movements were monitored using EOG measurements. One subject was excluded from analysis due to excessive eye movements. The remaining nine subject's trials were rejected individually if any large eye movements (EOG > 100 μV) or other artifacts (MEG > 3000 fT) were found. Recordings were made using a whole-head 208-channel system (MEG vision — PQ1400RM; Yokogawa Electric Co., Japan) sampled

M. de Brecht, N. Yamagishi / NeuroImage 60 (2012) 1550–1561

Q 〈p1 ;t 1 〉;〈p2 ;t 2 〉 ¼ exp

−jjp1 −p2 jj 2σ 2p

2

!

2

exp

−ðt 1 −t 2 Þ 2σ 2t

! ;

ð4Þ

where 〈pi ; t i 〉 is the index of the feature vector corresponding to the value of the sensor at location pi and at time ti. The positional vector pi represents the (relative) location of the physical MEG sensor in 3dimensional space, measured in meters. The scalar ti is the (relative) time of the measurement in milliseconds. The variables σp and σt are parameters representing the degree of spatial smoothing and temporal smoothing, respectively. As in Simulation experiment 3, we computed the elements of Q as needed using a precomputed 208× 208 spatial covariance matrix corresponding to the first term of Eq. (3) and a 60× 60 temporal covariance matrix corresponding to the second term. SSLR therefore had three parameters, λ determining the strength of sparseness, σp determining spatial smoothing, and σt determining temporal smoothing. Appropriate values for these parameters were determined in the following way using all available data from one subject, whom we will refer to as the “pre-test subject”. First, λ was varied from 0 to 5 in steps of 0.5, and the average performance of non-smooth SSLR (i.e., Q = identity matrix) was evaluated for each value of λ using leave-one-session-out cross-validation. Based on this search, λ = 0.5 gave the best performance for the pre-test subject. A more refined search around λ = 0.5 with step size 0.01 showed no significant changes in performance accuracy, so we fixed λ = 0.5 in the following. Next, λ was held fixed at 0.5 and a grid search was performed by simultaneously varying σp from 0 to 0.05 m in steps of 0.01 m and varying σt from 0 to 200 ms in steps of 50 ms. For the special cases σp = 0 m and σt = 0 ms, identity matrices were used for the spatial, respectively temporal, components of Q. The performance of SSLR for each parameter combination was determined by cross-validation. Using the above parameter search, the optimal parameters for the pre-test subject was determined to be λ = 0.5, σp = 0.02 m, and σt = 100 ms. Average decoding performance for the pre-test subject using SSLR with these parameters was 87.3 ± 1.5% (mean ± standard deviation). Average performance using non-smooth SSLR (Q = I) with optimal parameter λ = 0.5 was 84.4 ± 1.7%, which was

significantly below the optimal SSLR performance according to a ttest paired by cross-validation runs (p b 0.001). The parameters obtained from the pre-test subject were then used for evaluating decoding performance on the remaining eight subjects (see Fig. 8). Across subjects (excluding the pre-test subject), average decoding accuracy was 91.6±7.8% for SSLR with the optimal parameters and 86.7±13.1% for non-smooth SSLR, which were significantly different (paired t-test, pb 0.005). To verify that the difference in performance was not due to any single subject, we removed subjects one at a time and recalculated the t-test. Under this analysis, the advantage of optimal SSLR still stayed significant at the p b 0.025 level. Within subjects, although optimal SSLR performed better than non-smooth SSLR for each subject, only Subject 6 attained significance (p b 0.05). The reason more subjects did not become significant is possibly due to the small number of cross-validation runs. The time course of weight vectors learned for Subjects 6 and 8 are shown in Figs. 9 and 10, respectively. In both figures, the top panels (a) show the weights learned by SSLR with optimal parameters, and the lower panels (b) show the weights learned by non-smooth SSLR. The full weight vectors can be seen in the supplementary movies (compare “SmoothWgts_Subj6.avi” versus “NonSmoothWgts_Subj6.avi”, and “SmoothWgts_Subj8.avi” versus “NonSmoothWgts_Subj8.avi”). Across subjects, non-smooth SSLR performed worst with Subject 6 (mean accuracy 64.6 ± 12.4%; mean ± std. dev.) and best with Subject 8 (96.0 ± 3.4%). In comparison, optimal SSLR had mean accuracy 83.1 ± 5.9% with Subject 6 and 96.5 ± 3.0% with Subject 8. The weights learned by optimal SSLR clearly show three distinct spatial clusters of weights which develop over time (see Discussion section). Furthermore, this general pattern is common in both Subject 6 and Subject 8. In contrast, the weight vectors learned by non-smooth SSLR appear more random, and it is difficult to find common patterns between the weights for Subjects 6 and 8. Qualitatively similar results were seen for other subjects. As can be seen in Figs. 9 and 10, SSLR with temporal smoothing often produces stronger weights much earlier than non-smooth SSLR. This raises the question as to whether the weights are truly contributing to classification, or if they are simply a result of over-smoothing. To investigate the contribution of the weights over time, we reevaluated the decoding accuracy of the learned weight vectors on the testing data as a function of time, where at time t only the values

100

90

Accuracy (%)

at 1000 Hz. The MEG data was high-pass filtered offline at 1 Hz, and then down sampled to 100 Hz. The time interval starting 300 ms before and ending 300 ms after the button press was used for decoding. Only trials in which the subject pressed a button more than 300 ms after the cue were used. In total, an average of 143.8 ± 12.9 (mean ± std. dev.) trials were used per subject for decoding. The data for a single trial consisted of a 208 × 60 (channels × time samples) matrix, which can be viewed as a single 12,480 element feature vector when used by the decoding algorithm. For each train/test validation, each feature element was normalized by subtracting the mean and dividing by the standard deviation of the value of the feature element, based on the statistics of the training data set. Other than the high pass filtering, down sampling, and normalization described above, no additional preprocessing of the data was performed. Trials in which the subject pressed the button with their right index finger were assigned a label of y = 1, and trials with left finger button presses were assigned a label of y = −1. The performance of the decoding algorithms was evaluated using a leave-one-session-out cross-validation. Data from three sessions were used for training the decoder, and then prediction accuracy was determined using the remaining session. This was repeated four times using each available session once as testing data, and then the mean prediction accuracy was computed. The covariance matrix Q used by SSLR took the form

1557

80

70

Optimal 60

50

Non−smooth

1

2

3

4

5

6

7

8

Subject # Fig. 8. Average decoding accuracy for the MEG experiment. The accuracy for SSLR with optimal parameters (λ=0.5, σp = 0.02 m, σt =100 ms) is shown in blue, and nonsmooth SSLR (λ=0.5, Q=identity) is shown in green. Error bars represent one standard error. Accuracy for the pre-test subject used to optimize parameters is not shown.

1558

M. de Brecht, N. Yamagishi / NeuroImage 60 (2012) 1550–1561

Fig. 9. Temporal snap-shots of weight vectors learned by SSLR with (a) the optimal smoothing prior and (b) non-smooth SSLR for Subject 6 in the MEG experiment. Red values correspond to positive weights, blue values to negative weights, and green values to weights that are close to zero. Times are relative to the time of button press.

of the weight vector and MEG data for time points up to t were given to the decoder. If the weights for early time points learned by SSLR with optimal parameters were purely the result of over-smoothing, then we would expect that the decoding accuracy at early time points would be at chance level. Fig. 11 shows the decoding performance of optimal SSLR (blue line) and non-smooth SSLR (green line) as a function of time, averaged across subjects (excluding the pre-test subject). Performance for both decoders began to rise sharply 40 ms before button press, and began to plateau around 80 ms after button press. Performance for optimal SSLR is significantly above chance (uncorrected paired t-test, p b 0.05) for all time points starting 210 ms before button press, as indicated by the vertical blue dashed line in Fig. 11. Non-smooth SSLR did not attain significance until 40 ms before the button press (vertical green dashed line), with the exception of one isolated time point at −120 ms. If a Bonferroni correction is applied for the multiple comparisons across time, then optimal SSLR, respectively non-smooth SSLR, become significant starting 40 ms, respectively 30 ms, before the button press (Bonferroni corrected paired ttest, p b 0.05). Although the decoding accuracy between −200 and −40 ms is rather poor and statistical significance is relatively weak,

Fig. 10. Temporal snap-shots of weight vectors learned by SSLR with (a) the optimal smoothing prior and (b) non-smooth SSLR for Subject 8 in the MEG experiment.

we nevertheless take this as evidence that the weights learned by optimal SSLR corresponding to time points as early as 200 ms prior to button press are contributing to classification, and are not simply a result of over-smoothing. Finally, we investigated how SSLR behaves when the parameters are optimized for each subject independently. We used the same procedure for each subject that was used for the pre-test subject to determine an optimal value of λ for non-smooth SSLR, and then find optimal smoothing parameters σp and σt. These results are shown in Table 3. The results consistently show an increase in performance when a smoothing prior is added. However, performance did not change significantly compared to when the parameters were optimized using the pre-test subject's data (see values in parentheses), either across subjects (paired t-test; p > 0.2) or within subjects (p > 0.15). The variability in the parameter values but lack of change in performance adds evidence that SSLR can be robust to the choice of smoothing parameters (see Simulation experiment 2 and the supplementary text “SupplementaryText.pdf”). Discussion We have shown that by incorporating spatio-temporal prior knowledge, in particular smoothing, into the classifier, smooth sparse

M. de Brecht, N. Yamagishi / NeuroImage 60 (2012) 1550–1561

100

Accuracy (%)

90

80

70

Optimal Non−smooth

60

50

40 −300

−200

−100

0

100

200

300

Time (ms) Fig. 11. Average decoding accuracy across subjects for the MEG experiment as a function of time. Each time point represents the accuracy obtained when only the values of the weight vector and MEG data up to that time point are given to the decoder. Performance for SSLR with optimal smoothing prior is shown in blue, and performance for non-smooth SSLR is shown in green. The vertical dashed blue line (respectively, green line) represents the time point at which optimal SSLR (respectively, non-smooth SSLR) decoding accuracy starts to be significantly above chance (uncorrected paired t-test, pb 0.05).

logistic regression (SSLR) can achieve higher performance and select features in spatio-temporally continuous clusters. We have also derived a dual-optimization algorithm for SSLR which is useful in cases where there are a large number of features, which is typical in neuroscientific applications. Recently, there has been a growing interest in incorporating spatio-temporal information into classifiers. Below we compare SSLR with some other recent approaches. Dyrholm and Parra (2006) have proposed using a smoothing prior on the weight vector when classifying EEG signals, and have shown that this results in high classification accuracy and increased interpretability of the results. Applications of smooth regression to brain–computer interface (BCI) are given in Parra et al. (2008). Although the smoothing prior used by Dyrholm and Parra (2006) is similar to ours, they did not incorporate sparseness into their model. As our simulation results show (see Fig. 3), the addition of an ℓ1-norm to the smoothing prior is important not just for feature selection, but also for preventing “fluctuations” in the weight vector caused by overfitting to noise.

Table 3 Accuracy when parameters optimized individually.

Subject Subject Subject Subject Subject Subject Subject Subject

1 2 3 4 5 6 7 8

Non-smooth

Optimal (Pre-test)

Parameters

94.2 ± 5.5% 94.9 ± 4.0% 96.8 ± 2.7% 85.2 ± 9.4% 73.1 ± 9.8% 73.3 ± 10.5% 93.7 ± 3.9% 95.3 ± 4.6%

97.5 ± 1.7% 97.9 ± 4.2% 96. 8 ± 2.7% 92.5 ± 3.8% 82.1 ± 9.9% 81.4 ± 8.7% 96.1 ± 3.9% 97.2 ± 3.9%

λ: λ: λ: λ: λ: λ: λ: λ:

(96.4%) (97.0%) (95.2%) (92.4%) (77.7%) (83.1%) (94.5%) (96.5%)

0.50, σp: 0.03 m, σt: 1.50, σp: 0.01 m, σt: 1.50, σp: 0.01 m, σt: 0.50, σp: 0.01 m, σt: 0.00, σp: 0.00 m, σt: 0.00, σp: 0.02 m, σt: 0.50, σp: 0.05 m, σt: 2.50, σp: 0.02 m, σt:

150 ms 100 ms 0 ms 100 ms 150 ms 50 ms 50 ms 100 ms

Average decoding accuracy for the MEG experiment when parameters are optimized for each subject individually using the same procedure that was used for the pre-test subject. The “Non-smooth” column shows the results for each subject using non-smooth SSLR with the optimized value of λ shown in the last column. “Optimal” shows the results when the smoothing parameters are also optimized. The value in parentheses shows the accuracy when the parameters optimized for the pre-test subject were used (the blue line in Fig. 8).

1559

Gershman et al. (2011) took a slightly different approach to using spatial constraints in fMRI analysis by using linear combinations of latent sources modeled by spatial basis functions. Their approach showed better decoding performance than standard logistic regression without spatial constraints for some data sets. Recent approaches that use both smoothing and sparseness include theoretical investigations by Hebiri and van de Geer (2011), as well as applications to fMRI decoding by van Gerven and Heskes (2010), Ng and Abugharbieh (2011), and Grosenick et al. (2011). These papers have focused mainly on placing a constraint of the form (wi − wj) 2, scaled and summed over neighboring feature weights wi and wj. For fMRI data, “neighboring” means immediately adjacent voxels. A related approach is taken by Michel et al. (2011), where the weight vector was penalized by its total variation. For one-dimensional data, this corresponds to a penalty of the form |wi − wj| applied to neighboring weights (see the reference for the three-dimensional case), which encourages more of a block structure on the weight vector. For EEG or MEG data, longer reaching interactions between feature weights might be more beneficial. This is suggested by our Table 3, which shows that most subjects preferred smoothness extending over spatially and temporally distant features. van Gerven et al. (2010) have also investigated the use of spatiotemporal priors with sparse logistic regression by means of a multivariate Laplacian distribution. A major difference between their approach and SSLR is that the multi-variate Laplacian model correlations between the absolute value of the weights, whereas SSLR places correlations on the weights themselves. Thus, although the multi-variate Laplacian approach will select features in spatiotemporal clusters, it will not necessarily prefer smoothly changing weight vectors. This might result in unintuitive weight vectors in some cases. On the other hand, when the underlying signal is not smooth, the approach of van Gerven et al. (2010) might have benefits over SSLR. As discussed in Yamashita et al. (2008), if there is a strong positive correlation in the noise in two features, and one feature is positively correlated to the underlying class-related signal but the other feature is uncorrelated (or negatively correlated) to the signal, then the classifier can effectively remove noise by adding positive weight to the first feature and negative weight to the second feature. van Gerven et al. (2010) describe this as a kind of local filter produced by the classifier. In situations where the underlying signal is not smooth, the multivariate Laplacian approach might be more effective than SSLR at filtering out correlated noise in this way. Further experiments and analysis are needed to test this hypothesis and better understand how classifiers behave when there are correlations in both the signal and the noise. Our analysis in Section Simulation experiment 2 shows that SSLR resembles performing classification on preprocessed data while selecting features with respect to the original feature space. This shows that SSLR is a non-trivial example of incorporating part of the typical preprocessing stage into the learning stage. The results of Section Simulation experiment 2 demonstrate that moving some of the preprocessing into the learning stage can make the classifier more robust. However, there are additional advantages to using SSLR on raw data as opposed to smoothing the data by preprocessing. Since SSLR only applies smoothness to the weight vector, the classification accuracy of SSLR for different values of the smoothing parameter can be compared to test hypotheses concerning the degree of smoothness of the underlying signal. It is more difficult to perform such an analysis by smoothing data and then applying an unconstrained classifier, because it is possible for classifiers to learn a weight vector that partially reverses a (linear) transformation of the data. This issue was in fact the subject of a recent debate on the interpretation of the effects of smoothing fMRI data on classification performance (Op de Beeck, 2010; Kamitani and Sawahata, 2010). Furthermore, an analysis of decoding accuracy with respect to time, such as that shown in Fig. 11,

1560

M. de Brecht, N. Yamagishi / NeuroImage 60 (2012) 1550–1561

would be less reliable if performed on smoothed data because the temporal structure of the data would be altered. When applied to real MEG data, we found that the smoothing prior in SSLR tended to increase decoding accuracy when compared to non-smooth SSLR (see Fig. 8 and the supplementary text “SupplementaryText.pdf”). In addition, the weight vectors obtained when using a smoothing prior were much easier to interpret.

potentially useful features during preprocessing. More complicated spatio-temporal patterns could also be given to the classifier as prior information, such as characteristic “N100–P200” patterns. Covariance matrices based on weight vectors learned from previous classification experiments could also be used to implement a form of transfer learning (Raina et al., 2006).

The three cluster pattern found in the smooth weight vectors for both Subjects 6 and 8 (top panels of Figs. 9 and 10; see also the supplementary movies “SmoothWgts_Subj6.avi”, “SmoothWgts_Subj8.avi”) is easily explained in terms of the movement-evoked fields (MEF) observed when subjects make voluntary finger movements (Cheyne et al., 2006; Mayville et al., 2005; Toda et al., 2011). Fig. 12 shows the MEG signal at the time of the left or right button press, averaged over all subjects and trials. When the portion of the smooth weight vector for t = 0 ms is applied to the activity pattern for the right button press condition seen in Fig. 12, the upper left and central clusters in the weights and the activity pattern overlap, producing a positive label. For the left button condition, the signs of the upper right and central clusters in the weights are opposite of the corresponding signs in the activity pattern, so the output is a negative label. Interestingly, a close inspection of the smooth weights reveals that the clusters change signs at about t = 100 ms. This is most noticeable for Subject 6, but can also be seen in the central cluster of weights for Subject 8. This change in polarity of the weight vectors is consistent with the transition from MEF-I to MEF-II observed in previous studies (Cheyne et al., 2006; Mayville et al., 2005). Note that, although the MEF can be seen by averaging over a large number of trials, the fact that SSLR achieved high decoding accuracy implies that these signals are present and robust even in individual trials. In contrast, the weights produced by non-smooth SSLR (bottom panels of Figs. 9 and 10; see also the supplementary movies “NonSmoothWgts_Subj6.avi” and “NonSmoothWgts_Subj8.avi”), are much noisier and share less similarities across subjects. Without the insight we gained from inspecting the smooth weights, it would be difficult to explain the seemingly random pattern produced by nonsmooth SSLR. In particular, it is more difficult to distinguish between weights that are related to motor activity and weights that are due to artifacts. Finally, although we have focused mainly on the effects of a spatio-temporal smoothing prior, our approach is general enough to allow more task-specific prior knowledge to be incorporated into the classifier. By changing the variance of the weights in the prior (i.e., the diagonal of Q), it is possible to encourage the classifier to select features that are expected to be more relevant to the task at hand. For example, in a motor decoding task, one can increase the variance of weights above the motor areas and decrease surrounding weights to encourage the selection of task-relevant features. In this way, we can tell the classifier where to look, without explicitly removing

Conclusions We have investigated a generalization of ℓ1-normed sparse logistic regression, which we call smooth sparse logistic regression (SSLR). SSLR allows the user to encode prior knowledge of the correlational structure of weight vectors into the classifier, which can help the classifier to select features in the form of meaningful spatio-temporal clusters. We have shown, using simulated data and real MEG data, that a smoothing prior can increase classification accuracy and produce results that are easier to interpret from a neuroscientific perspective. Our results suggest that SSLR is a promising technique to incorporating prior knowledge into a classifier, and can have important applications to both brain–machine interfaces and neuroscience in general. Supplementary materials related to this article can be found online at doi:10.1016/j.neuroimage.2011.12.085. Acknowledgments We would like to thank the anonymous reviewers for their helpful comments and suggestions. We also thank S. Hirose, O. Yamashita, and M. Sato for valuable discussions and helpful comments on earlier versions of this paper. Finally, we thank Y. Furukawa for her support with the MEG recordings. This work was supported by JST PRESTO program. Appendix A In this appendix we derive the dual optimization problem for GSLR (see Material and methods). This approach can be more efficient than trying to find the optimal weight vector directly when there are a large number of features. For more on the dual of optimization problems, see Chapter 5 of Boyd and Vandenberghe (2004).  i ¼ yi xi and As a notational convenience, we define x  ¼ ðx  2 ; …; x  N ÞT . The problem of maximizing the equation  1; x X ℓ(w; X, y) from Subsection Generalized sparse logistic regression is equivalent to the following constrained optimization problem, where we have simply added some trivial constraints: Maximize : f ðz; u; wÞ ¼ −

N X

logð1 þ expð−zi ÞÞ

i¼1

1 T −1 − u Q u−λjjwjj1 2  Subject to : z ¼ Xu and u ¼ w:

ðA−1Þ

By introducing Lagrangian multiplier vectors μ and η, we obtain the Lagrangian function  T T Lðz; u; w; μ; ηÞ ¼ f ðz; u; wÞ þ μ Xu−z þ η ðw−uÞ: The dual Lagrange function g(μ, η) is defined as g ðμ; ηÞ ¼ sup Lðz; u; w; μ; ηÞ: z;u;w

Fig. 12. MEG signal for left and right button press conditions at time t = 0 ms, averaged over all trials for all subjects (excluding pre-test subject). The left condition is the average of 539 trials, and the right condition is the average of 601 trials.

Strong duality holds for g(μ, η), which means that the minimal value of g(μ, η) equals the maximal value of f(z, u, w). Furthermore, if μ*, η*

M. de Brecht, N. Yamagishi / NeuroImage 60 (2012) 1550–1561

minimize g, then the values of z, u, w that maximize Lðz; u; w; μ  ; η Þ will also maximize f(z, u, w). If z, u, w satisfy the constraints of (A-1), then w is the solution to maximizing ℓ(w; X, y). The ℓ∞-norm of η, denoted ||η||∞, is defined as maxfjη1 j; jη2 j; …; jηD jg. Note that the only portion of L that depends on w are the terms ηT w−λjjwjj1 . It follows that if ||η||∞ > λ, then g(μ, η) = ∞ because L can take on arbitrarily large values by choosing large wi with the same sign as ηi. If ||η||∞ ≤ λ, then L obtains a maximum at

zi ¼ − logðμ i =ð1−μ i ÞÞ;

   T μ−η ; u¼Q X

w ¼ 0:

ðA−2Þ

N X ðμ i logðμ i Þ þ ð1−μ i Þ logð1−μ i ÞÞ i¼1

T   1  T  T μ−η þ X μ−η Q X 2 Subject to : jjηjj∞ ≤λ:

Finally, we note that in the special case that Q is the identity matrix, then minimizing  g(μ, η) with respect to η and fixed μ is easily obtained  T μ  , where as η ¼ P ∞λ X ∞

P λ ðxÞ ¼



 xi minðjxi j; λÞ jx i j i

is the projection on to the ℓ∞ ball of radius λ. References

In general, this is not the only maximal point of L, since if |ηi| = λ then any value of wi with the same sign as ηi will also maximize L. In this case, only wi = ui satisfies the constraints of the optimization problem, but the value of L is the same whether we choose wi = ui or wi = 0, so we have adopted the latter to make the next derivation more transparent. On the other hand, if |ηi| b λ, then wi = 0 is uniquely determined. By plugging the values for z, u, and w from (A-2) into the definition of L, we obtain the following constrained dual to problem (A-1):

Minimize : g ðμ; ηÞ ¼

1561

ðA−3Þ

 T   μ  −η If μ*, η* solve the dual problem (A-3), then u ¼ w ¼ Q X   are the solution to problem (A-1). Therefore, w ¼ and z ¼ Xu  T    Q X μ −η maximizes ℓ(w; X, y). Below we list some properties of the dual problem. 1. At the solution μ*, η* to the dual problem,    zi ¼ − log μ i = 1−μ i ¼ yi xTi w , which implies P ðyi jxi ; w Þ ¼ 1−μ i . Therefore, w* does a good job of separating the two classes if μ* is close to the zero vector. 2. From our previous observations, wi* = 0 whenever |η⁎i | b λ. This provides us with a simple way to estimate which feature weights are non-zero given close approximations to the optimal solution. 3. There is no need to invert Q when dealing with g(μ, η). If the elements of Q are easily computed as a function of the indices, then it is not even necessary to keep Q in memory. This allows us to solve the dual problem (A-3) even in cases when Q is too large to solve problem (A-1) directly. 4. If the number of observations N is relatively small, then minimizing g(μ, η) with respect to μ and fixed η can be done very quickly using Newton's method. On the other hand, minimizing g(μ, η) with respect to η and fixed μ is complicated by the constraints on η. Fortunately, this is an example of a quadratic program, and many efficient algorithms are available (see Nocedal and Wright, 1999). We have found the gradient projection method to be effective even for problems involving several thousand features.

Boyd, S., Vandenberghe, L., 2004. Convex Optimization. Cambridge University Press. Carroll, M.K., Cecchi, G.A., Rish, I., Garg, R., Rao, A.R., 2009. Prediction and interpretation of distributed neural activity with sparse models. Neuroimage 44 (1), 112–122. Cheyne, D., Bakhtazad, L., Gaetz, W., 2006. Spatiotemporal mapping of cortical activity accompanying voluntary movements using an event-related beamforming approach. Hum. Brain Mapp. 27 (3), 213–229. Dyrholm, M., Parra, L., 2006. Smooth bilinear classification of EEG. Engineering in Medicine and Biology Society, 2006. EMBS'06. 28th Annual International Conference of the IEEE, pp. 4249–4252. Gershman, S.J., Blei, D.M., Pereira, F., Norman, K.A., 2011. A topographic latent source model for fMRI data. Neuroimage 57 (1), 89–100. Grosenick, L., Klingenberg, B., Knutson, B., Taylor, J. E., 2011. A family of interpretable multivariate models for regression and classification of whole-brain fMRI data. In submission (arXiv:1110.4139v1[stat.AP]) . Hebiri, M., van de Geer, S., 2011. The Smooth-Lasso and other ℓ1 + ℓ2-penalized methods. Electron. J. Stat. 5, 1184–1226. Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), 55–67. Kamitani, Y., Sawahata, Y., 2010. Spatial smoothing hurts localization but not information: pitfalls for brain mappers. Neuroimage 49 (3), 1949–1952. Koh, K., Kim, S.-J., Boyd, S., 2007. A method for large-scale l1-regularized logistic regression. Proceedings of the 22nd National Conference on Artificial Intelligence — Volume 1. AAAI Press, pp. 565–571. MacKay, D.J.C., 1992. Bayesian interpolation. Neural Comput. 4, 415–447. Mayville, J., Fuchs, A., Kelso, J., 2005. Neuromagnetic motor fields accompanying self-paced rhythmic finger movement at different rates. Exp. Brain Res. 166, 190–199. Michel, V., Gramfort, A., Varoquaux, G., Eger, E., Thirion, B., 2011. Total variation regularization for fMRI-based prediction of behavior. IEEE Trans. Med. Imaging 30 (7), 1328–1340. Miyawaki, Y., Uchida, H., Yamashita, O., Sato, M., Morito, Y., Tanabe, H., Sadato, N., Kamitani, Y., 2008. Visual image reconstruction from human brain activity using a combination of multiscale local image decoders. Neuron 60 (5), 915–929. Ng, B., Abugharbieh, R., 2011. Modeling spatiotemporal structure in fMRI brain decoding using generalized sparse classifiers. IEEE International Workshop on Pattern Recognition in NeuroImaging, pp. 65–68. Nocedal, J., Wright, S., 1999. Numerical Optimization. Springer. Op de Beeck, H.P., 2010. Against hyperacuity in brain reading: spatial smoothing does not hurt multivariate fmri analyses? Neuroimage 49 (3), 1943–1948. Parra, L., Christoforou, C., Gerson, A., Dyrholm, M., Luo, A., Wagner, M., Philiastides, M., Sajda, P., 2008. Spatiotemporal linear decoding of brain state. Signal Process. Mag. IEEE 25 (1), 107–115. Raina, R., Ng, A.Y., Koller, D., 2006. Constructing informative priors using transfer learning. Proceedings of the 23rd International Conference on Machine Learning. ICML'06. ACM, New York, NY, USA, pp. 713–720. Ryali, S., Supekar, K., Abrams, D.A., Menon, V., 2010. Sparse logistic regression for whole-brain classification of fMRI data. Neuroimage 51 (2), 752–764. Tibshirani, R., 1996. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 58 (1), 267–288. Toda, A., Imamizu, H., Kawato, M., Sato, M., 2011. Reconstruction of two-dimensional movement trajectories from selected magnetoencephalography cortical currents by combined sparse Bayesian methods. Neuroimage 54 (2), 892–905. Tomioka, R., Sugiyama, M., 2009. Dual-augmented Lagrangian method for efficient sparse reconstruction. Signal Proces. Lett. IEEE 16 (12), 1067–1070. van Gerven, M.A., Heskes, T., 2010. Sparse orthonormalized partial least squares. 22nd Benelux Conference on Artificial Intelligence (BNAIC 2010). van Gerven, M., Hesse, C., Jensen, O., Heskes, T., 2009. Interpreting single trial data using groupwise regularisation. Neuroimage 46 (3), 665–676. van Gerven, M.A., Cseke, B., de Lange, F.P., Heskes, T., 2010. Efficient Bayesian multivariate fMRI analysis using a sparsifying spatio-temporal prior. Neuroimage 50 (1), 150–161. Yamashita, O., Sato, M., Yoshioka, T., Tong, F., Kamitani, Y., 2008. Sparse estimation automatically selects voxels relevant for the decoding of fMRI activity patterns. Neuroimage 42 (4), 1414–1429. Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67, 301–320.