Accepted Manuscript Decoding naturalistic experiences from human brain activity via distributed representations of words Satoshi Nishida, Shinji Nishimoto PII:
S1053-8119(17)30664-X
DOI:
10.1016/j.neuroimage.2017.08.017
Reference:
YNIMG 14248
To appear in:
NeuroImage
Received Date: 15 March 2017 Revised Date:
31 July 2017
Accepted Date: 3 August 2017
Please cite this article as: Nishida, S., Nishimoto, S., Decoding naturalistic experiences from human brain activity via distributed representations of words, NeuroImage (2017), doi: 10.1016/ j.neuroimage.2017.08.017. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
1
Decoding naturalistic experiences from human brain activity via
2
distributed representations of words
4
RI PT
3 Satoshi Nishida1,2, and Shinji Nishimoto1,2,3,*
5 6
1
7
Communications Technology (NICT), Suita, Osaka 565-0871, Japan
8
2
Graduate School of Frontier Biosciences,
9
3
Graduate School of Medicine,
Osaka University, Suita, Osaka 565-0871, Japan
11 12
*
13
1-4 Yamadaoka, Suita, Osaka 565-0871, Japan
14
[email protected]
EP
TE D
Corresponding author: Shinji Nishimoto
AC C
15
SC
M AN U
10
Center for Information and Neural Networks (CiNet), National Institute of Information and
1
ACCEPTED MANUSCRIPT
Abstract
17
Natural visual scenes induce rich perceptual experiences that are highly diverse from scene to
18
scene and from person to person. Here, we propose a new framework for decoding such
19
experiences using a distributed representation of words. We used functional magnetic resonance
20
imaging (fMRI) to measure brain activity evoked by natural movie scenes. Then, we
21
constructed a high-dimensional feature space of perceptual experiences using skip-gram, a
22
state-of-the-art distributed word embedding model. We built a decoder that associates brain
23
activity with perceptual experiences via the distributed word representation. The decoder
24
successfully estimated perceptual contents consistent with the scene descriptions by multiple
25
annotators. Our results illustrate three advantages of our decoding framework: (1) three types of
26
perceptual contents could be decoded in the form of nouns (objects), verbs (actions), and
27
adjectives (impressions) contained in 10,000 vocabulary words; (2) despite using such a large
28
vocabulary, we could decode novel words that were absent in the datasets to train the decoder;
29
and (3) the inter-individual variability of the decoded contents co-varied with that of the
30
contents of scene descriptions. These findings suggest that our decoding framework can recover
31
diverse aspects of perceptual experiences in naturalistic situations and could be useful in various
32
scientific and practical applications.
SC
M AN U
TE D
EP
AC C
33
RI PT
16
2
ACCEPTED MANUSCRIPT
Highlights
35
•
A new decoding method that uses a distributed representation of words
36
•
Decoding of movie-induced perceptions from cortical activity in the form of words
37
•
Decoding could infer object, action, and impression perception separately
38
•
Inter-individual variability in decoding correlated with variability in perception
39
•
Our method provides a useful tool for scientific and practical applications
SC
RI PT
34
40 Key words
42
decoding; semantic perception; natural language processing; humans; fMRI; natural vision
M AN U
41
AC C
EP
TE D
43
3
ACCEPTED MANUSCRIPT
44
1.
Introduction Recent developments in decoding techniques using functional magnetic resonance
46
imaging (fMRI) have the potential to form the quantitative basis of non-invasive brain-machine
47
interfaces (Haynes and Rees, 2006; Kay et al., 2008; Miyawaki et al., 2008; Naselaris et al.,
48
2015, 2009; Nishimoto et al., 2011). One promising means of achieving interpretable decoding
49
of the brain activity evoked by diverse experiences is in the form of language: i.e., words
50
(Horikawa et al., 2013; Huth et al., 2016b; Pereira et al., 2011; Stansbury et al., 2013) or
51
sentences (Anderson et al., 2016; Matsuo et al., 2016; Pereira et al., 2016; Yang et al., 2017).
M AN U
SC
RI PT
45
52
Despite previous attempts to comprehensively decode objective and subjective
54
experiences, there are still several issues to overcome. First, the previous techniques tried to
55
recover perceptual contents using a restricted representational space comprising only a few
56
thousand words at most, although language consists of tens of thousands or more words and can
57
represent richer information than examined in previous studies. Second, previous word-based
58
decoding studies did not provide strong evidence to support that the variety of decoded contents
59
was consistent with the subjective experiences of individual participants.
61
EP
AC C
60
TE D
53
To address these issues, we propose a new decoding technique that can recover rich and
62
diverse perceptual experiences from the brain activity of individuals by effectively applying a
63
large-scale word representational space provided by a natural language processing (NLP) model.
64
For several decades, NLP researchers have tried to extract word representations from the
65
statistical characteristics in large datasets of text (e.g., Blei et al., 2003; Deerwester et al., 1990;
4
ACCEPTED MANUSCRIPT
Mikolov et al., 2013a; Pennington et al., 2014). Some neuroimaging studies have demonstrated
67
that the extracted feature representations of words can be used to model semantic
68
representations in the brain (Chang et al., 2011; Huth et al., 2016a; Mitchell et al., 2008; Pereira
69
et al., 2011; Stansbury et al., 2013). Among them, models that incorporate distributed word
70
representations produced by skip-gram (Mikolov et al., 2013a), a state-of-the-art NLP algorithm,
71
can predict visually evoked brain responses better than models with other NLP algorithms
72
(Güçlü and van Gerven, 2015; Nishida et al., 2015). If the skip-gram word representation
73
efficiently captures semantic representation in the human brain, it should also provide an
74
effective feature space for word-based decoding to recover richer and more complex semantic
75
perceptions from individual brain activity.
M AN U
SC
RI PT
66
76
To test this idea, we introduce a word-based decoding technique that uses skip-gram word
TE D
77
representations. Importantly, our decoding model does not learn the association between brain
79
activity and words per se. Instead, it learns the association between brain activity and the
80
low-dimensional feature space that captures the relational structure of tens of thousands of
81
words. This enables us to dramatically increase the number of potential words used for
82
decoding without increasing the model complexity. In this paper, we demonstrate the validity
83
and advantages of our decoding technique using participants’ fMRI responses while viewing
84
natural movie scenes. In particular, we focus on the following three questions. (1) Is it possible
85
to decode the perceptual contents induced by movie scenes into different types of content words,
86
including not only objects (nouns) and actions (verbs), but also impressions (adjectives)? (2)
87
How accurately can the decoder estimate words that have never been presented to it as part of a
AC C
EP
78
5
ACCEPTED MANUSCRIPT
88
training dataset? (3) Do the decoded contents co-vary with the inter-individual differences in
89
perceptual experiences?
AC C
EP
TE D
M AN U
SC
RI PT
90
6
ACCEPTED MANUSCRIPT
91
2.
Material and methods
92
2.1.
Participants We scanned six healthy participants (P1–P6; age 26–41; 1 female) with normal or
94
corrected-to-normal vision. P1 was an author of the study. Informed consent was obtained from
95
all of the participants. The experimental protocol was approved by the ethics and safety
96
committees of the National Institute of Information and Technology.
SC
2.2.
fMRI data collection
M AN U
97 98
RI PT
93
Functional scans were collected on a 3T Siemens TIM Trio scanner (Siemens, Germany)
100
using a 32-channel Siemens volume coil and a multiband gradient echo-EPI sequence (Moeller
101
et al., 2010; TR = 2000 ms; TE = 30 ms; flip angle = 62°; voxel size = 2 × 2 × 2 mm; matrix
102
size = 96 × 96; FOV = 192 × 192 mm; multiband factor = 3). Seventy-two axial slices covered
103
the entire cortex. Anatomical data were collected using a T1-weighted MPRAGE sequence (TR
104
= 2530 ms; TE = 3.26 ms; flip angle = 9°; voxel size = 1 × 1 × 1 mm; matrix size = 256 × 256;
105
FOV = 256 × 256 mm) on the same 3T scanner.
EP
TE D
99
107 108
AC C
106 2.3.
Experimental design
The movie stimuli consisted of 298 clips from TV commercials that were selected from
109
the database of Japanese TV commercials managed by the TEMS Corp. (Tokyo, Japan) and
110
NTT DATA Corp. (Tokyo, Japan). These commercials were broadcasted nationwide in Japan
111
between 2009 and 2015. The commercial clips included a wide variety of product categories,
112
including food, drink, clothing, appliances, cars, housing, insurance, and amusement facilities.
7
ACCEPTED MANUSCRIPT
The length of the commercial clips was typically around 15 to 30 s. To create the experimental
114
movie stimuli, the original commercial clips were arranged in size and then sequentially
115
concatenated in a pseudo-random order. We made 13 non-overlapping movie clips ranging from
116
400 to 662 s in length. The individual movie clips were displayed in separate scans. Twelve of
117
the clips were presented once each to collect a dataset for model training (training dataset; 7,232
118
s in total). The other was presented four times in four separate scans to collect a dataset for
119
model testing (test dataset; 1,560 s in total).
M AN U
120
SC
RI PT
113
121
Participants viewed the visual stimuli displayed on a projector screen inside the scanner (23.8 ×13.5 degrees of visual angle at 30 Hz) and listened to the audio stimuli through
123
MR-compatible headphones. The participants were given no explicit task and were instructed to
124
watch the clips naturally, as though they watch commercials on TV in everyday life. The brain
125
data from individual participants were collected in 2–4 separate recording sessions over 1–4
126
days.
129
2.4.
fMRI data preprocessing
AC C
128
EP
127
TE D
122
Motion correction in each functional scan was performed using the Statistical Parameter
130
Mapping toolbox (SPM8, http://www.fil.ion.ucl.ac.uk/spm/software/spm8/). All volumes were
131
aligned to the first image from the first functional run for each participant. Low-frequency voxel
132
response drift was detected using a median filter with a 120 s window and subtracted from the
133
signal. The response for each voxel was then normalized by subtracting the mean response and
134
scaling it to the unit variance. We used FreeSurfer (Dale et al., 1999; Fischl et al., 1999) to
8
ACCEPTED MANUSCRIPT
identify cortical surfaces from anatomical data and register them to the voxels of functional data.
136
All voxels identified within the whole cortex for each participant were used for the analysis (P1,
137
68,942 voxels; P2, 61,843 voxels; P3, 51,474 voxels; P4, 69,899 voxels; P5, 67,060 voxels; P6,
138
61,091 voxels).
RI PT
135
139 2.5.
Text data preprocessing
SC
140 141
We used a text corpus from the Japanese Wikipedia dump on January 11, 2016 (http://dumps.wikimedia.org/jawiki) to learn a word feature space. From the 2,016,021 articles
143
in the corpus, we selected the 633,477 articles that contained more than 50 words to use as the
144
text dataset. All Japanese texts in the dataset were segmented into words and lemmatized using
145
MeCab (http://taku910.github.io/mecab), an open-source software for Japanese text
146
segmentation and part-of-speech analysis using conditional random fields for sequential
147
segmentation (Lafferty et al., 2001). We used a custom-made Japanese dictionary as a
148
vocabulary database for the segmentation and part-of-speech analysis. The dictionary was made
149
by combining the words from the titles of Japanese Wikipedia articles with the Japanese
150
dictionary published by the Nara Institute of Science and Technology
151
(http://sourceforge.jp/projects/naist-jdic). Only nouns, verbs, and adjectives were used for the
152
following analysis, and the other parts of speech were discarded.
AC C
EP
TE D
M AN U
142
153 154
2.6.
Skip-gram vector space
155
The skip-gram algorithm was originally developed to learn a high-dimensional word
156
vector space based on local (nearby) word co-occurrence statistics in natural language texts
9
ACCEPTED MANUSCRIPT
(Mikolov et al., 2013a). Although the original study of the skip-gram algorithm used English
158
corpora for learning (Mikolov et al., 2013a), follow-up studies have demonstrated that the
159
skip-gram algorithm performs well for NLP problems in other languages, including Japanese
160
(e.g., Sakahara et al., 2014; Wang and Ittycheriah, 2015).
RI PT
157
161
On the basis of the lemmatized word data, we constructed a skip-gram latent word vector
163
space using the gensim Python library (Rehurek and Sojka, 2010). The training objective of the
164
skip-gram algorithm is to obtain latent word representations that enable accurate prediction of
165
the surrounding words given a word in a sentence (Mikolov et al., 2013a). More formally, given
166
a sequence of training words w1, w2, … , wT, the skip-gram algorithm seeks a k-dimensional
167
vector space to maximize the average log probability, given as
TE D
1
M AN U
SC
162
,
log ( | )
where c is the size of the training window, which corresponds to the number of to-be-predicted
169
words before and after the central word wt. The basic formulation of p(wt+j|wt) is the softmax
170
function
AC C
EP
168
=
exp〈"# , "# 〉 ) ∑( &'(〈"# , "#( 〉)
171
where vwi is the vector representation of wi, N is the number of words in the vocabulary, and
172
〈" , "* 〉 indicates the inner product of vectors v1 and v2. However, because this formulation has
173
a high computational cost, we used the “negative sampling” technique (Mikolov et al., 2013a;
174
the number of negative samples = 5) for a computationally efficient approximation of the
175
softmax function.
10
ACCEPTED MANUSCRIPT
176 We used the vector dimensionality k = 100 and the window size c = 10. To improve the
178
reliability of the learning and restrict the vocabulary size to around 100,000 words, words that
179
appeared less than 178 times in the corpus were excluded from the analysis; nevertheless, the
180
size of the vocabulary had little effect on our results. In addition, to accelerate learning and
181
improve the quality of the learned vectors of rare words, we used a procedure for subsampling
182
frequent words (Mikolov et al., 2013a).
184 185
2.7.
Movie scene annotation
M AN U
183
SC
RI PT
177
Each movie scene was manually annotated using natural Japanese language. The annotations were given at 1-s intervals to obtain precise descriptions of movie scenes that
187
changed dynamically from second to second. The annotators were native Japanese speakers (22
188
males and 26 females; age 18–51) who were neither the authors nor the MRI participants,
189
except that S5 and S6 provided the annotations of ≤150 scenes each in the training movies.
190
They were instructed to annotate each scene with descriptive sentences using more than 50
191
Japanese characters (see Figure 1A and Figure S1 for examples). We randomly assigned 4–7
192
annotators for each scene to reduce the potential effect of personal bias.
194
EP
AC C
193
TE D
186
Many of the movie scenes included text captions: they were present in 3,442 (45.2%) of
195
the 7,622 one-second scenes. In 1,420 (41.3%) of these 3,442 one-second scenes, the text
196
captions were contained in the descriptive sentences of the annotators. These text captions
197
might have affected the perceptual experiences of participants and annotators. However, we do
11
ACCEPTED MANUSCRIPT
not believe that this was an issue in the present study because we aimed to visualize various
199
perceptual experiences using our word-based decoding method, regardless of whether the
200
experiences were verbal or nonverbal.
RI PT
198
201 202
We also collected annotations from the MRI participants for the test movie scenes after the participants had viewed all of the movies in the scanner. The instructions for the annotation
204
were the same as above. We randomly assigned 130 scenes to each of the 6 participants without
205
repetition and obtained 2 annotations per 1-s scene.
207
2.8.
208
Scene-vector construction
M AN U
206
SC
203
Word vector representations for individual movie scenes were computed from the manual scene annotations using the learned word vector space. Each annotation for a given scene was
210
decomposed into nouns, verbs, and adjectives using the same method as described earlier (see
211
2.5. Text data preprocessing). Individual words were projected into the corresponding
212
100-dimensional word vector space (see 2.6. Skip-gram vector space). The word vectors were
213
averaged within each annotation. Then, for each scene, all vectors that were obtained from the
214
different annotations were averaged. This procedure yielded one vector representation for each
215
second of each movie. Finally, to match the sampling interval to the fMRI data (2 s) for the
216
following analysis, the vectors for individual seconds were averaged over two successive
217
seconds. These averaged vectors are referred to as annotation scene vectors.
AC C
EP
TE D
209
218 219
2.9.
Model fitting
12
ACCEPTED MANUSCRIPT
220
Our decoding model predicts scene vectors by a weighted linear summation of voxel responses. Specifically, a series of 100-dimensional annotation scene vectors in S movie scenes,
222
denoted by V, were modeled by a series of responses in the set of T voxels within the whole
223
cortex, denoted by R, times the linear weight W, plus isotropic Gaussian noise ε: + = ,-. + 0
RI PT
221
We used a set of linear temporal filters to model the slow hemodynamic response and its
225
coupling with brain activity (Nishimoto et al., 2011). To capture the hemodynamic delay in the
226
responses, the S × 3T matrix R was constructed by concatenating three sets of T-dimensional
227
response vectors with temporal shifts of 2, 4, and 6 s. The 3T × 100 weight matrix WE was
228
estimated using an L2-regularized linear least-squares regression. A regularized regression can
229
obtain good estimates even for models containing a large number of regressors (Çukur et al.,
230
2013; Huth et al., 2012). The estimation of the scene vector from newly measured responses
231
was conducted by multiplying the responses evoked in T voxels by WE. We refer to the scene
232
vector as a decoded scene vector.
M AN U
TE D
EP
234
The training dataset consisted of 3,616 samples (7,232 s), but the first 5 samples (10 s) of
AC C
233
SC
224
235
each block were discarded to avoid responses from the non-stimulus period. Hence, 3,556
236
samples were used for the model fitting procedure. In addition, to estimate the regularization
237
parameter, we divided the training dataset into two subsets by random resampling: 80% of the
238
samples were used for model fitting and the remaining 20% were used for model validation.
239
This random resampling procedure was repeated 10 times. We determined the optimal
240
regularization parameter for each subject using the mean value of the Pearson’s correlation
13
ACCEPTED MANUSCRIPT
241
coefficient between decoded and annotation scene vectors for the 20% of validation samples.
242 The test dataset consisted of 190 samples (380 s) after discarding the first 5 samples (10
RI PT
243
s) in each block. The fMRI signals for the four stimulus repetitions were averaged to improve
245
the signal-to-noise ratio. This dataset was not used for the model fitting or the parameter
246
estimation, but instead was used to evaluate the final prediction accuracy for each voxel. The
247
prediction accuracy was measured by the Pearson’s correlation coefficient between the decoded
248
and annotation scene vectors.
250 251
2.10. Word-based decoding
M AN U
249
SC
244
For each movie scene, we measured the similarity between each word in the vocabulary and each decoded movie scene vector using Pearson’s correlation coefficient in the
253
100-dimensional skip-gram vector space. We refer to the correlation coefficient as a word score.
254
Words with higher scores were regarded as more likely to reflect perceptual contents. We
255
restricted the size of the vocabulary to the 10,000 words that appeared most frequently in the
256
Wikipedia text dataset. Word scores were estimated for three parts of speech, nouns, verbs, and
257
adjectives, consisting of 9,320, 588, and 92 words, respectively.
259 260
EP
AC C
258
TE D
252
2.11. Word-wise decoding accuracy We conducted two analyses to assess the performance of our decoding model at the
261
single-word level. The first was a word-wise correlation analysis. For a given word, we
262
calculated the time series of word scores and the time series of annotation word scores which
14
ACCEPTED MANUSCRIPT
were defined by Pearson’s correlation coefficients between the word vector and annotation
264
scene vectors. Then, we calculated the Pearson’s correlation coefficient between these two
265
series and regarded it as the word-wise correlation coefficient of that word. We evaluated the
266
word-wise correlation for each of the 10,000 words we used for the word-based decoding unless
267
otherwise noted. The statistical significance of the word-wise correlations for each part of
268
speech was tested using Wilcoxon’s signed-rank test (p < 0.05); the null hypothesis was that
269
correlation coefficients were derived from a distribution with a median value of zero.
SC
RI PT
263
271
M AN U
270
The second analysis was a word-wise receiver operating characteristic (ROC) analysis (Huth et al., 2016b). After calculating the word score for a given word (see Word-based
273
decoding), we gradually increased the detection threshold from –1 to 1. For each threshold, we
274
counted the number of false positive detections (scenes for which the word score was higher
275
than the threshold but the word was not present in any of the annotations) and true positive
276
detections (scenes for which the word score was higher than the threshold and the word was
277
actually present in the annotations). Consequently, we obtained a function of the true positive
278
rate against the false positive rate across all thresholds to produce the area under the ROC curve
279
(AUC). An AUC value close to 1 indicates high decoding accuracy, whereas an AUC value
280
close to 0.5 indicates low decoding accuracy. We also varied the presence/absence threshold
281
that determined whether a given word was present or absent in each movie scene. We used three
282
thresholds: one, two, or three annotations. For example, the threshold of two annotations means
283
that a given word was regarded as present in a given scene if the word appeared in ≥2
284
annotations in that scene; otherwise, the word was regarded as absent.
AC C
EP
TE D
272
15
ACCEPTED MANUSCRIPT
285 286
In this ROC analysis, the infrequent words that were present in the annotations of less than five scenes were excluded to improve the quality of the analysis. The words that were
288
present in more than 97 two-second scenes (more than half of all scenes) were also excluded
289
from the analysis, because such words seemed to be highly common words that provided little
290
information about the scene (e.g., “be”). Using the selection criteria above, we selected 3,349
291
words for the ROC analysis.
293
M AN U
292
SC
RI PT
287
The statistical significance of the word-wise ROC analysis for single words was tested using a randomization test (p < 0.05). After randomly shuffling the presence/absence labels of a
295
given word in individual scenes, an AUC value was computed. This process was repeated 1,000
296
times to obtain a null distribution of AUC values. A p-value was computed as the fraction of the
297
null distribution that was higher than the original AUC value. The statistical significance of the
298
word-wise ROC analysis for each part of speech was tested using Wilcoxon’s signed-rank test
299
(p < 0.05); the null hypothesis was that AUC values minus 0.5 were derived from a distribution
300
with a median value of zero.
302 303
EP
AC C
301
TE D
294
2.12. Japanese–English translation For this paper, we translated the Japanese words (i.e., the language of the annotations and
304
corpus) into the corresponding English words. To achieve neutral word-by-word translation
305
without selection bias, we semi-automatically translated the words using Google Translate
306
(https://translate.google.com/). Specifically, we adopted the first candidate word in the
16
ACCEPTED MANUSCRIPT
Japanese–English translation returned from Google Translate. Occasionally, the part-of-speech
308
of an English word returned by Google Translate did not match with that of the original
309
Japanese word. In such cases, we manually translated it on the basis of another standard
310
Japanese–English dictionary.
RI PT
307
AC C
EP
TE D
M AN U
SC
311
17
ACCEPTED MANUSCRIPT
312
3.
313
3.1.
Word-based decoding could recover perceptual contents varying across scenes Our decoding model (Figure 1A) was trained to learn the association between
RI PT
314
Results
movie-evoked brain activity and perceptual experiences via distributed word representations.
316
We recorded the brain activity of six participants while they watched 147 minutes of movies
317
(TV ads). The presented movie scenes were chunked into one-second clips, and four or more
318
individual annotators described each clip using natural language (Figure 1A; see also Figure S1
319
for more annotation examples). The scene annotations were transformed into vector
320
representations via a skip-gram word vector space (annotation scene vectors.) We then
321
performed L2-regularized linear regressions to obtain the linear weight matrix between brain
322
activity and scene vectors (see Material and methods for further details). The optimal weights
323
were obtained separately for individual participants. We used a training dataset consisting of
324
3,616 time samples to train the model, and a separate test dataset consisting of 195 time samples
325
to quantify the performance of the trained model. Although the sample size of the test dataset
326
was small, the annotation scene vectors in the test dataset broadly covered the representational
327
space of the annotation scene vectors in the training dataset (Figure S2).
329
M AN U
TE D
EP
AC C
328
SC
315
The scene vectors predicted from brain activity (decoded scene vectors) were
330
substantially correlated with the annotation scene vectors in the same test dataset (the average
331
Pearson’s r for the six individual participants ranged from 0.40–0.45; p < 0.0001). Although the
332
correlations varied from scene to scene even within each movie clip, the temporal profiles of the
333
correlations of individual participants were similar throughout the whole movie (Figure S3).
18
ACCEPTED MANUSCRIPT
334 335
Then, the decoded scene vectors were used to recover the perceptual contents induced by individual movie scenes in the form of words (Figure 1B). We calculated the correlation
337
coefficients between decoded scene vectors and individual word vectors as word scores. Based
338
on the word scores, the words that were likely to reflect perceptual contents were estimated;
339
words with higher word scores were regarded as being more likely (see Material and methods
340
for further details). We performed word score estimations for the 10,000 most frequent words
341
(restricted to nouns, verbs, and adjectives) in the Wikipedia corpus. The word scores were
342
ranked separately for these three parts of speech, which were considered to reflect the
343
perceptual contents associated with objects, actions, and impressions, respectively.
M AN U
SC
RI PT
336
344
The most likely words estimated by the decoder varied across individual scenes and
346
across individual participants (Figure 2). Figure 2A shows the most likely words estimated from
347
a single participant for a single scene, in which a man is looking at a tablet computer. Likely
348
words were, for example, “display” and “system” as nouns, “represent” as a verb, and “wide” as
349
an adjective, which adequately explain the visual contents of the scene. Figure 2B shows the
350
estimation for another scene. In this scene, a woman is soaking in a bath and a Japanese caption
351
is presented in the center. Likely words were, for example, “female” and “face” as nouns, and
352
“cute” and “young” as adjectives. Figure 2C is the same as Figure 2B, but the estimation is from
353
another participant. Likely words were, for example, “comment” and “catchphrase” as nouns.
354
The words in Figure 2B are more related to the perception of the woman, whereas the words in
355
Figure 2C are more related to the perception of the caption.
AC C
EP
TE D
345
19
ACCEPTED MANUSCRIPT
356 To test the validity of our decoding method at the single-word level, we conducted two
358
analyses using the decoded scene vectors averaged across all participants. First, we conducted
359
word-wise correlation analysis. For each word and scene in the test dataset, we calculated
360
annotation word scores: the correlation coefficients between the annotation scene vectors and
361
the individual word vectors. The annotation word scores reflect the likelihood of words based
362
on the annotators’ scene descriptions, thereby forming an approximate ground truth of the word
363
scores. We quantified the word-wise correlation by calculating the correlation coefficients
364
between word scores and annotation word scores across the scenes in the test dataset (see
365
Material and methods for details; Figure 3A). The correlation coefficients were significantly
366
higher than zero for all three parts of speech (Wilcoxon signed-rank test, p < 0.0001; see Figure
367
S4 for individual participants), suggesting that our decoded results reflect perceptual
368
experiences regarding objects, actions, and impressions.
SC
M AN U
TE D
EP
369
RI PT
357
Second, we conducted a word-wise receiver operating characteristic (ROC) analysis
371
(Huth et al., 2016b). For individual words, we obtained the time series of word scores and those
372
of word occurrences in scene annotations throughout the test dataset (Figure 3B shows the
373
example “Building”). The word occurrence was evaluated using three word presence/absence
374
thresholds; a word was regarded as being present in a given scene when the word appeared in
375
more than n annotations (n = 1, 2, or 3). Using these time series data with a particular detection
376
threshold of word scores, we counted the number of false positive detections (scenes where the
377
word score was higher than the threshold but the word was not present in any of the
AC C
370
20
ACCEPTED MANUSCRIPT
annotations) and true positive detections (scenes where the word score was higher than the
379
threshold and the word was actually present in the annotations). Then, we drew an ROC curve
380
given as a function of the true positive rate against the false positive rate while the detection
381
threshold gradually increased from –1 to 1 (e.g., Figure 3C). Finally, we obtained the area under
382
the ROC curve (AUC) to quantify the degree to which the presence or absence of a given word
383
in the annotations could be predicted by word scores (see Material and methods for details). For
384
the example word “Building,” the AUC values were significantly higher than the level of
385
chance (Figure 3C; randomization test, p < 0.001). The distribution of AUC values for all words
386
was significantly higher than 0.5 for all of the tested parts of speech (Figure 3D; Wilcoxon
387
signed-rank test, p < 0.0005; see Figure S5 for individual participants; see Table S1 for the most
388
decodable words). In addition, the AUC values of the example word “Building” increased (from
389
0.79 to 0.88) as the presence/absence threshold of the word increased (Figure 3C). This
390
indicates that a word that was commonly used by different annotators tended to have a higher
391
word score for that scene. This tendency was preserved for all words (Figure S6). Together,
392
these results indicate that our decoding successfully estimated the individual words related to
393
objects, actions, and impressions consistent with the annotators’ scene descriptions.
395 396
SC
M AN U
TE D
EP
AC C
394
RI PT
378
3.2.
The decoder could infer words that had never appeared during its training
Our decoder was able to score novel words that had never appeared in the annotation
397
dataset used for model training (for example, the words indicated by asterisks in Figure 2). We
398
tested our decoder’s ability to estimate these novel words. From the 10,000 most frequent words
399
in the Wikipedia corpus, we selected 5,119 words that were not present in the training dataset.
21
ACCEPTED MANUSCRIPT
We evaluated the word-wise correlations for the subset of words (Figure 4A; see Figure S7 for
401
individual participants) and found that all of the coefficients were significantly higher than zero
402
(Wilcoxon signed-rank test, p < 0.0001). We then evaluated the word-wise AUC values using
403
the words that were present only in the test dataset (n = 270; Figure 4B; see Figure S7 for
404
individual participants). Note that because such words were rare, we extracted them from the
405
entire vocabulary in the corpus (n = 100,035) rather than from the 10,000 words. We found that
406
the AUC values were significantly higher than 0.5 (Wilcoxon signed-rank test, p < 0.0001). The
407
decoder made a successful estimation even for the words that had never appeared in the training
408
data set.
M AN U
SC
RI PT
400
409 410
3.3.
Inter-individual variability in the decoded contents explained the variability in the annotators’ scene descriptions
412
The results shown in Figure 2B and C raise the possibility that the contents that our
TE D
411
decoder recovered from brain activity may reflect the inter-individual variability in perceptual
414
experiences induced by movie scenes. To test this possibility, we separately evaluated the
415
inter-individual variability in decoded scene vectors and that in annotation scene vectors, and
416
examined the association between them. The variability in each movie scene was quantified by
417
the pairwise correlation distance between scene vectors across all possible pairs of experimental
418
participants or annotators. The pairwise distance was then averaged across all pairs and
419
transformed into z scores. The black and pink traces in Figure 5A show the temporal profiles of
420
the normalized mean pairwise distance for decoded and annotation scene vectors, respectively.
421
These two series of pairwise distances were significantly correlated (pink dots in Figure 5B;
AC C
EP
413
22
ACCEPTED MANUSCRIPT
Pearson’s r = 0.27, p = 0.0001; Spearman’s ρ = 0.21, p = 0.003), indicating that we could infer
423
the inter-individual variability of scene-evoked perceptions from those decoded from brain
424
activity.
RI PT
422
425 426
Although the two sets of inter-individual variability were derived from different groups, i.e., experimental participants and annotators, the association of inter-individual variability was
428
also observed in data derived from the same group. We also collected scene annotations from
429
the experimental participants themselves and calculated a new set of annotation scene vectors.
430
We found that the series of pairwise distances between participant-derived annotation scene
431
vectors (blue trace in Figure 5A) were significantly correlated with those of decoded scene
432
vectors (blue dots in Figure 5B; Pearson’s r = 0.27, p = 0.0001; Spearman’s ρ = 0.23, p =
433
0.001).
TE D
M AN U
SC
427
434
One might argue that because the pairwise distance was likely to increase during scene
436
transitions (particularly, when one movie clip moved on to another), the clip transitions could
437
yield a quasi-significant correlation between the pairwise distances. To exclude this possibility,
438
we removed the samples that included clip transitions and re-calculated the correlation between
439
the variability of decoded and annotation scene vectors. We found that a significant correlation
440
remained (Pearson’s r = 0.25, p = 0.001; Spearman’s ρ = 0.21, p = 0.006; Figure S8), indicating
441
that clip transitions had little effect on the relationship between the pairwise distances. Taken
442
together, the inter-individual variability in decoded contents is likely to be closely related to that
443
in scene descriptions.
AC C
EP
435
23
ACCEPTED MANUSCRIPT
444 445
To further examine whether we could decode the experiences of a specific participant from his or her brain activity, we conducted a binomial classification analysis in which the
447
identity of each participant was examined according to the correlation between the participant’s
448
annotations and the decoded scene vectors. For each scene, we used two annotation scene
449
vectors obtained from the annotations of two participants (PA and PB). We also used decoded
450
scene vectors for each participant. For each scene, if the correlation coefficient between PA’s
451
decoded and annotation vectors was higher than that between PA’s decoded and PB’s
452
annotation vectors, the identification was correct; if not, the identification was wrong. When we
453
evaluated the rate of correct identifications for each participant (130 scenes per participant)
454
across all self-annotated scenes, the rate was significantly higher than the level of chance for
455
only one participant (P4) (Figure S9; binomial test, p < 0.05). Therefore, using our current data,
456
the contents decoded from the individual participants could not be used to distinguish their
457
subjective experiences estimated from their annotations.
458
460
3.4.
Higher visual areas contributed to our decoding
AC C
459
EP
TE D
M AN U
SC
RI PT
446
To investigate which cortical areas our decoder extracted perceptual information from,
461
we estimated the informative voxels from the decoding-model weights and visualized those
462
voxels on the cortical surface of each participant. Our decoding model consisted of three sets of
463
time lag terms to capture the hemodynamic delay of fMRI responses (see Material and methods
464
for details); therefore, the size of the weight matrix was N × 3K, where N is the number of
465
voxels and K is the vector dimensionality (= 100). We first averaged the weights across the
24
ACCEPTED MANUSCRIPT
three time lags, leading to an N × K matrix. The voxels with high decoding weights were
467
considered as computationally contributing to the decoding. However, those voxels should not
468
be simply regarded as voxels signaling rich perceptual information, because response
469
covariance across voxels may produce voxels with high decoding weights but uninformative
470
signals (Haufe et al., 2014). To avoid such dissociation, we corrected the decoding-model
471
weights in the same manner as proposed by Haufe et al. (2014): the original weights were
472
left-multiplied by the covariance matrix computed from the voxel response time-courses. Then,
473
we took the absolute values of the corrected weights and selected the maximum for each
474
individual voxel, leading to an N-dimensional vector. Finally, the vector was z-scored for each
475
participant and projected onto a cortical surface. We regarded voxels with higher z-scores as
476
more informative.
478
SC
M AN U
TE D
477
RI PT
466
Figure 6 shows the normalized absolute weights on the cortical surface of a single participant (P1). Although we used brain activity from all cortical voxels for our decoder, the
480
informative voxels were mainly located in the cortical regions that include the fusiform gyrus,
481
parahippocampal gyrus, occipitotemporal areas, and posterior superior temporal sulcus, which
482
are part of the higher visual areas. Similar results were observed for the other participants
483
(Figure S10). These areas involve previously reported object-selective regions such as the
484
fusiform face area (Kanwisher et al., 1997), the extrastriate body area (Downing et al., 2001),
485
and the lateral occipital complex (Malach et al., 1995), and scene-selective regions such as the
486
parahippocampal place area (Epstein and Kanwisher, 1998). Therefore, consistent with the
487
results from previous decoding studies (Huth et al., 2016b; Stansbury et al., 2013), our decoder
AC C
EP
479
25
ACCEPTED MANUSCRIPT
488
read the perceptual contents primarily from the information conveyed across these higher visual
489
areas.
AC C
EP
TE D
M AN U
SC
RI PT
490
26
ACCEPTED MANUSCRIPT
491 492
4.
Discussion We developed a new word-based decoding technique that recovers perceptual contents
from fMRI responses to natural movies via a word embedding space. Our decoder successfully
494
visualized movie-induced perceptions of objects, actions, and impressions in the form of words,
495
and showed generalizability in that it could estimate words that it had not seen during training.
496
Moreover, the decoded contents reflected the inter-individual variability of perceptual
497
experiences. These results suggest that our decoding technique provides a useful tool to recover
498
naturalistic perceptual experiences varying across scenes and across individuals.
M AN U
SC
RI PT
493
499 500
We decoded diverse perceptual experiences induced by continuous natural movies. Our work extends previous studies that decoded semantic contents induced by line drawings or static
502
images (Pereira et al., 2011; Stansbury et al., 2013) to more complex real-life movie materials
503
(TV ads.) This study also extends previous studies that demonstrated word-based decoding of
504
up to a few thousand nouns and verbs (Güçlü and van Gerven, 2015; Huth et al., 2016b; Pereira
505
et al., 2011; Stansbury et al., 2013) to more comprehensive natural language descriptions using
506
tens of thousands of words including adjectives, and also novel words that did not appear in the
507
training data set. Taken together, these quantitative and partly qualitative extensions
508
demonstrate the potential usability of our techniques to real-life applications such as
509
neuromarketing. In addition, the use of natural language annotations to capture perceptual
510
contents during movie viewing is an advantage of our decoding model. A recent study reported
511
that natural language annotations projected onto a word embedding space were effective for
512
characterizing movie-evoked brain responses varying across movie scenes (Vodrahalli et al.,
AC C
EP
TE D
501
27
ACCEPTED MANUSCRIPT
2016). Therefore, natural language annotations of natural movie scenes may play a key role in
514
our decoding technique, allowing us to effectively model the association between movie-evoked
515
complex perceptions and brain responses.
RI PT
513
516 517
Existing decoding models incorporate word variables that are roughly sorted into two
types: binary-based (Huth et al., 2016b) and feature-based (Pereira et al., 2011; Stansbury et al.,
519
2013). The binary-based variables are represented by vectors with values of one or zero as
520
categorical word labels corresponding to visual stimuli or objects and actions in viewed scenes.
521
The feature-based variables are represented by vectors that reflect the latent intermediate
522
representation of words learned from word statistics (typically, co-occurrence statistics) in
523
large-scale text data. For word-wise decoding, one of the most important characteristics of using
524
feature-based variables is to improve the model’s generalization capability (Pereira et al., 2011).
525
A feature-based variable model learns to associate brain responses with the dimensions of a
526
word feature space rather than with words per se; therefore, a label of every categorical word
527
the model used in a test phase does not need to be present during model training. The decoding
528
method introduced in the present study is an extension of such feature-based variable models.
529
Consistent with previous decoding studies (Pereira et al., 2016, 2011), our decoder showed high
530
decoding accuracy for the novel words that were absent in the training dataset (Figure 4),
531
indicating its generalization capability. Its generalization capability was also evident in the
532
decoding of complex perceptions represented by the natural language annotations and the large
533
vocabulary size (10,000 words). This suggests that our decoder is able to generalize more
534
comprehensively than previous models.
AC C
EP
TE D
M AN U
SC
518
28
ACCEPTED MANUSCRIPT
535 536
The feature-based variables we used were derived from a state-of-the-art NLP algorithm (skip-gram; Mikolov et al., 2013a). In contrast, previous feature-based variable models
538
incorporate conventional NLP algorithms, typically latent Dirichlet allocation (Blei et al., 2003).
539
The skip-gram vector representation shows higher performance than any other model
540
representation, not only in NLP tasks (Mikolov et al., 2013a, 2013b), but also in modeling brain
541
responses to natural stimuli (Güçlü and van Gerven, 2015; Nishida et al., 2015; but see Huth et
542
al., 2016a). Although we did not directly compare the decoding accuracy of the skip-gram
543
model with that of other NLP-based models using our data, this may make it easier to obtain the
544
complex association of brain activity with a larger set (tens of thousands) of words for
545
word-based decoding.
547
SC
M AN U
TE D
546
RI PT
537
Using such a large set of potential words for decoding allows us to estimate perceptual contents related not only to objects (nouns) and actions (verbs) but also to impressions
549
(adjectives). Strictly speaking, the word-wise ROC analysis indicated that some nouns, verbs,
550
and adjectives were not decodable (i.e., the AUC values were not significantly higher than 0.5;
551
Figure 3D). However, although a previous study used a similar word-wise ROC analysis for the
552
evaluation of word decodability (Huth et al., 2016b), our ROC analysis was more conservative
553
because our decoder covered a large language feature space, including a larger vocabulary and
554
richer scene annotations using natural language. This large feature space meant that the
555
occurrence of a specific word in the annotations was rarer, and thus our ROC analysis was
556
likely to have lower statistical power and produce lower AUC values. Although we used the
AC C
EP
548
29
ACCEPTED MANUSCRIPT
ROC analysis for compatibility with Huth et al. (2016b), the alternative word-wise correlation
558
analysis revealed that almost all (99.3%) of the words had correlation coefficients significantly
559
higher than 0 (Figure 3A). Therefore, we believe that our decoder successfully estimated many
560
words relevant to the perception of objects, actions, and impressions.
RI PT
557
561
Previous studies, in contrast, have primarily aimed to decode the words relevant to only
SC
562
object and action perceptions (Huth et al., 2016b; Pereira et al., 2011; Stansbury et al., 2013).
564
Impression perception, which is more subjective than object and action perception, may
565
considerably extend the possible applications of word-based decoding. A potential application
566
of impression decoding is to evaluate movie contents, such as TV commercials and promotional
567
films, on the basis of decoded contents. We believe that this application contributes to progress
568
in the field of neuromarketing (Plassmann et al., 2015).
TE D
570
The neuromarketing literature has suggested that a brain decoding framework potentially
EP
569
M AN U
563
offers more reliable information for predicting human economic behavior than conventional
572
questionnaire-based evaluations (Berns and Moore, 2012). Information decoded from brain
573
activation in a small group of people has been shown to be sufficient to predict mass behavior
574
such as music purchasing (Berns and Moore, 2012) and audience ratings of broadcast TV
575
content (Dmochowski et al., 2014). These studies were able to directly associate brain activity
576
with individuals’ economic preferences or behavioral tendencies. As an extension of this line of
577
research, our decoding technique is able to visualize the cognitive contents of naturalistic
578
experiences, including subjective impressions. Such a content-based estimation of cognitive
AC C
571
30
ACCEPTED MANUSCRIPT
579
information may provide a powerful framework to estimate economic behavior more
580
effectively.
582
RI PT
581 We found that the inter-individual variability in the decoded contents of each movie
scene was correlated with that in annotators’ and participants’ scene descriptions (Figure 5),
584
although we failed to distinguish one person’s subjective experience from another’s using our
585
decoding procedure (Figure S7). Note that even though the number of participants was
586
relatively small (n=6), we observed a significant relationship between the inter-individual
587
variability of decoded and annotated contents. This suggests that the inter-individual variability
588
in decoded contents is a robust measure to evaluate the inter-individual variability in perceptual
589
experiences. A previous report showed that activation patterns in the human higher-order visual
590
cortex were more variable across individuals when there was more inter-individual variability in
591
the familiarity of the viewed objects (Charest et al., 2014). This suggests that the
592
inter-individual variability of visual experiences is reflected in higher-order visual
593
representations in the brain. Consistent with this finding, our results indicate that the
594
inter-individual variability of semantic perceptions, even under natural visual conditions, can be
595
decoded from brain activity via word-based decoding. To the best of our knowledge, this is the
596
first direct evidence that word-based decoding can reflect the inter-individual variability in
597
visual experiences.
AC C
EP
TE D
M AN U
SC
583
598 599 600
What enabled our decoding results to reflect such inter-individual variability? A potentially important factor is that we used natural language annotations to evaluate the
31
ACCEPTED MANUSCRIPT
subjective perception of a scene (Figure 1A and Figure S1). The annotations contained rich
602
information that captured some aspects of the inter-individual variability in subjective
603
perceptions. It is also important that our model used an expressive word vector space that
604
enabled us to learn the projection from the representation in the rich annotations to the
605
representation in the brain. Our framework may provide a useful tool to study inter-individual
606
differences in perceptual/cognitive representations in the brain.
SC
607
The higher visual areas, including the fusiform gyrus, parahippocampal gyrus,
M AN U
608
RI PT
601
occipitotemporal areas, and posterior superior temporal sulcus, made the greatest contributions
610
to our decoding (Figure 6 and Figure S10). It could be argued that these areas did not involve
611
some parts of the semantic network, such as the prefrontal cortex, defined in previous studies
612
(Binder et al., 2009; Patterson et al., 2007). However, this is not surprising because the
613
contributing areas largely overlapped with those found in the previous study by Huth et al.
614
(2016b), in which decoding was conducted using an experimental paradigm highly similar to
615
ours. Both our experiments and those of Huth et al. (2016b) required participants to view
616
natural movies passively in the scanner without any behavioral tasks. In such a situation,
617
semantic processing relevant to visual processing may become dominant and involve higher
618
visual areas but few other semantic areas.
AC C
EP
TE D
609
619 620
Quantifying the inter-individual variability of perception may also offer potential
621
applications, particularly in neuromarketing. TV commercials are created, for instance, with the
622
intention to enhance consumers’ willingness to buy specific products or to improve a
32
ACCEPTED MANUSCRIPT
company’s brand image. To achieve this, consumers’ perceptions induced by commercial
624
contents should be focused toward a company’s intention, and thereby to reduce the variability
625
of inter-individual perceptions. In such a case, measuring the degree of inter-individual
626
perceptual variability may be useful to evaluate the effectiveness of TV commercials. Indeed, a
627
recent study reported that the favorability of viewers toward broadcast TV contents could be
628
predicted by the inter-individual variability of the viewers’ brain responses to the contents;
629
viewers were more favorable toward a given TV scene when the variability of responses to the
630
scene decreased (Dmochowski et al., 2014). Thus, again our decoding technique has the
631
potential to be used as a method in neuromarketing and may contribute to developments in
632
neuromarketing research.
M AN U
SC
RI PT
623
633
Although our word-wise decoding method remedies some of the technical issues in
635
previous decoding methods, it also leaves several limitations. First, our method may make poor
636
estimations when word pairs that are distantly located in the skip-gram word vector space, such
637
as a pair of “human” and “mountain,” appear simultaneously in the annotations of a single scene.
638
Because likely words are estimated according to the distance from a single decoded scene
639
vector in each scene, it is difficult for multiple words located apart to be chosen together as
640
likely words. This could be problematic for practical applications. A potential solution for this
641
issue may be to estimate the likelihood of individual words independently of each other via the
642
one-to-one association of words with brain responses. The binary-based variable model, rather
643
than the feature-based model, may have the potential to achieve this; nevertheless, a previous
644
binary-based variable model also showed poorer decoding performance when the number of
AC C
EP
TE D
634
33
ACCEPTED MANUSCRIPT
object and action categories in a scene increased (Huth et al., 2016b). To address this issue,
646
further considerations will be needed to introduce a new modeling framework that incorporates
647
the advantages of both types of variable modeling.
RI PT
645
648 649
Second, the training of our decoding model required a large amount (>2 hours) of fMRI data from each participant, which may impose too much of a burden. For practical applications,
651
it is necessary to reduce the time that participants are constrained in the MRI scanner. A
652
possible solution for this issue may be to transfer decoders across individuals using methods to
653
align representational spaces across individual brains (Conroy et al., 2013; Haxby et al., 2011;
654
Sabuncu et al., 2010; Yamada et al., 2015). The so-called “hyperalignment” technique
655
(Guntupalli et al., 2016; Haxby et al., 2011) has recently attracted broad interest and has the
656
potential to be extensively used in fMRI studies that include brain decoding (Haxby et al., 2014;
657
Nishimoto and Nishida, 2016). Once we construct a decoder from abundant fMRI data for one
658
person, additional shorter experiments for other people may be sufficient to estimate their
659
decoders by transferring the representational spaces of the subsequent participants from that of
660
the first participant. Accordingly, this technique may reduce the experimental burden on
661
individual participants. This idea will be tested in future studies.
M AN U
TE D
EP
AC C
662
SC
650
663
Third, our decoder estimates multiple words independently, which makes the dependency
664
relation between the words unclear. For instance, if the decoder returns the words “man,” “car,”
665
and “hit,” it could indicate “A man hits a car” or “A car hits a man.” Our current decoder cannot
666
distinguish such different contents. A possible way to address this issue is to construct a
34
ACCEPTED MANUSCRIPT
decoding model that incorporates language features, including the dependency relation of words.
668
An implicit implementation of such a model is a sentence-based decoding model that learns the
669
association between brain responses and sentences themselves. Toward this, we recently
670
demonstrated such sentence-based decoding based on the feature representation of a
671
state-of-the-art NLP algorithm with deep learning (Matsuo et al., 2016; Vinyals et al., 2014).
672
Further developments in NLP models may help to improve decoding techniques to recover
673
richer semantic information from brain activity.
AC C
EP
TE D
M AN U
674
SC
RI PT
667
35
ACCEPTED MANUSCRIPT
675
5.
Conclusions Our new decoding framework incorporates a word embedding space and descriptions of
677
scenes using natural language. The decoder successfully recovered movie-induced perceptions
678
of objects, actions, and impressions, in the form of words, from fMRI signals. The decoder also
679
showed remarkable generalizability by estimating words unseen during decoder training.
680
Furthermore, the decoded contents reflected the inter-individual variability of movie-induced
681
experiences. Because words are useful to express categorical and conceptual information in an
682
interpretable manner, such word-based decoding is assumed to be effective for visualizing
683
perceptual contents in the brain. Therefore, we believe that our decoding framework contributes
684
to the technical development of brain reading for various scientific and practical applications.
M AN U
SC
RI PT
676
AC C
EP
TE D
685
36
ACCEPTED MANUSCRIPT
Acknowledgments
687
The work was supported by grants from the Japan Society for the Promotion of Science (JSPS;
688
KAKENHI 15K16017) to S. Nishida and from the JSPS (JSPS; KAKENHI 15H05311 and
689
15H05710), NTT DATA Corporation, and NTT DATA Institute of Management Consulting,
690
Inc. to S. Nishimoto. We also thank Ryo Yano, Masataka Kado, Naoya Maeda, Takuya Ibaraki,
691
and Ippei Hagiwara for helping movie materials and fMRI data collection.
SC
RI PT
686
AC C
EP
TE D
M AN U
692
37
ACCEPTED MANUSCRIPT
693
References
694
Anderson, A.J., Binder, J.R., Fernandino, L., Humphries, C.J., Conant, L.L., Aguilar, M., Wang, X., Doko, D., Raizada, R.D.S., 2016. Predicting neural activity patterns associated with
696
sentences using a neurobiologically motivated model of semantic representation. Cereb.
697
Cortex. doi:10.1093/cercor/bhw240
700
SC
699
Berns, G.S., Moore, S.E., 2012. A neural predictor of cultural popularity. J. Consum. Psychol. 22, 154–160. doi:10.1016/j.jcps.2011.05.001
Binder, J.R., Desai, R.H., Graves, W.W., Conant, L.L., 2009. Where is the semantic system? A
M AN U
698
RI PT
695
701
critical review and meta-analysis of 120 functional neuroimaging studies. Cereb. cortex 19,
702
2767–96. doi:10.1093/cercor/bhp055
704 705
Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022.
TE D
703
Chang, K.M.K., Mitchell, T., Just, M.A., 2011. Quantitative modeling of the neural representation of objects: How semantic feature norms can account for fMRI activation.
707
Neuroimage 56, 716–727. doi:10.1016/j.neuroimage.2010.04.271
709 710 711
Charest, I., Kievit, R.A., Schmitz, T.W., Deca, D., Kriegeskorte, N., 2014. Unique semantic space
AC C
708
EP
706
in the brain of each beholder predicts perceived similarity. Proc. Natl. Acad. Sci. U. S. A. 111, 14545–14570. doi:10.1073/pnas.1402594111
Conroy, B.R., Singer, B.D., Guntupalli, J.S., Ramadge, P.J., Haxby, J. V., 2013. Inter-subject
712
alignment of human cortical anatomy using functional connectivity. Neuroimage 81, 400–
713
411. doi:10.1016/j.neuroimage.2013.05.009
714
Çukur, T., Nishimoto, S., Huth, A.G., Gallant, J.L., 2013. Attention during natural vision warps
38
ACCEPTED MANUSCRIPT
715
semantic representation across the human brain. Nat. Neurosci. 16, 763–770.
716
doi:10.1038/nn.3381
719 720 721
RI PT
718
Dale, A.M., Fischl, B., Sereno, M.I., 1999. Cortical surface-based analysis. I. Segmentation and surface reconstruction. Neuroimage 9, 179–194. doi:10.1006/nimg.1998.0395
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R., 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407.
SC
717
Dmochowski, J.P., Bezdek, M.A., Abelson, B.P., Johnson, J.S., Schumacher, E.H., Parra, L.C., 2014. Audience preferences are predicted by temporal reliability of neural processing. Nat.
723
Commun. 5, 1–9. doi:10.1038/ncomms5567
726 727 728
processing of the human body. Science 293, 2470–2473. doi:10.1126/science.1063414 Epstein, R., Kanwisher, N., 1998. A cortical representation of the local visual environment.
TE D
725
Downing, P.E., Jiang, Y., Shuman, M., Kanwisher, N., 2001. A cortical area selective for visual
Nature 392, 598–601. doi:10.1038/33402
Fischl, B., Sereno, M.I., Dale, A.M., 1999. Cortical surface-based analysis. II: Inflation,
EP
724
M AN U
722
flattening, and a surface-based coordinate system. Neuroimage 9, 195–207.
730
doi:10.1006/nimg.1998.0396
731 732 733
AC C
729
Güçlü, U., van Gerven, M.A.J., 2015. Semantic vector space models predict neural responses to complex visual stimuli. arXiv Prepr:1510.04738.
Guntupalli, J.S., Hanke, M., Halchenko, Y.O., Connolly, A.C., Ramadge, P.J., Haxby, J. V., 2016.
734
A model of representational spaces in human cortex. Cereb. Cortex 26, 2919–2934.
735
doi:10.1093/cercor/bhw068
736
Haufe, S., Meinecke, F., Görgen, K., Dähne, S., Haynes, J.D., Blankertz, B., Bießmann, F., 2014.
39
ACCEPTED MANUSCRIPT
737
On the interpretation of weight vectors of linear models in multivariate neuroimaging.
738
Neuroimage 87, 96–110. doi:10.1016/j.neuroimage.2013.10.067 Haxby, J. V., Guntupalli, J.S., Connolly, A.C., Halchenko, Y.O., Conroy, B.R., Gobbini, M.I.,
RI PT
739
Hanke, M., Ramadge, P.J., 2011. A common, high-dimensional model of the
741
representational space in human ventral temporal cortex. Neuron 72, 404–416.
742
doi:10.1016/j.neuron.2011.08.026
743
SC
740
Haxby, J. V, Connolly, A.C., Guntupalli, J.S., 2014. Decoding neural representational spaces using multivariate pattern analysis. Annu. Rev. Neurosci. 37, 435–456.
745
doi:10.1146/annurev-neuro-062012-170325
748 749 750
Neurosci. 7, 523–534. doi:10.1093/cercor/bhj086
Horikawa, T., Tamaki, M., Miyawaki, Y., Kamitani, Y., 2013. Neural decoding of visual imagery
TE D
747
Haynes, J.D., Rees, G., 2006. Decoding mental states from brain activity in humans. Nat. Rev.
during sleep. Science 639, 639–642. doi:10.1126/science.1234330 Huth, A.G., Heer, W.A. De, Griffiths, T.L., Theunissen, F.E., Jack, L., 2016a. Natural speech
EP
746
M AN U
744
reveals the semantic maps that tile human cerebral cortex. Nature 532, 453–458.
752
doi:10.1038/nature17637
753 754 755
AC C
751
Huth, A.G., Lee, T., Nishimoto, S., Bilenko, N.Y., Vu, A.T., Gallant, J.L., 2016b. Decoding the semantic content of natural movies from human brain activity. Front. Syst. Neurosci. 10, 1– 16. doi:10.3389/fnsys.2016.00081
756
Huth, A.G., Nishimoto, S., Vu, A.T., Gallant, J.L., 2012. A continuous semantic space describes
757
the representation of thousands of object and action categories across the human brain.
758
Neuron 76, 1210–1224. doi:10.1016/j.neuron.2012.10.014
40
ACCEPTED MANUSCRIPT
760 761 762
Kanwisher, N., McDermott, J., Chun, M.M., 1997. The fusiform face area: a module in human extrastriate cortex specialized for face perception. J. Neurosci. 17, 4302–4311. Kay, K.N., Naselaris, T., Prenger, R.J., Gallant, J.L., 2008. Identifying natural images from
RI PT
759
human brain activity. Nature 452, 352–325. doi:10.1038/nature06713
Lafferty, J., McCallum, A., Pereira, F., 2001. Conditional random fields: probabilistic models for
764
segmenting and labeling sequence data, in: Proceedings of the Eighteenth International
765
Conference on Machine Learning. pp. 282–289.
Malach, R., Reppas, J.B., Benson, R.R., Kwong, K.K., Jiang, H., Kennedy, W.A., Ledden, P.J.,
M AN U
766
SC
763
767
Brady, T.J., Rosen, B.R., Tootell, R.B., 1995. Object-related activity revealed by functional
768
magnetic resonance imaging in human occipital cortex. Proc. Natl. Acad. Sci. U. S. A. 92,
769
8135–8159.
Matsuo, E., Kobayashi, I., Nishimoto, S., Nishida, S., Hideki, A., 2016. Generating natural
TE D
770
language descriptions for semantic r epresentations of human brain activity, in: Proceedings
772
of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016).
773
pp. 22–29. doi:10.18653/v1/P16-3004
775 776 777
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013a. Distributed representations of words and
AC C
774
EP
771
phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119.
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013b. Efficient estimation of word representations in vector space, in: ICLR Workshop.
778
Mitchell, T.M., Shinkareva, S. V, Carlson, A., Chang, K., Malave, V.L., Mason, R.A., Just, M.A.,
779
2008. Predicting human brain activity associated with the meanings of nouns. Science 320,
780
1191–1195. doi:10.1126/science.1152876
41
ACCEPTED MANUSCRIPT
781
Miyawaki, Y., Uchida, H., Yamashita, O., Sato, M., Morito, Y., Tanabe, H.C., Sadato, N., Kamitani, Y., 2008. Visual image reconstruction from human brain activity using a
783
combination of multiscale local image decoders. Neuron 60, 915–929.
784
doi:10.1016/j.neuron.2008.11.004
785
RI PT
782
Moeller, S., Yacoub, E., Olman, C.A., Auerbach, E., Strupp, J., Harel, N., Uğurbil, K., 2010. Multiband multislice GE-EPI at 7 tesla, with 16-fold acceleration using partial parallel
787
imaging with application to high spatial and temporal whole-brain fMRI. Magn. Reson.
788
Med. 63, 1144–1153. doi:10.1002/mrm.22361
M AN U
SC
786
789
Naselaris, T., Olman, C.A., Stansbury, D.E., Ugurbil, K., Gallant, J.L., 2015. A voxel-wise
790
encoding model for early visual areas decodes mental images of remembered scenes.
791
Neuroimage 105, 215–228. doi:10.1016/j.neuroimage.2014.10.018 Naselaris, T., Prenger, R.J., Kay, K.N., Oliver, M., Gallant, J.L., 2009. Bayesian reconstruction of
TE D
792
natural images from human brain activity. Neuron 63, 902–915.
794
doi:10.1016/j.neuron.2009.09.006
796 797 798 799
Nishida, S., Huth, A.G., Gallant, J.L., Nishimoto, S., 2015. Word statistics in large-scale texts explain the human cortical semantic representation of objects, actions, and impressions. Soc.
AC C
795
EP
793
Neurosci. Abstr. 45, 333.13.
Nishimoto, S., Nishida, S., 2016. Lining up brains via a common representational space. Trends Cogn. Sci. 20, 565–567. doi:10.1016/j.tics.2016.06.001
800
Nishimoto, S., Vu, A.T., Naselaris, T., Benjamini, Y., Yu, B., Gallant, J.L., 2011. Reconstructing
801
visual experiences from brain activity evoked by natural movies. Curr. Biol. 21, 1641–1646.
802
doi:10.1016/j.cub.2011.08.031
42
ACCEPTED MANUSCRIPT
803
Patterson, K., Nestor, P.J., Rogers, T.T., 2007. Where do you know what you know? The representation of semantic knowledge in the human brain. Nat. Rev. Neurosci. 8, 976–987.
805
doi:10.1038/nrn2277
RI PT
804
Pennington, J., Socher, R., Manning, C.D., 2014. Glove: Global vectors for word representation,
807
in: Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).
808
pp. 1532–1543.
810 811
Pereira, F., Detre, G., Botvinick, M., 2011. Generating text from functional brain images. Front. Hum. Neurosci. 5, 72. doi:10.3389/fnhum.2011.00072
M AN U
809
SC
806
Pereira, F., Lou, B., Pritchett, B., Kanwisher, N., Botvinick, M., Deepmind, G., Fedorenko, E.,
812
2016. Decoding of generic mental representations from functional MRI data using word
813
embeddings. bioRxiv. doi:10.1101/057216
816
TE D
815
Plassmann, H., Venkatraman, V., Huettel, S., Yoon, C., 2015. Consumer neuroscience: Applications, challenges, and possible solutions. J. Mark. Res. Rehurek, R., Sojka, P., 2010. Software framework for topic modelling with large corpora, in:
EP
814
Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp.
818
45–50.
819 820 821 822
AC C
817
Sabuncu, M.R., Singer, B.D., Conroy, B., Bryan, R.E., Ramadge, P.J., Haxby, J. V, 2010. Function-based intersubject alignment of human cortical anatomy. Cereb. cortex 20, 130–
140. doi:10.1093/cercor/bhp085 Sakahara, M., Okada, S., Nitta, K., 2014. Domain-independent unsupervised text segmentation
823
for data management, in: IEEE International Conference on Data Mining Workshop. pp.
824
481–487. doi:10.1109/ICDMW.2014.118
43
ACCEPTED MANUSCRIPT
825
Stansbury, D.E., Naselaris, T., Gallant, J.L., 2013. Natural scene statistics account for the representation of scene categories in human visual cortex. Neuron 79, 1025–1034.
827
doi:10.1016/j.neuron.2013.06.034
829 830
Vinyals, O., Toshev, A., Bengio, S., Erhan, D., 2014. Show and tell: a neural image caption generator. arXiv Prepr:1411.4555.
Vodrahalli, K., Chen, P.-H., Liang, Y., Baldassano, C., Chen, J., Yong, E., Honey, C., Hasson, U.,
SC
828
RI PT
826
Ramadge, P., Norman, K., Arora, S., 2016. Mapping between fMRI responses to movies
832
and their natural language annotations. arXiv Prepr:1610.03914.
833 834
M AN U
831
Wang, Z., Ittycheriah, A., 2015. FAQ-based question answering via word alignment. arXiv Prepr:1507.02628.
Yamada, K., Miyawaki, Y., Kamitani, Y., 2015. Inter-subject neural code converter for visual
836
image representation. Neuroimage 113, 289–297. doi:10.1016/j.neuroimage.2015.03.059
837
TE D
835
Yang, Y., Wang, J., Bailer, C., Cherkassky, V., Just, M.A., 2017. Commonality of neural representations of sentences across languages: Predicting brain activation during
839
Portuguese sentence comprehension using an English-based model of brain function.
840
Neuroimage 146, 658–666. doi:10.1016/j.neuroimage.2016.10.029
AC C
841
EP
838
44
ACCEPTED MANUSCRIPT
Figure legends
843
Figure 1 | Schematic of the word-based decoding using word vector representations
844
Our word-based decoding can be divided into training (A) and decoding (B) phases. The
845
objective in the training phase is to train a linear model that estimates vector representations
846
corresponding to movie scenes (scene vectors) from the cortical activity evoked by the scenes.
847
The cortical activity was measured using fMRI while participants viewed TV commercials. The
848
scene vectors were obtained from scene annotations using natural language by projecting them
849
into a high-dimensional word vector space (annotation scene vectors). This space was
850
constructed in advance from Wikipedia text corpus data using a statistical learning method
851
(skip-gram; see Material and methods for details). Then, the fMRI responses from individual
852
voxels were used to fit linear weights to the scene vectors using a regularized linear regression.
853
In the decoding phase, likely words for each movie scene were estimated in a different test
854
dataset than the dataset used for the model training. Scene vectors were computed from the
855
evoked cortical activity via the trained model (decoded scene vectors). Likely words for a given
856
scene were determined on the basis of correlation coefficients between the decoded scene vector
857
and individual word vectors (word scores). A word with a higher score was regarded as being
858
more likely. © NTT DATA Corp. All rights reserved.
SC
M AN U
TE D
EP
AC C
859
RI PT
842
860
Figure 2 | Word-based decoding in three example scenes
861
(A) The most likely words decoded from the brain activity of a single participant (P1) in an
862
example movie scene. Up to eight likely words are shown separately for each category: nouns,
863
verbs, and adjectives. A more reddish color indicates that the word is more likely, namely that
45
ACCEPTED MANUSCRIPT
the word has a higher word score (see color bar). Words with a score lower than 0.2 are not
865
shown. The words denoted by asterisks (*) are novel words that were absent in the training
866
dataset. © NTT DATA Corp. All rights reserved.
867
(B, C) The most likely words decoded from two participants (B, P6; C, P4) for a single scene.
868
© Shiseido Japan Co. Ltd. All rights reserved.
869
SC
RI PT
864
Figure 3 | Word-level evaluation of decoding accuracy
871
(A) Word-wise correlation. The distribution of correlations for individual words is shown
872
separately for nouns (left), verbs (middle), and adjectives (right). The word-wise correlation
873
reflects how the distance between individual word vectors and the decoded scene vectors
874
co-varied with the distance between the individual word vectors and the annotation scene
875
vectors (see Material and methods for details). Filled bars indicate the words with correlation
876
coefficients significantly higher than zero (p < 0.05). The median correlations were significantly
877
higher than zero (vertical lines) for all parts of speech (Wilcoxon signed-rank test, p < 0.0001).
878
(B) The relation between word scores and the presence in the scene annotations for the example
879
word “Building.” The trace shows the word scores for “Building” as a function of the time in
880
the test movie. Shaded areas indicate the presence of “Building” in individual scenes. The color
881
of the shading denotes the number of times a word appeared for each scene (green, 1 annotation;
882
blue, 2 annotations; red, ≥3 annotations).
883
(C) ROC curves for “Building.” The ROC curves reflect how well the word scores predict the
884
presence of the word in the scene annotations (see Material and methods for details). The color
885
of the curves differentiates the thresholds used to determine whether the word was present or
AC C
EP
TE D
M AN U
870
46
ACCEPTED MANUSCRIPT
absent in each scene (green, 1 annotation; blue, 2 annotations; red, 3 annotations). The AUC
887
values of all of the curves (bottom right) were significantly higher than 0.5 (randomization test,
888
p < 0.001).
889
(D) Word-wise ROC for all words. The distribution of the word-wise AUC values is shown
890
separately for nouns (left), verbs (middle), and adjectives (right). Filled bars indicate the words
891
with AUC values significantly different from 0.5 (randomization test, p < 0.05). The median
892
AUC values were significantly higher than 0.5 (vertical lines) for all parts of speech (Wilcoxon
893
signed-rank test, p < 0.001).
M AN U
894
SC
RI PT
886
Figure 4 | Decoding accuracy for words that did not appear in the training phase
896
Our model could recover the perceptual contents associated with novel words that had never
897
appeared in the training dataset. The word-wise correlations (A) and the word-wise AUC values
898
(B) were calculated using only novel words. The present/absent threshold of the AUC was 1
899
annotation. Filled bars indicate the words with significant correlation coefficients or AUC
900
values (p < 0.05). Both median values were significant (Wilcoxon signed-rank test, p < 0.0001).
EP
TE D
895
AC C
901 902
Figure 5 | Inter-individual variability of decoder-estimated and annotation-derived scene
903
contents
904
(A) The temporal profile of the pairwise scene-vector distance between individuals. The
905
pairwise distance was computed as the correlation distance between scene vectors for all
906
possible pairs of the scene annotators or the experimental participants. The pairwise distances
907
between the scene vectors derived from the decoder estimation (black), the annotations from the
47
ACCEPTED MANUSCRIPT
scene annotators (pink), and the annotations from the experimental participants (blue) were
909
averaged across all pairs and transformed into z scores. The mean pairwise distance for each
910
type of scene vector is shown as a function of the time it appeared in the test movie.
911
(B) The correlation between the decoded and annotation pairwise distances. Each dot represents
912
the mean pairwise distance between the decoded (x-axis) and annotation (y-axis) scene vectors
913
in each movie scene. The color of the dots differentiates the annotation scene vectors derived
914
from the annotators (pink) and the experimental participants (blue). The pairwise distance for
915
both annotations showed a significant correlation with the pairwise distance for the decoded
916
scene vectors (Pearson’s r = 0.27, p < 0.001).
M AN U
SC
RI PT
908
917
Figure 6 | Important cortical areas for decoding
919
To specify the cortical areas that made the greatest contributions to our decoding, the maximum
920
absolute weights of individual voxels in the decoding model were projected onto the cortical
921
surface of a single participant (P1). The weights were z-scored. Brighter locations in the surface
922
maps indicate voxels that have larger weights. Only those with weights above 1 SD are shown.
923
LH, left hemisphere; RH, right hemisphere; A, anterior; P, posterior.
AC C
EP
TE D
918
48
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT
AC C
EP
TE D
M AN U
SC
RI PT
ACCEPTED MANUSCRIPT