Decoding naturalistic experiences from human brain activity via distributed representations of words

Decoding naturalistic experiences from human brain activity via distributed representations of words

Accepted Manuscript Decoding naturalistic experiences from human brain activity via distributed representations of words Satoshi Nishida, Shinji Nishi...

2MB Sizes 0 Downloads 36 Views

Accepted Manuscript Decoding naturalistic experiences from human brain activity via distributed representations of words Satoshi Nishida, Shinji Nishimoto PII:

S1053-8119(17)30664-X

DOI:

10.1016/j.neuroimage.2017.08.017

Reference:

YNIMG 14248

To appear in:

NeuroImage

Received Date: 15 March 2017 Revised Date:

31 July 2017

Accepted Date: 3 August 2017

Please cite this article as: Nishida, S., Nishimoto, S., Decoding naturalistic experiences from human brain activity via distributed representations of words, NeuroImage (2017), doi: 10.1016/ j.neuroimage.2017.08.017. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

1

Decoding naturalistic experiences from human brain activity via

2

distributed representations of words

4

RI PT

3 Satoshi Nishida1,2, and Shinji Nishimoto1,2,3,*

5 6

1

7

Communications Technology (NICT), Suita, Osaka 565-0871, Japan

8

2

Graduate School of Frontier Biosciences,

9

3

Graduate School of Medicine,

Osaka University, Suita, Osaka 565-0871, Japan

11 12

*

13

1-4 Yamadaoka, Suita, Osaka 565-0871, Japan

14

[email protected]

EP

TE D

Corresponding author: Shinji Nishimoto

AC C

15

SC

M AN U

10

Center for Information and Neural Networks (CiNet), National Institute of Information and

1

ACCEPTED MANUSCRIPT

Abstract

17

Natural visual scenes induce rich perceptual experiences that are highly diverse from scene to

18

scene and from person to person. Here, we propose a new framework for decoding such

19

experiences using a distributed representation of words. We used functional magnetic resonance

20

imaging (fMRI) to measure brain activity evoked by natural movie scenes. Then, we

21

constructed a high-dimensional feature space of perceptual experiences using skip-gram, a

22

state-of-the-art distributed word embedding model. We built a decoder that associates brain

23

activity with perceptual experiences via the distributed word representation. The decoder

24

successfully estimated perceptual contents consistent with the scene descriptions by multiple

25

annotators. Our results illustrate three advantages of our decoding framework: (1) three types of

26

perceptual contents could be decoded in the form of nouns (objects), verbs (actions), and

27

adjectives (impressions) contained in 10,000 vocabulary words; (2) despite using such a large

28

vocabulary, we could decode novel words that were absent in the datasets to train the decoder;

29

and (3) the inter-individual variability of the decoded contents co-varied with that of the

30

contents of scene descriptions. These findings suggest that our decoding framework can recover

31

diverse aspects of perceptual experiences in naturalistic situations and could be useful in various

32

scientific and practical applications.

SC

M AN U

TE D

EP

AC C

33

RI PT

16

2

ACCEPTED MANUSCRIPT

Highlights

35



A new decoding method that uses a distributed representation of words

36



Decoding of movie-induced perceptions from cortical activity in the form of words

37



Decoding could infer object, action, and impression perception separately

38



Inter-individual variability in decoding correlated with variability in perception

39



Our method provides a useful tool for scientific and practical applications

SC

RI PT

34

40 Key words

42

decoding; semantic perception; natural language processing; humans; fMRI; natural vision

M AN U

41

AC C

EP

TE D

43

3

ACCEPTED MANUSCRIPT

44

1.

Introduction Recent developments in decoding techniques using functional magnetic resonance

46

imaging (fMRI) have the potential to form the quantitative basis of non-invasive brain-machine

47

interfaces (Haynes and Rees, 2006; Kay et al., 2008; Miyawaki et al., 2008; Naselaris et al.,

48

2015, 2009; Nishimoto et al., 2011). One promising means of achieving interpretable decoding

49

of the brain activity evoked by diverse experiences is in the form of language: i.e., words

50

(Horikawa et al., 2013; Huth et al., 2016b; Pereira et al., 2011; Stansbury et al., 2013) or

51

sentences (Anderson et al., 2016; Matsuo et al., 2016; Pereira et al., 2016; Yang et al., 2017).

M AN U

SC

RI PT

45

52

Despite previous attempts to comprehensively decode objective and subjective

54

experiences, there are still several issues to overcome. First, the previous techniques tried to

55

recover perceptual contents using a restricted representational space comprising only a few

56

thousand words at most, although language consists of tens of thousands or more words and can

57

represent richer information than examined in previous studies. Second, previous word-based

58

decoding studies did not provide strong evidence to support that the variety of decoded contents

59

was consistent with the subjective experiences of individual participants.

61

EP

AC C

60

TE D

53

To address these issues, we propose a new decoding technique that can recover rich and

62

diverse perceptual experiences from the brain activity of individuals by effectively applying a

63

large-scale word representational space provided by a natural language processing (NLP) model.

64

For several decades, NLP researchers have tried to extract word representations from the

65

statistical characteristics in large datasets of text (e.g., Blei et al., 2003; Deerwester et al., 1990;

4

ACCEPTED MANUSCRIPT

Mikolov et al., 2013a; Pennington et al., 2014). Some neuroimaging studies have demonstrated

67

that the extracted feature representations of words can be used to model semantic

68

representations in the brain (Chang et al., 2011; Huth et al., 2016a; Mitchell et al., 2008; Pereira

69

et al., 2011; Stansbury et al., 2013). Among them, models that incorporate distributed word

70

representations produced by skip-gram (Mikolov et al., 2013a), a state-of-the-art NLP algorithm,

71

can predict visually evoked brain responses better than models with other NLP algorithms

72

(Güçlü and van Gerven, 2015; Nishida et al., 2015). If the skip-gram word representation

73

efficiently captures semantic representation in the human brain, it should also provide an

74

effective feature space for word-based decoding to recover richer and more complex semantic

75

perceptions from individual brain activity.

M AN U

SC

RI PT

66

76

To test this idea, we introduce a word-based decoding technique that uses skip-gram word

TE D

77

representations. Importantly, our decoding model does not learn the association between brain

79

activity and words per se. Instead, it learns the association between brain activity and the

80

low-dimensional feature space that captures the relational structure of tens of thousands of

81

words. This enables us to dramatically increase the number of potential words used for

82

decoding without increasing the model complexity. In this paper, we demonstrate the validity

83

and advantages of our decoding technique using participants’ fMRI responses while viewing

84

natural movie scenes. In particular, we focus on the following three questions. (1) Is it possible

85

to decode the perceptual contents induced by movie scenes into different types of content words,

86

including not only objects (nouns) and actions (verbs), but also impressions (adjectives)? (2)

87

How accurately can the decoder estimate words that have never been presented to it as part of a

AC C

EP

78

5

ACCEPTED MANUSCRIPT

88

training dataset? (3) Do the decoded contents co-vary with the inter-individual differences in

89

perceptual experiences?

AC C

EP

TE D

M AN U

SC

RI PT

90

6

ACCEPTED MANUSCRIPT

91

2.

Material and methods

92

2.1.

Participants We scanned six healthy participants (P1–P6; age 26–41; 1 female) with normal or

94

corrected-to-normal vision. P1 was an author of the study. Informed consent was obtained from

95

all of the participants. The experimental protocol was approved by the ethics and safety

96

committees of the National Institute of Information and Technology.

SC

2.2.

fMRI data collection

M AN U

97 98

RI PT

93

Functional scans were collected on a 3T Siemens TIM Trio scanner (Siemens, Germany)

100

using a 32-channel Siemens volume coil and a multiband gradient echo-EPI sequence (Moeller

101

et al., 2010; TR = 2000 ms; TE = 30 ms; flip angle = 62°; voxel size = 2 × 2 × 2 mm; matrix

102

size = 96 × 96; FOV = 192 × 192 mm; multiband factor = 3). Seventy-two axial slices covered

103

the entire cortex. Anatomical data were collected using a T1-weighted MPRAGE sequence (TR

104

= 2530 ms; TE = 3.26 ms; flip angle = 9°; voxel size = 1 × 1 × 1 mm; matrix size = 256 × 256;

105

FOV = 256 × 256 mm) on the same 3T scanner.

EP

TE D

99

107 108

AC C

106 2.3.

Experimental design

The movie stimuli consisted of 298 clips from TV commercials that were selected from

109

the database of Japanese TV commercials managed by the TEMS Corp. (Tokyo, Japan) and

110

NTT DATA Corp. (Tokyo, Japan). These commercials were broadcasted nationwide in Japan

111

between 2009 and 2015. The commercial clips included a wide variety of product categories,

112

including food, drink, clothing, appliances, cars, housing, insurance, and amusement facilities.

7

ACCEPTED MANUSCRIPT

The length of the commercial clips was typically around 15 to 30 s. To create the experimental

114

movie stimuli, the original commercial clips were arranged in size and then sequentially

115

concatenated in a pseudo-random order. We made 13 non-overlapping movie clips ranging from

116

400 to 662 s in length. The individual movie clips were displayed in separate scans. Twelve of

117

the clips were presented once each to collect a dataset for model training (training dataset; 7,232

118

s in total). The other was presented four times in four separate scans to collect a dataset for

119

model testing (test dataset; 1,560 s in total).

M AN U

120

SC

RI PT

113

121

Participants viewed the visual stimuli displayed on a projector screen inside the scanner (23.8 ×13.5 degrees of visual angle at 30 Hz) and listened to the audio stimuli through

123

MR-compatible headphones. The participants were given no explicit task and were instructed to

124

watch the clips naturally, as though they watch commercials on TV in everyday life. The brain

125

data from individual participants were collected in 2–4 separate recording sessions over 1–4

126

days.

129

2.4.

fMRI data preprocessing

AC C

128

EP

127

TE D

122

Motion correction in each functional scan was performed using the Statistical Parameter

130

Mapping toolbox (SPM8, http://www.fil.ion.ucl.ac.uk/spm/software/spm8/). All volumes were

131

aligned to the first image from the first functional run for each participant. Low-frequency voxel

132

response drift was detected using a median filter with a 120 s window and subtracted from the

133

signal. The response for each voxel was then normalized by subtracting the mean response and

134

scaling it to the unit variance. We used FreeSurfer (Dale et al., 1999; Fischl et al., 1999) to

8

ACCEPTED MANUSCRIPT

identify cortical surfaces from anatomical data and register them to the voxels of functional data.

136

All voxels identified within the whole cortex for each participant were used for the analysis (P1,

137

68,942 voxels; P2, 61,843 voxels; P3, 51,474 voxels; P4, 69,899 voxels; P5, 67,060 voxels; P6,

138

61,091 voxels).

RI PT

135

139 2.5.

Text data preprocessing

SC

140 141

We used a text corpus from the Japanese Wikipedia dump on January 11, 2016 (http://dumps.wikimedia.org/jawiki) to learn a word feature space. From the 2,016,021 articles

143

in the corpus, we selected the 633,477 articles that contained more than 50 words to use as the

144

text dataset. All Japanese texts in the dataset were segmented into words and lemmatized using

145

MeCab (http://taku910.github.io/mecab), an open-source software for Japanese text

146

segmentation and part-of-speech analysis using conditional random fields for sequential

147

segmentation (Lafferty et al., 2001). We used a custom-made Japanese dictionary as a

148

vocabulary database for the segmentation and part-of-speech analysis. The dictionary was made

149

by combining the words from the titles of Japanese Wikipedia articles with the Japanese

150

dictionary published by the Nara Institute of Science and Technology

151

(http://sourceforge.jp/projects/naist-jdic). Only nouns, verbs, and adjectives were used for the

152

following analysis, and the other parts of speech were discarded.

AC C

EP

TE D

M AN U

142

153 154

2.6.

Skip-gram vector space

155

The skip-gram algorithm was originally developed to learn a high-dimensional word

156

vector space based on local (nearby) word co-occurrence statistics in natural language texts

9

ACCEPTED MANUSCRIPT

(Mikolov et al., 2013a). Although the original study of the skip-gram algorithm used English

158

corpora for learning (Mikolov et al., 2013a), follow-up studies have demonstrated that the

159

skip-gram algorithm performs well for NLP problems in other languages, including Japanese

160

(e.g., Sakahara et al., 2014; Wang and Ittycheriah, 2015).

RI PT

157

161

On the basis of the lemmatized word data, we constructed a skip-gram latent word vector

163

space using the gensim Python library (Rehurek and Sojka, 2010). The training objective of the

164

skip-gram algorithm is to obtain latent word representations that enable accurate prediction of

165

the surrounding words given a word in a sentence (Mikolov et al., 2013a). More formally, given

166

a sequence of training words w1, w2, … , wT, the skip-gram algorithm seeks a k-dimensional

167

vector space to maximize the average log probability, given as 



TE D

1  

M AN U

SC

162

  , 

log ( | )

where c is the size of the training window, which corresponds to the number of to-be-predicted

169

words before and after the central word wt. The basic formulation of p(wt+j|wt) is the softmax

170

function

AC C

EP

168

     =

exp〈"# , "# 〉 ) ∑( &'(〈"# , "#( 〉)

171

where vwi is the vector representation of wi, N is the number of words in the vocabulary, and

172

〈" , "* 〉 indicates the inner product of vectors v1 and v2. However, because this formulation has

173

a high computational cost, we used the “negative sampling” technique (Mikolov et al., 2013a;

174

the number of negative samples = 5) for a computationally efficient approximation of the

175

softmax function.

10

ACCEPTED MANUSCRIPT

176 We used the vector dimensionality k = 100 and the window size c = 10. To improve the

178

reliability of the learning and restrict the vocabulary size to around 100,000 words, words that

179

appeared less than 178 times in the corpus were excluded from the analysis; nevertheless, the

180

size of the vocabulary had little effect on our results. In addition, to accelerate learning and

181

improve the quality of the learned vectors of rare words, we used a procedure for subsampling

182

frequent words (Mikolov et al., 2013a).

184 185

2.7.

Movie scene annotation

M AN U

183

SC

RI PT

177

Each movie scene was manually annotated using natural Japanese language. The annotations were given at 1-s intervals to obtain precise descriptions of movie scenes that

187

changed dynamically from second to second. The annotators were native Japanese speakers (22

188

males and 26 females; age 18–51) who were neither the authors nor the MRI participants,

189

except that S5 and S6 provided the annotations of ≤150 scenes each in the training movies.

190

They were instructed to annotate each scene with descriptive sentences using more than 50

191

Japanese characters (see Figure 1A and Figure S1 for examples). We randomly assigned 4–7

192

annotators for each scene to reduce the potential effect of personal bias.

194

EP

AC C

193

TE D

186

Many of the movie scenes included text captions: they were present in 3,442 (45.2%) of

195

the 7,622 one-second scenes. In 1,420 (41.3%) of these 3,442 one-second scenes, the text

196

captions were contained in the descriptive sentences of the annotators. These text captions

197

might have affected the perceptual experiences of participants and annotators. However, we do

11

ACCEPTED MANUSCRIPT

not believe that this was an issue in the present study because we aimed to visualize various

199

perceptual experiences using our word-based decoding method, regardless of whether the

200

experiences were verbal or nonverbal.

RI PT

198

201 202

We also collected annotations from the MRI participants for the test movie scenes after the participants had viewed all of the movies in the scanner. The instructions for the annotation

204

were the same as above. We randomly assigned 130 scenes to each of the 6 participants without

205

repetition and obtained 2 annotations per 1-s scene.

207

2.8.

208

Scene-vector construction

M AN U

206

SC

203

Word vector representations for individual movie scenes were computed from the manual scene annotations using the learned word vector space. Each annotation for a given scene was

210

decomposed into nouns, verbs, and adjectives using the same method as described earlier (see

211

2.5. Text data preprocessing). Individual words were projected into the corresponding

212

100-dimensional word vector space (see 2.6. Skip-gram vector space). The word vectors were

213

averaged within each annotation. Then, for each scene, all vectors that were obtained from the

214

different annotations were averaged. This procedure yielded one vector representation for each

215

second of each movie. Finally, to match the sampling interval to the fMRI data (2 s) for the

216

following analysis, the vectors for individual seconds were averaged over two successive

217

seconds. These averaged vectors are referred to as annotation scene vectors.

AC C

EP

TE D

209

218 219

2.9.

Model fitting

12

ACCEPTED MANUSCRIPT

220

Our decoding model predicts scene vectors by a weighted linear summation of voxel responses. Specifically, a series of 100-dimensional annotation scene vectors in S movie scenes,

222

denoted by V, were modeled by a series of responses in the set of T voxels within the whole

223

cortex, denoted by R, times the linear weight W, plus isotropic Gaussian noise ε: + = ,-. + 0

RI PT

221

We used a set of linear temporal filters to model the slow hemodynamic response and its

225

coupling with brain activity (Nishimoto et al., 2011). To capture the hemodynamic delay in the

226

responses, the S × 3T matrix R was constructed by concatenating three sets of T-dimensional

227

response vectors with temporal shifts of 2, 4, and 6 s. The 3T × 100 weight matrix WE was

228

estimated using an L2-regularized linear least-squares regression. A regularized regression can

229

obtain good estimates even for models containing a large number of regressors (Çukur et al.,

230

2013; Huth et al., 2012). The estimation of the scene vector from newly measured responses

231

was conducted by multiplying the responses evoked in T voxels by WE. We refer to the scene

232

vector as a decoded scene vector.

M AN U

TE D

EP

234

The training dataset consisted of 3,616 samples (7,232 s), but the first 5 samples (10 s) of

AC C

233

SC

224

235

each block were discarded to avoid responses from the non-stimulus period. Hence, 3,556

236

samples were used for the model fitting procedure. In addition, to estimate the regularization

237

parameter, we divided the training dataset into two subsets by random resampling: 80% of the

238

samples were used for model fitting and the remaining 20% were used for model validation.

239

This random resampling procedure was repeated 10 times. We determined the optimal

240

regularization parameter for each subject using the mean value of the Pearson’s correlation

13

ACCEPTED MANUSCRIPT

241

coefficient between decoded and annotation scene vectors for the 20% of validation samples.

242 The test dataset consisted of 190 samples (380 s) after discarding the first 5 samples (10

RI PT

243

s) in each block. The fMRI signals for the four stimulus repetitions were averaged to improve

245

the signal-to-noise ratio. This dataset was not used for the model fitting or the parameter

246

estimation, but instead was used to evaluate the final prediction accuracy for each voxel. The

247

prediction accuracy was measured by the Pearson’s correlation coefficient between the decoded

248

and annotation scene vectors.

250 251

2.10. Word-based decoding

M AN U

249

SC

244

For each movie scene, we measured the similarity between each word in the vocabulary and each decoded movie scene vector using Pearson’s correlation coefficient in the

253

100-dimensional skip-gram vector space. We refer to the correlation coefficient as a word score.

254

Words with higher scores were regarded as more likely to reflect perceptual contents. We

255

restricted the size of the vocabulary to the 10,000 words that appeared most frequently in the

256

Wikipedia text dataset. Word scores were estimated for three parts of speech, nouns, verbs, and

257

adjectives, consisting of 9,320, 588, and 92 words, respectively.

259 260

EP

AC C

258

TE D

252

2.11. Word-wise decoding accuracy We conducted two analyses to assess the performance of our decoding model at the

261

single-word level. The first was a word-wise correlation analysis. For a given word, we

262

calculated the time series of word scores and the time series of annotation word scores which

14

ACCEPTED MANUSCRIPT

were defined by Pearson’s correlation coefficients between the word vector and annotation

264

scene vectors. Then, we calculated the Pearson’s correlation coefficient between these two

265

series and regarded it as the word-wise correlation coefficient of that word. We evaluated the

266

word-wise correlation for each of the 10,000 words we used for the word-based decoding unless

267

otherwise noted. The statistical significance of the word-wise correlations for each part of

268

speech was tested using Wilcoxon’s signed-rank test (p < 0.05); the null hypothesis was that

269

correlation coefficients were derived from a distribution with a median value of zero.

SC

RI PT

263

271

M AN U

270

The second analysis was a word-wise receiver operating characteristic (ROC) analysis (Huth et al., 2016b). After calculating the word score for a given word (see Word-based

273

decoding), we gradually increased the detection threshold from –1 to 1. For each threshold, we

274

counted the number of false positive detections (scenes for which the word score was higher

275

than the threshold but the word was not present in any of the annotations) and true positive

276

detections (scenes for which the word score was higher than the threshold and the word was

277

actually present in the annotations). Consequently, we obtained a function of the true positive

278

rate against the false positive rate across all thresholds to produce the area under the ROC curve

279

(AUC). An AUC value close to 1 indicates high decoding accuracy, whereas an AUC value

280

close to 0.5 indicates low decoding accuracy. We also varied the presence/absence threshold

281

that determined whether a given word was present or absent in each movie scene. We used three

282

thresholds: one, two, or three annotations. For example, the threshold of two annotations means

283

that a given word was regarded as present in a given scene if the word appeared in ≥2

284

annotations in that scene; otherwise, the word was regarded as absent.

AC C

EP

TE D

272

15

ACCEPTED MANUSCRIPT

285 286

In this ROC analysis, the infrequent words that were present in the annotations of less than five scenes were excluded to improve the quality of the analysis. The words that were

288

present in more than 97 two-second scenes (more than half of all scenes) were also excluded

289

from the analysis, because such words seemed to be highly common words that provided little

290

information about the scene (e.g., “be”). Using the selection criteria above, we selected 3,349

291

words for the ROC analysis.

293

M AN U

292

SC

RI PT

287

The statistical significance of the word-wise ROC analysis for single words was tested using a randomization test (p < 0.05). After randomly shuffling the presence/absence labels of a

295

given word in individual scenes, an AUC value was computed. This process was repeated 1,000

296

times to obtain a null distribution of AUC values. A p-value was computed as the fraction of the

297

null distribution that was higher than the original AUC value. The statistical significance of the

298

word-wise ROC analysis for each part of speech was tested using Wilcoxon’s signed-rank test

299

(p < 0.05); the null hypothesis was that AUC values minus 0.5 were derived from a distribution

300

with a median value of zero.

302 303

EP

AC C

301

TE D

294

2.12. Japanese–English translation For this paper, we translated the Japanese words (i.e., the language of the annotations and

304

corpus) into the corresponding English words. To achieve neutral word-by-word translation

305

without selection bias, we semi-automatically translated the words using Google Translate

306

(https://translate.google.com/). Specifically, we adopted the first candidate word in the

16

ACCEPTED MANUSCRIPT

Japanese–English translation returned from Google Translate. Occasionally, the part-of-speech

308

of an English word returned by Google Translate did not match with that of the original

309

Japanese word. In such cases, we manually translated it on the basis of another standard

310

Japanese–English dictionary.

RI PT

307

AC C

EP

TE D

M AN U

SC

311

17

ACCEPTED MANUSCRIPT

312

3.

313

3.1.

Word-based decoding could recover perceptual contents varying across scenes Our decoding model (Figure 1A) was trained to learn the association between

RI PT

314

Results

movie-evoked brain activity and perceptual experiences via distributed word representations.

316

We recorded the brain activity of six participants while they watched 147 minutes of movies

317

(TV ads). The presented movie scenes were chunked into one-second clips, and four or more

318

individual annotators described each clip using natural language (Figure 1A; see also Figure S1

319

for more annotation examples). The scene annotations were transformed into vector

320

representations via a skip-gram word vector space (annotation scene vectors.) We then

321

performed L2-regularized linear regressions to obtain the linear weight matrix between brain

322

activity and scene vectors (see Material and methods for further details). The optimal weights

323

were obtained separately for individual participants. We used a training dataset consisting of

324

3,616 time samples to train the model, and a separate test dataset consisting of 195 time samples

325

to quantify the performance of the trained model. Although the sample size of the test dataset

326

was small, the annotation scene vectors in the test dataset broadly covered the representational

327

space of the annotation scene vectors in the training dataset (Figure S2).

329

M AN U

TE D

EP

AC C

328

SC

315

The scene vectors predicted from brain activity (decoded scene vectors) were

330

substantially correlated with the annotation scene vectors in the same test dataset (the average

331

Pearson’s r for the six individual participants ranged from 0.40–0.45; p < 0.0001). Although the

332

correlations varied from scene to scene even within each movie clip, the temporal profiles of the

333

correlations of individual participants were similar throughout the whole movie (Figure S3).

18

ACCEPTED MANUSCRIPT

334 335

Then, the decoded scene vectors were used to recover the perceptual contents induced by individual movie scenes in the form of words (Figure 1B). We calculated the correlation

337

coefficients between decoded scene vectors and individual word vectors as word scores. Based

338

on the word scores, the words that were likely to reflect perceptual contents were estimated;

339

words with higher word scores were regarded as being more likely (see Material and methods

340

for further details). We performed word score estimations for the 10,000 most frequent words

341

(restricted to nouns, verbs, and adjectives) in the Wikipedia corpus. The word scores were

342

ranked separately for these three parts of speech, which were considered to reflect the

343

perceptual contents associated with objects, actions, and impressions, respectively.

M AN U

SC

RI PT

336

344

The most likely words estimated by the decoder varied across individual scenes and

346

across individual participants (Figure 2). Figure 2A shows the most likely words estimated from

347

a single participant for a single scene, in which a man is looking at a tablet computer. Likely

348

words were, for example, “display” and “system” as nouns, “represent” as a verb, and “wide” as

349

an adjective, which adequately explain the visual contents of the scene. Figure 2B shows the

350

estimation for another scene. In this scene, a woman is soaking in a bath and a Japanese caption

351

is presented in the center. Likely words were, for example, “female” and “face” as nouns, and

352

“cute” and “young” as adjectives. Figure 2C is the same as Figure 2B, but the estimation is from

353

another participant. Likely words were, for example, “comment” and “catchphrase” as nouns.

354

The words in Figure 2B are more related to the perception of the woman, whereas the words in

355

Figure 2C are more related to the perception of the caption.

AC C

EP

TE D

345

19

ACCEPTED MANUSCRIPT

356 To test the validity of our decoding method at the single-word level, we conducted two

358

analyses using the decoded scene vectors averaged across all participants. First, we conducted

359

word-wise correlation analysis. For each word and scene in the test dataset, we calculated

360

annotation word scores: the correlation coefficients between the annotation scene vectors and

361

the individual word vectors. The annotation word scores reflect the likelihood of words based

362

on the annotators’ scene descriptions, thereby forming an approximate ground truth of the word

363

scores. We quantified the word-wise correlation by calculating the correlation coefficients

364

between word scores and annotation word scores across the scenes in the test dataset (see

365

Material and methods for details; Figure 3A). The correlation coefficients were significantly

366

higher than zero for all three parts of speech (Wilcoxon signed-rank test, p < 0.0001; see Figure

367

S4 for individual participants), suggesting that our decoded results reflect perceptual

368

experiences regarding objects, actions, and impressions.

SC

M AN U

TE D

EP

369

RI PT

357

Second, we conducted a word-wise receiver operating characteristic (ROC) analysis

371

(Huth et al., 2016b). For individual words, we obtained the time series of word scores and those

372

of word occurrences in scene annotations throughout the test dataset (Figure 3B shows the

373

example “Building”). The word occurrence was evaluated using three word presence/absence

374

thresholds; a word was regarded as being present in a given scene when the word appeared in

375

more than n annotations (n = 1, 2, or 3). Using these time series data with a particular detection

376

threshold of word scores, we counted the number of false positive detections (scenes where the

377

word score was higher than the threshold but the word was not present in any of the

AC C

370

20

ACCEPTED MANUSCRIPT

annotations) and true positive detections (scenes where the word score was higher than the

379

threshold and the word was actually present in the annotations). Then, we drew an ROC curve

380

given as a function of the true positive rate against the false positive rate while the detection

381

threshold gradually increased from –1 to 1 (e.g., Figure 3C). Finally, we obtained the area under

382

the ROC curve (AUC) to quantify the degree to which the presence or absence of a given word

383

in the annotations could be predicted by word scores (see Material and methods for details). For

384

the example word “Building,” the AUC values were significantly higher than the level of

385

chance (Figure 3C; randomization test, p < 0.001). The distribution of AUC values for all words

386

was significantly higher than 0.5 for all of the tested parts of speech (Figure 3D; Wilcoxon

387

signed-rank test, p < 0.0005; see Figure S5 for individual participants; see Table S1 for the most

388

decodable words). In addition, the AUC values of the example word “Building” increased (from

389

0.79 to 0.88) as the presence/absence threshold of the word increased (Figure 3C). This

390

indicates that a word that was commonly used by different annotators tended to have a higher

391

word score for that scene. This tendency was preserved for all words (Figure S6). Together,

392

these results indicate that our decoding successfully estimated the individual words related to

393

objects, actions, and impressions consistent with the annotators’ scene descriptions.

395 396

SC

M AN U

TE D

EP

AC C

394

RI PT

378

3.2.

The decoder could infer words that had never appeared during its training

Our decoder was able to score novel words that had never appeared in the annotation

397

dataset used for model training (for example, the words indicated by asterisks in Figure 2). We

398

tested our decoder’s ability to estimate these novel words. From the 10,000 most frequent words

399

in the Wikipedia corpus, we selected 5,119 words that were not present in the training dataset.

21

ACCEPTED MANUSCRIPT

We evaluated the word-wise correlations for the subset of words (Figure 4A; see Figure S7 for

401

individual participants) and found that all of the coefficients were significantly higher than zero

402

(Wilcoxon signed-rank test, p < 0.0001). We then evaluated the word-wise AUC values using

403

the words that were present only in the test dataset (n = 270; Figure 4B; see Figure S7 for

404

individual participants). Note that because such words were rare, we extracted them from the

405

entire vocabulary in the corpus (n = 100,035) rather than from the 10,000 words. We found that

406

the AUC values were significantly higher than 0.5 (Wilcoxon signed-rank test, p < 0.0001). The

407

decoder made a successful estimation even for the words that had never appeared in the training

408

data set.

M AN U

SC

RI PT

400

409 410

3.3.

Inter-individual variability in the decoded contents explained the variability in the annotators’ scene descriptions

412

The results shown in Figure 2B and C raise the possibility that the contents that our

TE D

411

decoder recovered from brain activity may reflect the inter-individual variability in perceptual

414

experiences induced by movie scenes. To test this possibility, we separately evaluated the

415

inter-individual variability in decoded scene vectors and that in annotation scene vectors, and

416

examined the association between them. The variability in each movie scene was quantified by

417

the pairwise correlation distance between scene vectors across all possible pairs of experimental

418

participants or annotators. The pairwise distance was then averaged across all pairs and

419

transformed into z scores. The black and pink traces in Figure 5A show the temporal profiles of

420

the normalized mean pairwise distance for decoded and annotation scene vectors, respectively.

421

These two series of pairwise distances were significantly correlated (pink dots in Figure 5B;

AC C

EP

413

22

ACCEPTED MANUSCRIPT

Pearson’s r = 0.27, p = 0.0001; Spearman’s ρ = 0.21, p = 0.003), indicating that we could infer

423

the inter-individual variability of scene-evoked perceptions from those decoded from brain

424

activity.

RI PT

422

425 426

Although the two sets of inter-individual variability were derived from different groups, i.e., experimental participants and annotators, the association of inter-individual variability was

428

also observed in data derived from the same group. We also collected scene annotations from

429

the experimental participants themselves and calculated a new set of annotation scene vectors.

430

We found that the series of pairwise distances between participant-derived annotation scene

431

vectors (blue trace in Figure 5A) were significantly correlated with those of decoded scene

432

vectors (blue dots in Figure 5B; Pearson’s r = 0.27, p = 0.0001; Spearman’s ρ = 0.23, p =

433

0.001).

TE D

M AN U

SC

427

434

One might argue that because the pairwise distance was likely to increase during scene

436

transitions (particularly, when one movie clip moved on to another), the clip transitions could

437

yield a quasi-significant correlation between the pairwise distances. To exclude this possibility,

438

we removed the samples that included clip transitions and re-calculated the correlation between

439

the variability of decoded and annotation scene vectors. We found that a significant correlation

440

remained (Pearson’s r = 0.25, p = 0.001; Spearman’s ρ = 0.21, p = 0.006; Figure S8), indicating

441

that clip transitions had little effect on the relationship between the pairwise distances. Taken

442

together, the inter-individual variability in decoded contents is likely to be closely related to that

443

in scene descriptions.

AC C

EP

435

23

ACCEPTED MANUSCRIPT

444 445

To further examine whether we could decode the experiences of a specific participant from his or her brain activity, we conducted a binomial classification analysis in which the

447

identity of each participant was examined according to the correlation between the participant’s

448

annotations and the decoded scene vectors. For each scene, we used two annotation scene

449

vectors obtained from the annotations of two participants (PA and PB). We also used decoded

450

scene vectors for each participant. For each scene, if the correlation coefficient between PA’s

451

decoded and annotation vectors was higher than that between PA’s decoded and PB’s

452

annotation vectors, the identification was correct; if not, the identification was wrong. When we

453

evaluated the rate of correct identifications for each participant (130 scenes per participant)

454

across all self-annotated scenes, the rate was significantly higher than the level of chance for

455

only one participant (P4) (Figure S9; binomial test, p < 0.05). Therefore, using our current data,

456

the contents decoded from the individual participants could not be used to distinguish their

457

subjective experiences estimated from their annotations.

458

460

3.4.

Higher visual areas contributed to our decoding

AC C

459

EP

TE D

M AN U

SC

RI PT

446

To investigate which cortical areas our decoder extracted perceptual information from,

461

we estimated the informative voxels from the decoding-model weights and visualized those

462

voxels on the cortical surface of each participant. Our decoding model consisted of three sets of

463

time lag terms to capture the hemodynamic delay of fMRI responses (see Material and methods

464

for details); therefore, the size of the weight matrix was N × 3K, where N is the number of

465

voxels and K is the vector dimensionality (= 100). We first averaged the weights across the

24

ACCEPTED MANUSCRIPT

three time lags, leading to an N × K matrix. The voxels with high decoding weights were

467

considered as computationally contributing to the decoding. However, those voxels should not

468

be simply regarded as voxels signaling rich perceptual information, because response

469

covariance across voxels may produce voxels with high decoding weights but uninformative

470

signals (Haufe et al., 2014). To avoid such dissociation, we corrected the decoding-model

471

weights in the same manner as proposed by Haufe et al. (2014): the original weights were

472

left-multiplied by the covariance matrix computed from the voxel response time-courses. Then,

473

we took the absolute values of the corrected weights and selected the maximum for each

474

individual voxel, leading to an N-dimensional vector. Finally, the vector was z-scored for each

475

participant and projected onto a cortical surface. We regarded voxels with higher z-scores as

476

more informative.

478

SC

M AN U

TE D

477

RI PT

466

Figure 6 shows the normalized absolute weights on the cortical surface of a single participant (P1). Although we used brain activity from all cortical voxels for our decoder, the

480

informative voxels were mainly located in the cortical regions that include the fusiform gyrus,

481

parahippocampal gyrus, occipitotemporal areas, and posterior superior temporal sulcus, which

482

are part of the higher visual areas. Similar results were observed for the other participants

483

(Figure S10). These areas involve previously reported object-selective regions such as the

484

fusiform face area (Kanwisher et al., 1997), the extrastriate body area (Downing et al., 2001),

485

and the lateral occipital complex (Malach et al., 1995), and scene-selective regions such as the

486

parahippocampal place area (Epstein and Kanwisher, 1998). Therefore, consistent with the

487

results from previous decoding studies (Huth et al., 2016b; Stansbury et al., 2013), our decoder

AC C

EP

479

25

ACCEPTED MANUSCRIPT

488

read the perceptual contents primarily from the information conveyed across these higher visual

489

areas.

AC C

EP

TE D

M AN U

SC

RI PT

490

26

ACCEPTED MANUSCRIPT

491 492

4.

Discussion We developed a new word-based decoding technique that recovers perceptual contents

from fMRI responses to natural movies via a word embedding space. Our decoder successfully

494

visualized movie-induced perceptions of objects, actions, and impressions in the form of words,

495

and showed generalizability in that it could estimate words that it had not seen during training.

496

Moreover, the decoded contents reflected the inter-individual variability of perceptual

497

experiences. These results suggest that our decoding technique provides a useful tool to recover

498

naturalistic perceptual experiences varying across scenes and across individuals.

M AN U

SC

RI PT

493

499 500

We decoded diverse perceptual experiences induced by continuous natural movies. Our work extends previous studies that decoded semantic contents induced by line drawings or static

502

images (Pereira et al., 2011; Stansbury et al., 2013) to more complex real-life movie materials

503

(TV ads.) This study also extends previous studies that demonstrated word-based decoding of

504

up to a few thousand nouns and verbs (Güçlü and van Gerven, 2015; Huth et al., 2016b; Pereira

505

et al., 2011; Stansbury et al., 2013) to more comprehensive natural language descriptions using

506

tens of thousands of words including adjectives, and also novel words that did not appear in the

507

training data set. Taken together, these quantitative and partly qualitative extensions

508

demonstrate the potential usability of our techniques to real-life applications such as

509

neuromarketing. In addition, the use of natural language annotations to capture perceptual

510

contents during movie viewing is an advantage of our decoding model. A recent study reported

511

that natural language annotations projected onto a word embedding space were effective for

512

characterizing movie-evoked brain responses varying across movie scenes (Vodrahalli et al.,

AC C

EP

TE D

501

27

ACCEPTED MANUSCRIPT

2016). Therefore, natural language annotations of natural movie scenes may play a key role in

514

our decoding technique, allowing us to effectively model the association between movie-evoked

515

complex perceptions and brain responses.

RI PT

513

516 517

Existing decoding models incorporate word variables that are roughly sorted into two

types: binary-based (Huth et al., 2016b) and feature-based (Pereira et al., 2011; Stansbury et al.,

519

2013). The binary-based variables are represented by vectors with values of one or zero as

520

categorical word labels corresponding to visual stimuli or objects and actions in viewed scenes.

521

The feature-based variables are represented by vectors that reflect the latent intermediate

522

representation of words learned from word statistics (typically, co-occurrence statistics) in

523

large-scale text data. For word-wise decoding, one of the most important characteristics of using

524

feature-based variables is to improve the model’s generalization capability (Pereira et al., 2011).

525

A feature-based variable model learns to associate brain responses with the dimensions of a

526

word feature space rather than with words per se; therefore, a label of every categorical word

527

the model used in a test phase does not need to be present during model training. The decoding

528

method introduced in the present study is an extension of such feature-based variable models.

529

Consistent with previous decoding studies (Pereira et al., 2016, 2011), our decoder showed high

530

decoding accuracy for the novel words that were absent in the training dataset (Figure 4),

531

indicating its generalization capability. Its generalization capability was also evident in the

532

decoding of complex perceptions represented by the natural language annotations and the large

533

vocabulary size (10,000 words). This suggests that our decoder is able to generalize more

534

comprehensively than previous models.

AC C

EP

TE D

M AN U

SC

518

28

ACCEPTED MANUSCRIPT

535 536

The feature-based variables we used were derived from a state-of-the-art NLP algorithm (skip-gram; Mikolov et al., 2013a). In contrast, previous feature-based variable models

538

incorporate conventional NLP algorithms, typically latent Dirichlet allocation (Blei et al., 2003).

539

The skip-gram vector representation shows higher performance than any other model

540

representation, not only in NLP tasks (Mikolov et al., 2013a, 2013b), but also in modeling brain

541

responses to natural stimuli (Güçlü and van Gerven, 2015; Nishida et al., 2015; but see Huth et

542

al., 2016a). Although we did not directly compare the decoding accuracy of the skip-gram

543

model with that of other NLP-based models using our data, this may make it easier to obtain the

544

complex association of brain activity with a larger set (tens of thousands) of words for

545

word-based decoding.

547

SC

M AN U

TE D

546

RI PT

537

Using such a large set of potential words for decoding allows us to estimate perceptual contents related not only to objects (nouns) and actions (verbs) but also to impressions

549

(adjectives). Strictly speaking, the word-wise ROC analysis indicated that some nouns, verbs,

550

and adjectives were not decodable (i.e., the AUC values were not significantly higher than 0.5;

551

Figure 3D). However, although a previous study used a similar word-wise ROC analysis for the

552

evaluation of word decodability (Huth et al., 2016b), our ROC analysis was more conservative

553

because our decoder covered a large language feature space, including a larger vocabulary and

554

richer scene annotations using natural language. This large feature space meant that the

555

occurrence of a specific word in the annotations was rarer, and thus our ROC analysis was

556

likely to have lower statistical power and produce lower AUC values. Although we used the

AC C

EP

548

29

ACCEPTED MANUSCRIPT

ROC analysis for compatibility with Huth et al. (2016b), the alternative word-wise correlation

558

analysis revealed that almost all (99.3%) of the words had correlation coefficients significantly

559

higher than 0 (Figure 3A). Therefore, we believe that our decoder successfully estimated many

560

words relevant to the perception of objects, actions, and impressions.

RI PT

557

561

Previous studies, in contrast, have primarily aimed to decode the words relevant to only

SC

562

object and action perceptions (Huth et al., 2016b; Pereira et al., 2011; Stansbury et al., 2013).

564

Impression perception, which is more subjective than object and action perception, may

565

considerably extend the possible applications of word-based decoding. A potential application

566

of impression decoding is to evaluate movie contents, such as TV commercials and promotional

567

films, on the basis of decoded contents. We believe that this application contributes to progress

568

in the field of neuromarketing (Plassmann et al., 2015).

TE D

570

The neuromarketing literature has suggested that a brain decoding framework potentially

EP

569

M AN U

563

offers more reliable information for predicting human economic behavior than conventional

572

questionnaire-based evaluations (Berns and Moore, 2012). Information decoded from brain

573

activation in a small group of people has been shown to be sufficient to predict mass behavior

574

such as music purchasing (Berns and Moore, 2012) and audience ratings of broadcast TV

575

content (Dmochowski et al., 2014). These studies were able to directly associate brain activity

576

with individuals’ economic preferences or behavioral tendencies. As an extension of this line of

577

research, our decoding technique is able to visualize the cognitive contents of naturalistic

578

experiences, including subjective impressions. Such a content-based estimation of cognitive

AC C

571

30

ACCEPTED MANUSCRIPT

579

information may provide a powerful framework to estimate economic behavior more

580

effectively.

582

RI PT

581 We found that the inter-individual variability in the decoded contents of each movie

scene was correlated with that in annotators’ and participants’ scene descriptions (Figure 5),

584

although we failed to distinguish one person’s subjective experience from another’s using our

585

decoding procedure (Figure S7). Note that even though the number of participants was

586

relatively small (n=6), we observed a significant relationship between the inter-individual

587

variability of decoded and annotated contents. This suggests that the inter-individual variability

588

in decoded contents is a robust measure to evaluate the inter-individual variability in perceptual

589

experiences. A previous report showed that activation patterns in the human higher-order visual

590

cortex were more variable across individuals when there was more inter-individual variability in

591

the familiarity of the viewed objects (Charest et al., 2014). This suggests that the

592

inter-individual variability of visual experiences is reflected in higher-order visual

593

representations in the brain. Consistent with this finding, our results indicate that the

594

inter-individual variability of semantic perceptions, even under natural visual conditions, can be

595

decoded from brain activity via word-based decoding. To the best of our knowledge, this is the

596

first direct evidence that word-based decoding can reflect the inter-individual variability in

597

visual experiences.

AC C

EP

TE D

M AN U

SC

583

598 599 600

What enabled our decoding results to reflect such inter-individual variability? A potentially important factor is that we used natural language annotations to evaluate the

31

ACCEPTED MANUSCRIPT

subjective perception of a scene (Figure 1A and Figure S1). The annotations contained rich

602

information that captured some aspects of the inter-individual variability in subjective

603

perceptions. It is also important that our model used an expressive word vector space that

604

enabled us to learn the projection from the representation in the rich annotations to the

605

representation in the brain. Our framework may provide a useful tool to study inter-individual

606

differences in perceptual/cognitive representations in the brain.

SC

607

The higher visual areas, including the fusiform gyrus, parahippocampal gyrus,

M AN U

608

RI PT

601

occipitotemporal areas, and posterior superior temporal sulcus, made the greatest contributions

610

to our decoding (Figure 6 and Figure S10). It could be argued that these areas did not involve

611

some parts of the semantic network, such as the prefrontal cortex, defined in previous studies

612

(Binder et al., 2009; Patterson et al., 2007). However, this is not surprising because the

613

contributing areas largely overlapped with those found in the previous study by Huth et al.

614

(2016b), in which decoding was conducted using an experimental paradigm highly similar to

615

ours. Both our experiments and those of Huth et al. (2016b) required participants to view

616

natural movies passively in the scanner without any behavioral tasks. In such a situation,

617

semantic processing relevant to visual processing may become dominant and involve higher

618

visual areas but few other semantic areas.

AC C

EP

TE D

609

619 620

Quantifying the inter-individual variability of perception may also offer potential

621

applications, particularly in neuromarketing. TV commercials are created, for instance, with the

622

intention to enhance consumers’ willingness to buy specific products or to improve a

32

ACCEPTED MANUSCRIPT

company’s brand image. To achieve this, consumers’ perceptions induced by commercial

624

contents should be focused toward a company’s intention, and thereby to reduce the variability

625

of inter-individual perceptions. In such a case, measuring the degree of inter-individual

626

perceptual variability may be useful to evaluate the effectiveness of TV commercials. Indeed, a

627

recent study reported that the favorability of viewers toward broadcast TV contents could be

628

predicted by the inter-individual variability of the viewers’ brain responses to the contents;

629

viewers were more favorable toward a given TV scene when the variability of responses to the

630

scene decreased (Dmochowski et al., 2014). Thus, again our decoding technique has the

631

potential to be used as a method in neuromarketing and may contribute to developments in

632

neuromarketing research.

M AN U

SC

RI PT

623

633

Although our word-wise decoding method remedies some of the technical issues in

635

previous decoding methods, it also leaves several limitations. First, our method may make poor

636

estimations when word pairs that are distantly located in the skip-gram word vector space, such

637

as a pair of “human” and “mountain,” appear simultaneously in the annotations of a single scene.

638

Because likely words are estimated according to the distance from a single decoded scene

639

vector in each scene, it is difficult for multiple words located apart to be chosen together as

640

likely words. This could be problematic for practical applications. A potential solution for this

641

issue may be to estimate the likelihood of individual words independently of each other via the

642

one-to-one association of words with brain responses. The binary-based variable model, rather

643

than the feature-based model, may have the potential to achieve this; nevertheless, a previous

644

binary-based variable model also showed poorer decoding performance when the number of

AC C

EP

TE D

634

33

ACCEPTED MANUSCRIPT

object and action categories in a scene increased (Huth et al., 2016b). To address this issue,

646

further considerations will be needed to introduce a new modeling framework that incorporates

647

the advantages of both types of variable modeling.

RI PT

645

648 649

Second, the training of our decoding model required a large amount (>2 hours) of fMRI data from each participant, which may impose too much of a burden. For practical applications,

651

it is necessary to reduce the time that participants are constrained in the MRI scanner. A

652

possible solution for this issue may be to transfer decoders across individuals using methods to

653

align representational spaces across individual brains (Conroy et al., 2013; Haxby et al., 2011;

654

Sabuncu et al., 2010; Yamada et al., 2015). The so-called “hyperalignment” technique

655

(Guntupalli et al., 2016; Haxby et al., 2011) has recently attracted broad interest and has the

656

potential to be extensively used in fMRI studies that include brain decoding (Haxby et al., 2014;

657

Nishimoto and Nishida, 2016). Once we construct a decoder from abundant fMRI data for one

658

person, additional shorter experiments for other people may be sufficient to estimate their

659

decoders by transferring the representational spaces of the subsequent participants from that of

660

the first participant. Accordingly, this technique may reduce the experimental burden on

661

individual participants. This idea will be tested in future studies.

M AN U

TE D

EP

AC C

662

SC

650

663

Third, our decoder estimates multiple words independently, which makes the dependency

664

relation between the words unclear. For instance, if the decoder returns the words “man,” “car,”

665

and “hit,” it could indicate “A man hits a car” or “A car hits a man.” Our current decoder cannot

666

distinguish such different contents. A possible way to address this issue is to construct a

34

ACCEPTED MANUSCRIPT

decoding model that incorporates language features, including the dependency relation of words.

668

An implicit implementation of such a model is a sentence-based decoding model that learns the

669

association between brain responses and sentences themselves. Toward this, we recently

670

demonstrated such sentence-based decoding based on the feature representation of a

671

state-of-the-art NLP algorithm with deep learning (Matsuo et al., 2016; Vinyals et al., 2014).

672

Further developments in NLP models may help to improve decoding techniques to recover

673

richer semantic information from brain activity.

AC C

EP

TE D

M AN U

674

SC

RI PT

667

35

ACCEPTED MANUSCRIPT

675

5.

Conclusions Our new decoding framework incorporates a word embedding space and descriptions of

677

scenes using natural language. The decoder successfully recovered movie-induced perceptions

678

of objects, actions, and impressions, in the form of words, from fMRI signals. The decoder also

679

showed remarkable generalizability by estimating words unseen during decoder training.

680

Furthermore, the decoded contents reflected the inter-individual variability of movie-induced

681

experiences. Because words are useful to express categorical and conceptual information in an

682

interpretable manner, such word-based decoding is assumed to be effective for visualizing

683

perceptual contents in the brain. Therefore, we believe that our decoding framework contributes

684

to the technical development of brain reading for various scientific and practical applications.

M AN U

SC

RI PT

676

AC C

EP

TE D

685

36

ACCEPTED MANUSCRIPT

Acknowledgments

687

The work was supported by grants from the Japan Society for the Promotion of Science (JSPS;

688

KAKENHI 15K16017) to S. Nishida and from the JSPS (JSPS; KAKENHI 15H05311 and

689

15H05710), NTT DATA Corporation, and NTT DATA Institute of Management Consulting,

690

Inc. to S. Nishimoto. We also thank Ryo Yano, Masataka Kado, Naoya Maeda, Takuya Ibaraki,

691

and Ippei Hagiwara for helping movie materials and fMRI data collection.

SC

RI PT

686

AC C

EP

TE D

M AN U

692

37

ACCEPTED MANUSCRIPT

693

References

694

Anderson, A.J., Binder, J.R., Fernandino, L., Humphries, C.J., Conant, L.L., Aguilar, M., Wang, X., Doko, D., Raizada, R.D.S., 2016. Predicting neural activity patterns associated with

696

sentences using a neurobiologically motivated model of semantic representation. Cereb.

697

Cortex. doi:10.1093/cercor/bhw240

700

SC

699

Berns, G.S., Moore, S.E., 2012. A neural predictor of cultural popularity. J. Consum. Psychol. 22, 154–160. doi:10.1016/j.jcps.2011.05.001

Binder, J.R., Desai, R.H., Graves, W.W., Conant, L.L., 2009. Where is the semantic system? A

M AN U

698

RI PT

695

701

critical review and meta-analysis of 120 functional neuroimaging studies. Cereb. cortex 19,

702

2767–96. doi:10.1093/cercor/bhp055

704 705

Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022.

TE D

703

Chang, K.M.K., Mitchell, T., Just, M.A., 2011. Quantitative modeling of the neural representation of objects: How semantic feature norms can account for fMRI activation.

707

Neuroimage 56, 716–727. doi:10.1016/j.neuroimage.2010.04.271

709 710 711

Charest, I., Kievit, R.A., Schmitz, T.W., Deca, D., Kriegeskorte, N., 2014. Unique semantic space

AC C

708

EP

706

in the brain of each beholder predicts perceived similarity. Proc. Natl. Acad. Sci. U. S. A. 111, 14545–14570. doi:10.1073/pnas.1402594111

Conroy, B.R., Singer, B.D., Guntupalli, J.S., Ramadge, P.J., Haxby, J. V., 2013. Inter-subject

712

alignment of human cortical anatomy using functional connectivity. Neuroimage 81, 400–

713

411. doi:10.1016/j.neuroimage.2013.05.009

714

Çukur, T., Nishimoto, S., Huth, A.G., Gallant, J.L., 2013. Attention during natural vision warps

38

ACCEPTED MANUSCRIPT

715

semantic representation across the human brain. Nat. Neurosci. 16, 763–770.

716

doi:10.1038/nn.3381

719 720 721

RI PT

718

Dale, A.M., Fischl, B., Sereno, M.I., 1999. Cortical surface-based analysis. I. Segmentation and surface reconstruction. Neuroimage 9, 179–194. doi:10.1006/nimg.1998.0395

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R., 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407.

SC

717

Dmochowski, J.P., Bezdek, M.A., Abelson, B.P., Johnson, J.S., Schumacher, E.H., Parra, L.C., 2014. Audience preferences are predicted by temporal reliability of neural processing. Nat.

723

Commun. 5, 1–9. doi:10.1038/ncomms5567

726 727 728

processing of the human body. Science 293, 2470–2473. doi:10.1126/science.1063414 Epstein, R., Kanwisher, N., 1998. A cortical representation of the local visual environment.

TE D

725

Downing, P.E., Jiang, Y., Shuman, M., Kanwisher, N., 2001. A cortical area selective for visual

Nature 392, 598–601. doi:10.1038/33402

Fischl, B., Sereno, M.I., Dale, A.M., 1999. Cortical surface-based analysis. II: Inflation,

EP

724

M AN U

722

flattening, and a surface-based coordinate system. Neuroimage 9, 195–207.

730

doi:10.1006/nimg.1998.0396

731 732 733

AC C

729

Güçlü, U., van Gerven, M.A.J., 2015. Semantic vector space models predict neural responses to complex visual stimuli. arXiv Prepr:1510.04738.

Guntupalli, J.S., Hanke, M., Halchenko, Y.O., Connolly, A.C., Ramadge, P.J., Haxby, J. V., 2016.

734

A model of representational spaces in human cortex. Cereb. Cortex 26, 2919–2934.

735

doi:10.1093/cercor/bhw068

736

Haufe, S., Meinecke, F., Görgen, K., Dähne, S., Haynes, J.D., Blankertz, B., Bießmann, F., 2014.

39

ACCEPTED MANUSCRIPT

737

On the interpretation of weight vectors of linear models in multivariate neuroimaging.

738

Neuroimage 87, 96–110. doi:10.1016/j.neuroimage.2013.10.067 Haxby, J. V., Guntupalli, J.S., Connolly, A.C., Halchenko, Y.O., Conroy, B.R., Gobbini, M.I.,

RI PT

739

Hanke, M., Ramadge, P.J., 2011. A common, high-dimensional model of the

741

representational space in human ventral temporal cortex. Neuron 72, 404–416.

742

doi:10.1016/j.neuron.2011.08.026

743

SC

740

Haxby, J. V, Connolly, A.C., Guntupalli, J.S., 2014. Decoding neural representational spaces using multivariate pattern analysis. Annu. Rev. Neurosci. 37, 435–456.

745

doi:10.1146/annurev-neuro-062012-170325

748 749 750

Neurosci. 7, 523–534. doi:10.1093/cercor/bhj086

Horikawa, T., Tamaki, M., Miyawaki, Y., Kamitani, Y., 2013. Neural decoding of visual imagery

TE D

747

Haynes, J.D., Rees, G., 2006. Decoding mental states from brain activity in humans. Nat. Rev.

during sleep. Science 639, 639–642. doi:10.1126/science.1234330 Huth, A.G., Heer, W.A. De, Griffiths, T.L., Theunissen, F.E., Jack, L., 2016a. Natural speech

EP

746

M AN U

744

reveals the semantic maps that tile human cerebral cortex. Nature 532, 453–458.

752

doi:10.1038/nature17637

753 754 755

AC C

751

Huth, A.G., Lee, T., Nishimoto, S., Bilenko, N.Y., Vu, A.T., Gallant, J.L., 2016b. Decoding the semantic content of natural movies from human brain activity. Front. Syst. Neurosci. 10, 1– 16. doi:10.3389/fnsys.2016.00081

756

Huth, A.G., Nishimoto, S., Vu, A.T., Gallant, J.L., 2012. A continuous semantic space describes

757

the representation of thousands of object and action categories across the human brain.

758

Neuron 76, 1210–1224. doi:10.1016/j.neuron.2012.10.014

40

ACCEPTED MANUSCRIPT

760 761 762

Kanwisher, N., McDermott, J., Chun, M.M., 1997. The fusiform face area: a module in human extrastriate cortex specialized for face perception. J. Neurosci. 17, 4302–4311. Kay, K.N., Naselaris, T., Prenger, R.J., Gallant, J.L., 2008. Identifying natural images from

RI PT

759

human brain activity. Nature 452, 352–325. doi:10.1038/nature06713

Lafferty, J., McCallum, A., Pereira, F., 2001. Conditional random fields: probabilistic models for

764

segmenting and labeling sequence data, in: Proceedings of the Eighteenth International

765

Conference on Machine Learning. pp. 282–289.

Malach, R., Reppas, J.B., Benson, R.R., Kwong, K.K., Jiang, H., Kennedy, W.A., Ledden, P.J.,

M AN U

766

SC

763

767

Brady, T.J., Rosen, B.R., Tootell, R.B., 1995. Object-related activity revealed by functional

768

magnetic resonance imaging in human occipital cortex. Proc. Natl. Acad. Sci. U. S. A. 92,

769

8135–8159.

Matsuo, E., Kobayashi, I., Nishimoto, S., Nishida, S., Hideki, A., 2016. Generating natural

TE D

770

language descriptions for semantic r epresentations of human brain activity, in: Proceedings

772

of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016).

773

pp. 22–29. doi:10.18653/v1/P16-3004

775 776 777

Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013a. Distributed representations of words and

AC C

774

EP

771

phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119.

Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013b. Efficient estimation of word representations in vector space, in: ICLR Workshop.

778

Mitchell, T.M., Shinkareva, S. V, Carlson, A., Chang, K., Malave, V.L., Mason, R.A., Just, M.A.,

779

2008. Predicting human brain activity associated with the meanings of nouns. Science 320,

780

1191–1195. doi:10.1126/science.1152876

41

ACCEPTED MANUSCRIPT

781

Miyawaki, Y., Uchida, H., Yamashita, O., Sato, M., Morito, Y., Tanabe, H.C., Sadato, N., Kamitani, Y., 2008. Visual image reconstruction from human brain activity using a

783

combination of multiscale local image decoders. Neuron 60, 915–929.

784

doi:10.1016/j.neuron.2008.11.004

785

RI PT

782

Moeller, S., Yacoub, E., Olman, C.A., Auerbach, E., Strupp, J., Harel, N., Uğurbil, K., 2010. Multiband multislice GE-EPI at 7 tesla, with 16-fold acceleration using partial parallel

787

imaging with application to high spatial and temporal whole-brain fMRI. Magn. Reson.

788

Med. 63, 1144–1153. doi:10.1002/mrm.22361

M AN U

SC

786

789

Naselaris, T., Olman, C.A., Stansbury, D.E., Ugurbil, K., Gallant, J.L., 2015. A voxel-wise

790

encoding model for early visual areas decodes mental images of remembered scenes.

791

Neuroimage 105, 215–228. doi:10.1016/j.neuroimage.2014.10.018 Naselaris, T., Prenger, R.J., Kay, K.N., Oliver, M., Gallant, J.L., 2009. Bayesian reconstruction of

TE D

792

natural images from human brain activity. Neuron 63, 902–915.

794

doi:10.1016/j.neuron.2009.09.006

796 797 798 799

Nishida, S., Huth, A.G., Gallant, J.L., Nishimoto, S., 2015. Word statistics in large-scale texts explain the human cortical semantic representation of objects, actions, and impressions. Soc.

AC C

795

EP

793

Neurosci. Abstr. 45, 333.13.

Nishimoto, S., Nishida, S., 2016. Lining up brains via a common representational space. Trends Cogn. Sci. 20, 565–567. doi:10.1016/j.tics.2016.06.001

800

Nishimoto, S., Vu, A.T., Naselaris, T., Benjamini, Y., Yu, B., Gallant, J.L., 2011. Reconstructing

801

visual experiences from brain activity evoked by natural movies. Curr. Biol. 21, 1641–1646.

802

doi:10.1016/j.cub.2011.08.031

42

ACCEPTED MANUSCRIPT

803

Patterson, K., Nestor, P.J., Rogers, T.T., 2007. Where do you know what you know? The representation of semantic knowledge in the human brain. Nat. Rev. Neurosci. 8, 976–987.

805

doi:10.1038/nrn2277

RI PT

804

Pennington, J., Socher, R., Manning, C.D., 2014. Glove: Global vectors for word representation,

807

in: Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014).

808

pp. 1532–1543.

810 811

Pereira, F., Detre, G., Botvinick, M., 2011. Generating text from functional brain images. Front. Hum. Neurosci. 5, 72. doi:10.3389/fnhum.2011.00072

M AN U

809

SC

806

Pereira, F., Lou, B., Pritchett, B., Kanwisher, N., Botvinick, M., Deepmind, G., Fedorenko, E.,

812

2016. Decoding of generic mental representations from functional MRI data using word

813

embeddings. bioRxiv. doi:10.1101/057216

816

TE D

815

Plassmann, H., Venkatraman, V., Huettel, S., Yoon, C., 2015. Consumer neuroscience: Applications, challenges, and possible solutions. J. Mark. Res. Rehurek, R., Sojka, P., 2010. Software framework for topic modelling with large corpora, in:

EP

814

Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp.

818

45–50.

819 820 821 822

AC C

817

Sabuncu, M.R., Singer, B.D., Conroy, B., Bryan, R.E., Ramadge, P.J., Haxby, J. V, 2010. Function-based intersubject alignment of human cortical anatomy. Cereb. cortex 20, 130–

140. doi:10.1093/cercor/bhp085 Sakahara, M., Okada, S., Nitta, K., 2014. Domain-independent unsupervised text segmentation

823

for data management, in: IEEE International Conference on Data Mining Workshop. pp.

824

481–487. doi:10.1109/ICDMW.2014.118

43

ACCEPTED MANUSCRIPT

825

Stansbury, D.E., Naselaris, T., Gallant, J.L., 2013. Natural scene statistics account for the representation of scene categories in human visual cortex. Neuron 79, 1025–1034.

827

doi:10.1016/j.neuron.2013.06.034

829 830

Vinyals, O., Toshev, A., Bengio, S., Erhan, D., 2014. Show and tell: a neural image caption generator. arXiv Prepr:1411.4555.

Vodrahalli, K., Chen, P.-H., Liang, Y., Baldassano, C., Chen, J., Yong, E., Honey, C., Hasson, U.,

SC

828

RI PT

826

Ramadge, P., Norman, K., Arora, S., 2016. Mapping between fMRI responses to movies

832

and their natural language annotations. arXiv Prepr:1610.03914.

833 834

M AN U

831

Wang, Z., Ittycheriah, A., 2015. FAQ-based question answering via word alignment. arXiv Prepr:1507.02628.

Yamada, K., Miyawaki, Y., Kamitani, Y., 2015. Inter-subject neural code converter for visual

836

image representation. Neuroimage 113, 289–297. doi:10.1016/j.neuroimage.2015.03.059

837

TE D

835

Yang, Y., Wang, J., Bailer, C., Cherkassky, V., Just, M.A., 2017. Commonality of neural representations of sentences across languages: Predicting brain activation during

839

Portuguese sentence comprehension using an English-based model of brain function.

840

Neuroimage 146, 658–666. doi:10.1016/j.neuroimage.2016.10.029

AC C

841

EP

838

44

ACCEPTED MANUSCRIPT

Figure legends

843

Figure 1 | Schematic of the word-based decoding using word vector representations

844

Our word-based decoding can be divided into training (A) and decoding (B) phases. The

845

objective in the training phase is to train a linear model that estimates vector representations

846

corresponding to movie scenes (scene vectors) from the cortical activity evoked by the scenes.

847

The cortical activity was measured using fMRI while participants viewed TV commercials. The

848

scene vectors were obtained from scene annotations using natural language by projecting them

849

into a high-dimensional word vector space (annotation scene vectors). This space was

850

constructed in advance from Wikipedia text corpus data using a statistical learning method

851

(skip-gram; see Material and methods for details). Then, the fMRI responses from individual

852

voxels were used to fit linear weights to the scene vectors using a regularized linear regression.

853

In the decoding phase, likely words for each movie scene were estimated in a different test

854

dataset than the dataset used for the model training. Scene vectors were computed from the

855

evoked cortical activity via the trained model (decoded scene vectors). Likely words for a given

856

scene were determined on the basis of correlation coefficients between the decoded scene vector

857

and individual word vectors (word scores). A word with a higher score was regarded as being

858

more likely. © NTT DATA Corp. All rights reserved.

SC

M AN U

TE D

EP

AC C

859

RI PT

842

860

Figure 2 | Word-based decoding in three example scenes

861

(A) The most likely words decoded from the brain activity of a single participant (P1) in an

862

example movie scene. Up to eight likely words are shown separately for each category: nouns,

863

verbs, and adjectives. A more reddish color indicates that the word is more likely, namely that

45

ACCEPTED MANUSCRIPT

the word has a higher word score (see color bar). Words with a score lower than 0.2 are not

865

shown. The words denoted by asterisks (*) are novel words that were absent in the training

866

dataset. © NTT DATA Corp. All rights reserved.

867

(B, C) The most likely words decoded from two participants (B, P6; C, P4) for a single scene.

868

© Shiseido Japan Co. Ltd. All rights reserved.

869

SC

RI PT

864

Figure 3 | Word-level evaluation of decoding accuracy

871

(A) Word-wise correlation. The distribution of correlations for individual words is shown

872

separately for nouns (left), verbs (middle), and adjectives (right). The word-wise correlation

873

reflects how the distance between individual word vectors and the decoded scene vectors

874

co-varied with the distance between the individual word vectors and the annotation scene

875

vectors (see Material and methods for details). Filled bars indicate the words with correlation

876

coefficients significantly higher than zero (p < 0.05). The median correlations were significantly

877

higher than zero (vertical lines) for all parts of speech (Wilcoxon signed-rank test, p < 0.0001).

878

(B) The relation between word scores and the presence in the scene annotations for the example

879

word “Building.” The trace shows the word scores for “Building” as a function of the time in

880

the test movie. Shaded areas indicate the presence of “Building” in individual scenes. The color

881

of the shading denotes the number of times a word appeared for each scene (green, 1 annotation;

882

blue, 2 annotations; red, ≥3 annotations).

883

(C) ROC curves for “Building.” The ROC curves reflect how well the word scores predict the

884

presence of the word in the scene annotations (see Material and methods for details). The color

885

of the curves differentiates the thresholds used to determine whether the word was present or

AC C

EP

TE D

M AN U

870

46

ACCEPTED MANUSCRIPT

absent in each scene (green, 1 annotation; blue, 2 annotations; red, 3 annotations). The AUC

887

values of all of the curves (bottom right) were significantly higher than 0.5 (randomization test,

888

p < 0.001).

889

(D) Word-wise ROC for all words. The distribution of the word-wise AUC values is shown

890

separately for nouns (left), verbs (middle), and adjectives (right). Filled bars indicate the words

891

with AUC values significantly different from 0.5 (randomization test, p < 0.05). The median

892

AUC values were significantly higher than 0.5 (vertical lines) for all parts of speech (Wilcoxon

893

signed-rank test, p < 0.001).

M AN U

894

SC

RI PT

886

Figure 4 | Decoding accuracy for words that did not appear in the training phase

896

Our model could recover the perceptual contents associated with novel words that had never

897

appeared in the training dataset. The word-wise correlations (A) and the word-wise AUC values

898

(B) were calculated using only novel words. The present/absent threshold of the AUC was 1

899

annotation. Filled bars indicate the words with significant correlation coefficients or AUC

900

values (p < 0.05). Both median values were significant (Wilcoxon signed-rank test, p < 0.0001).

EP

TE D

895

AC C

901 902

Figure 5 | Inter-individual variability of decoder-estimated and annotation-derived scene

903

contents

904

(A) The temporal profile of the pairwise scene-vector distance between individuals. The

905

pairwise distance was computed as the correlation distance between scene vectors for all

906

possible pairs of the scene annotators or the experimental participants. The pairwise distances

907

between the scene vectors derived from the decoder estimation (black), the annotations from the

47

ACCEPTED MANUSCRIPT

scene annotators (pink), and the annotations from the experimental participants (blue) were

909

averaged across all pairs and transformed into z scores. The mean pairwise distance for each

910

type of scene vector is shown as a function of the time it appeared in the test movie.

911

(B) The correlation between the decoded and annotation pairwise distances. Each dot represents

912

the mean pairwise distance between the decoded (x-axis) and annotation (y-axis) scene vectors

913

in each movie scene. The color of the dots differentiates the annotation scene vectors derived

914

from the annotators (pink) and the experimental participants (blue). The pairwise distance for

915

both annotations showed a significant correlation with the pairwise distance for the decoded

916

scene vectors (Pearson’s r = 0.27, p < 0.001).

M AN U

SC

RI PT

908

917

Figure 6 | Important cortical areas for decoding

919

To specify the cortical areas that made the greatest contributions to our decoding, the maximum

920

absolute weights of individual voxels in the decoding model were projected onto the cortical

921

surface of a single participant (P1). The weights were z-scored. Brighter locations in the surface

922

maps indicate voxels that have larger weights. Only those with weights above 1 SD are shown.

923

LH, left hemisphere; RH, right hemisphere; A, anterior; P, posterior.

AC C

EP

TE D

918

48

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT