Short- and long-term reliability of language fMRI

Short- and long-term reliability of language fMRI

Accepted Manuscript Short- and long-term reliability of language fMRI Charlotte Nettekoven, Nicola Reck, Roland Goldbrunner, Christian Grefkes, Caroli...

2MB Sizes 3 Downloads 47 Views

Accepted Manuscript Short- and long-term reliability of language fMRI Charlotte Nettekoven, Nicola Reck, Roland Goldbrunner, Christian Grefkes, Carolin Weiß Lucas PII:

S1053-8119(18)30360-4

DOI:

10.1016/j.neuroimage.2018.04.050

Reference:

YNIMG 14897

To appear in:

NeuroImage

Received Date: 19 October 2017 Revised Date:

23 March 2018

Accepted Date: 22 April 2018

Please cite this article as: Nettekoven, C., Reck, N., Goldbrunner, R., Grefkes, C., Weiß Lucas, C., Short- and long-term reliability of language fMRI, NeuroImage (2018), doi: 10.1016/ j.neuroimage.2018.04.050. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

RI PT

Short- and long-term reliability of language fMRI

NETTEKOVEN Charlotte1,2*, RECK Nicola1*, GOLDBRUNNER Roland1, GREFKES Christian2,3,

SC

WEIß LUCAS Carolin1

M AN U

*These authors contributed equally to the manuscript (shared first authorship)

AFFILIATIONS

1. Department of General Neurosurgery, Cologne University Hospital, 50924 Cologne, Germany 2. Department of Neurology, Cologne University Hospital, 50924 Cologne, Germany

CORRESPONDING AUTHOR Name: Carolin Weiß Lucas, MD

TE D

3. Institute of Neuroscience and Medicine (INM-3), Juelich Research Centre, 52428 Juelich, Germany

Address: Department of General Neurosurgery Uniklinik Koeln

EP

Kerpener Straße 62

50924 Koeln, Germany

AC C

Telephone Number: +49 (0)221 478 88937 Email: [email protected]

-----------------------------------------------------ABBREVIATIONS: ANOVA – analysis of variance, BA – Brodmann area, BOLD – blood-oxygen level dependent, CoG – center of gravity, ED – Euclidean distance, EPI – echo planar imaging, FDR – false discovery rate, fMRI – functional magnetic resonance imaging, FOV – field of view, FWE – family wise error, GLM – general linear model, ICC – intraclass correlation coefficient, IFG – inferior frontal gyrus, LI – laterality index, M1 – primary motor cortex, ROI – region of interest, STG – superior temporal gyrus, TA – time of acquisition, TE – echo time, TR - repetition time

1

ACCEPTED MANUSCRIPT

Abstract When using functional magnetic resonance imaging (fMRI) for mapping important language functions, a high test-retest reliability is mandatory, both in basic scientific research and for clinical applications. We, therefore, systematically tested the short- and long-term reliability of fMRI in a group of healthy subjects using a picture naming task

RI PT

and a sparse-sampling fMRI protocol. We hypothesized that test-retest reliability might be higher for (i) speech-related motor areas than for other language areas and for (ii) the short as compared to the long intersession interval.

16 right-handed subjects (mean age: 29 years) participated in three sessions separated

SC

by 2-6 (session 1 and 2, short-term) and 21-34 days (session 1 and 3, long-term). Subjects were asked to perform the same overt picture naming task in each fMRI session (50 black-white images per session). Reliability was tested using the following measures:

M AN U

(i) Euclidean distances (ED) between local activation maxima and Centers of Gravity (CoGs), (ii) overlap volumes and (iii) voxel-wise intraclass correlation coefficients (ICCs). Analyses were performed for three regions of interest which were chosen based on whole-brain group data: primary motor cortex (M1), superior temporal gyrus (STG) and inferior frontal gyrus (IFG).

TE D

Our results revealed that the activation centers were highly reliable, independent of the time interval, ROI or hemisphere with significantly smaller ED for the local activation maxima (6.45 ± 1.36 mm) as compared to the CoGs (8.03 ± 2.01 mm). In contrast, the extent of activation revealed rather low reliability values with overlaps ranging from 24%

EP

(IFG) to 56% (STG). Here, the left hemisphere showed significantly higher overlap volumes than the right hemisphere. Although mean ICCs ranged between poor (ICC<0.5)

AC C

and moderate (ICC 0.5-0.74) reliability, highly reliable voxels (ICC>0.75) were found for all ROIs. Voxel-wise reliability of the different ROIs was influenced by the intersession interval.

Taken together, we showed that, despite of considerable ROI-dependent variations of the extent of activation over time, highly reliable centers of activation can be identified using an overt picture naming paradigm.

2

ACCEPTED MANUSCRIPT

Introduction The current gold standard and most accurate method to reliably identify the cortical representations of language function is the intraoperative craniotomy of awake patients tested by direct cortical stimulation (Szelényi et al., 2010). Given the invasiveness of this method and other limitations, alternative approaches for cortical mapping of the human

RI PT

brain have been developed. Functional magnetic resonance imaging (fMRI) is currently one of the most frequently used methods for mapping the human brain. Achieving a high test-retest reliability of fMRI mappings is not only important for the use in healthy subjects (basic scientific research), but also for presurgical mappings of patients with brain tumors

SC

involving eloquent cortex regions. Complete tumor resection while preserving the ability to speak is a major objective of the presurgical mapping of language areas in the tumor neighborhood (Stippich et al., 2007; Tyndall et al., 2017). Compared to the motor cortex,

M AN U

which can be reliably mapped with fMRI (Bennett and Miller, 2010, 2013; Kristo et al., 2014; Quiton et al., 2014; Weiss et al., 2013), the mapping of cortical areas responsible for speech and language functions is, however, much more challenging due to the complexity of the language network (Friederici, 2011). Previous studies have shown quite heterogeneous results, suggesting that the test-retest reliability of fMRI language

TE D

processing and speech production varies between poor and high reliability (Bennett and Miller, 2010, 2013; Brannen et al., 2001; Eaton et al., 2008; Fernandez et al., 2003; Rutten et al., 2002). The high variability between studies may be due to differences in fMRI designs (block design vs. sparse sampling), tasks, time-intervals, subjects/patients or

2004).

EP

outcome parameters which make it difficult to compare results (Billingsley-Marshall et al.,

AC C

During the last years, language-related areas have been intensively investigated with fMRI using a picture naming task, which also represents the most widely used task during intraoperative mappings (Freyschlag and Duffau, 2014) and has more recently also been used for non-invasive preoperative mappings by transcranial magnetic stimulation (Lefaucheur and Picht, 2016). However, to the best of our knowledge, the short- and longterm test-retest reliability of language fMRI has not been systematically tested to date. We, therefore, designed a serial fMRI study to compare the short- and long-term reliability using a picture naming task in a group of healthy subjects, with intersession intervals of 26 (short-term) and 21-34 days (long-term). In contrast to most of the studies using fMRI to investigate language function, we employed overt instead of covert speech, which allows to control for correct task performance. This might be of high importance when e.g. 3

ACCEPTED MANUSCRIPT studying patients with aphasia (Wilson et al., 2017). However, overt speech can induce a considerable amount of motion artifacts causing spurious or missing brain activity (Eden et al., 1999; Gracco et al., 2005). Aiming at reliable fMRI results, we, thus, used a sparsesampling volume acquisition design. This approach reduces the influence of movement artifacts by means of pauses in volume acquisition during the task execution period.

RI PT

Moreover, scanner noise and, thus, auditory interference and unrelated alertness effects can be reduced (Amaro et al., 2002; Langers and van Dijk, 2011; Yakunina et al., 2015). To quantify the test-retest reliability, we applied frequently used measures of reliability like the laterality index (LI), Euclidean distances between the cluster centers (i.e., local

SC

activation maxima and centers of gravity [COGs]) and the overlap volumes of the entire cluster between sessions (Dice coefficient) as well as the voxel-wise intraclass correlation coefficient (ICC) (Bennett and Miller, 2010, 2013; Morrison et al., 2016). In order to keep

M AN U

the degrees of freedom in the reliability analysis to the necessary minimum (Bennett and Miller, 2013) and to avoid biasing language-network-related results by inclusion of distinct co-activated regions such as visual areas (Wilson et al., 2017), the primary analysis was restricted to three regions of interest (ROIs) based on the group analysis. However, key results were confirmed by using a whole-brain approach.

We hypothesized that the test-retest reliability might depend on the ROIs, i.e., the

TE D

anatomical location. Here, we expected a higher reliability for speech-related primary motor areas than for other language areas (i.e., superior temporal gyrus, inferior frontal gyrus) as previously reported by other studies (Bennett and Miller, 2010, 2013; Morrison et

EP

al., 2016; Stevens et al., 2016). Furthermore, reliability was assumed to be influenced by the intersession interval with better reproducibility for the short as compared to the long

AC C

intersession interval (Bennett and Miller, 2010, 2013).

Material and methods Subjects and study design We examined 16 healthy, right-handed subjects (9 female, mean age: 29 years, range: 2440 years) with no prior history of neurological diseases or MR contraindications. All subjects were German native speakers, except for one subject who had excellent German language skills (mother-tongue equivalent). Moreover, all subjects were highly educated (students or graduates). The study was carried out according to the declaration of Helsinki (1969, according to revision of 2008) and was approved by the local ethics committee. Written informed consent was obtained prior to the first measurement. All subjects 4

ACCEPTED MANUSCRIPT participated in three sessions: session 1, session 2 (2-6 days after session 1, short-term), session 3 (21-34 days after session 1, long-term). In each session, fMRI language mapping was performed by using a picture naming task.

fMRI acquisition

RI PT

The fMRI task in our study was designed to detect the activation patterns associated with speech production. Therefore, the subjects were asked to name objects which were presented as black-white drawings by speaking out loudly a whole sentence introduced by the phrase “That is a/an…” (German: “Das ist ein/eine…”). The objects were displayed on

SC

a video screen, which was visible through a mirror attached to the MR head coil and were taken from daily life scenarios including non-living objects as well as animals, plants and food. The pictures were composed of objects taken from the commercial software

M AN U

Nexspeech (Nexstim Ltd., Finland) as well as modified from the Snodgrass and Vanderwart Inventory (Snodgrass and Vanderwart, 1980). In each session, the same 50 pictures were presented in a randomized order. Every picture was presented for 3 s before being replaced by a black screen. Image acquisition started between 1000 and 3500 ms (jittered) after the picture naming. Pictures were presented on average every 11 s. Approximately 30% of the events were “null-events”, where a black screen was displayed

TE D

instead of a picture serving as a “resting baseline”, where the subjects were instructed to remain as motionless as possible. The whole fMRI experiment lasted ~20 min (raw acquisition time: 14.5 min), thus staying within a reasonable time frame regarding to

EP

attention span and alertness, particularly with respect to clinical applications. All subjects were trained for correct task performance outside the scanner as well as inside the scanner immediately before each session. Furthermore, subjects were advised to avoid

AC C

unnecessary movements in the scanner. The performance of all subjects was recorded with Audacity (www.audacityteam.org) via an intercom connected to an external computer outside the scanner room. A major problem of imaging overt speech, however, lies in the huge amount of imaging artifacts induced by head movements (Huang et al., 2002; Kemeny et al., 2005). We, thus, made use of an event-related design using a “sparse-sampling” protocol as previously described by our group (see e.g. Volz et al., 2015 for details). Here, the actual task and the image acquisition are decoupled, i.e., the image was acquired after the end of the task performance. Sparse-sampling makes use of the relatively long delay between a neural event and the evoked hemodynamic response (maximum approximately after 5-8 s). 5

ACCEPTED MANUSCRIPT Varying the time between the task and image acquisition (jitter) enables a better sampling of the hemodynamic response (Amaro et al., 2002; Dresel et al., 2005). Thereby, head movement artifacts as well as acoustic contamination in the imaging data can be minimized. Nevertheless, compared to a classical block design, sparse-sampling has the disadvantage of a reduced statistical power due to the lower number of images per

RI PT

condition, limiting its application depending on the task used to induce neural activity (Volz et al., 2015). A reduced number of images per condition can only be compensated by extending scanning times. Furthermore, less information about the time course of the response is available due to the long TR (Peelle, 2014). However, this lack of information

SC

can be partly ameliorated by varying the delay between the stimulus and the EPI acquisition (jitter; Robson et al., 1998; 1998; Belin et al., 1999) as applied in this study (see upper section). The advantages of sparse-sampling seem to outweigh, since it is

M AN U

recommended for the use in language fMRI (Peelle, 2014).

fMRI was performed on a 3T Scanner (Trio, Siemens, Erlangen, Germany) using a gradient-echo planar imaging (EPI) sequence sensitive to detect blood-oxygenation level dependent (BOLD) signal changes in tissue contrast with the following imaging parameters: repetition time (TR) = 11000 ms, time of acquisition (TA) = 2000 ms, echo time (TE) = 30 ms, flip angle = 90°, voxel size 3.0 x 3.0 x 3.0 mm3, field of view (FOV) =

TE D

192 mm2, 30 slices (whole brain), 79 EPI volumes per session. In a separate session (before the EPI acquisition), high-resolution anatomical images were acquired for all subjects using a 3D T1 sequence with the following parameters: TR = 2250 ms, TE = 3.93

EP

ms, FOV = 256 mm, 176 sagittal slices, voxel size = 1.0 × 1.0 × 1.0 mm3.

fMRI preprocessing and ROI-selection

AC C

The fMRI data were analyzed using the Statistical Parametric Mapping software package (SPM

8;

Wellcome

Department

of

Imaging

Neuroscience,

London,

UK,

http://www.fil.ion.ucl.ac.uk) implemented in Matlab (version 2014a, The MathWorks Inc., Natick, MA, USA). The first two volumes (“dummy” images) of each session were discarded from further analyses to allow for magnetic field saturation. After realignment of the remaining EPI volumes to the mean image of each time series and co-registration with the structural T1-weighted image, all images were spatially normalized to the standard template of the MNI using the unified segmentation approach (Ashburner and Friston, 2005) and smoothed using an isotropic Gaussian kernel of 8 mm full-width at halfmaximum. Next, the statistical analysis was performed. The time series of each voxel were 6

ACCEPTED MANUSCRIPT high-pass-filtered at 1/128 Hz. The six head motion parameters, as assessed by the realignment algorithm, were treated as covariates to remove movement-related variance from the image time series. Simple main effects were calculated for each subject by applying appropriate baseline contrasts (i.e., picture naming vs. “rest”). Image quality metrics, i.e. the six realignment parameters as well as the temporal Signal to Noise Ratio

=

1

where n is the number of voxels,

and

SC

The tSNR was calculated using the formula

RI PT

(tSNR), were compared between sessions to rule out confounding factors on reliability.

are the mean and standard deviation of the

M AN U

SNR within the ith voxel across time (Gorgolewski et al., 2013). The average was taken across all voxels within the brain mask as implemented in SPM. For the group analysis, a full-factorial analysis of variance (ANOVA) with the factor SESSION (3 levels) was computed using a random effects model. Voxels passing a threshold of p<0.05, FWE-corrected at the voxel level were considered as significantly activated (Figure 1, see also Suppl. Figure 1 for session 1-3). Moreover, we calculated

TE D

probabilistic overlap maps depicting the number of subjects with a significant activity

AC C

EP

(p<0.05 FWE-corrected) at that voxel (Fedorenko et al., 2010; Suppl. Figure 2).

Figure 1: Group analysis (second level; conjunction session 1-3; p<0.05, FWE-corrected at the voxel-level). Strongest activation, except for the visual cortex, was detected for the clusters of choice (IFG, STG, M1), which were considered for further ROI analysis.

7

ACCEPTED MANUSCRIPT According to the results of the group analysis (conjunction analysis of session 1-3; p<0.05, FWE-corrected, Figure 1) and the probabilistic overlap maps, we restricted the reliability analyses to the following ROIs: (i) superior temporal gyrus (STG, Brodmann area [BA] 22, commonly referred to as Wernicke’s area); (ii) inferior frontal gyrus (IFG, BA 44 and 45,

RI PT

commonly referred to as Broca’s area) – i.e. the two major regions associated with language processing - ; and (iii) primary motor cortex (M1, lateral precentral gyrus, BA 4). MNI coordinates of the ROIs are shown in Table 1. We investigated both hemispheres, since the language-critical cortex frequently also includes right-hemispheric regions (Price,

SC

2012). We did not include other areas which were also activated by the picture naming task (e.g. the visual cortex, Figure 1) in order to avoid overestimation of reliability by

M AN U

including robust regions which are not primarily involved in speech production. Table 1: MNI coordinates (x, y, z) of the local activation maxima derived from the group analysis

ROI IFG left IFG right STG left STG right M1 left M1 right

x -52.5 48 -58.5

63 -46.5 49.5

MNI coordinates y 19.5 25.5 -33

z 4.5 0 10.5

t-value 4.87 5.14 9.53

-33

3

9.75

-9 -9

34.5 31.5

15.91 15.01

TE D

(all subjects and sessions)

EP

For single subject analysis voxels were considered significantly activated when passing a certain threshold. The following thresholds were applied at the voxel level for the

AC C

respective ROI to ensure sufficient activation in each subject: p<0.05, family-wise error (FWE) corrected for M1; p<0.01, uncorrected for the IFG and p<0.001, uncorrected for the STG. In case of no significant activation, the threshold was lowered (starting at the respective threshold) until significant voxels were identified or until a minimum threshold of p<0.05, uncorrected was reached. In order to ensure comparability between sessions threshold levels were kept stable within the respective subjects. The local activation maximum representing the highest t-value was identified for each ROI in every subject and session. Of note, if the STG cluster contained multiple local activation maxima, the most posterior one – presumably representing more language-specific portion of the cluster such as lexico-semantic processing (e.g., Gow, 2012) and phonological encoding (Boatman and 8

ACCEPTED MANUSCRIPT Miglioretti, 2005; cf. Poeppel, 2001 for review) – was considered for calculation of Euclidean distances. Local maxima were controlled via the Anatomy toolbox v1.7 (Eickhoff et al., 2005) as implemented in SPM. In addition to this approach, which we considered most in line with common clinical practice, a more data-driven approach was performed for single-subject analysis and

RI PT

served for complementary calculation of overlap volumes and ICCs. Here, we chose more stringent threshold levels for single-subject analysis in all cases for the STG (p<0.05, FWE-corrected; p<0.001, uncorrected) and IFG (p<0.05, FWE-corrected; p<0.001, uncorrected; p<0.01, uncorrected; Supplemental material).

SC

After identifying the activation cluster yielding the highest BOLD activity in the different ROIs all other voxels were removed from the respective SPM(T) map for each subject and session (cf. Weiss et al., 2013), resulting in separate clusters for the left and right IFG,

M AN U

STG and M1 for further analyses (COGs, overlap volumes, ICCs). Here, we did not restrict the analysis by e.g. masking the ROIs, but considered the entire cluster of the respective ROIs even when exceeding the expected anatomical margins, in order to minimize the risk of driving the results towards a more favorable reliability result. Although a ROI-based approach allows for comparison of local activation maxima/CoGs (Euclidean Distances) and addresses location-specific clinical questions, it may not

TE D

account for the complexity and variability of the language network as such (Friederici, 2011; Hagoort and Indefrey, 2014; Hickok and Poeppel, 2007; Price, 2012). We, therefore, also included whole-brain data into the manuscript testing whether a good task-based

EP

reliability in the ROI-based analysis – which might be biased towards higher reliability indices – can be confirmed over the whole brain (overlap analysis: p<0.05, FWE-

AC C

corrected; p<0.001, uncorrected; p<0.01, uncorrected).

Laterality index

To test for language lateralization a laterality index (LI) was calculated including all ROIs using the formula:

=

− +

100

with L representing the number of voxels within the left hemisphere and R number of voxels within the right hemisphere (+100 = strong left hemispheric dominance; −100 = strong right hemispheric dominance). 9

ACCEPTED MANUSCRIPT

Statistical reliability measures Local activation maximum and CoG While the local activation maximum represents the site of highest local activity within a

RI PT

ROI, the CoG also includes the spatial extent and the distribution of the activation patterns. For CoG computation, the following formula was used (Wassermann, 1998): , ,

,∑

,!"#



, ,

,∑

,!"#



, , ,!"#

/ ∑

,!"#



SC



CoG = ∑

with % representing the t-value of one voxel with the coordinate (xi, yi, zi) and % ,&' being

M AN U

the maximum t-value of all included voxels.

Euclidean distances

For evaluation of the between-session differences of the local activation maxima and CoGs Euclidean distances (ED) were computed in 3D single subject space using the

ED = *

+



,

,

TE D

following formula: + -+ − -,

,

+ .+ − ., ,

with x1,2, y1,2, z1,2 representing the coordinates of the respective sessions. Mean ED were

EP

computed between session 1 and 2 (short-term reliability) as well as between session 1

AC C

and 3 (long-term reliability). Overlap volumes

The spatial overlaps/Dice coefficients (Sorensen, 1948; Dice et al., 1945) of session 1 with session 2 and 3 were computed in 3D single subject space for every ROI volume as well as for the whole brain using the binarized image volumes. The Dice coefficient was calculated as follows (Rombouts et al., 1997): ,∗89:;<='>

ROverlap =

8+?8,



10

ACCEPTED MANUSCRIPT The Dice coefficient ranges between 0 and 1, with 0 indicating no overlap and 1 indicating perfect overlap (low: 0.00 to 0.19, low-moderate: 0.20 to 0.39, moderate: 0.40 to 0.59, moderate high: 0.60 to 0.79, high: 0.80 to 1.00) (Wilson et al., 2017).

Intraclass correlation coefficients

RI PT

For analyzing the test-retest reliability of the activation level within a given voxel between the different sessions voxel-wise ICCs were computed using the following formula (Shrout and Fleiss, 1979): ABC D ABE

SC

ICC 1,1 = ABC ? F D + ABE

M AN U

MSB represents the mean sum of squares between subjects, MSW the mean sum of squares within subjects, k the number of sessions and n the number of subjects. The data used for computing the mean sum of squares were the t-values of each voxel. MSB = J ∑NKO+

K

− ̅ . .

and MSW = ∑QO+ ∑NKO+

K

,

/

− 1

TE D



− ̅ . .

,

/

J−1

EP

The ICC ranges from -1 to 1: an ICC of <0 indicates no agreement while an ICC close to 1 indicates near-perfect agreement between the values of test- and retest-sessions. ICC

AC C

values below 0.5 are considered to reflect poor reliability, 0.5 – 0.74 moderate reliability and ICC ≥ 0.75 high reliability (Portney and Watkins, 2000). For estimation of mean values as well as statistical comparisons between sessions, hemispheres and regions, ICC values were transformed to Fisher z-scores in Matlab (version 2014a, The MathWorks Inc., Natick, MA, USA).

Statistical tests Euclidean Distances were entered into a repeated measures analysis of variance (ANOVA)

with

the

factors

LOCAL

ACTIVATION

MAXIMUM/COG

(2

levels),

INTERSESSION INTERVAL (2 levels: short-term = session 1 and 2, long-term = session 1

11

ACCEPTED MANUSCRIPT and 3), HEMISPHERE (2 levels: left, right) and REGION (3 levels: IFG, STG, M1) using SPSS version 21 (Statistical Package for the Social Sciences, IBM). Likewise, a repeated measures ANOVA was calculated for overlap volumes with the factors INTERSESSION INTERVAL (2 levels: short-term = session 1∩2, long-term = session 1∩3), HEMISPHERE (2 levels: left, right) and REGION (3 levels: IFG, STG, M1).

RI PT

In case of significant main or interaction effects, post hoc Student′s t-tests were performed to compare between the different variables of interest. Results were corrected for multiple comparisons by using the false-discovery-rate (FDR) procedure (Benjamini and Hochberg,

SC

1995).

Results

M AN U

Feasibility

All subjects successfully completed all three fMRI sessions without any problems. The between-EPI displacement in x, y and z across the entire session was 0.13 ± 0.09 mm on average ranging from -0.89 to 1.52 mm, thus indicating a relatively little task-induced movement of the head. Accordingly, no subject had to be excluded due to excessive head movements (>3mm; e.g. Gupta, 2014). Moreover, there were no differences in the mean

TE D

EPI displacement between sessions (Suppl. Table 1) as well as in the tSNR (session 1: 120.99 ± 15.23, session 2: 116.32 ± 27.79, session 3: 115.78 ± 18.13; p>0.1, FDRcorrected). As far as a comparison to literature is possible (due to differences in signal detection protocols as well as SNR definitions), the tSNR was within a well-acceptable

EP

range (Welvaert and Rosseel, 2013).

While for most ROIs sufficient activation levels could be obtained at the single subject

AC C

level, the statistical threshold for the left IFG had to be lowered to p<0.05 (uncorrected) in one case in order to find significantly activated voxels. For the right IFG the threshold had to be adjusted to p<0.05 (uncorrected) in three subjects. Of note, the statistical threshold was kept stable within subjects/between sessions. Examples of single-subject activation maps for the three sessions (depicting different levels of similarity between sessions) are shown in Suppl. Figure 3.

Laterality There was no clear lateralization of the activation across ROIs to one hemisphere as indicated by a mean LI across sessions of 10 ± 18 (mean ± SD). The LI values for the different ROIs and sessions are shown in Table 2. Between-session reliability of the LI 12

ACCEPTED MANUSCRIPT was rather poor with an ICC of 0.477 for the short and 0.338 for the long intersession interval. There was no significant difference between sessions or ROIs (REGION: F2,30=0.212, p=0.810, η²=0.014; INTERSESSION INTERVAL: F2,30=0.185, p=0.832, η²=0.012; REGION x INTERSESSION INTERVAL: F4,60=0.119, p=0.975, η²=0.008).

session 1 10 ± 57 6 ± 15 14 ± 32 9 ± 15 9 ± 18

session 2

11 ± 60 9 ± 19 12 ± 36 9 ± 21 10 ± 22

session 3 19 ± 46 9 ± 33 11 ± 21 11 ± 20 11± 24

SC

ROI IFG STG M1 all ROIs IFG and STG

RI PT

Table 2: LI values for each session and ROI as well as over all ROIs mean 13 ± 54 8 ± 23 12 ± 30

10 ± 18 10 ± 21

M AN U

Reliability of fMRI local activation maxima and CoG

The ANOVA revealed no significant main effect for the factors HEMISPHERE (F1,15=0.283, p=0.603, η²=0.019) or INTERSESSION INTERVALS (F1,15=1.610, p=0.224, η²=0.097), indicating no overall differences neither between hemispheres (left vs. right) nor between the two intersession intervals (short- vs. long-term) (Figure 2). Likewise, there was no significant interaction between factors (Suppl. Table 2). In contrast, we found a significant

TE D

main effect for the factor REGION (F2,30=12.164, p<0.05, η²=0.448) as well as a statistical trend for the factor LOCAL ACTIVATION MAXIMA/CoG (F1,15=4.060, p=0.062, η²=0.213). Post-hoc Student’s t-tests showed a significant difference between local activation maxima and CoGs (averaged across regions, hemispheres and intersession intervals). Here, ED

EP

were significantly smaller for the local activation maxima (6.45 ± 1.36 mm) as compared to the CoGs (8.03 ± 2.06 mm, p<0.05). When testing for differences between regions, we

AC C

found a statistical trend pointing towards smaller ED of the local activation maxima of M1 (5.31 ± 0.40 mm) compared to, both, (IFG 8.01 ± 1.14 mm) and STG (6.02 ± 0.35 mm) as well as for the STG compared to the IFG (p<0.1, FDR corrected; Figure 2A). A similar statistical trend (p<0.1, FDR-corrected) was evident for the CoGs with smaller ED for M1 (7.77 ± 0.52 mm) compared to the IFG (10.31 ± 0.83 mm), but not compared to the STG (6.02 ± 1.19, Figure 2B).

13

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

Figure 2: Euclidean distances (ED) between the short (light grey) and long (dark grey) intersession interval for the local activation maxima (A) and CoGs (B). For both, no significant differences were evident between the intersession intervals or between hemispheres. ED of the local activation maxima were significantly smaller than for the CoGs. Statistical trends for differences between regions are indicated by asterisks (*) (p<0.1, FDR-corrected).

Mean ICC values for the local activation maxima and CoGs ranged between -0.023 and 0.903, indicating no to excellent reliability (Table 3). Fisher z-transformed ICC values showed significant differences between local activation maxima and CoGs with significantly higher ICCs for the local activation maxima (p<0.001). This finding was

14

ACCEPTED MANUSCRIPT evident for both, the short- and long-intersession interval. However, there was no difference between the left and right hemisphere as well as between regions. Table 3: ICCs of the local activation maxima and CoGs (coordinates of high reliability are highlighted in grey, moderate reliability in light grey)

x y z mean x y z mean

left IFG 0.263 0.523 0.820 0.585 0.498 0.592 0.646 0.585

short-term

STG 0.182 0.331 0.780 0.478 0.481 0.744 0.494 0.592

STG 0.563 0.263 0.515 0.446 0.737 0.172 0.811 0.635

M1 0.825 0.744 0.840 0.808 0.795 0.575 0.814 0.744

M1 0.262 0.446 0.713 0.501 0.235 0.411 0.530 0.397

right IFG 0.106 0.283 0.475 0.291 0.053 0.709 0.658 0.523

STG 0.280 0.152 -0.023 0.139 0.609 0.168 0.350 0.389

M1 0.086 0.467 0.384 0.319 0.208 0.214 0.341 0.254

EP

long-term

TE D

CoG

M1 0.688 0.592 0.903 0.762 0.496 0.744 0.840 0.664

right IFG 0.514 0.499 0.505 0.508 0.376 0.633 0.681 0.578

RI PT

long-term

x y z mean x y z mean

STG 0.649 0.683 0.467 0.604 0.319 0.832 0.277 0.523

M AN U

short-term

left IFG 0.223 0.877 0.767 0.701 0.603 0.778 0.838 0.753

SC

local activation maximum

Spatial reliability - overlap volumes The mean overlap volume of all ROIs between session 1 and 2 (short-term) was 45 ± 18%

AC C

for the left hemisphere and 38 ± 20% for the right hemisphere. Similarly, the overlap

between session 1 and 3 (long-term) was 46 ± 15% for the left hemisphere and 40 ± 12%

for the right hemisphere. The highest overlap could be found for the STG, whereas the IFG had the lowest overlap between sessions (Figure 3). For the factor INTERSESSION INTERVAL the repeated-measures ANOVA revealed no significant main (F1,15=0.107, p=0.749, η²=0.007) or interaction effect (INTERSESSION INTERVAL x REGION: F=1.319, p= 0.282, η²=0.081 ; INTERSESSION INTERVAL x HEMISPHERE: F1,15=0.080, p=0.782, η²=0.005; INTERSESSION INTERVAL x REGION x HEMISPHERE: F2,30=0.414, p=0.665, η²=0.027), indicating that the time interval had no effect on the overlap between regions (short- vs. long-term). In contrast, the factor HEMISPHERE (F1,15=6.993, p<0.05, 15

ACCEPTED MANUSCRIPT η²=0.318) as well as the factor REGION (F2,30= 27.391, p<0.001, η²=0.646) were significant, indicating differences in the overlap volume between the different ROIs as well as between hemispheres. However, there was no significant interaction effect (REGION x HEMISPHERE: F2,30=0.095, p=0.909, η²=0.006). Post-hoc Student t-tests revealed significantly higher overlaps within the left hemisphere

RI PT

(45 ± 15%) as compared to the right hemisphere (38 ± 15%, averaged across all regions and intersession intervals; p<0.001, FDR-corrected). This difference was also evident for the short intersession interval (p<0.05, FDR-corrected) and as a statistical trend for the long intersession interval (p=0.086, FDR-corrected). When testing for differences between regions, we found significantly higher overlap volumes in the STG (56 ± 4%) as compared

SC

to M1 (46 ± 5%) and IFG (24 ± 6%) as well as for the M1 compared to the IFG (averaged

AC C

EP

TE D

M AN U

across hemispheres and intersession intervals; p<0.01, FDR-corrected).

Figure 3: Overlap volumes (Dice coefficient) between the short (light grey) and long (dark grey) intersession interval. There was no difference between the intersession intervals, but overlap volumes were significantly higher for the left as compared to the right hemisphere (p<0.001, FDRcorrected). Significant differences between regions are indicated by asterisks * (p<0.05, FDRcorrected).

For the whole-brain approach, Dice coefficients ranged between 47 ± 16 % and 60 ± 10 % (Table 4), depending on the statistical threshold (higher overlap volumes when using less stringent thresholds). Of note, we also did not find an influence of the intersession interval on reliability, when considering whole-brain activation. 16

ACCEPTED MANUSCRIPT In summary, whole-brain data confirmed – at least compared to the STG and M1 ROI – a moderate to moderately high Dice similarity (overlap volumes) between sessions, independent of the intersession interval. Table 4: Overlap volumes - whole brain (mean ± SD) short-term (S1∩S2)

p<0.01, uncorrected

60 ± 10 %

p<0.001, uncorrected

56 ± 11 %

p<0.05, FWE-corrected

49 ± 15 %

long-term (S1∩S3)

RI PT

threshold

60 ± 8 %

56 ± 10 %

47 ± 16 %

SC

When applying different (more stringent) thresholds for analyzing the (left) IFG, we found similar overlap volumes for the short intersession-interval (low-moderate, Suppl. Table 3).

M AN U

Nevertheless, results are not unrestrictedly comparable due to differences in sample size (p<0.001, uncorrected: N=12; p<0.05, FWE-corrected: N=4). For the long intersession interval, overlap volumes were similar when comparing the more liberal threshold levels (i.e., p<0.01, uncorrected vs. p<0.001, uncorrected) whereas using p<0.05, FWE corrected as a threshold led to the lowest overlap volumes (Suppl. Table 3). However, this difference was not statistically significant. Overlap volumes for the STG were both moderate for

TE D

p<0.05, FWE-corrected and p<0.001, uncorrected (Suppl. Table 3), but with significantly lower overlap volumes for the more stringent threshold criterion (S1∩S2: p<0.05, S1∩S3: p<0.1).

EP

Spatial reliability - intraclass correlation coefficient The mean ICC values for session 1 and 2 (short-term) as well as session 1 and 3 (long-

AC C

term) for all ROIs are shown in Table 5. Mean ICC values ranged between poor (ICC <0.5) and moderate (ICC <0.74) reliability. Nevertheless, highly reliable voxels (ICC >0.75) were found for all ROIs (Table 6). There was a statistical trend for a higher number of highly reliable voxels in the left as compared to the right hemisphere (p=0.052).

17

ACCEPTED MANUSCRIPT Table 5: Mean ICC values right 0.117 ± 0.285 0.389 ± 0.300 0.363 ± 0.414 0.302 ± 0.504 0.422 ± 0.282 0.336 ± 0.462

Table 6: Number of highly reliable voxels (ICC>0.75)

M AN U

SC

left right 431 25 short-term IFG STG 353 212 544 336 M1 long-term IFG 297 332 STG 482 290 313 275 M1

RI PT

left short-term IFG 0.345 ± 0.400 STG 0.414 ± 0.300 M1 0.515 ± 0.438 long-term IFG 0.273 ± 0.380 STG 0.422 ± 0.310 M1 0.422 ± 0.380

When testing for differences between ICCs of both intersession intervals we obtained significant differences for all ROIs. Higher ICCs in the short intersession interval were found for the left IFG and bilateral M1. Contrariwise, ICCs were significantly higher in the

TE D

long intersession interval for the right IFG and bilateral STG (p<0.05, FDR-corrected). For all ROIs, reliability was significantly higher in the left as compared to the right hemisphere in the short intersession interval (p<0.001, FDR-corrected). However, for the

EP

long intersession interval, this result was only evident for M1 (p<0.001, FDR-corrected), whereas for the IFG ICC values were higher in the right hemisphere (p<0.001, FDRcorrected). The STG revealed no significant difference between hemispheres in the long

AC C

intersession interval.

18

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

Figure 4: Intraclass correlation coefficient (ICC) maps for IFG, STG and M1 for the short and long

EP

intersession interval (dark purple: ICC=0.0, red: ICC=1.0). Highly reliable voxels (ICC>0.75) are

AC C

shown in the upper left and right corner.

Of note, ICCs of the left STG and IFG cluster significantly increased with more liberal (standardized) thresholds (Suppl. Table 4 and Suppl. Figure 4) and confirmed the higher short-term as compared to long-term reliability, regardless of the threshold chosen for ROI extraction, except for the STG when using a threshold of p<0.05, FWE-corrected for ROI extraction (no statistical difference).

19

ACCEPTED MANUSCRIPT

Discussion Analyzing the short- and long-term reliability of language fMRI using an overt picture naming task, we found a reliable location of the activation peak (i.e., local activation maxima, CoG) with mean Euclidean distances below 12 mm. In contrast, the extent and the magnitude of activation (i.e., overlap volumes, ICCs) were less reliable. The Dice

RI PT

similarity (overlap) between sessions was moderate to moderately high in the ROI analysis – at least for the STG and M1 – as well as in the whole-brain analysis with more liberal thresholds corresponding to higher overlap volumes. Reliability as measured by mean ICCs, however, was low to moderate, depending on the ROI. Nevertheless, voxels with

SC

ICCs >0.75, indicating a high reliability, were detected in each ROI for both time intervals. Moreover, overlap volumes and ICCs provided evidence for a lateralization of reliable

M AN U

activation towards the left hemisphere.

Lateralization

Although our results pointed towards a higher test-retest reliability of the left hemisphere, the picture naming task did not reveal a strong hemispheric lateralization as reflected by the laterality index (LI). Similar to our finding, previous studies reported that picture naming

TE D

did not lead to left-lateralized activation patterns, but rather bilateral activations (Wilson et al., 2017). High LI values can be achieved when the analysis is limited to previously defined language regions instead of using a whole brain approach and when nonlanguage regions are excluded from the analysis (Rutten et al., 2002; Wilson et al., 2017).

EP

In our study, however, we did not find high LI values although we aimed at using predominantly language-related areas (activated by the picture naming task). Here, the

AC C

data-driven approach of considering the entire cluster (particularly of the STG and M1) which also include more bilaterally activated, not strictly language-specific functions such as audition (STG) and movement execution (M1) may have contributed to a certain underestimation of the LI (see Limitations). However, excluding e.g. M1 did not significantly change LI values. Furthermore, it is known that language lateralization is strongly threshold- and task-dependent (Rutten et al., 2002, Ruff et al., 2008). According to previous studies, a stronger left-lateralized activation pattern might also be achieved by the use of a non-resting baseline, e.g. a non-object control condition (Meltzer et al., 2009; Price et al., 2005; Wilson et al., 2017). However, the current study aimed at investigating the test-retest reliability of fMRI referring to a clinically setting, i.e., the picture naming paradigm, maintaining the comparability between experimental designs. 20

ACCEPTED MANUSCRIPT In line with our findings on LI reliability, previous studies reported that the picture naming task did not lead to a reproducible LI (Benson et al., 1999; Rutten et al., 2002). In a study of Rutten and colleagues (2002), highest reproducibility of the LI was reported for a verb generation task and a conjoint analysis of three different language tasks, whereas the picture naming task revealed differences between session of 31 ± 30% and correlation

RI PT

coefficients below 0.4 (Pearson´s correlation). In general, tasks of higher cognitive demands (e.g., verb generation or semantic decision tasks) have been reported to lead to strong left-lateralized activation patterns and a more reliable LI (Eaton et al., 2008; Fernandez et al., 2003; Fesl et al., 2010; Harrington et al., 2006; Rutten et al., 2002).

SC

However, other studies found that robust and reliable activation patterns can be produced using tasks of low cognitive effort (Morrison et al., 2016).

Fernandez and colleagues (2003) reported a similar distribution of reliability between

M AN U

hemispheres, when assessing fMRI twice within one day with a synonym judgement and letter-matching condition. In contrast, we, here, found a higher reliability (mean ICCs) for the left as compared to the right hemisphere for the short intersession interval. Differences between hemispheres in the long intersession interval were less consistent, i.e., varied between ROIs. Although the picture naming task does not seem to be suitable for defining hemispheric dominance (LI), there is a tendency towards a more reliable and stronger

TE D

activation within the left hemisphere as also reflected by significantly higher cluster sizes (and overlap volumes) as well as t-values of the local activation maxima (p<0.05, see Suppl. Table 5 and 6). Similarly, the left hemisphere revealed a higher number of highly

EP

reliable voxels (ICC>0.75). According to Fernandez and colleagues (2003), however, the similar reliability of activations of the left and right hemisphere is an important prerequisite when using fMRI to detect language dominance, especially in the case of atypical

AC C

language dominance.

Reliability of local activation maxima and CoGs The location of local activation maxima and CoGs was rather well reliable with mean ED of 6.45 ± 1.36 mm and 8.03 ± 2.01 mm. The time interval between the different sessions did not significantly influence the results. Likewise, there was no difference between hemispheres. While ED are quite often used in studies on test-retest reliability of motor mappings, there are only a few studies reporting ED between the centers of language function across different sessions. A study of Morrison and colleagues (2016) investigated two different 21

ACCEPTED MANUSCRIPT language tasks (phonemic fluency task, rhyming task) in a group of tumor patients and a healthy control group, repeated within an intersession interval of 20 min. In their healthy control group, they found significantly smaller ED for the CoGs (6.20 ± 0.09 mm and 3.42 ± 1.15 mm) as compared to the local activation maxima (11.27 ± 6.28 mm and 5.77 ± 2.78 mm) averaged across all ROIs. In contrast, we found significantly shorter ED as well as

RI PT

higher ICC values for the local activation maxima as compared to the CoGs (Figure 2). However, different factors might have contributed to the diverging results between studies, for example the cluster forming strategy for CoG calculation or the task. In line with this assumption, Morrison and colleagues (2016) reported significantly different ED between

SC

the phonemic fluency and the rhyming task.

As previously described by other studies (Bennett and Miller, 2010, 2013; Morrison et al., 2016), we could show that motor regions (M1) activated during the overt picture naming

M AN U

task exhibited higher reliability, i.e. lower ED, as compared to language regions activated by the same task. Of note, the STG revealed a relatively high reliability regarding the CoG stability (comparable to M1), suggesting the STG to be activated with at least a moderately high reliability, also concerning the spatial extent (cf. overlap volumes).

Spatial reliability – overlap volumes

TE D

Our ROI-based data revealed rather low Dice coefficients between sessions for the IFG (below 30%), whereas M1 and STG resulted in moderate overlap volumes between 40% and 60%. In comparison, the Dice similarity of whole-brain activation maps was slightly

EP

higher (ranging from 47% to 60%), probably due to the inclusion of more robust regions such as the sensory and visual cortex (Wilson et al., 2017) as compared to selected language-related areas. In a review on fMRI reliability, Bennett and Miller (2010) reported

AC C

an average Dice overlap of 47.6% (range: 31-67%). Therefore, our data lie within the range of overlap volumes derived from a variety of different tasks. However, studies cannot be directly compared due to a number of factors (e.g., block design vs. sparsesampling).

Several studies investigated the overlap volume between sessions using a picture naming task (e.g., Harrington et al., 2006; Rau et al., 2007; Rutten et al., 2002; Wilson et al., 2017). These studies reported overlap volumes ranging from 14% (frontal ROI, Harrington et al., 2006) to 61% (supratentorial ROI, i.e., the whole brain except for the cerebellum; Wilson et al., 2017), depending on the ROI, cluster forming threshold and time interval. Rutten and colleagues (2002) investigated the test-retest reliability of a picture naming 22

ACCEPTED MANUSCRIPT task as well as a verb generation and an antonym generation task in two sessions separated by 2-7 months. Highest overlap volumes (no Dice coefficient, formula see Rutten et al., 2002) across language regions were achieved by a combined task analysis (~40%), followed by the verb and the antonym generation task. Lowest overlap volumes, however, were reported for the picture naming task (up to 27%, depending on the

RI PT

statistical threshold). The authors conclude that the combined analysis detects more language critical areas, while being less sensitive to task-specific processing unrelated to critical language functions. In contrast, Wilson and colleagues (2016) described that out of four tested language tasks, picture naming was the most reliable in terms of overlap

SC

volumes (compared to sentence completion, naturalistic comprehension and narrative comprehension), especially when the analysis was performed over the whole brain. Nevertheless, the authors suggest that this high reliability rather reflects a robust mapping

M AN U

of sensory and motor processes than of language regions because visual object processing and motor control dominate the activations during picture naming. In line with this assumption, we found relatively high overlap volumes for M1 and smaller overlap volumes for the IFG. However, significantly higher overlap volumes were found for the STG (up to 60%) as compared to M1 (and IFG), showing that picture naming serves not only for a relatively reliable mapping of M1 during speech production, but also for reliably

TE D

delineating temporal language regions like the STG (see discussion below). A study evaluating the reproducibility of a task using naming and naming plus noun generation in a group of healthy subjects in three sessions (3-35 days), found no

EP

reproducible activations across the three sessions (median overlap volume: 0%) within the left Broca´s area and the left insula (for the naming task without noun generation; Rau et al., 2007). Similarly, lowest overlap volumes were found for the IFG in our study.

AC C

According to Rau and colleagues (2007) the combined task (picture naming plus noun generation) led to an improved reproducibility in the activation of Broca´s area (triangular part: 48%, opercular part: 49%; for discussion see IFG activation), whereas activations in the insula were still not reliably reproduced. In line with the conclusion of other authors, the results of this study support the hypothesis that the combination of different language tasks might be advisable in order to achieve reliable and comprehensive results for all critical regions within the language network (Billingsley-Marshall et al., 2004; Bookheimer, 2007; Rutten et al., 2002; Wilson et al., 2017). However, the primary objective of this study was to specifically investigate the short- and long-term reliability of the picture naming task, which still represents the task most commonly used during awake surgery, since a 23

ACCEPTED MANUSCRIPT high correspondence between pre- and intraoperative mapping paradigms is essential to ensure a high accuracy (Weng et al., 2018).

Spatial reliability – ICC maps Overall, the mean voxel-wise ICC in our study revealed a test-retest reliability that was

RI PT

utmost moderate. Similar results have been described by other studies testing a variety of different language tasks (e.g., Gonzalez-Castillo and Talavage, 2011; Gorgolewski et al., 2013; Maldjian et al., 2002; Morrison et al., 2016; Wilson et al., 2017). Although the mean ICCs were rather low in our study, we could find highly reliable voxels within each ROI

SC

which might correspond to language critical “core” regions, even though, in general, a high variability exists among cortical language regions within and between individuals (Price, 2012). This provides further evidence that there are rather stable centers of fMRI-

M AN U

activation not only for motor (cf. Weiss et al., 2013) but also for language regions. Of note, highly reliable voxels within M1 in this study showed a similar pattern like M1 derived from a lip pursing and tongue twisting task of a previous study (Weiss et al., 2013). Here, the more lateral cluster including highly reliable voxels could well represent the tongue area whereas the more medial part of the cluster is assumed to represent M1 of the lips, in line

data from our group.

TE D

with anatomical knowledge (Penfield and Rasmussen, 1950) and according to previous

We expected a higher reliability within M1 as compared to STG and IFG. According to this hypothesis, we found the highest mean ICCs for the (left) M1 for the short intersession

EP

interval. This finding is in line with the assumption that activity associated with motor components shows a lower inter- and intra-subject variability as compared to higher cognitive components like language (Seghier et al., 2004). Nevertheless, we found similar

AC C

mean ICCs for M1 and the STG, proving that this region is also relatively reliably activated during the picture naming task. Clusters of high reliability were found on the medial as well as on the posterior portion of the STG, representing different components within the language network (Figure 4). The medial STG is also referred to as the primary auditory cortex, which has been shown to be reliably activated using fMRI (Binder et al., 1994; Di Salle et al., 2003). The auditory cortex was expected to be co-activated during the picture naming task because subjects hear themselves while speaking. The posterior STG, however, which is traditionally referred to as Wernicke’s area, is involved in sensory speech processing. Due to their anatomical (and functional) proximity, these areas were not separated for reliability analyses. However, the approach of considering the entire 24

ACCEPTED MANUSCRIPT STG cluster for ICC analysis may have led to a certain over-estimation of reliability (mean ICCs) with regard to more specifically language-related functions. In contrast to more specific language functions such as lexico-semantic and phonological processing, the representation of auditory functions seems to be rather robust (e.g., Leaver and Rauschecker, 2016). Nevertheless, our results clearly show areas of high reliability not

RI PT

only within the auditory-related section (BA 41/42) but also in the posterior part of the STG (BA 22p).

In contrast to the ED and overlap volumes, an influence of the intersession interval was evident for the mean ICC values, depending on the ROI. Higher reliability within the short

SC

as compared to the long intersession interval could be found for the left IFG (regardless of the statistical threshold) and bilateral M1, which is well in line with our hypothesis. In contrast, reliability was higher in the long intersession interval for the right IFG and

M AN U

bilateral STG. Here, attention and learning effects might have influenced the results. A study testing the reliability of a picture naming task in four sessions (one month interval) also found highly reliable voxels over the whole brain, however, this was only achieved when investigating untrained words (Meltzer et al., 2009). Untrained items led to consistent activation patterns, but the magnitude of activation declined over multiple testing sessions. In contrast, overlearned pictures showed only a small amount of highly

TE D

reliable voxels across the whole brain. The authors suggest that less reliable results are caused by a minimized and therefore stable activation magnitude in overlearned pictures (values closer to noise level). These findings lead to the assumptions that the low to

EP

moderate mean ICCs in our study might partly be caused by the training of the pictures before each session as well as by unequal learning levels between repeated sessions. Moreover, the test-retest reliability might be improved by combining novel and trained

AC C

stimulus materials.

When aiming at fMRI measurements with high voxel-wise reliability (as commonly measured by ICCs), the importance of the choice of the statistical threshold should be considered. Our supplementary data confirm that (at least for the left IFG and STG cluster) reliability indices increase with more liberal thresholds. However, choosing less stringent statistical thresholds goes along with a higher rate of false-positive results (e.g., Gorgolewski et al., 2012) which must be considered particularly when basing clinical decisions on such data (see Limitations).

25

ACCEPTED MANUSCRIPT IFG activation Among the tested ROIs the IFG revealed the lowest reliability measures; not only compared to M1 (according to our hypothesis) but also compared to other languagerelated areas, i.e. the STG. Although the IFG is a major region of the language network, it

RI PT

does not seem to be reliably activated by a picture naming task in combination with a sparse-sampling design as used in our study. The picture naming task is widely used to identify language-relevant regions for both non-invasive and, particularly, invasive cortex mapping (Freyschlag and Duffau, 2014). However, we were not the first to find this

SC

unfavorable post-hoc results in fMRI studies: According to Etard et al. (2000) Broca´s area might not be sufficiently activated during a picture naming task, since it is engaged in the selection of semantic knowledge among competitive alternatives. However, picture naming

M AN U

tasks are generally designed to ensure homogeneity of the expected response; vice versa, the object has a single possible label once it has been identified. Likewise, a metaanalysis of Indefrey and Levelt (2004) revealed that only 5 out of 9 studies reported activations in the IFG during a naming task.

The activation level in our study might have also been influenced by the training before each session which has been performed outside the scanner to guarantee stable

TE D

performance within the scanner. A reduction in the activity of the IFG with repeated presentation of the same stimulus material has been described in several naming studies (e.g., Meister et al., 2005; Rau et al., 2007; van Turennout et al., 2003). In line with this,

EP

although not reaching statistical significance, the cluster size and the maximal activation level of the left IFG decreased from session to session in our study (possibly causing low test-retest reliability). However, the IFG of the right hemisphere, which in general is

AC C

attributed to attention (Hampshire et al., 2010), showed a slightly but not significantly higher cluster size in session 3 as compared to session 2. This might speak for higher attention levels if the task was not well known from a few days before (or new as indicated by the highest cluster size in session 1). A higher long- than short-term reliability within the right IFG might therefore be due to a more comparable level of the novelty level of the stimulus material in the first and the last session. Considering and controlling for the effect of training is, therefore, not only important when investigating test-retest reliability in healthy subjects, but also for clinical routine. Our results strengthen the evidence that the degree of training before the measurement significantly influences the level and extent of event-related activation even in a single 26

ACCEPTED MANUSCRIPT session (Meltzer et al., 2009), especially within the IFG during a picture naming task. Therefore, this task seems to be not adequate for a robust and reliable IFG mapping, particularly in clinical routine where influencing factors like training level are usually harder to control as compared to experimental settings. According to previous studies a semantic decision task seems to be more suitable for

RI PT

activating Broca´s area (Poldrack et al., 1999; Seghier et al., 2004; Thompson-Schill et al., 1997). A semantic decision task performed twice within one day in a group of patients undergoing evaluation for epilepsy surgery led to high ICC values, with a higher reliability in frontal areas (Broca, premotor) than in temporoparietal regions, revealing an

SC

intrahemispheric gradient of reliability (Fernandez et al., 2003). The higher reliability within temporoparietal regions as found in our study can be achieved by tasks focusing on lexical retrieval, auditory language perception and syntactic processing (Fernandez et al., 2003).

M AN U

Since different tasks lead to distinct brain activation patterns (Wilson et al., 2017), a combination of different tasks seems to be advisable in order to reliably identify and delineate all important regions engaged in the language network (Ramsey et al., 2001; Rutten et al., 2002). This is especially crucial when fMRI is used in preoperative language mapping, where essential decisions (e.g., whether or not functional tissue can be resected

Limitations

TE D

without significant functional harm) may be, at least partly, based on fMRI results.

One major challenge when investigating test-retest reliability between and within subjects

EP

is the choice of the statistical thresholds. Due to a large variety of factors such as alertness and (over-)learning of tasks as well as technical parameters (including motion artifacts) influencing the SNR, there is a high variability between subjects, making it

AC C

difficult to use a predefined statistical threshold. Accordingly, thresholds had to be adjusted on an individual level in some cases which is a common work-around especially in clinical studies and routine. Although the influence of the statistical threshold on reliability was considered only low to moderate according to literature (Fernandez et al., 2003; Rutten et al., 2002), its effect was still significant for the whole brain analysis in our study with higher overlap volumes when applying more liberal thresholds. We, however, decided to use distinct statistical thresholds for the ROIs, which were differentially activated by the picture naming task as usual in common practice. Applying high statistical thresholds for all ROIs would have led to a high number of missing values, particularly for the IFG (see Supplemental material), whereas low statistical thresholds would have 27

ACCEPTED MANUSCRIPT caused large and implausible activation clusters, e.g. for M1. Therefore, we believe that – at least for the aim of our study – this approach represented the best trade-off between standardization and accuracy leading to reasonable results. Nevertheless, since this approach introduces a considerable investigator-dependency and, thus, a potential bias, results were confirmed by additional analyses using different statistical thresholds (see

RI PT

Supplemental material). Of note, our methodological approach of using fixed statistical thresholds was oriented towards the application of basic research. However, when it comes to clinical applications like presurgical functional mappings, not only a good reliability of the analysis method is mandatory, but also an optimal trade-off between

SC

sensitivity and specificity as, e.g., suggested by Gorgolewski et al. (2012) who used a balanced statistical model (combining Gamma-Gaussian mixture modeling with topological thresholding) for data-driven, adaptive thresholding.

M AN U

Another factor which influences not only lateralization metrics but also reliability results is the choice of functional areas / clusters for the analysis: We here aimed at balancing the choice of a rather data-driven approach against the restriction to language-related regions by selecting the ROIs based on the whole-brain group data and by including the entire cluster. However, the laterality index might have suffered from considering also more bilaterally activated regions like the auditory and primary motor cortices whilst reliability

TE D

results reflect the combination of both language-specific and (eventually more robust) language-related regions.

With regard to the depiction of higher-level language processes (such as addressed in this

EP

study) which typically have small effect sizes, the limited statistical power of the applied sparse-sampling fMRI protocol must be considered as a possible limitation. In our study, scanning time (~14.5 min) was a trade-off with statistical power in order to avoid

AC C

fluctuations in alertness/attention as well as task-independent motion artifacts (also considering the particular requirements of clinical applications). Here, the applied acquisition parameters (TR, TA, number of volumes) were similar to those used in previous studies on language function (Gorgolewski et al., 2013; Smits et al., 2006; Volz et al., 2015; Wilson et al., 2017). The relatively high tSNR (representing another main factor possibly influencing effect size, Murphy et al., 2007) of the BOLD signal in our study, allowed to achieve sufficient statistical power even when assessing a rather low number of volumes, at least regarding more robust functional regions.

28

ACCEPTED MANUSCRIPT

Conclusion Taken together, picture naming led to a reliable mapping of language-critical regions by fMRI, especially concerning the cluster centers. However, spatial reliability was rather low,

RI PT

especially for the IFG. Thus, even using an optimized fMRI design (including, e.g., sparsesampling acquisition), the extent of functional areas must be interpreted with caution, particularly for a clinical use like in the context of preoperative imaging. Moreover, special attention should be paid to the intensity of task training prior to the measurement. To

SC

ensure reliable activation of the whole language network a combination of multiple tasks seems to be advisable for language mapping, also to test which (combination of) paradigm(s) is best suited for presurgical language mapping, and should be addressed in

AC C

EP

TE D

M AN U

future studies.

29

ACCEPTED MANUSCRIPT

Literature Amaro, E., Jr., Williams, S.C., Shergill, S.S., Fu, C.H., MacSweeney, M., Picchioni, M.M., Brammer, M.J., McGuire, P.K., 2002. Acoustic noise and functional magnetic resonance imaging: current strategies and future prospects. J Magn Reson Imaging 16, 497-510. Ashburner, J., Friston, K.J., 2005. Unified segmentation. Neuroimage 26, 839-851.

RI PT

Belin, P., Zatorre, R.J., Hoge, R., Evans, A.C., Pike, B., 1999. Event-related fMRI of the auditory cortex. Neuroimage 10, 417-429. Benjamini, Y., Hochberg, Y., 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Ser B (Methodological) 57, 289-300.

SC

Bennett, C.M., Miller, M.B., 2010. How reliable are the results from functional magnetic resonance imaging? Ann N Y Acad Sci 1191, 133-155. Bennett, C.M., Miller, M.B., 2013. fMRI reliability: influences of task and experimental design. Cogn Affect Behav Neurosci 13, 690-702.

M AN U

Benson, R.R., FitzGerald, D.B., LeSueur, L.L., Kennedy, D.N., Kwong, K.K., Buchbinder, B.R., Davis, T.L., Weisskoff, R.M., Talavage, T.M., Logan, W.J., Cosgrove, G.R., Belliveau, J.W., Rosen, B.R., 1999. Language dominance determined by whole brain functional MRI in patients with brain lesions. Neurology 52, 798. Billingsley-Marshall, R.L., Simos, P.G., Papanicolaou, A.C., 2004. Reliability and validity of functional neuroimaging techniques for identifying language-critical areas in children and adults. Dev Neuropsychol 26, 541-563.

TE D

Binder, J.R., Rao, S.M., Hammeke, T.A., Yetkin, F.Z., Jesmanowicz, A., Bandettini, P.A., Wong, E.C., Estkowski, L.D., Goldstein, M.D., Haughton, V.M., et al., 1994. Functional magnetic resonance imaging of human auditory cortex. Ann Neurol 35, 662-672. Boatman, D.F., Miglioretti, D.L., 2005. Cortical sites critical for speech discrimination in normal and impaired listeners. J Neurosci 25, 5475-5480.

EP

Bookheimer, S., 2007. Pre-surgical language mapping with functional magnetic resonance imaging. Neuropsychol Rev 17, 145-155.

AC C

Brannen, J.H., Badie, B., Moritz, C.H., Quigley, M., Meyerand, M.E., Haughton, V.M., 2001. Reliability of functional MR imaging with word-generation tasks for mapping Broca's area. AJNR Am J Neuroradiol 22, 1711-1718. Di Salle, F., Esposito, F., Scarabino, T., Formisano, E., Marciano, E., Saulino, C., Cirillo, S., Elefante, R., Scheffler, K., Seifritz, E., 2003. fMRI of the auditory system: understanding the neural basis of auditory gestalt. Magn Reson Imaging 21, 1213-1224. Eaton, K.P., Szaflarski, J.P., Altaye, M., Ball, A.L., Kissela, B.M., Banks, C., Holland, S.K., 2008. Reliability of fMRI for studies of language in post-stroke aphasia subjects. Neuroimage 41, 311-322. Eden, G.F., Joseph, J.E., Brown, H.E., Brown, C.P., Zeffiro, T.A., 1999. Utilizing hemodynamic delay and dispersion to detect fMRI signal change without auditory interference: the behavior interleaved gradients technique. Magn Reson Med 41, 13-20.

30

ACCEPTED MANUSCRIPT Eickhoff, S.B., Stephan, K.E., Mohlberg, H., Grefkes, C., Fink, G.R., Amunts, K., Zilles, K., 2005. A new SPM toolbox for combining probabilistic cytoarchitectonic maps and functional imaging data. Neuroimage 25, 1325-1335. Etard, O., Mellet, E., Papathanassiou, D., Benali, K., Houde, O., Mazoyer, B., Tzourio-Mazoyer, N., 2000. Picture naming without Broca's and Wernicke's area. Neuroreport 11, 617-622.

RI PT

Fedorenko, E., Hsieh, P.J., Nieto-Castanon, A., Whitfield-Gabrieli, S., Kanwisher, N., 2010. New method for fMRI investigations of language: defining ROIs functionally in individual subjects. J Neurophysiol 104, 1177-1194. Fernandez, G., Specht, K., Weis, S., Tendolkar, I., Reuber, M., Fell, J., Klaver, P., Ruhlmann, J., Reul, J., Elger, C.E., 2003. Intrasubject reproducibility of presurgical language lateralization and mapping using fMRI. Neurology 60, 969-975.

SC

Fesl, G., Bruhns, P., Rau, S., Wiesmann, M., Ilmberger, J., Kegel, G., Brueckmann, H., 2010. Sensitivity and reliability of language laterality assessment with a free reversed association task--a fMRI study. Eur Radiol 20, 683-695.

M AN U

Freyschlag, C.F., Duffau, H., 2014. Awake brain mapping of cortex and subcortical pathways in brain tumor surgery. J Neurosurg Sci 58, 199-213. Friederici, A.D., 2011. The brain basis of language processing: from structure to function. Physiol Rev 91, 1357-1392. Gonzalez-Castillo, J., Talavage, T.M., 2011. Reproducibility of fMRI activations associated with auditory sentence comprehension. Neuroimage 54, 2138-2155.

TE D

Gorgolewski, K.J., Storkey, A.J., Bastin, M.E., Pernet, C.R., 2012. Adaptive thresholding for reliable topological inference in single subject fMRI analysis. Front Hum Neurosci 6, 245. Gorgolewski, K.J., Storkey, A.J., Bastin, M.E., Whittle, I., Pernet, C., 2013. Single subject fMRI test-retest reliability metrics and confounding factors. Neuroimage 69, 231-243. Gow, D.W., Jr., 2012. The cortical organization of lexical knowledge: a dual lexicon model of spoken language processing. Brain Lang 121, 273-288.

EP

Gracco, V.L., Tremblay, P., Pike, B., 2005. Imaging speech production using fMRI. Neuroimage 26, 294-301.

AC C

Gupta, S.S., 2014. fMRI for mapping language networks in neurosurgical cases. Indian J Radiol Imaging 24, 37-43. Hagoort, P., Indefrey, P., 2014. The neurobiology of language beyond single words. Annu Rev Neurosci 37, 347-362. Hampshire, A., Chamberlain, S.R., Monti, M.M., Duncan, J., Owen, A.M., 2010. The role of the right inferior frontal gyrus: inhibition and attentional control. Neuroimage 50, 1313-1319. Harrington, G.S., Buonocore, M.H., Farias, S.T., 2006. Intrasubject reproducibility of functional MR imaging activation in language tasks. AJNR Am J Neuroradiol 27, 938-944. Hickok, G., Poeppel, D., 2007. The cortical organization of speech processing. Nat Rev Neurosci 8, 393-402. Indefrey, P., Levelt, W.J., 2004. The spatial and temporal signatures of word production components. Cognition 92, 101-144.

31

ACCEPTED MANUSCRIPT Kristo, G., Rutten, G.J., Raemaekers, M., de Gelder, B., Rombouts, S.A., Ramsey, N.F., 2014. Task and task-free FMRI reproducibility comparison for motor network identification. Hum Brain Mapp 35, 340-352. Langers, D.R., van Dijk, P., 2011. Robustness of intrinsic connectivity networks in the human brain to the presence of acoustic scanner noise. Neuroimage 55, 1617-1632. Leaver, A.M., Rauschecker, J.P., 2016. Functional Topography of Human Auditory Cortex. J Neurosci 36, 1416-1428.

RI PT

Lefaucheur, J.P., Picht, T., 2016. The value of preoperative functional cortical mapping using navigated TMS. Neurophysiol Clin 46, 125-133. Maldjian, J.A., Laurienti, P.J., Driskill, L., Burdette, J.H., 2002. Multiple reproducibility indices for evaluation of cognitive functional MR imaging paradigms. AJNR Am J Neuroradiol 23, 10301037.

SC

Meister, I.G., Weidemann, J., Foltys, H., Brand, H., Willmes, K., Krings, T., Thron, A., Topper, R., Boroojerdi, B., 2005. The neural correlate of very-long-term picture priming. Eur J Neurosci 21, 1101-1106.

M AN U

Meltzer, J.A., Postman-Caucheteux, W.A., McArdle, J.J., Braun, A.R., 2009. Strategies for longitudinal neuroimaging studies of overt language production. Neuroimage 47, 745-755. Morrison, M.A., Churchill, N.W., Cusimano, M.D., Schweizer, T.A., Das, S., Graham, S.J., 2016. Reliability of Task-Based fMRI for Preoperative Planning: A Test-Retest Study in Brain Tumor Patients and Healthy Controls. PLoS One 11, e0149547.

TE D

Murphy, K., Bodurka, J., Bandettini, P.A., 2007. How long to scan? The relationship between fMRI temporal signal to noise ratio and necessary scan duration. Neuroimage 34, 565-574. Peelle, J.E., 2014. Methodological challenges and solutions in auditory functional magnetic resonance imaging. Front Neurosci 8, 253. Penfield, W., Rasmussen, T., 1950. The cerebral cortex of man; a clinical study of localization of function. Macmillan, Oxford, England.

EP

Poeppel, D., 2001. Pure word deafness and the bilateral processing of the speech code. Cognitive Science 25, 679-693.

AC C

Poldrack, R.A., Wagner, A.D., Prull, M.W., Desmond, J.E., Glover, G.H., Gabrieli, J.D., 1999. Functional specialization for semantic and phonological processing in the left inferior prefrontal cortex. Neuroimage 10, 15-35. Portney, L.G., Watkins, M.P., 2000. Foundations of clinical research: applications to practice. Prentice Hall; New Jersey. Price, C.J., 2012. A review and synthesis of the first 20 years of PET and fMRI studies of heard speech, spoken language and reading. Neuroimage 62, 816-847. Price, C.J., Devlin, J.T., Moore, C.J., Morton, C., Laird, A.R., 2005. Meta-analyses of object naming: effect of baseline. Hum Brain Mapp 25, 70-82. Quiton, R.L., Keaser, M.L., Zhuo, J., Gullapalli, R.P., Greenspan, J.D., 2014. Intersession reliability of fMRI activation for heat pain and motor tasks. Neuroimage Clin 5, 309-321. Ramsey, N.F., Sommer, I.E., Rutten, G.J., Kahn, R.S., 2001. Combined analysis of language tasks in fMRI improves assessment of hemispheric dominance for language functions in individual subjects. Neuroimage 13, 719-733.

32

ACCEPTED MANUSCRIPT Rau, S., Fesl, G., Bruhns, P., Havel, P., Braun, B., Tonn, J.C., Ilmberger, J., 2007. Reproducibility of activations in Broca area with two language tasks: a functional MR imaging study. AJNR Am J Neuroradiol 28, 1346-1353. Robson, M.D., Dorosz, J.L., Gore, J.C., 1998. Measurements of the temporal fMRI response of the human auditory cortex to trains of tones. Neuroimage 7, 185-198.

RI PT

Rombouts, S.A., Barkhof, F., Hoogenraad, F.G., Sprenger, M., Valk, J., Scheltens, P., 1997. Testretest analysis with functional MR of the activated area in the human visual cortex. AJNR Am J Neuroradiol 18, 1317-1322. Rutten, G.J., Ramsey, N.F., van Rijen, P.C., van Veelen, C.W., 2002. Reproducibility of fMRIdetermined language lateralization in individual subjects. Brain Lang 80, 421-437.

SC

Seghier, M.L., Lazeyras, F., Pegna, A.J., Annoni, J.M., Zimine, I., Mayer, E., Michel, C.M., Khateb, A., 2004. Variability of fMRI activation during a phonological and semantic language task in healthy subjects. Hum Brain Mapp 23, 140-155. Shrout, P.E., Fleiss, J.L., 1979. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 86, 420-428.

M AN U

Smits, M., Visch-Brink, E., Schraa-Tam, C.K., Koudstaal, P.J., van der Lugt, A., 2006. Functional MR imaging of language processing: an overview of easy-to-implement paradigms for patient care and clinical research. Radiographics 26 Suppl 1, S145-158. Snodgrass, J.G., Vanderwart, M., 1980. A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity. J Exp Psychol Hum Learn 6, 174-215.

TE D

Stevens, M.T., Clarke, D.B., Stroink, G., Beyea, S.D., D'Arcy, R.C., 2016. Improving fMRI reliability in presurgical mapping for brain tumours. J Neurol Neurosurg Psychiatry 87, 267-274. Stippich, C., Rapps, N., Dreyhaupt, J., Durst, A., Kress, B., Nennig, E., Tronnier, V.M., Sartor, K., 2007. Localizing and lateralizing language in patients with brain tumors: feasibility of routine preoperative functional MR imaging in 81 consecutive patients. Radiology 243, 828-836.

EP

Szelényi, A., Bello, L., Duffau, H., Fava, E., Feigl, G.C., Galanda, M., Neuloh, G., Signorelli, F., Sala, F., 2010. Intraoperative electrical stimulation in awake craniotomy: methodological aspects of current practice. Neurosurgical Focus 28, E7.

AC C

Thompson-Schill, S.L., D'Esposito, M., Aguirre, G.K., Farah, M.J., 1997. Role of left inferior prefrontal cortex in retrieval of semantic knowledge: a reevaluation. Proc Natl Acad Sci U S A 94, 14792-14797. Tyndall, A.J., Reinhardt, J., Tronnier, V., Mariani, L., Stippich, C., 2017. Presurgical motor, somatosensory and language fMRI: Technical feasibility and limitations in 491 patients over 13 years. Eur Radiol 27, 267-278. van Turennout, M., Bielamowicz, L., Martin, A., 2003. Modulation of neural activity during object naming: effects of time and practice. Cereb Cortex 13, 381-391. Volz, L.J., Eickhoff, S.B., Pool, E.M., Fink, G.R., Grefkes, C., 2015. Differential modulation of motor network connectivity during movements of the upper and lower limbs. Neuroimage 119, 44-53. Wassermann, E.M., 1998. Risk and safety of repetitive transcranial magnetic stimulation: report and suggested guidelines from the International Workshop on the Safety of Repetitive Transcranial Magnetic Stimulation, June 5-7, 1996. Electroencephalogr Clin Neurophysiol 108, 1-16.

33

ACCEPTED MANUSCRIPT Weiss, C., Nettekoven, C., Rehme, A.K., Neuschmelting, V., Eisenbeis, A., Goldbrunner, R., Grefkes, C., 2013. Mapping the hand, foot and face representations in the primary motor cortex - Retest reliability of neuronavigated TMS versus functional MRI. Neuroimage 66C, 531-542. Welvaert, M., Rosseel, Y., 2013. On the definition of signal-to-noise ratio and contrast-to-noise ratio for FMRI data. PLoS One 8, e77089.

RI PT

Weng, H.H., Noll, K.R., Johnson, J.M., Prabhu, S.S., Tsai, Y.H., Chang, S.W., Huang, Y.C., Lee, J.D., Yang, J.T., Yang, C.T., Tsai, Y.H., Yang, C.Y., Hazle, J.D., Schomer, D.F., Liu, H.L., 2018. Accuracy of Presurgical Functional MR Imaging for Language Mapping of Brain Tumors: A Systematic Review and Meta-Analysis. Radiology 286, 512-523. Wilson, S.M., Bautista, A., Yen, M., Lauderdale, S., Eriksson, D.K., 2017. Validity and reliability of four language mapping paradigms. Neuroimage Clin 16, 399-408.

AC C

EP

TE D

M AN U

SC

Yakunina, N., Kang, E.K., Kim, T.S., Min, J.H., Kim, S.S., Nam, E.C., 2015. Effects of scanner acoustic noise on intrinsic brain activity during auditory stimulation. Neuroradiology 57, 10631073.

34

ACCEPTED MANUSCRIPT

Acknowledgments The last author (CWL) received additional funding by the Faculty of Medicine of the

Supplemental Material Suppl. Table 1: Realignment parameters session 1

session 2

session 3

0.16 ± 0.50 0.05 ± 0.29 0.11 ± 0.24

x

0.23 ± 0.26 0.28 ± 0.21 0.19 ± 0.18

z

0.09 ± 0.09 0.02 ± 0.29 0.05 ± 0.33

pitch 0.01 ± 0.01 0.01 ± 0.01 0.01 ± 0.01 roll

0.00 ± 0.01 0.00 ± 0.01 0.00 ± 0.01

yaw

0.00 ± 0.01 0.00 ± 0.01 0.00 ± 0.01

M AN U

rotation

SC

translation x

RI PT

University of Cologne (grant No.: Gerok 8/2016).

Suppl. Table 2: Non-significant interaction effects (ANOVA) FACTORS

F-value

p-value

effect size (η²)

F1,15=0.908

p=0.356

0.057

local activation maxima/CoG x region

F2,30=1.668

p=0.206

0.100

intersession interval x region

F2,30=0.895

p=0.419

0.056

local activation maxima/CoG x intersession

F2,30=0.335

p=0.718

0.022

local activation maxima/CoG x hemisphere

F1,15=0.002

p=0.967

0.000

intersession interval x hemisphere

F1,15=0.001

p=0.976

0.000

local activation maxima/CoG x intersession

F1,15=0.056

p=0.816

0.004

F2,30=1.035

p=0.368

0.065

F2,30=0.765

p=0.474

0.049

intersession interval x region x hemisphere

F2,30=0.046

p=0.955

0.003

local activation maxima/CoG x intersession

F2,30=0.387

p=0.683

0.025

TE D

local activation maxima/CoG x intersession

AC C

interval x region

EP

interval

interval x hemisphere region x hemisphere local

activation

maxima/CoG

x

region

x

hemisphere

interval x region x hemisphere

35

ACCEPTED MANUSCRIPT Suppl. Table 3: Overlap volumes – STG and IFG, left hemisphere. Mean Dice coefficients (± SD) are reported, depending on threshold levels and intersession intervals. Of note, due to thresholddependent reduction in activity, the sample size was diminished to n=12 (p<0.001, uncorrected) / n=4 (p<0.05, FWE-corrected) for the IFG. threshold

short-term (S1∩S2) long-term (S1∩S3)

STG left p<0.001, uncorrected p<0.05, FWE-corrected IFG left p<0.01, uncorrected

59 ± 11 %

58 ± 10 %

55 ± 12 %

53 ± 14 %

RI PT

ROI

25 ± 24 % 24 ± 22 %

p<0.05, FWE-corrected

21 ± 26 %

23 ± 17 % 5±8%

SC

p<0.001, uncorrected

29 ± 28%

Suppl. Table 4: ICCs – STG and IFG, left hemisphere. Mean ICCs (± SD) are reported,

ROI

threshold

M AN U

depending on threshold levels and intersession intervals.

STG left p<0.001, uncorrected

interval

n ICC>0.75

short-term 0.414 ± 0.300

353

long-term

0.422 ± 0.310

482

p<0.05, FWE-corrected short-term 0.257 ± 0.460

456

long-term

0.253 ± 0.428

439

short-term 0.345 ± 0.400

431

long-term

0.273 ± 0.380

297

short-term 0.222 ± 0.468

634

long-term

0.197 ± 0.508

375

p<0.05, FWE-corrected short-term 0.137 ± 0.549

140

p<0.01, uncorrected

TE D

IFG left

mean ICC

AC C

EP

p<0.001, uncorrected

long-term

0.040 ± 0.227

46

Suppl. Table 5: t-values of local activation maxima (mean ± SD) session 1 session 2 IFG left 5.23 ± 1.57 4.93 ± 1.65 IFG right 4.74 ± 1.20 4.39 ± 1.18 STG left 10.51 ± 2.42 9.59 ± 1.49 STG right 9.73 ± 1.86 9.44 ± 1.62 M1 left 9.51 ± 2.33 8.91 ± 2.14 M1 right 9.37 ± 1.99 8.49 ± 1.76 mean 8.18 ± 2.98 7.62 ± 2.69

session 3 4.53 ± 1.43 4.20 ± 1.33 9.60 ± 1.82 8.86 ± 1.91 8.82 ± 1.41 8.42 ± 1.65 7.40 ± 2.69

36

ACCEPTED MANUSCRIPT Suppl. Table 6: Cluster size (mean number of voxels ± SD) session3 1191 ± 782 1059 ± 1071 7663 ± 3471 6135 ± 2817 1822 ± 1363 1496 ± 1189 3228 ± 2897

RI PT

session2 1729 ± 1592 988 ± 864 8856 ± 3668 7990 ± 4556 1892 ± 1339 1648 ± 1162 3851 ± 3566

AC C

EP

TE D

M AN U

SC

session1 IFG left 2241 ± 2587 IFG right 1287 ± 1143 STG left 8501 ± 3826 STG right 7474 ± 2910 M1 left 2149 ± 1446 M1 right 1764 ± 1075 mean 3902 ± 3199

37

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

Suppl. Figure 1: Group analysis for session 1-3 (p<0.05, FWE-corrected at the voxel-level)

38

RI PT

ACCEPTED MANUSCRIPT

SC

Suppl. Figure 2: Probabilistic overlap maps (p<0.05, FWE-corrected) depicting the number of subjects with significant activity at that voxel. Regions with highest overlaps between subjects are shown in

AC C

EP

TE D

M AN U

green.

39

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

Suppl. Figure 3: Examples of single-subject activation maps, by session. The representative samples exemplify different levels of similarity between sessions (for the left hemisphere, p<0.05, FWE-corrected on the voxel level). The Dice coefficients (whole-brain) for session 2 and 3 with session 1 are shown in the respective upper right corner.

40

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

Suppl. Figure 4: ICC maps of the (A) left STG and (B) left IFG with different statistical thresholds for ROI-extraction (dark purple: ICC=0.0, red: ICC=1.0). Highly reliable voxels (ICC>0.75) are shown in the upper right corner.

41

ACCEPTED MANUSCRIPT

Table 1: MNI coordinates (x, y, z) of the local activation maxima derived from the group analysis (all subjects and sessions)

63

-33

-46.5 49.5

-9 -9

z 4.5 0 10.5

3

9.75

34.5 31.5

15.91 15.01

M AN U TE D EP AC C

t-value 4.87 5.14 9.53

RI PT

x -52.5 48 -58.5

SC

ROI IFG left IFG right STG left STG right M1 left M1 right

MNI coordinates y 19.5 25.5 -33

ACCEPTED MANUSCRIPT

Table 2: LI values for each session and ROI as well as over all ROIs session 2

11 ± 60 9 ± 19 12 ± 36 9 ± 21 10 ± 22

session 3 19 ± 46 9 ± 33 11 ± 21 11 ± 20 11± 24

M AN U TE D EP AC C

mean 13 ± 54 8 ± 23 12 ± 30

10 ± 18 10 ± 21

RI PT

session 1 10 ± 57 6 ± 15 14 ± 32 9 ± 15 9 ± 18

SC

ROI IFG STG M1 all ROIs IFG and STG

ACCEPTED MANUSCRIPT

Table 3: ICCs of the local activation maxima and CoGs (coordinates of high reliability are highlighted in grey, moderate reliability in light grey)

x y z mean x y z mean

left IFG 0.263 0.523 0.820 0.585 0.498 0.592 0.646 0.585

CoG short-term

AC C

EP

TE D

long-term

M1 0.688 0.592 0.903 0.762 0.496 0.744 0.840 0.664

STG 0.182 0.331 0.780 0.478 0.481 0.744 0.494 0.592

right IFG 0.514 0.499 0.505 0.508 0.376 0.633 0.681 0.578

STG 0.563 0.263 0.515 0.446 0.737 0.172 0.811 0.635

M1 0.825 0.744 0.840 0.808 0.795 0.575 0.814 0.744

right IFG 0.106 0.283 0.475 0.291 0.053 0.709 0.658 0.523

STG 0.280 0.152 -0.023 0.139 0.609 0.168 0.350 0.389

M1 0.086 0.467 0.384 0.319 0.208 0.214 0.341 0.254

RI PT

long-term

x y z mean x y z mean

STG 0.649 0.683 0.467 0.604 0.319 0.832 0.277 0.523

M AN U

short-term

left IFG 0.223 0.877 0.767 0.701 0.603 0.778 0.838 0.753

SC

local activation maximum

M1 0.262 0.446 0.713 0.501 0.235 0.411 0.530 0.397

ACCEPTED MANUSCRIPT

Table 4: Overlap volumes - whole brain (mean ± SD) threshold

short-term (S1∩S2)

long-term (S1∩S3)

60 ± 10 %

60 ± 8 %

p<0.001, uncorrected

56 ± 11 %

56 ± 10 %

p<0.05, FWE-corrected

49 ± 15 %

47 ± 16 %

AC C

EP

TE D

M AN U

SC

RI PT

p<0.01, uncorrected

ACCEPTED MANUSCRIPT

right 0.117 ± 0.285 0.389 ± 0.300 0.363 ± 0.414 0.302 ± 0.504 0.422 ± 0.282 0.336 ± 0.462

AC C

EP

TE D

M AN U

SC

left short-term IFG 0.345 ± 0.400 STG 0.414 ± 0.300 M1 0.515 ± 0.438 long-term IFG 0.273 ± 0.380 STG 0.422 ± 0.310 M1 0.422 ± 0.380

RI PT

Table 5: Mean ICC values

ACCEPTED MANUSCRIPT

Table 6: Number of highly reliable voxels (ICC>0.75)

AC C

EP

TE D

M AN U

SC

RI PT

left right 25 short-term IFG 431 STG 353 212 544 336 M1 long-term IFG 297 332 STG 482 290 313 275 M1