NeuroImage 50 (2010) 1168–1176
Contents lists available at ScienceDirect
NeuroImage j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / y n i m g
Retest reliability of reward-related BOLD signals Klaus Fliessbach ⁎,1, Tim Rohe 1, Nicolas S. Linder, Peter Trautner, Christian E. Elger, Bernd Weber University of Bonn Medical Center, Department of Epileptology and Life and Brain Center, Bonn, Department of NeuroCognition, Sigmund-Freud-Str. 25, 51105 Bonn, Germany
a r t i c l e
i n f o
Article history: Received 9 November 2009 Revised 30 December 2009 Accepted 11 January 2010 Available online 18 January 2010
a b s t r a c t Reward processing is a central component of learning and decision making. Functional magnetic resonance imaging (fMRI) has contributed essentially to our understanding of reward processing in humans. The strength of reward-related brain responses might prove as a valuable marker for, or correlate of, individual preferences or personality traits. An essential prerequisite for this is a sufficient reliability of individual measures of reward-related brain signals. We therefore determined test–retest reliabilities of BOLD responses to reward prediction, reward receipt and reward prediction errors in the ventral striatum and the orbitofrontal cortex in 25 subjects undergoing three different simple reward paradigms (retest interval 7–13 days). Although on a group level the paradigms consistently led to significant activations of the relevant brain areas in two sessions, across-subject retest reliabilities were only poor to fair (with intraclass correlation coefficients (ICCs) of −0.15 to 0.44). ICCs for motor activations were considerably higher (ICCs 0.32 to 0.73). Our results reveal the methodological difficulties behind across-subject correlations in fMRI research on reward processing. These results demonstrate the need for studies that address methods to optimize the retest reliability of fMRI. © 2010 Elsevier Inc. All rights reserved.
Introduction Over the last 20 years, functional magnetic resonance imaging (fMRI) has made immense contributions to the understanding of the neural foundations of reward processing in humans. Reward processing plays a central role for basic cognitive functions like reinforcement learning and decision making, and is thus of fundamental importance for neuroscientific studies in the fields of learning theory, economic decision making (neuroeconomics) and social neurosciences (because social decisions appear to be grounded in the same reward processing brain structures as financial decisions; Fehr and Camerer, 2007). The fundamentals of reward processing were first identified in animal studies, which showed that dopaminergic midbrain neurons increase their activity when a cue signals an upcoming reward or when an unexpected reward actually occurs (reward prediction error, RPE) (Schultz et al., 1997). Secondary reward signals occur in the primary projection sites of those midbrain neurons, i.e. the ventral striatum (VS) and the orbitofrontal cortex (OFC). FMRI studies have shown that BOLD signal responses in humans parallel those findings to a high extent: BOLD activity in the midbrain (D'Ardenne et al., 2008) and in its projection sites (Pagnoni et al., 2002; Rolls et al., 2008) increases with reward predictions and scales positively with ⁎ Corresponding author. Department of Epileptology, University of Bonn Medical Center, Sigmund Freud-Str. 25, D-53105 Bonn, Germany. Fax: +49 228 6885261. E-mail address: klaus.fl
[email protected] (K. Fliessbach). 1 These authors have equally contributed to the study. 1053-8119/$ – see front matter © 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.neuroimage.2010.01.036
the RPE. It shows temporal characteristics as expected from temporal difference models of reinforcement learning (O'Doherty et al., 2003). FMRI studies have demonstrated that a variety of different rewards can induce such reward-related brain activity. The strength of the brain's response to a reward correlates better with the subjective value of a reward for that individual (individual preference) than with objective reward magnitude (Kable and Glimcher, 2007; O'Doherty et al., 2006; Tobler et al., 2007). Together, these results suggest that activity in reward processing brain areas can be interpreted as a surrogate biomarker of individuals' preferences (Knutson et al., 2009). There are also promising results suggesting that inter-individual differences in reward-related BOLD signals are related to personality traits (Beaver et al., 2006) and to genetic polymorphisms that affect dopamine metabolism (Cohen et al., 2005; Jocham et al., 2009). Further establishing these links requires better knowledge about the reliability of the respective measures. This subject has been the topic of a recent scientific debate that has evolved regarding the meaningfulness of across-subject correlations between BOLD signals and other individual measures such as personality traits, preferences or attitudes (Poldrack and Mumford, 2009; Vul et al., 2009). The expected correlation between two measures A and B is theoretically limited by measures' reliability, where the upper limit of correlation is given by the reliability index: √(relA ⁎ relB) (Nunnally, 1970). Therefore, when using the BOLD signal as an indicator for individual characteristics, it is essential to know its reliability, i.e. the stability of inter-individual differences in the magnitude of BOLD contrasts over time. Given the central role of reward processing for many fields of cognitive neuroscience and the potential use of reward-related BOLD signals as
K. Fliessbach et al. / NeuroImage 50 (2010) 1168–1176
biomarkers for individuals' preferences, we therefore specifically analyzed the test–retest reliability of reward-related BOLD responses. Material and methods Subjects Eight female and 17 male subjects, aged 19–35 (mean = 26.12, SD = 3.98), were scanned during two sessions separated by 7.7 days on average (range 7–13 days). The time of day of each session was kept fairly constant for each subject (median daytime differences between the sessions: 18 min, range 0–260 min). Subjects had no history of psychiatric or neurological disorders. Two female subjects had to be excluded from the entire analysis, one due to an incidental finding of a brain abnormality and the other due to acuity problems with the video goggles in one session. The data of another subject from Paradigms A and C were discarded because he reported vision problems during one session of these tasks. Another subject was excluded from the analysis of Paradigm A because the subject fell asleep during one session. Finally, one subject was excluded from the analysis of Paradigm C due to excessive head motion (greater than 3 mm in one direction) within the scanner. For the remaining subjects, translational movement never exceeded 3 mm and rotational movement never exceeded 2.5° (with respect to the first acquired image) in any subject or session. For the main analysis, the data of 21 subjects from Paradigm A, of 23 subjects from Paradigm B and of 21 subjects from Paradigm C were included in the final analysis. A stricter head motion exclusion criterion of translation of less than 1.5 mm and rotation of less than 1° was applied in an additional analysis that was conducted to test whether retest reliabilities were affected by this (leaving 13 participants in Paradigm A, 17 in B, and 12 in C). All subjects gave written informed consent. The study was approved by the Ethics Committee of the University of Bonn. Experimental procedure Before scanning, all subjects were told they were participating in tasks designed to study reward processing. They received detailed instructions and completed practice trials for each task. In the scanner, subjects completed three simple monetary reward tasks of similar experimental structure but different task requirements with a parametric variation of the probability to win a monetary reward of 10 cents per trial. In each paradigm, a cue signaled a certain reward probability (RP), and subjects had to respond by pressing a button and finally received a reward feedback message. Subjects were informed that the sum of the gains of all trials were paid after the scanning session. Each of the three paradigms comprised 100 trials and took about 15 min to complete. The inter-trial interval (ITI) was jittered between 1 and 4 s in all three paradigms. The order of the paradigms was balanced across the subjects but held constant for a given subject across the two sessions. The stimuli were presented via video goggles (Nordic NeuroLab, Bergen, Norway) using Presentation© software (NeuroBehavioural Systems Inc.). Subjects gave their responses by pushing buttons on response grips (NordicNeuroLab, Bergen). Paradigm A (Fig. 1A) In Paradigm A, subjects had to guess whether the first number, which was presented on the left side of the screen (cue), was numerically greater or less than a subsequently shown second number appearing on the right side of the screen (for a similar design see e.g. D'Ardenne et al., 2008). The first number was between two and eight and the second number was between one and nine and never equal to the first number. Thus, the cue signaled a reward probability of either 0.50, 0.625, 0.75 or 0.875 if subjects chose the most probable guess. Subjects indicated their guess by pressing a button with either their left (for “greater than”) or with their right (for
1169
“less than”) index finger. If subjects failed to respond within a time limit of 3 s, they lost 10 cents for that trial. Then the chosen symbol was given as a response feedback for a variable time interval of 0.5– 3.5 s. Finally, the correct number was displayed together with the reward feedback for 2 s (either 0 cent if the guess was incorrect or 10 cents if the guess was correct). Paradigm B (Fig. 1B) At the beginning of each trial of Paradigm B, subjects saw a pie chart indicating the reward probability for that trial (cue). The reward probability was either 0 (an empty pie), 0.25, 0.50, 0.75 or 1 (a filled pie). After the presentation of a fixation cross for a variable time interval of 0.5–3.5 s, subjects next had to respond with their left index finger if a square appeared or with their right index finger if a triangle appeared. By answering correctly, they preserved their chance to win 10 cents with the previously indicated reward probability. If subjects responded incorrectly, they did not receive a reward, and if they failed to give a response within 1 s, they lost 10 cents. Finally, subjects received a reward feedback for 1.5 s. For similar designs, see e.g. Abler et al. (2006). Paradigm C (Fig. 1C) In Paradigm C, one, two, three or four quadratic boxes were displayed horizontally at the beginning of a trial for 2 s (cue). Subjects had to guess behind which of the boxes a circle was hidden. This means that reward probability was either 0.25 (four boxes), 0.33, 0.5 or 1 (one box). Subjects indicated their guess by pressing one of the four buttons on the response grips, which corresponded to the position of the boxes. If subjects failed to respond within 2 s or pressed a button which did not correspond to a box (in trials with less than 4 boxes), they lost 10 cents in that trial. Subjects received a feedback of the selected box for 1.5–4.5 s. At the end of a trial, the circle was revealed behind one of the boxes together with the reward feedback for 2 s. After each scanning session, subjects filled out personality questionnaires that included scales that have been theoretically and empirically linked to individual differences in reward processing, including the NEO-FFI (Costa and McCrae, 1992), the Temperament and Character Inventory (Cloninger, 1994), and the BehavioralActivating-System/Behavioral-Inhibition-System-Scales (Carver and White, 1994). This was done in order to determine retest reliabilities for these personality measures in our sample and to calculate correlations between these measures and reward-related BOLD signals. In addition to receiving a payment of €15 per session, subjects were paid the sum of rewards from the three paradigms (on average €6.83 in Paradigm A, €4.83 in Paradigm B and €4.99 in Paradigm C). Image acquisition Scanning was performed on a 1.5 T Avanto Scanner (Siemens, Erlangen, Germany) using an 8-channel head coil. In both sessions, functional data were acquired using EPI-sequences with a repetition time (TR) of 2.5 s, an echo time (TE) of 45 ms and a Flip angle of 90°. Thirty-one axial slices covered the whole brain including the superior part of the cerebellum and midbrain. Image resolution was 64 × 64 pixels with a field of view of 192 × 192 mm. Together with a slice thickness of 3 mm and an interslice gap of 0.3 mm, this resulted in a voxel size of 3 × 3 × 3.3 mm. fMRI data analysis Preprocessing and analysis of all fMRI data were performed using Statistical Parametric Mapping 5 (SPM5, http://www.fil.ion.ucl.ac. uk/spm/). Preprocessing comprised slice time correction, motion correction, spatial normalization to the canonical EPI template used in SPM5 and smoothing with an 8 mm Gaussian kernel. After normalization, volumes were resliced to a voxel size of 3 × 3 × 3 mm.
1170
K. Fliessbach et al. / NeuroImage 50 (2010) 1168–1176
Fig. 1. Time courses of a single trial in each of the three tasks. All tasks comprised a cue that signaled reward probability, a behavioral response and reward feedback. Reward probability and task requirements were varied. (A) Subjects had to guess whether one number would be numerically greater or less than a subsequently shown second number. Thus, subjects had a reward probability of either 50%, 62.5%, 75% or 87.5% in a trial if they chose the most favorable guess. (B) Subjects saw a pie chart indicating the reward probability: 0% (empty pie), 25%, 50%, 75% or 100% (filled pie). Then subjects had to correctly classify a stimulus as triangle or square. (C) Subjects had to guess behind which of one, two, three or four boxes a circle was hidden. Thus reward probability was 25% (four boxes), 33%, 50% or 100% (one box).
Three different first-level general linear models were estimated for each of the three paradigms, for each session and for each subject in order to obtain measures for different aspects of reward processing and for motor activity (see Table 1). Because of their structural similarities, the models were principally alike for all three paradigms. i) Categorical reward model: this model included an (unmodulated) regressor for the onset of the cue and two regressors for the onset of the reward feedback, depending on whether a subject received a reward (“win-trials”) or not (“no-wintrials”). This model allowed us to determine the effects of reward receipt that are independent of reward expectation. ii) Parametric reward model: this model included a regressor for the onset of the cue and a linear parametric regressor designating the reward probability (RP) indicated by the cue. It further included one regressor for the onset of the reward feedback together with a linear parametric regressor designating the size of the reward prediction error (RPE) (for win-trials 1 minus RP, for no-win-trials 0 minus RP). This parameter designates the deviation of reward outcome from reward expectation (given by RP). This model allowed us to determine the linear effects of reward prediction and reward prediction errors on the BOLD signal.
iii) Motor model: this model included unmodulated regressors for the onsets of the cue and the reward feedback together with regressors for the onset of left and right-sided motor responses. The minimum number of events in an event category was 21 for the categorical reward model (paradigm A). The median number of events that went into the reward receipt contrast was 50, which follows from the fact that there were 100 trials per paradigm and two event categories. The number of events in the parametric analyses was always 100, because here events were not split up according to event type. In addition to modeling responses for each paradigm separately, we set up a first-level model that combined data from all three paradigms. For this analysis, data from 20 subjects were available who completed all sessions for all three paradigms. In all models, the stimulus onsets were convolved with the canonical HRF of SPM5 and its time derivative. Low frequency noise was reduced by a high-pass filter of 128 s and temporal serial correlations in the fMRI signal were corrected by a first order autoregressive model. The six realignment parameters of head motion were included in the models to account for residual head movement not corrected by the motion correction during preprocessing. To identify brain activity related to reward receipt, reward prediction, reward prediction error and motor activity, we defined five contrasts:
Table 1 Overview over first-level models.
Categorical reward model
Parametric reward model Motor model
Onset regressor
Parametric regressor
Contrast
1. 2. 3. 1. 2. 1. 2. 3.
None
1. Reward receipt (positive reward feedback N negative reward feedback)
1. Reward probability 2. RPE None
1. 2. 1. 2.
Cue Positive reward feedback (win) Negative reward feedback (no win) Cue Reward feedback Cue and reward feedback Left side response Right side response
Reward prediction (reward probability N 0) Reward prediction error (RPE N 0) Left motor (left side response N 0) Right motor (right side response N 0)
K. Fliessbach et al. / NeuroImage 50 (2010) 1168–1176
a reward prediction contrast, a reward receipt contrast, a reward prediction error contrast, a left motor contrast, and a right motor contrast (Table 1). Note that the reward receipt contrast and the RPE contrast are similar and correlated but reflect different aspects of reward processing: the reward receipt contrast reflects responses to reward receipt vs. reward omission independent of reward probability, whereas the RPE contrast reflects the effect of the mismatch between the actual reward and the RP. In order to identify the main effects of the respective events, each of these five contrasts for each paradigm was subjected to a randomeffects second-level analysis with a one-sample t-test as a model, resulting in 15 t-maps per session. We then investigated the main effects of the contrasts in anatomically defined regions of interest (ROIs). For the reward contrasts, we investigated the left and right ventral striatum (VS) (creating spherical ROIs with a diameter of 12 mm centered around the MNI coordinates ±14, 12, −8) (Knutson et al., 2008) and the medial orbitofrontal cortex (OFC) (comprising approximately 420 voxels, defined according to the anatomic labeler WFU-PickAtlas; Maldjian et al., 2003; Maldjian et al., 2004) as ROIs. For the motor contrast, we used the left and right M1 (comprising approximately 600 voxels on each side, defined by the Brodmann Area atlas implemented in WFU PickAtlas) as ROIs. To test for the stability of the main effects over the two sessions, we conducted pair-wise t-tests comparing the contrast values of sessions 1 and 2 within the ROIs. We additionally tested for differences between the three paradigms and report and discuss results in Supplemental Table 4 and Supplemental Discussion. Statistical thresholds for the main effect analyses were set at p b 0.05, and were small volume-corrected for the respective ROI using the Family Wise Error correction implemented in SPM5. We then calculated measures of test–retest reliability on individual contrast values for each ROI and paradigm. For the main analysis, we extracted individual contrast values from a sphere of 5 mm radius (≈19 voxels) centered at the peak of the group activation in Session 1 and from the same site for Session 2. In additional analyses, we applied different methods to sample the voxels from which contrast values were extracted. These included calculating ICCs based on (i) group peak voxel activation (and 5 mm surrounding) from Sessions 1 and 2, (ii) individual peak voxel activations from Sessions 1 and 2 and (iii) activations from both sessions averaged over all voxels exceeding a preset threshold (see Supplemental Table 1). We also performed an analysis for the subgroup of male subjects. This was based on the consideration that effects of the menstrual cycle on task-related brain activity are known in female subjects which could increase variability of the BOLD signal over time. As a measure of retest reliability, we applied an intra-class correlation coefficient (ICC), which sets within-subject (error) variance in relation to (true) between-subject variance. Specifically, we utilized ICC (3,1) (Shrout and Fleiss, 1979) which has been established as an adequate measure for the underlying question (Specht et al., 2003). ICC (3,1) tests the consistency of individual measures by relating (true) between- and (error) within-subject variance and is computed as ICC (3,1) = (σ2between-subjects − σ2within-subjects) / (σ2between-subjects + σ2within-subjects) for the case of two test sessions. In the current study, ICC related the variance due to intraindividual changes in activation effects between
1171
the two sessions and the variance due to interindividual differences in activation effects. ICCs were calculated for each of the three paradigms separately and for a model combining the data from the three paradigms. As a measure of internal consistency between the three paradigms, we calculated Cronbach's α (see Supplemental Table 2). We also calculated the ICC in the same way for the reaction times during the experimental sessions and for the scales of the applied personality questionnaires. In addition to analyzing reliabilities within the ROI, we calculated ICC maps for each contrast for the whole brain (by using the ICC toolbox provided by Caceres et al., 2009). This allowed us to test for retest reliabilities outside the ROI and enabled us to relate reliabilities to the strength of main effects via joint probability distributions of ICCs and tvalues. By doing so, we could test whether reliabilities are consistently higher in brain areas which are activated by a given contrast, and therefore check for the reliabilities within relevant “brain networks,” as done in previous studies (Aron et al., 2006; Caceres et al., 2009). As another potential estimate of reliability, ICC maps provide the opportunity to determine median values of the distribution of ICC within a ROI. Results Behavioral data Behavioral data revealed no gross differences in subject's response behavior between both sessions: neither the percentage of missed responses, mean response times nor amount of rewards differed significantly between sessions 1 and 2 in any of the paradigms (pN 0.05, see Table 2), except that mean response times in Session 2 of Paradigm C were significantly lower than in session one (p=0.015). Percentage of improbable guesses in Paradigm A and percentage of incorrect responses in Paradigm B also did not differ between sessions. Moreover, mean response times proved to be highly reliable (ICC N 0.74 in any paradigm). fMRI analysis: main effects The reward receipt contrast, the RPE contrast and the motor contrast evoked significant BOLD responses within the respective ROIs for every paradigm in each session. The only exception was that in Paradigm B the RPE contrast did not lead to significant OFC activation in both sessions. The RP contrast did not lead to significant activations in Paradigm A, and only to inconsistent activations in the other two paradigms (see Table 3). Within the ROIs, significant effects of session were only found for the RP contrast in Paradigm A in the left VS (seven voxels at MNI 12, 21, –15, F = 23.82) and in Paradigm B in the OFC (one voxel at MNI 12, 33, –12, F = 36.70). T-maps for the reward-receipt contrast and the RPE contrast for the two sessions overlapped to a high degree (Fig. 2). These results show that on the group level the reward receipt contrast and the RPE contrast significantly activated bilateral VS and OFC and that these activations were reproducible in a second measurement. The comparison between the three paradigms yielded several significant differences (see Supplemental Table 4). While the RP
Table 2 Behavioral data. Session
% missed responses (± SD) % incorrect responses (± SD) % improbable guesses (± SD) Response times in ms (± SD) ICC response times (95% CI) Rewards in € (± SD) n/a: not applicable.
Paradigm A
Paradigm B
Paradigm C
1
2
1
2
1
2
0.19 ± 0.60 n/a 6.71 ± 5.76 896 ± 196 0.77 (0.51–0.90) 6.83 ± 0.60
0.19 ± 0.51 n/a 5.48 ± 4.48 845 ± 142
1.35 ± 1.68 1.13 ± 1.94 n/a 532 ± 58 0.75 (0.50–0.89) 4.83 ± 0.53
0.87 ± 1.51 2.04 ± 3.17 n/a 527 ± 52
1.38 ± 1.88 n/a n/a 872 ± 136 0.75 (0.49–0.89) 4.99 ± 0.61
1.10 ± 2.14 n/a n/a 818 ± 120⁎
6.84 ± 0.43
4.82 ± 0.59
4.99 ± 0.59
1172
K. Fliessbach et al. / NeuroImage 50 (2010) 1168–1176
Table 3 Main effects of reward and motor contrast in the regions of interest. ROI
Session
Paradigm A MNI
RP contrast Left VS Right VS OFC
1 2 1 2 1 2
Reward receipt contrast Left VS 1 2 Right VS 1 2 OFC 1 2 RPE contrast Left VS Right VS OFC
Motor contrast Right M1 Left M1
1 2 1 2 1 2
1 2 1 2
Paradigm B No
−12 21 −15 −21 9 0 18 12 −18 12 3 −12 12 45 −9 −12 45 −6
t
MNI
Paradigm C No
2.90 1.16 2.80 2.20 3.88 2.01
−15 21 −15 −12 3 −15 9 9 −3 18 3 −6 15 42 −3 12 48 −3
7 5
2
t
MNI 4.19⁎ 3.05 5.13⁎ 4.57⁎
No
t
2.45 3.68
−12 24 −9 −3 12 −12 6 21 −9 3 9 −9 3 48 −3 15 48 −6
3 69 3
1.74 3.50 1.69 5.45⁎ 8.30⁎ 4.66⁎
−9 9 −6 −15 3 −12 21 9 −15 12 15 −9 −9 51 −3 −3 57 −3
2 37 3 8 96 142
4.19⁎ 7.2⁎ 4.52⁎ 5.46⁎ 7.34⁎ 7.68⁎
−21 12 −12 −18 9 −6 15 3 −12 3 12 −6 −6 42 −9 3 39 −6
4 33 2 17 2 10
4.76⁎ 5.81⁎ 4.39⁎ 5.41⁎ 4.29⁎ 5.75⁎
−15 3 −15 −15 3 −15 18 9 −12 24 9 −12 9 51 −3 0 39 −9
70 75 73 74 271 84
7.62⁎ 6.68⁎ 5.86⁎ 6.13⁎ 10.33⁎ 5.83⁎
−12 6 −6 −18 3 −12 15 21 0 9 18 −9 −3 51 −3 −3 57 −3
6 40 3 5 43 59
4.29⁎ 7.49⁎ 4.30⁎ 5.02⁎ 6.93⁎ 6.71⁎
−21 15 −12 −9 15 −3 15 6 −12 15 18 −6 −6 39 −9 3 39 −6
2 1 4 1
4.66⁎ 4.12⁎ 4.55⁎ 4.22⁎
−18 6 −15 −21 12 −9 15 9 −12 12 6 −12 6 48 −3 9 36 −9
76 114 97 119 121 17
8.79⁎ 7.45⁎ 8.64⁎ 6.44⁎ 7.59⁎ 7.07⁎
−21 −15 −18 −18
92 71 83 83
7.97⁎ 6.01⁎ 7.52⁎ 10.06⁎
−15 60 −18 57 −15 66 −18 66
142 164 117 149
9.67⁎ 9.58⁎ 7.14⁎ 8.18⁎
39 54 −39 −39
63 42 66 63
39 45 −39 −36
−18 −15 −21 −21
63 60 63 57
3.57 3.76
386 328 337 390
12.51⁎ 9.41⁎ 11.98⁎ 11.99⁎
45 45 −39 −36
MNI: MNI coordinates of peak activation. No: Number of suprathreshold voxels. ⁎ p b 0.05, FWE-corrected for the search volume.
contrast in the VS was significantly higher for Paradigm B than for the two other paradigms, the strongest RPE contrast in all regions was observed for paradigm C. The reward receipt contrast yielded higher values for both Paradigms A and C than for Paradigm B in the OFC.
for right M1). The exclusion of female subjects from the analysis did not significantly alter the results (see Supplemental Table 3).
Analysis of reliabilities within ROIs
ICC maps for the whole brain showed that there were clusters of voxels exhibiting high reliabilities, but the overlap between networks of reward-related activity and high ICC values was small. Fig. 3 shows that in Paradigms A and C activity due to the reward reception contrast and high reliabilities of this measure overlap in OFC, but not in VS. Whole-brain joint probability distributions showed an association between t-values and ICCs in the sense that ICCs were generally higher within brain areas displaying strong main effects (e.g. t N 4.0) (Fig. 4). This implies that activation within relevant networks (i.e. voxels with higher t values) is more reliably measured than within areas not responding to a given condition. In the respective networks, median ICCs of the motor contrasts were higher (ranging from 0.34– 0.45) than those of the reward contrasts (0.04–0.34).
Individual estimates of the RP contrast, reward receipt contrast and RPE contrast demonstrated mostly poor2 (ICC b 0.40) reliability (Table 4). Fair reliabilities (ICCN 0.41) were only found for the reward receipt contrast and the RPE contrast in OFC in Paradigm A and for the RP contrast in the right VS in Paradigm C. Good reliabilities (ICC N 0.60) were found for the motor contrast in Paradigms A and C (Table 3). The different methods to sample the voxels from which contrast values were extracted led to slightly different ICCs, but did not generally change the pattern of results of the main analysis (see Supplemental Table 1). The application of a stricter head movement criterion led to higher ICCs for some of the contrasts in Paradigm A. In this analysis, the highest ICC for a reward contrast was observed (ICC = 0.67 for the reward receipt contrast in the OFC). However, in other cases the stricter head movement criterion did not improve reliabilities. Throughout these different approaches, the general trend was that reliabilities of the motor contrasts were higher than reliabilities of reward-related contrasts, for which observed reliabilities were fair at best. This was also the case for the analysis based on the first-level model that combined data from all three paradigms. The highest ICC for a reward contrast was 0.57 for the reward receipt contrast in the OFC, while ICCs for all other reward contrasts in all other locations were poor. Again, ICCs for motor contrast were relatively high (0.59 for left M1, and 0.79 2 Throughout this article we will classify reliabilities as follows: b0.40: “poor”, 0.41– 0.6: “fair”, 0.61–0.8: “good” and N0.8: “excellent” (see Cicchetti, D.V., 2001. The precision of reliability and validity estimates re-visited: Distinguishing between clinical and statistical significance of sample size requirements. Journal of Clinical and Experimental Neuropsychology 23, 695–700.
Analysis of reliabilities: whole brain
Reliability of questionnaire scales and correlations with reward-related activity Scales of the NEO-FFI, TCI and BIS were highly reliable, reaching from 0.76 (BIS/ BAS: Fun seeking) up to 0.95 (TCI: Self-trandescence) (Supplemental Table 5). Only few scales correlated significantly with individual estimates of the reward reception contrast in any ROI or paradigm (Supplemental Table 6). Discussion This study determined retest reliabilities of BOLD signals related to reward processing. This was based on the consideration that these signals might be used as surrogate markers for individuals' preferences, but that such an application requires sufficient reliability. Despite
K. Fliessbach et al. / NeuroImage 50 (2010) 1168–1176
1173
Fig. 2. T-maps of the reward receipt contrast in Paradigms A, B and C, separately for Session 1 (upper row) and Session 2 (lower row). T-maps are thresholded at p b 0.001, uncorrected. The contours of the predefined VS and OFC are overlaid in white.
highly significant main effects of reward contrasts and good reliabilities of motor-related activity, we found that retest reliabilities of different reward contrasts in the ventral striatum and the orbitofrontal cortex were poor to fair. These findings raise important questions concerning the reasons and consequences of limited reliabilities in fMRI. Several other studies have previously assessed the reliability of BOLD signals for a variety of different paradigms such as simple visual stimulation (Specht et al., 2003), auditory signal detection (Caceres et al., 2009), working memory (Caceres et al., 2009), probabilistic learning (Aron et al., 2006), antisaccade generation (Raemaekers et al., 2007), finger tapping (Kong et al., 2007), emotion processing (Johnstone et al., 2005) or electroacupuncture stimulation (Kong et al., 2007). In the majority of these studies, retest reliabilities were very heterogeneous across different contrasts and regions, reaching from below zero to very high (N0.9) reliabilities (Aron et al., 2006). The main difference between previous studies and this study lies in our focus on rewardrelated brain activity and our relatively large sample size. We will discuss potential sources for reduced reliabilities, addressing differences between our study and previous studies where appropriate.
explanation for the observed reliabilities. We have tried to realize this in our study by keeping experimental conditions between the two sessions as constant as possible, by applying a relatively short retest interval, by excluding data sets with doubtful quality and by choosing a not overly homogenous and relatively large sample size. A more detailed discussion of these points is provided as supplemental material. Additionally, we obtained parameters that indicate the realization of these factors to a certain degree. The stability of the behavioral data collected during scanning argues against large inconsistencies in the experimental conditions; the highly significant main effects for two of the paradigms, their stability over time and the satisfactory reliabilities for motor responses ensure a certain data quality; and the high reliabilities of the personality scales argue against an overly homogenous sample. Finally, the relative large sample size (compared to other studies addressing the reliability of the BOLD signal) allows for a more precise estimation of reliability coefficients than in previous studies.
Experimental conditions, measurement quality and sample characteristics
Generally, it is assumed that an improved signal to noise ratio (SNR) can be obtained by higher field strength. Accordingly, one might suspect better reliabilities of BOLD signals when using 3 T scanners instead of the 1.5T scanner used in the present study. On the
Any study addressing the reliability of fMRI data has to ensure that obvious and avoidable sources of error variance are excluded as an
Scanner field strength and scanning parameters
1174
K. Fliessbach et al. / NeuroImage 50 (2010) 1168–1176
Table 4 Reliabilities of activations in the ROI (main analysis). ROI
Paradigm A ICC (95% CI)
Paradigm B
Paradigm C
p
ICC (95% CI)
p
ICC (95% CI)
p
RP contrast Left VS Right VS OFC
0.13 (−0.31–0.52) 0.02 (−0.41–0.44) 0.13 (−0.31–0.53)
0.289 0.472 0.278
0.03 (−0.38–0.43) −0.15 (−0.53–0.27) 0.05 (−0.36–0.45)
0.451 0.763 0.407
0.13 (−0.31–0.52) 0.44 (0.02–0.72) 0.31(−0.13–0.65)
0.284 0.022 0.079
Reward receipt contrast Left VS Right VS OFC
0.18 (−0.26–0.56) 0.14 (−0.30–0.53) 0.44 (0.02–0.73)
0.211 0.268 0.020
0.12 (−0.30–0.50) 0.09 (−0.33–0.48) 0.25 (−0.17–0.59)
0.295 0.336 0.120
0.22 (−0.22–0.59) −0.13 (−0.52–0.31) 0.33 (−0.11–0.66)
0.164 0.715 0.067
−0.05 (−0.46–0..38) 0.14 (−0.30–0.53) 0.43 (0.01–0.72)
0.583 0.262 0.022
0.10 (−0.32–0.48) 0.10 (−0.31–0.49) −0.03 (−0.43–0.38)
0.325 0.317 0.562
0.10 (−.34–0.50) −0.01 (−0.43–0.42) 0.06 (−0.37–0.47)
0.330 0.516 0.387
0.67 (0.35–0.85) 0.69 (0.38–0.86)
0.000 0.000
0.32 (−0.09–0.64) 0.54 (0.17–0.78)
0.061 0.003
0.66 (0.32–0.84) 0.73 (0.45–0.88)
0.001 0.000
RPE contrast Left VS Right VS OFC Motor contrast Right M1 Left M1
other hand, higher field strength is associated with an increase in susceptibility artifacts, which is especially problematic for regions adjacent to air/tissue interfaces, like the orbitofrontal cortex. Therefore, whether higher fields strengths are actually beneficial for retest reliabilities of reward-related BOLD signal has to be established in further studies. A related question concerns the optimal slice orientation for acquiring BOLD responses from the OFC. In our study, we used standard axial (AC/PC orientated) slice positioning in order to make our results comparable to the majority of studies aiming at whole-brain analyses. Other orientations have been suggested in order to improve SNR especially for the OFC (Deichmann et al., 2003; Weiskopf et al., 2006) and are sometimes employed in imaging studies dealing with reward processing (Hampton et al., 2006; O'Doherty et al., 2006). Optimization of slice orientation might therefore be an approach for improving reliability at least in the OFC. Another potential source of error variance between the two sessions might arise from the usage of interslice gaps. If the gap (and not a slice) accidentally covers a certain portion of the active neurons, this portion will not contribute to the signal. As it is impossible to position slices and gaps exactly in the same way in both sessions, this
portion may differ between sessions. This may result in reduced reliability especially if the gap is large relative to the dimensions of the active brain area under investigation and might therefore affect reward processing areas more than more extended motor areas. The rationale for using gaps is to minimize cross-talking between adjacent slices resulting from imperfections in the slice selection pulse of the sequence. Although the excitation profile of these pulses is quite precise in modern scanners, little imperfections will still result in cross-talking. Without interslice gaps, the cross-talking would result in some portions of the active region being acquired twice which could again reduce reliability. To meet these concerns and in accordance with general practice, we decided to use a gap of 10 % (0.3 mm). In case of relevant imperfections of the excitation profile, this could even reduce signal loss. Experimental design In order to obtain a good basis for the parameter estimation of the general linear model applied in the data analysis, we aimed at a sufficient number of events for a given condition. Each of the three
Fig. 3. Overlay of the t-map (color coded in red) of the reward detection contrast from Paradigms A, B and C and associated voxel-wise ICC-values (color coded in blue). T-maps are averaged across Sessions 1 and 2 and thresholded at p b 0.001, uncorrected. ICC maps are thresholded at ICC N 0.40. Contours of the VS and OFC are overlaid in white.
K. Fliessbach et al. / NeuroImage 50 (2010) 1168–1176
1175
Fig. 4. Joint probability distributions of voxel-wise t-values and associated ICCs (upper part). Frequency distributions of these ICCs for the whole brain (red), for all voxels within the activated network (blue, voxels with p b 0.001), and the brain without the network (green) (lower part). (A) T-values represent the group effect of reward detection in Paradigm C. (B) T-values represent the group effect of left motor response in Paradigm A.
paradigms consisted of 100 trials. The regressors for each subject contained at least 21 events per reward condition and the median number of events for the reward receipt contrast was 50. Regressors for the parametric analyses (RP and RPE contrast) always consisted of 100 trials. This is in line with general recommendations for fMRI analysis and methodological studies suggest that approximately 25 events per regressor ensure sufficiently robust results of event-related fMRI (Murphy and Garavan, 2005). To test whether an increase of observations would improve reliabilities, we have calculated a model that combined all three paradigms and recalculated ICCs for this model. This yielded an increase in the ICC for the reward receipt contrast in the OFC, while ICCs for all other contrast in all other location remained poor. Thus, an increase in the number of observed events might improve reliability although this might also raise new problems (e.g. head movement, fluctuations in motivation/vigilance). Another important issue is the use of an event-related design when it is generally assumed that blocked designs yield more robust results (Friston et al., 1999). However, when addressing reward processing, unexpectedness of rewards is a crucial issue (in order to observe the effects of RPE), and this simply cannot be obtained with blocked designs. One might nevertheless think of inducing reward system activation by a blocked design, e.g. by presenting blocks of pleasant or unpleasant stimuli. By using three different paradigms, we are able to generalize our results across different experimental settings. All of our designs manipulate reward expectancy and reward predictions errors in a simple way. Although they all led to significant reward-related activations, the strength of activation varied between the designs. The weakest activations were observed for Paradigm B and for this paradigm there were also no significant ICC. For Paradigms A and C, we found highly significant rewards effects within our ROI. Therefore, the lack of reliability for the reward contrasts in these paradigms cannot simply be explained by their inability to produce the relevant BOLD signal. Varying reward magnitude and type may further generalize our results. It seems possible that larger monetary rewards would yield higher levels of attention throughout an experiment, which could improve the reliability of the responses. It might also be that more abstract (e.g. social) rewards or more basic rewards (e.g. primary reinforcers such as food/drink for hungry/thirsty subjects) provide more reliable activations. The identification of general requirements of an experimental design suited to produce reliable results should be addressed by future studies.
Sampling of voxels for ICC calculation A critical aspect in the determination of reliability coefficients (and in correlating BOLD and other measures) concerns the sampling of voxels from which the BOLD signal is derived. On the whole-brain level, ICCs can be computed voxel-wise and the resulting ICC maps can be related to maps of main effects, which designate task relevant brain areas. The resulting distributions of ICCs within task relevant networks can be characterized by distribution parameters such as the median or the maximum. We have followed this approach and found positive but altogether low median ICCs for reward networks, and higher values for the motor contrasts. In agreement with Aron et al. (2006) who investigated the reliability of frontostriatal BOLD responses in probabilistic classification learning, we found high to excellent maximal ICCs in reward processing areas. However, maximal ICC values from large brain areas do not provide unbiased estimates for the size of reliability coefficients in given areas of the brain, which is especially problematic when samples sizes are small. Median ICC might not provide a very meaningful measure either, given that “activated networks” often consist of inhomogeneous functional areas which reflect several aspects of an experimental condition (e.g. attention, emotional reaction, cognitive appraisal in reward processing). Therefore, ROI approaches which combine anatomical and functional constraints appear more appropriate to the question at hand. In order to identify relevant reward processing regions, we determined sites of high reward-related main effects within anatomically defined regions (VS and mOFC). We then extracted data from the surrounding of the group peak voxel for the respective main effect in Session 1 and calculated ICCs on the averaged parameter estimates of Sessions 1 and 2 for this site. Several alternative approaches were employed, which did not improve the ICCs significantly. It is of note that such ROI analyses seem to be the most appropriate and unbiased approaches for the selection of voxels if BOLD signal data are correlated with external individual characteristics (Vul et al., 2009). The so obtained reliability estimates should therefore be most informative when theoretical upper limits of correlations between BOLD signal and other measures are at question. Consequences of low reliability in reward-related BOLD measures We conclude from the previously discussed points that retest reliability of individual reward-related BOLD measures can be
1176
K. Fliessbach et al. / NeuroImage 50 (2010) 1168–1176
severely limited despite cautious control of the most obvious sources of error variance. We believe that this finding emphasizes the – already self-evident – need to cautiously validate correlational results in fMRI studies. Correlations obtained on a voxel-by-voxel basis have to be tested applying the same strict corrections for multiple comparisons used for other fMRI results. Alternatively, data from a priori defined ROI can be used to test for correlations with an external criterion on conventional statistical thresholds (Poldrack and Mumford, 2009). These recommendations are valid independently of the actual magnitude of reliabilities. However, the results from our study might contribute to the awareness for these issues. As a further consequence from our results, the chance to find robust associations between reward-related BOLD signal changes and individual measures such as personality traits, preferences or attitudes, which also do not have perfect reliability, may be low (but not zero). In case of a measure with perfect reliability (e.g. gene polymorphism status) and assuming a reliability of 0.5 of the BOLD signal (i.e. the upper limit of reliabilities for reward contrast in our study), the expected value of an observed correlation between the two measures would be 0.7, even if the “true” correlation was perfect. Together with the cited previous studies, our results further demonstrate that – in principal – higher reliabilities of the BOLD signal (N0.65) are obtainable (for the motor contrasts in our study). Further studies are needed to test how such reliabilities can be achieved in more complex cognitive domains. Finally, our results demonstrate the need to develop methods to improve reliabilities in fMRI research. All discussed potential sources of reduced reliability can be addressed to optimize the quality of individual measurements. Potential improvement can also be expected with further technical improvements (e.g. higher field strengths, multi-coil arrays). We are looking forward to studies which report progress in this field. Acknowledgments We would like to thank Courtney B. Phillipps, Florian Mormann and Jason Aimone for valuable comments on the manuscript. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.neuroimage.2010.01.036. References Abler, B., Walter, H., Erk, S., Kammerer, H., Spitzer, M., 2006. Prediction error as a linear function of reward probability is coded in human nucleus accumbens. NeuroImage 31, 790–795. Aron, A.R., Gluck, M.A., Poldrack, R.A., 2006. Long-term test–retest reliability of functional MRI in a classification learning task. NeuroImage 29, 1000–1006. Beaver, J.D., Lawrence, A.D., van Ditzhuijzen, J., Davis, M.H., Woods, A., Calder, A.J., 2006. Individual differences in reward drive predict neural responses to images of food. J. Neurosci. 26, 5160–5166. Caceres, A., Hall, D.L., Zelaya, F.O., Williams, S.C., Mehta, M.A., 2009. Measuring fMRI reliability with the intra-class correlation coefficient. NeuroImage 45, 758–768. Carver, C.S., White, T.L., 1994. Behavioral-inhibition, behavioral activation, and affective responses to impending reward and punishment—the bis bas scales. J. Pers. Soc. Psychol. 67, 319–333. Cicchetti, D.V., 2001. The precision of reliability and validity estimates re-visited: distinguishing between clinical and statistical significance of sample size requirements. J. Clin. Exp. Neuropsychol. 23, 695–700. Cloninger, C.R., 1994. The Temperament and Character Inventory (TCI): A Guide to Its
Development and use. Center for Psychobiology of Personality, Washington University, St. Louis, MO. Cohen, M.X., Young, J., Baek, J.M., Kessler, C., Ranganath, C., 2005. Individual differences in extraversion and dopamine genetics predict neural reward responses. Brain Res. Cogn. Brain Res. 25, 851–861. Costa, P., McCrae, R.R., 1992. Revised NEO Personality Inventory (NEO PI-R) and NEO Five Factor Inventory. Professional Manual. Psychological Assessment Resources, Odessa, Florida. D'Ardenne, K., McClure, S.M., Nystrom, L.E., Cohen, J.D., 2008. BOLD responses reflecting dopaminergic signals in the human ventral tegmental area. Science 319, 1264–1267. Deichmann, R., Gottfried, J.A., Hutton, C., Turner, R., 2003. Optimized EPI for fMRI studies of the orbitofrontal cortex. NeuroImage 19, 430–441. Fehr, E., Camerer, C.F., 2007. Social neuroeconomics: the neural circuitry of social preferences. Trends Cogn. Sci. 11, 419–427. Friston, K.J., Zarahn, E., Josephs, O., Henson, R.N., Dale, A.M., 1999. Stochastic designs in event-related fMRI. NeuroImage 10, 607–619. Hampton, A.N., Bossaerts, P., O'Doherty, J.P., 2006. The role of the ventromedial prefrontal cortex in abstract state-based inference during decision making in humans. J. Neurosci. 26, 8360–8367. Jocham, G., Klein, T.A., Neumann, J., von Cramon, D.Y., Reuter, M., Ullsperger, M., 2009. Dopamine DRD2 polymorphism alters reversal learning and associated neural activity. J. Neurosci. 29, 3695–3704. Johnstone, T., Somerville, L.H., Alexander, A.L., Oakes, T.R., Davidson, R.J., Kalin, N.H., Whalen, P.J., 2005. Stability of amygdala BOLD response to fearful faces over multiple scan sessions. NeuroImage 25, 1112–1123. Kable, J.W., Glimcher, P.W., 2007. The neural correlates of subjective value during intertemporal choice. Nat. Neurosci. 10, 1625–1633. Knutson, B., Wimmer, G.E., Kuhnen, C.M., Winkielman, P., 2008. Nucleus accumbens activation mediates the influence of reward cues on financial risk taking. NeuroReport 19, 509–513. Knutson, B., Delgado, M.R., Phillips, P.E.M. (Eds.), 2009. Representation of Subjective Value in the Striatum. Academic Press, London. Kong, J., Gollub, R.L., Webb, J.M., Kong, J.T., Vangel, M.G., Kwong, K., 2007. Test–retest study of fMRI signal change evoked by electroacupuncture stimulation. NeuroImage 34, 1171–1181. Maldjian, J.A., Laurienti, P.J., Burdette, J.B., Kraft, R.A., 2003. An Automated Method for Neuroanatomic and Cytoarchitectonic Atlas-based Interrogation of fMRI Data Sets. NeuroImage 19, 1233–1239. Maldjian, J.A., Laurienti, P.J., Burdette, J.H., 2004. Precentral Gyrus Discrepancy in Electronic Versions of the Talairach Atlas. NeuroImage 21, 450–455. Murphy, K., Garavan, H., 2005. Deriving the optimal number of events for an eventrelated fMRI study based on the spatial extent of activation. NeuroImage 27, 771–777. Nunnally, J.C., 1970. Introduction to Psychological Measurement. McGraw-Hill, New York. O'Doherty, J.P., Dayan, P., Friston, K., Critchley, H., Dolan, R.J., 2003. Temporal difference models and reward-related learning in the human brain. Neuron 38, 329–337. O'Doherty, J.P., Buchanan, T.W., Seymour, B., Dolan, R.J., 2006. Predictive neural coding of reward preference involves dissociable responses in human ventral midbrain and ventral striatum. Neuron 49, 157–166. Pagnoni, G., Zink, C.F., Montague, P.R., Berns, G.S., 2002. Activity in human ventral striatum locked to errors of reward prediction. Nat. Neurosci. 5, 97–98. Poldrack, R.A., Mumford, J.A., 2009. Independence in ROI analysis: where is the voodoo? Soc. Cogn. Affect. Neurosci. 4, 208–213. Raemaekers, M., Vink, M., Zandbelt, B., van Wezel, R.J., Kahn, R.S., Ramsey, N.F., 2007. Test–retest reliability of fMRI activation during prosaccades and antisaccades. NeuroImage 36, 532–542. Rolls, E.T., McCabe, C., Redoute, J., 2008. Expected value, reward outcome, and temporal difference error representations in a probabilistic decision task. Cereb. Cortex 18, 652–663. Schultz, W., Dayan, P., Montague, P.R., 1997. A neural substrate of prediction and reward. Science 275, 1593–1599. Shrout, P.E., Fleiss, J.L., 1979. Intraclass correlations—uses in assessing rater reliability. Psychol. Bull. 86, 420–428. Specht, K., Willmes, K., Shah, N.J., Jancke, L., 2003. Assessment of reliability in functional imaging studies. J. Magn. Reson. Imaging 17, 463–471. Tobler, P.N., Fletcher, P.C., Bullmore, E.T., Schultz, W., 2007. Learning-related human brain activations reflecting individual finances. Neuron 54, 167–175. Vul, E., Harris, C., Winkielman, P., Pashler, H., 2009. Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspect. Psychol. Sci. 4, 274–290. Weiskopf, N., Hutton, C., Josephs, O., Deichmann, R., 2006. Optimal EPI parameters for reduction of susceptibility-induced BOLD sensitivity losses: a whole-brain analysis at 3 T and 1.5 T. NeuroImage 33, 493–504.