NeuroImage 110 (2015) 205–216
Contents lists available at ScienceDirect
NeuroImage journal homepage: www.elsevier.com/locate/ynimg
Cortical delta activity reflects reward prediction error and related behavioral adjustments, but at different times James F. Cavanagh ⁎ Department of Psychology, University of New Mexico, Logan Hall, 1 University of New Mexico, MSC03 2220, Albuquerque, NM 87131 USA
a r t i c l e
i n f o
Article history: Accepted 2 February 2015 Available online 10 February 2015 Keywords: Prediction error Delta Reward positivity Reinforcement learning Decision making Hierarchy
a b s t r a c t Recent work has suggested that reward prediction errors elicit a positive voltage deflection in the scalp-recorded electroencephalogram (EEG); an event sometimes termed a reward positivity. However, a strong test of this proposed relationship remains to be defined. Other important questions remain unaddressed: such as the role of the reward positivity in predicting future behavioral adjustments that maximize reward. To answer these questions, a three-armed bandit task was used to investigate the role of positive prediction errors during trial-by-trial exploration and task-set based exploitation. The feedback-locked reward positivity was characterized by delta band activities, and these related EEG features scaled with the degree of a computationally derived positive prediction error. However, these phenomena were also dissociated: the computational model predicted exploitative action selection and related response time speeding whereas the feedback-locked EEG features did not. Compellingly, delta band dynamics time-locked to the subsequent bandit (the P3) successfully predicted these behaviors. These bandit-locked findings included an enhanced parietal to motor cortex delta phase lag that correlated with the degree of response time speeding, suggesting a mechanistic role for delta band activities in motivating action selection. This dissociation in feedback vs. bandit locked EEG signals is interpreted as a differentiation in hierarchically distinct types of prediction error, yielding novel predictions about these dissociable delta band phenomena during reinforcement learning and decision making. © 2015 Elsevier Inc. All rights reserved.
Introduction Recent endeavors in the field of cognitive neuroscience have aimed to transcend our understanding of neural events from descriptive terminologies and psychological correlates toward a definition based on quantifiable constructs. The field of reinforcement learning has benefitted from this mechanistic approach, particularly our understanding of the electrophysiological signals that predict modulation of behavior following surprising punishments (the negative prediction error). In this report, the lesser known phenomena of reward-related electrophysiology and related behavioral adjustment were investigated with a similar mechanistic perspective. Motivated by earlier theoretical work (Holroyd and Coles, 2002), the electrophysiological representation of punishment prediction errors has been compellingly detailed (Cavanagh et al., 2010; Chase et al., 2010; Ichikawa et al., 2010; Philiastides et al., 2010). These studies and more recent specifically controlled investigations have revealed how frontal midline theta band activities reliably scale with the surprisal of punishment, yet can also be observed to surprising rewards (Cavanagh et al., 2012a; Talmi et al., 2013; Hauser et al., 2014). Thus frontal midline theta does not appear to reflect a negative prediction error per se, but
⁎ Fax: +1 505 277 1394. E-mail address:
[email protected].
http://dx.doi.org/10.1016/j.neuroimage.2015.02.007 1053-8119/© 2015 Elsevier Inc. All rights reserved.
rather reflects the surprise elicited from a system sensitive to the need for adjustment — a combined phenomenon that is often correlated with punishment avoidance (Cavanagh and Frank, 2014). Relatedly, meta analyses reveal that frontal midline theta reliably predicts inhibited and avoidant reflexive behavioral adjustments (Cavanagh and Shackman, in press). While candidate substrates of positive prediction errors have been advanced (see the subsequent paragraph), a similarly sophisticated understanding of reward related neuroelectric activities has yet to be detailed. Recent work has indicated how surprising rewards (positive prediction errors) reliably elicit a positive voltage deflection in the eventrelated potential (ERP), sometimes termed a reward-positivity or Rew-P (Holroyd et al., 2008; Baker and Holroyd, 2011). While this observation has considerable empirical support, much of the existing research on this ERP component often characterize it as a difference wave of the reward minus punishment conditions (Holroyd et al., 2008; Baker and Holroyd, 2011; Walsh and Anderson, 2011; Kujawa et al., 2014; Lukie et al., 2014; Weinberg et al., 2014).1 Unfortunately, 1 This difference wave is oftentimes called a Feedback Related Negativity (FRN), which is unfortunately also a term sometimes used to describe the second negative deflection (i.e. the N2 component) in the punishment condition alone (Fig. 2A). Here, the reward condition is investigated without comparison to the punishment condition, so the term Rew-P is used. The lack of specificity in such ERP terms further motivates comparison based on spectral decomposition.
206
J.F. Cavanagh / NeuroImage 110 (2015) 205–216
this contrast confounds the accurate interpretation of this signal in dynamic tasks, as punishment prediction errors also modulate ERP activities across a similar temporo-spatial locale (Cavanagh et al., 2010; Chase et al., 2010; Ichikawa et al., 2010; Philiastides et al., 2010). Yet recent findings using spectral decomposition have indicated that a centro-posterior delta band phase dynamic underlies the Rew-P (Bernat et al., 2011; Cavanagh et al., 2014). The difference in dominant frequencies between punishment and reward (theta vs. delta) suggests one facet to leverage a deeper understanding of the information content and mechanistic role of the Rew-P. The current report utilized time–frequency analyses to decompose reward-related EEG into spectral quantities. Spectral analyses are necessary for a thorough understanding of the relationship between latent states and manifest EEG signals. For example, phase-varying vs. phaseconsistent aspects of frontal midline theta have been proposed to relate to distinct aspects of unsigned (generic) surprise vs. signed negative prediction error (Hajihosseini and Holroyd, 2013), a hypothesis that would remain untestable with ERPs only. The role of delta band phase in positive prediction errors remains unknown. In addition, there has been no formal test of single trial correlation of the Rew-P (or the constituent delta band dynamics) with the positive prediction error. Furthermore, it is not known if the Rew-P relates to subsequent behavioral adjustment that may be expected following a positive prediction error, such as win-stay tendencies and speeded response times (Frank et al., 2007; Cavanagh et al., 2010). Time–frequency analyses are wellsuited to single-trial analyses, as they are not reliant on cross trial averaging like ERPs (Cohen, 2011). To test these explicit hypotheses, this study utilized a dynamic three-armed bandit reinforcement learning task where participants were required to adjust behavior based on rewarding feedback during both exploratory and exploitative phases. This task allows a sophisticated examination of the role of positive prediction errors during trial-bytrial exploratory reinforcement learning as well as rule-like exploitative decision making. Recent modeling work has demonstrated how a reinforcement learning actor can exploit the optimal stimulus–response pairing by creating a rule-based task-set, leading to exclusive action selection (Collins and Koechlin, 2012). It was expected that participants create similar latent task-sets in the three-armed bandit task (i.e. consistent selection of a rewarding bandit), particularly following high positive prediction errors during exploration. To test this hypothesis, competing computational models were contrasted on the manner by which positive prediction errors were utilized to hasten integrative reinforcement learning vs. enhance punctate task-set-like decision making. It was further hypothesized that the Rew-P, to the extent that it reflects the positive prediction error, should predict the adoption of these task-sets. The current study thus aimed to test the multi-dimensional electrophysiological dynamics of this reward-related phenomenon, particularly the predictive power of this signal on future behavioral adjustment. To preview the cliffhanger, while the feedback-locked Rew-P/delta scaled with positive prediction error, it did not predict behavioral adjustment, whereas the imperative bandit-locked P3/delta activity did. These findings are discussed in terms of hierarchical levels of prediction error. Methods Participants The experiment was approved by the University of New Mexico (UNM) Institutional Review Board and all participants provided written informed consent. Participants were N = 26 undergraduates (7 male) from the UNM who received course credit for participation. The average age was 20 years old (SD = 4). All participants had normal or correctedto-normal vision, no history of neurological, psychiatric, or any other relevant medical problems.
Task Each trial began with a crosshair (displayed for a duration selected from a uniform distribution of 800 to 1000 ms), followed by a display of three pictures of slot machines (“bandits”), Fig. 1A. On each trial, bandits were pseudo-randomly distributed to have an equal chance of appearing in the upper, left or right area. Participants had 1500 ms to select one bandit by pressing one of three joystick buttons with their right thumb. Following selection and a delay of 200 to 400 ms (random uniform distribution), feedback was provided based on a probabilistic schedule (green + 1 or red ~ for 500 ms duration). The probabilistic reward schedule for each bandit was created using a sine wave with a mean of 60%, a range from 20% to 100%, and a period of 120 trials. Bandit schedules were offset by 40 trials each, creating a harmonious correlated distribution of increasing and decreasing probabilities between bandits (Fig. 1B). Participants received instructions that they were to get as many points as possible by selecting the slot machine that was rewarding most often. They were warned of the probabilistic nature of rewards, that the machines may change how often they were rewarding, and that the position of the slot machine on the screen (upper, left, or right) was not related to reward. Following instructions and 8 practice trials, the task consisted of 480 trials with a self-paced break every 80 trials. The task took an average of 39 min (SD = 9).
EEG recording and preprocessing EEG was recorded continuously across .1 to 100 Hz with a sampling rate 500 Hz and an online CPz reference on a 64 channel Brain Vision system. The vertical electrooculogram (VEOG) was recorded from bipolar auxiliary inputs. Data were epoched around the bandit onset (−2000 to 5000 ms), from which the associated feedbacks were isolated. CPz was re-created, bad channels and bad epochs were identified using the FASTER algorithm (Nolan et al., 2010) and were subsequently interpolated and rejected respectively. Eye blinks were removed following ICA (Delorme and Makeig, 2004). Data were then re-referenced to an average reference. Time–frequency measures were computed by multiplying the fast Fourier transformed (FFT) power spectrum of single trial EEG data with the FFT power spectrum of a set of complex Morlet wavelets 2 2 (defined as a Gaussian-windowed complex sine wave: ei2πt f e−t =ð2σ Þ , where t is time, f is frequency (which increased from 1 to 50 Hz in 50 logarithmically spaced steps), and σ defines the width (or “cycles”) of each frequency band, set according to 4/(2πf)), and taking the inverse FFT. The end result of this process is identical to time-domain signal convolution, and it resulted in estimates of instantaneous power (the magnitude of the analytic signal) and phase angle (the arctangent of the analytic signal). Each epoch was then cut in length (− 500 to + 1000 ms). Power was normalized by conversion to a decibel (dB) scale (10 ∗ log10(powert / powerbaseline)), allowing a direct comparison of effects across frequency bands. The baseline for each frequency consisted of the average power from −300 to −200 ms prior to the onset of the bandits. Inter trial phase clustering (ITPC) was quantified as the length of the average of unit-length vectors that were distributed according to their phase angles (Lachaux et al., 1999). ITPC quantifies the consistency of phase values for a given frequency band at each point in time, with values varying from 0 to 1 where 0 indicates random phases at that time–frequency point across trials, and 1 indicates identical phase values at that time–frequency point across trials. The time lag between CPz and surrounding sites was computed as a function of phase angles to provide a measure of presumed directional connectivity, where the lag in ms was computed as the average of delta band phase angle differences according to ms = 1000 ∗ (θtarget − θCPz) / (2πf) (Cohen, 2014, pg. 352). While this measure of phase lag
J.F. Cavanagh / NeuroImage 110 (2015) 205–216
207
Fig. 1. Task and performance. A) Participants selected one of three bandit icons to get nominal rewards. B) The probability of reward for each bandit (red, green, blue) changed over time according to a deterministic schedule. C) Percent of optimal responses (left) and number of points earned (right); each measure is presented as a deviation from chance, which is at the bottom of the y-axis (33% and 288 points). Most participants performed the task quite well, with three clear outliers who did not (these were subsequently removed from analyses). D) Example selection of 30 trials of a participant's performance. It was hypothesized that participants tested hypotheses and formed rule-based task sets during performance. Task sets were defined by the selection of a bandit at least three times in a row following two rewards. The bandit onset following the first reward was hypothesized to reflect the occurrence of a candidate task set (shown here with dashed lines). To visualize the relative action values over time (see Computational modeling section), the difference between the maximum Q value and the two alternative Q values was calculated, revealing the onset of very deterministic value differences at the time of the candidate task sets (cyan and magenta lines). E) Participants had an median of 30 task sets and the median duration was 9.5 trials. F) An example of a task-set. The median prediction error declined over the course of the task sets, and the average RT (+/− SEM) was faster on task sets than the preceding exploratory trial, especially the RT of the candidate trial.
can be used as one piece of evidence for directional connectivity, it is limited in its estimation of connectivity (which is only implied if one site reliably lags the other in phase angle) and directionality (the “who's first” problem). Fortunately, in the delta band, the phase cycle is slow enough that a slight lag (i.e. site ‘B’ lags ‘A’ by 50 ms) can imply a plausible “who's first” A ➔ B gradient of 50 ms or a rather implausible B ➔ A gradient of 950 to 200 ms (across 1 to 4 Hz). ERPs were derived from these same epochs and were low pass filtered at 20 Hz. Topographical plots were displayed from the range of 300 to 500 ms for time–frequency transformed data (1 to 4 Hz), and from 250 to 350 ms for the Rew-P activities or 350 to 500 ms for bandit-locked P3 activities (each of which benefits from less temporal smearing than time–frequency measures). A Laplacian transform (Matlab function laplacian_perrinX.m from Cohen, 2014) was used to examine the topographical specificity of the Rew-P in subsequent discrete time ranges (250 to 350 ms in steps of 14 samples each).
EEG correlations with positive prediction error and RT Regressors of interest included positive prediction error (derivation is described below) and RT change between trials. Correlations with positive prediction error by definition did not include all rewarding events, only those characterized by a degree of surprise in the rewarding outcome. Spearman's ρ correlations were performed between regressors and each time point in the raw EEG as well as each time– frequency point of spectral power. While correlations can be used to investigate such linear relationships, the relationship between regressors and phase consistency cannot be assessed with linear correlations as these data are circularly distributed. Based on a similar methods as phase-amplitude coupling (Canolty et al., 2006), the single trial influence of regressors on phase consistency can be investigated by taking each regressor–phase pair as a vector in complex space with the phase as the angle and the regressors as the modulus, as detailed in Cohen and Cavanagh (2011). The length of the average of regressor-length vectors
208
J.F. Cavanagh / NeuroImage 110 (2015) 205–216
that were distributed according to their phase angles thus reflects the modulation of phase angle by the regressor, such that any relationship would indicate that phase consistency changes as a function of the regressor. To validate this assumption, ITPC was computed on a trinary split of positive prediction error and a binary split of RT change (as there were fewer trials in this latter comparison). All ITPC comparisons were matched for trial count. To account for the potential influence of a non-even distribution of phase values across trials (i.e. due to feedback-locked phase reset), the phase modulation between 1000 sets of permuted regressor–phase pairings were computed at each time–frequency point. These distributions were used to normalize the empirical magnitude of phase–regressor modulation (by computing the difference between the empirical and permuted means normalized by the standard deviation across these permuted distributions). This procedure created a modulation index (MI), identical to a z-scored difference from the permutation-tested null hypothesis (Canolty et al., 2006; Cohen and Cavanagh, 2011; Cohen, 2014). Statistical methods As the EEG feature associated with positive prediction error is well known as the Rew-P with constituent delta band power and ITPC, only these activities were of interest here. Thus, data were not corrected for multiple comparisons as most of the spatio-temporo–frequency activities that are detailed in time–frequency and topographic plots were not of interest. Plots of statistical tests were thus thresholded at p b .05. While increased beta band power has also been observed following positive prediction errors (Cohen et al., 2007; Marco-Pallares et al., 2008; HajiHosseini et al., 2012), it is likely that this reflects modulation of motor commands (e.g. “preserving the status quo”: Engel and Fries, 2010). Given that this 3-armed bandit task has varying motor strategies in exploratory vs. exploitative states, beta band activities were expected to have a more complex relationship with positive prediction error and behavior than previously observed. Punishment prediction errors were not investigated in this experiment, as there were relatively few of them (~20%) which confounded the processing of punishment information with oddball-like novelty effects (see the P3b in Fig. 2A). There were an average of 373 (SD = 16) rewarding trials per participant, yet only an average of 207 (SD = 102) rewarding trials with positive prediction errors (i.e. a degree of surprise in the outcome). As described in the Introduction, and detailed later in the Discussion, exploitative periods were defined by the presence of task-sets, and exploratory periods were defined as the remaining trials. Task sets were defined when there were three consecutive selections of the same bandit with positive prediction errors on the first two choices (see Fig. 1D). Importantly, the task set was defined to start on the second of these events, as the first trial usually involved switching to a new bandit and was considered exploratory. The dissolution of a task set was defined as the selection of an alternative bandit. For the analysis of the first trial of exploitation, the candidate task set bandit onset was compared to the immediately preceding exploratory trial, which matched these events both in terms of event count and experimental timing. As described below, there were an average of 29 (median = 30, SD = 7) of these conditions for EEG analyses. During exploration, the occurrence of win-switch behaviors was rare; only N = 20 participants had some occurrences of these events. A minimum of 15 trials in each condition were determined to allow accurate signal-to-noise, yielding N = 15 participants for these analyses (win-stay trials M = 32, SD = 15, win-switch trials: M = 28, SD = 17). To equate for different trial counts between exploratory win-stay and win-switch occurrences, these conditions were matched within participants by randomly selecting events from the larger set (usually win-stay). Metrics were then calculated for each condition using the same number of trials. This process was repeated 1000 times and the mean of the distribution of the sampled condition was taken as the
Fig. 2. Reward and punishment related ERPs +/− SEM. A) The mid-frontal electrode FCz shows the well-known N2/FRN component to punishment, whereas the reward positivity (Rew-P) was more robust over posterior areas. Punishment was also characterized by a novelty parietal P3b component, as it was a rare occurrence (only on about 20% of trials). B) Topographic plot of reward trials from 250–350 ms, showing the centro-posterior distribution of the Rew-P. This topoplot used an ERP baseline of 0 ms to obviate the influence of the large negativity before the onset of the feedback over posterior sites. Black dots indicate electrode positions in A. C) Laplacian transform of topography in (B) evolving over time, demonstrating a clear centro-parietal distribution of the Rew-P.
most accurate representation of that condition under the constraint of similar sample size (this procedure was especially important for ITPC). Computational modeling In all models, state-action values were estimated for each bandit and a softmax choice function was used to predict the most likely action on each trial. State-action values (Q values) were updated according to the delta learning rule with a learning rate (α) scaling the prediction error (δ): Q t ¼ Q t−1 þ α ðδÞ;
ð1Þ
where prediction errors were calculated as the difference between reinforcements (r) and Q values: δ ¼ r−Q ;
ð2Þ
J.F. Cavanagh / NeuroImage 110 (2015) 205–216
and reinforcements were from a set of 0,1: r∈ð0; 1Þ:
ð3Þ
The probability of action selection was predicted using a softmax logistic function with a free parameter for gain adjustment to select the highest value option (β, also termed behavioral consistency or inverse temperature): .X expðβ Q all Þ: pðQ selected Þ ¼ expðβ Q selected Þ all
ð4Þ
The computationally derived prediction error (Eq. (2)) was used as a single trial regressor in EEG analyses. However, the estimation of prediction errors may vary depending on dynamic task qualities captured by free parameters. A number of competing models were formally compared and prediction errors from the best-fitting model were used as regressors for EEG analyses. The probabilities of action selection (Eq. (4)) were used to compute the log likelihood estimate (LLE) of the participant having chosen that set of responses for a given set of parameters. The parameters that produced the maximum LLE were found using the Nelder–Mead simplex method, a standard hill-climbing search algorithm (implemented with Matlab function fminsearch.m). All models used the best fitting outcome of 10 different starting points (using Matlab function rmsearch.m). The first model (M1: Vanilla) included free parameters for learning rate and gain. A subsequent model (M2: Decay) included a free decay parameter (γ) for diminishing the Q values of non-selected trials following reward: Q non‐selected ¼
ð1−γ Þ Q non‐selected Q non‐selected
if reward if punishment
ð5Þ
A third model (M3: Decay_All) included a similar free decay parameter (γ) for diminishing the Q values of non-selected trials following any feedback (reward or punishment). As detailed in the Results, this model was motivated by the expected (and confirmed) probability of participants sometimes adopting a “lose-stay” strategy to weather temporary disappointments in the pursuit of longer-term gains. Computational modeling: modulation of parameters by prediction error Three additional Boost models utilized the prediction error on each trial to modulate each parameter described above. Prediction errors contain important information about the volatility of the world, and thus may be used to scale the effective learning rate (Pearce and Hall, 1980; Krugel et al., 2009). Alternatively, a ‘softened’ beta parameter has oftentimes been used to model exploration (Daw et al., 2006), suggesting that prediction error may not solely influence learning, but instead could modulate action selection. These ideas were contrasted here with the comparison between M5 and M6 (prediction error modulation of alpha vs. beta) based on the aforementioned theoretical justification, whereas M4 (prediction error modulation of decay) was included for solely for consistency. Model four (M4: Boost_Decay) adjusted the decay rate on a trial-bytrial level by fitting an intercept (γ0) and a weight (γ1) for the prediction error on that trial. The resultant γboost parameter was then used as γ in M3: γ boost ¼ 1=ð1 þ expðγ 0 þ γ 1 δÞÞ:
ð6Þ
Model five (M5: Boost_Learning) adjusted the learning rate on a trial-by-trial level by fitting an intercept (α0) and a weight (α1) for the
209
prediction error on that trial. The resultant αboost parameter was then used as α in M3: α boost ¼ 1=ð1 þ expðα 0 þ α 1 δÞÞ:
ð7Þ
Model six (M6: Boost_Gain) adjusted the gain on a trial-by-trial level by fitting an intercept (β0) and a weight (β1) following the prediction error (i.e. this gain adjustment influenced subsequent action selection). The resultant βboost parameter was then used as β in M3: βboost ¼ expð−β 0 þ β1 δÞ:
ð8Þ
Following convention, all learning and decay rates were constrained to remain between 0 and 1, and all softmax gain parameters were constrained to be above zero (Daw, 2011). Characterization of model fits were computed as pseudo-R2 statistics: (LLE-chance)/chance (Camerer and Ho, 1999). For model comparison, more complex models were penalized by computing the Akaike information criterion (AIC) for each subject. Exceedance probabilities were also reported, reflecting the likelihood that each model is the best of all candidate models given the distribution of AIC values across all subjects (Stephan et al., 2009). Results Performance Participants selected the optimal bandit on 64.6% of trials, earning an average of 90 points (positive prediction errors) above chance (which was 480 ∗ .6 = 288 points). However three participants were clear outliers, with two of them at or below chance and a third with only an 8% increase above chance at selecting the optimal bandit (but who still earned fewer points than one chance-level participant), Fig. 1C. These three participants were removed from all subsequent analyses. The remaining N = 23 participants had an average probability of selecting the optimal stimulus 68.4% of the time, they earned an average of 100 points above chance, and they had an average RT of 527 ms (SD = 52 ms). Participants did not follow a common win-stay/lose-switch performance pattern, rather they selected the same bandit following reward at a high probability (“win-stay”: M = 85%, SD = 7%) and they tended to also select the same bandit even after it was not rewarded (“lose-stay”: M = 63%, SD = 12%). This latter finding suggests a complex action selection strategy not immediately discernable from behavior. As such, computational modeling was used to compare competing learning and decision making strategies on this task. Each model (M1:M6) fit the data increasingly better from the preceding model as determined by AIC, Pseudo-R2, and sequential pairwise exceedance probabilities (Table 1). The best fitting model suggested that participants performed this task by strongly favoring a single bandit, especially immediately following increasingly positive prediction errors (i.e. creating a task-set). Trial-by-trial action values (Q values from Eq. (1)) and prediction errors (δ from Eq. (2)) were derived for each participant using their best fitting parameters in M6. To visualize the trajectory of evolving valuation, a single contrast was then derived for each participant as the difference between the maximum Q value and the average of the other two alternate Q values on each trial (Fig. 1D). Participants had a median of 30 (SD = 7) unique task sets, and the average of the median duration of task sets was 9.5 trials (SD = 5.6) (Fig. 1E). Interestingly, task sets were not always dissolved on the presentation of punishing feedback; in fact an average of 34% (SD = 20%) of task sets were dissolved following reward. To provide convergent evidence that these performance patterns represent reward-influenced rule creation, RTs were examined on the first four epochs of task sets (Fig. 1F). There was a significant quadratic interaction (F(1,21) = 26.10, p b .001) which was driven by significant pairwise t-tests between a slower first (exploratory) epoch compared to the others (1st & 2nd: t(22) = 5.48, p = 1.67−e05; 1st & 3rd: t(22) =
210
J.F. Cavanagh / NeuroImage 110 (2015) 205–216
Table 1 Mean (SD) model parameters and fit metrics. The intercept and weight of each Boost model are shown here, as well as the value of the parameter when prediction error was + .5. EP = exceedance probability that the model was a better fit across the entire group than the preceding model (i.e. M6 was a better fit than M5 with 99% probability).
M1 M2 M3 M4 M5 M6
α
β
.78 (.21) .75 (.22) .75 (.22) .75 (.23)
4.04 (1.08) 3.97 (1.0) 4.17 (1.0) 4.20 (1.0) 4.25 (.91)
.69 (.29)
γ
Intercept
Weight
Param @ PE = +.5
AIC
Pseudo-R2
EP
.40 (.42) .77 (.23) 3.95 (3.66)
10,141 8850 8661 8600 8575 8517
.56 .61 .62 .62 .63 .63
1 1 1 .77 .99
.53 (.37) .47 (.33) .52 (.34) .52 (.33)
56.45 (201) −11 (25.21) −1.42 (.30)
4.95, p = 5.98−e05; 1st & 4th: t(22) = 3.83, p = 9.74−e04), as well as a linear effect of increasing RTs throughout the task set (F(1,21) = 5.24, p = 03). This pattern suggests two additional features of task sets in this experiment: 1) task sets were characterized by faster RTs than exploratory responses, and 2) the first RT following the initial very high positive prediction error (i.e. the candidate task set trial) was fastest and thereafter rapidly regressed toward the mean. EEG — positive prediction error The candidate EEG signature of a positive prediction error has been described as a reward positivity (Rew-P), see Fig. 2. As an initial step, the relationship between the EEG features underlying Rew-P and positive prediction errors were defined. Subsequently, constituent neuroelectric activities underlying the translation of a positive prediction error to behavioral adjustment were examined. Fig. 3A shows the ERP to reward and the temporal and spatial correlations between the single trial EEG and positive prediction error. A significant enhancement of EEG amplitude from about 200 to 500 ms over central midline sites was observed. Spectral decomposition of these events revealed enhanced delta band power (Fig. 3B). The single trial correlation between power and positive prediction error revealed a similar low frequency effect over central sites (Fig. 3C). The Rew-P was also characterized by enhanced ITPC (Fig. 3D), over the same midline posterior sites. The single trial weighted phase modulation revealed a dependence of ITPC on the degree of positive prediction error (Fig. 3E). The descriptive trinary split of ITPC by prediction error revealed a significant linear trend where ITPC increased with the degree of prediction error (Fig. 3F, F1,22 = 21.13, p = 1.4e − 4). While the Rew-P and constituent delta band dynamics were characterized by a centro-posterior topography (Fig. 2B, Figs. 3B,D topoplots), each of these signals reliably scaled with single trial estimates of positive prediction error over more central areas (Figs. 3A,C,E topoplots). Exploration Next, the predictive power of these neuroelectric signals was tested. Since the task-set (exploitation) phase of the experiment did not appear to be strongly dependent on trial-to-trial learning and adaptation, only positive prediction errors during the exploratory phase were investigated here. It was hypothesized that if positive prediction errors predict behavioral adaptation, then the Rew-P and associated delta dynamics should show the same relationship. As shown in Fig. 4A, the computationally derived positive prediction error was larger prior to staying on the same bandit as opposed to switching to a new one (t19 = 2.32, p = .03). Similar to the RT speeding shown in Fig. 1F, more positive prediction errors predicted the hastening of RT when staying (t19 = 5.62, p = 2.01e − 05), but not switching (t19 = 1.18, p = .25). In contrast to these findings, neither the Rew-P nor delta band dynamics at CPz were predictive of subsequent behavioral adaptation. Rew-P amplitude did not differ prior to staying vs. switching (Fig. 4B; t b 1), nor did delta power (Fig. 4C; t13 b 1) or ITPC (Fig. 4D; t13 = 1.41, p = .18). These neuroelectric phenomena did not predict RT change either (ts b 1.35, ps b .20). There were no effects when these EEG data were contrasted between the full N = 20 who had
−77.93 (205) −2.27 (25.71) .41 (7.82)
win-switch occurrences or when limited to N = 15 participants with at least 15 win-switch trials (shown in Fig. 4 and tested above), nor when tested in a later window (500 to 700 ms), nor when tested in the beta band. Examination of topographic plots for ERP, power and ITPC revealed only minor sporadic differentiation for stay vs. switch, with no interpretable differences except selectively increased delta band ITPC over motor cortex. This motor ITPC finding is only mentioned for possible benefit of future targeted investigations, as the density of topographic comparisons increased the likelihood of a Type I error. In summary, feedback-locked centro-parietal delta band activities faithfully reflect positive prediction error, but they do not reliably predict related behavioral adjustments. Exploitation The hypothesis that feedback-locked centro-parietal delta activities should reflect both positive prediction error and behavioral adjustment was partially motivated by similar findings in the literature of punishment-related theta band activities (Cavanagh and Shackman, in press). While the hypothesis that the Rew-P delta band activities may predict behavioral adjustment was unsupported for feedback-locked delta, here it was investigated if bandit-locked delta during the P3-like response showed these expected patterns. EEG features to the bandit onset of the candidate trial at the beginning of a task set were selected to test this hypothesis. This candidate trial bandit onset was compared to the immediately preceding exploratory trial bandit onset, which matched these events both in terms of event count and experimental timing. The ERP to the candidate task set bandit onset displayed an enhanced P3-like positivity ~ 350 to 500 ms (Fig. 5A), which was larger and statistically significant at the immediately posterior Pz electrode (t22 = 2.39, p = .03). Delta power and ITPC were enhanced over a similar centro-parietal area extending toward the left motor cortex (corresponding to the right button required for bandit selection) from 300 to 500 ms (with the curious subtle exception of CPz-specific delta power, Figs. 5B–C). Motivated by these robust phase dynamics, two additional analyses were applied to examine the possible functional operations of delta phase. Fig. 5D shows the weighted phase modulation of RT change, where faster RTs were associated with enhanced phase modulation as well as increased ITPC in a broad area encompassing centro-parietal and left motor cortex (descriptive offsets of binary split ITPC: CPz: t22 = 2.15, p = .04 from 300 to 600 ms and 1.5 to 3 Hz; C3: t22 = 2.00, p = .05 from 0 to 500 ms and 1.5 to 3 Hz). Fig. 6A shows the statistically significant difference in phase lag from the CPz electrode (in ms) between conditions. Only around left motor cortex was there a significantly different phase lag, with faster CPz-motor phase lag to the bandit associated with a candidate task set (average of 47 ms faster over motor cortex). This average time duration nearly perfectly accounts for the average 49 ms RT difference between stay and switch trials (Fig. 6B, previously shown in Fig. 1F). Across participants, the average single trial correlation between CPz-C3 phase lag and RT change was significantly different than zero on this candidate trial (Fig. 6C, t22 = 2.63, p = .015), indicating that faster parietal–motor delta band phase transmission was intimately linked with the degree of RT speeding.
J.F. Cavanagh / NeuroImage 110 (2015) 205–216
211
Fig. 3. EEG correlations with positive prediction error (+PE). All ERP and time–frequency panels are from the CPz electrode (black dot on topographic maps). Topomaps are 1–4 Hz: 250–350 ms for ERP, 300–500 ms for the others. A) Rew-P as in Fig. 2A, as well as single trial correlation with + PE revealing significant correlations between 200 and 500 ms (red bar), most notable over central areas. B) Power (dB) of the Rew-P. C) Single trial power correlations with + PE revealed a direct relationship with central delta as well as an inverse relationship with beta power. D) Phase consistency (ITPC) of the Rew-P. E) Single trial weighted phase modulation by +PE, demonstrating a similar direct relationship with central delta band phase. F) Trinary split of +PE revealing greater ITPC as +PE is larger. Bars are mean +/ SEM. **p b .01.
Collectively, these findings suggest that bandit-locked parietal delta activities are mechanistically involved in facilitating action selection during the application of a rule-like strategy. Similarities and differences in centro-parietal delta band activities to feedback and bandit onsets are discussed below.
Discussion As suggested by the title, centro-parietal delta band activities correlated with positive prediction error (after feedback), yet they were only related to behavioral adjustment following the subsequent imperative
212
J.F. Cavanagh / NeuroImage 110 (2015) 205–216
Bossaerts, 2011), fewer have investigated the spontaneous acquisition of a new exploitative rule (c.f. Collins and Frank, 2013). Model comparisons suggested that positive prediction errors increased behavioral consistency above and beyond any contribution to slower integrative processes that would have been captured by the learning rate parameter (Table 1), and this effect was indeed observed in single trial correlations during the exploration phase (Fig. 4A). This ability to effectively model the task dynamics bolsters the interpretation of the functional role of positive prediction errors. Although model-free reinforcement learning could account for task performance due to the implicit counterfactual schedule of reversal used here (Doll et al., 2012) (i.e. when one bandit is better the others are worse), the rapidity of exploitative action selection in the current task suggests some internal representation of structure. While such imputed structure is sometimes termed model-based learning (Daw et al., 2011), an options model (Botvinick et al., 2009), or a rule (Bunge, 2004), here this phenomenon is described as a task-set (Botvinick et al., 2009; Collins and Koechlin, 2012; Collins and Frank, 2013). A task-set can be derived by a reinforcement learning actor, leading to exclusive action selection (Collins and Koechlin, 2012). Moreover, this depiction of a task-set differentiates the phenomenon observed here from the context of sequential behavioral state transitions associated with formal estimates of model-based (Daw et al., 2011) or hierarchical structure during learning (Ribas-Fernandes et al., 2011; Collins and Frank, 2013). Different types of prediction error
Fig. 4. During exploration, positive prediction error predicted subsequent behavioral adaptation, but EEG features did not. A) Positive prediction errors were larger prior to staying on the rewarding bandit and they predicted the degree of RT speeding when staying. B) Rew-P ERPs (+/−SEM) did not differ prior to staying vs. switching. C) Feedback-locked delta power did not differ prior to stay vs. switch, nor did it predict RT change. D) Feedback-locked delta ITPC did not differ prior to stay vs. switch, nor was it modulated by RT change.
cue (bandit). It is intriguing that the feedback locked activities did not predict subsequent behavioral adaption like the computationallyderived positive prediction error did. In contrast, bandit-locked delta band activities, particularly phase-based, were intimately related to immediate behavioral adjustment: 1) they were enhanced on the candidate task set trial, 2) they scaled with the degree of RT speeding, 3) they predicted a boost in parietal–motor cortex connectivity, and 4) this connectivity scaled with RT speeding. Below, it is argued that these dissociations between surprise and control might be accounted for by hierarchically differing surprise signals, motivating novel predictions for future hypothesis testing. Building structure to exploit In this report, aspects reinforcement learning and decision making were integrated using a common task. While many prior studies of n-armed bandits have investigated determinants of exploration (Daw et al., 2006; Jepma and Nieuwenhuis, 2011; Payzan-LeNestour and
Recent work on hierarchical structure has capitalized on the understanding of how prediction errors can differ between simultaneously operating yet segregated systems (den Ouden et al., 2012; O'Reilly, 2013). For example, model-free and model-based systems can compete for behavioral control (Gläscher et al., 2010; Daw et al., 2011; Doll et al., 2012). A model-free learner is characterized by simple trial-and-error learning, which is slow but effective. A more complex model-based learner is characterized by forward prediction or creation of rules/ task-sets, which can be rapid but risk being incorrect. Critically, each learner utilizes different types of prediction errors: with reward prediction errors informing the model-free system yet pseudo-reward or state prediction errors informing the model-based system (Gläscher et al., 2010; Daw et al., 2011; Ribas-Fernandes et al., 2011). Without task-based sequential behavioral adjustments that forcibly segregate the update of these systems (Gläscher et al., 2010; Daw et al., 2011; Ribas-Fernandes et al., 2011), it is difficult to model the update of each system and interpret related neural signals in the context of either system. Thus, while a latent hierarchical structure may be the critical feature underlying the observed dissociation in posterior delta signals, the flat structure of the bandit task precludes conclusive identification of model-free or model-based updating. Nevertheless, through the combination of confirmed and null findings reported here it is suggested that the Rew-P (and constituent delta band activities) reflect a reward prediction error and the P3 (and constituent delta band activities) reflect a state prediction error, at least in this experiment. These interpretations are consistent with the previous characterization of the Rew-P as a reward prediction error (Holroyd et al., 2008), description of maintained Rew-P yet altered behavior in the context of a task rule (Walsh and Anderson, 2011), and prior work suggesting that P3 activities may indicate a state prediction error (Gläscher et al., 2010). This interpretation predicts a similar dissociation in a hierarchically structured task. Given that such tasks tend to utilize probabilistic reinforcements for reward prediction errors (events commonly termed “feedback” in the ERP literature) and probabilistic stimuli that in turn predict reward for state prediction errors (events commonly termed imperative “cues” or “stimuli” in the ERP literature), it would be very surprising if these did not relate to Rew-P and P3-like activities, respectively.
J.F. Cavanagh / NeuroImage 110 (2015) 205–216
213
Fig. 5. Bandit-locked activities on the first trial of task-set exploitation (the candidate trial) compared to the preceding prior trial of exploration. A) ERPs (+/−SEM) show an enhanced centro-parietal P3 following the onset of the bandit on the candidate trial. The topographic plot shows the difference between trials from 350 to 500 ms. B–C) Power and ITPC differences, showing delta band enhancement on the candidate trial over centro-parietal areas as well as motor cortex (C3 electrode). D) Difference in RT-weighted phase modulation, demonstrating greater coupling between bandit-locked centro-parietal and motor cortex delta phase as a function of faster RT on the first (candidate) trial of a task set. Offset bar graphs show a binary split of RT change, revealing greater ITPC as RT is faster. Bars are mean +/ SEM. *p b .05.
Fig. 6. Parietal–motor cortex phase lag and RT speeding on the candidate task set trials. A) Significant differences in phase lag from the CPz electrode (350 to 500 ms) between the candidate task set and the prior exploratory trial. The candidate trial was associated with a faster CPz ➔ C3 phase lag of an average of 47 ms. B) The enhanced phase lag was nearly equal to the RT difference between conditions (previously shown in Fig. 1F). C) The average single trial correlation (mean +/− SEM) between CPz-C3 phase-lag and RT change was significantly different than zero, implicating faster parietal-motor delta phase-based transmission in the strategic hastening of RT to motivationally salient cues. *p b .05.
214
J.F. Cavanagh / NeuroImage 110 (2015) 205–216
Psychometrics of reward and the Rew-P The nature of the EEG signals described here deserves some discussion. In this investigation, the Rew-P was somewhat smaller than the punishment-related ERP occurring at the same time (Fig. 2A). As described in the Methods section, this is likely due to the rare nature of punishment (~ 20%), which confounded the punishment ERP with oddball-like novelty effects. Many of the previous reports on the RewP cited in the Introduction show dissociation between reward and punishment conditions with a clearly enhanced positivity in this time range. The slow and punctate nature of the Rew-P and associated spectral activities suggest that this reward-related phenomenon may not be due to phase reset of an obligatory exogenously-invoked oscillation, such as has been postulated to underlie the N2/FRN and initial P3 at FCz (Luu et al., 2004; Trujillo and Allen, 2007; Cavanagh et al., 2012b), but rather may reflect an instantiated burst event (similar to that proposed by Holroyd et al., 2008). Note that a reliably timed burst can yield ITPC even in the absence of oscillations, but any absence or presence of oscillatory dynamics is difficult to infer empirically. Contrary to prior literature, (Cohen et al., 2007; Marco-Pallares et al., 2008; HajiHosseini et al., 2012), beta band power was negatively correlated with positive prediction error. This beta band finding bolsters the suggestion that beta likely reflects more of an active inhibition/ disinhibition of motor commands (Engel and Fries, 2010) and is not an axiomatic marker of positive prediction error (c.f. Caplin and Dean, 2008). The Rew-P is oftentimes interpreted as having a source in the midcingulate cortex (MCC) (Walsh and Anderson, 2012), yet the signal reported here had a more posterior distribution (an interpretation bolstered by temporally specific topographic plots of the scalp Laplacian: Fig. 2C). However, the correlations between positive prediction error and EEG activities were more central. The common practice of investigating reward-locked ERPs in contrast to punishment-locked ERPs (i.e. creation of a difference wave) may contribute to some of the minor differences between the prior literature and the condition-specific signal reported here. The findings reported here are consistent with previous suggestions that the Rew-P reflects the downstream outcome in MCC of a dopaminergic burst to better-than-expected reward prediction errors (Holroyd et al., 2008), but pharmacological studies are clearly needed for a direct test of that hypothesis.2 If MCC selects actions based on average reward rate (Holroyd and Mcclure, 2015), input from a reward prediction error signal (such as that provided by the Rew-P) may be necessary in order to integrate average reward regardless of the policy instantiated by a higher-level model-based controller. In contrast, the model-based state prediction error may be utilized by dorsolateral PFC and intraparietal sulcus (Gläscher et al., 2010), both major areas implicated in the generation of the P3 (Friedman et al., 2001; Polich, 2007).
Differences and similarities in delta band surprise signals While the eliciting circumstances of the Rew-P and P3 ERP components are clearly differentiated, there are many commonalities between these signals (e.g. polarity, latency, topography, spectral) that may suggest similar underlying processes. ERP components are inferred by indicators such as polarity, latency, and topography, but their ultimate definition relies upon a presumed latent computational operation, which can complicate a straightforward identification by manifest indicators (Luck, 2005). Sensitivity to some form of surprise is a latent computation common to many ERP components, yet different components 2 As predicted from this same report, dopamine dips could modulate frontal midline theta activities over a half or quarter oscillation cycle to enhance punishment adaptation in the time range of the N2. The interpretation of frontal theta here is agnostic to this possibility in the absence of pharmacological data, and would remain largely unchanged (only modulated) by such an occurrence.
may reflect hierarchically different surprise signals (Friston, 2003, 2005). A canonical posterior P3b is modulated by the probability and motivational significance of a rare event (Nieuwenhuis et al., 2005), whereas the anterior P3a is primarily sensitive to truly novel distracting stimuli (Luck, 2005; Polich, 2007). While manifest aspects of the Rew-P share some similarities with P3a and P3b, the specific sensitivity of the RewP to surprising rewards as well as the much earlier time course suggests a more specialized computational operation than a canonical P3a or P3b. In contrast, the bandit-locked P3 appears similar to a P3b, and it shares a common operation of enhanced motivational significance. Late parietal positivities bearing some spatiotemporal resemblance to a P3b have been recently implicated in the convergence of information accumulation leading to a decision. To imperative cues, P3b-like activities correlate with decision confidence (Fischer and Ullsperger, 2013), and peak at the moment of response execution, thus co-varying with the objective marker of decision execution, the RT (O'Connell et al., 2012; Kelly and O'Connell, 2013). Fitting with the hierarchical interpretation of the EEG signals reported here, reward and state prediction errors should share a broadly common sensitivity to surprise, but reward prediction errors (i.e. Rew-P) should be characterized by a greater specificity of surprise whereas state prediction errors (i.e. P3) should be more intimately related to an agent's decision making policy. While the comparison of manifest and latent determinants of ERP components is interesting, the nebulous science of defining EEG features as ERP components may hinder the ultimate aim of understanding common parsimonious mechanistic processes underlying these neural events. The manifest similarities (polarity, latency, topography, spectral) between the centro-posterior delta band activities observed in the Rew-P and later centro-parietal delta band positivities in the P3 are compelling, and thus may implicate shared generators or common low level operations. In addition to suggested MCC sources of the Rew-P and P3 (Polich, 2007; Walsh and Anderson, 2012), the posterior distribution and the slow timescale of delta band activities suggest that multiple areas in the cortical midline and parietal cortex could contribute to both signals. Posterior cingulate is also a compelling candidate generator, as it has been described as a hub linking reinforcement, memory, attention, and action selection, particularly when switching strategies (Pearson et al., 2009, 2011). Unfortunately, without highly detailed source analyses or clear behavioral correlates, it is difficult to infer why surprising rewards are reflected in centro-posterior delta instead of, for instance theta or beta. The nature of the delta band may provide clues for further study. A large-amplitude low frequency temporal organization scheme like delta phase may be ideal for organizing activities across large spatial distances into a transient functional network (Buzsáki and Draguhn, 2004; Uhlhaas et al., 2010). Lateral parietal neurons are commonly implicated in evidence accumulation (Sugrue et al., 2005; Gold and Shadlen, 2007), and parietal delta oscillations modulate rhythmic gain of information accumulation with phase-specific dynamics (Lakatos et al., 2008; Schroeder and Lakatos, 2009; Wyart et al., 2012; Cheadle et al., 2014). Indeed, the current findings suggest that delta phase links integrative parietal activities with motor cortex disinhibition to motivate rulebased action selection (Fig. 6). It is possible that the delta band response to surprising rewards observed in the Rew-P reflects a need to transiently link brain-wide states, actions, and outcomes together to inform downstream integrators of credit assignment, such as orbitofrontal cortex, MCC or striatum. If this hypothesis is true, pattern classification may reveal the activation of state or action representations during the time course of the Rew-P (see Collins et al., 2014 for the use of this method to reveal latent representations). Conclusion The feedback-locked Rew-P and constituent delta band activities appear to be specific to surprising rewards, but they do not predict
J.F. Cavanagh / NeuroImage 110 (2015) 205–216
associated behavioral adjustments. In contrast, delta band phase dynamics to the imperative cue (P3) appear to be mechanistically involved in strategic behavioral adjustments. These findings suggest that the delta band activities underlying Rew-P and P3 reflect hierarchically different levels of prediction error. Experimenters are urged to test this novel hypothesis and report any errors in this prediction. Perhaps the findings will be even better than expected. Acknowledgments The author thanks the members of the UNM Cognitive Rhythms and Computation Lab for data collection and Clay Holroyd for helpful discussions of the Rew-P and hierarchical prediction errors. References Baker, T.E., Holroyd, C.B., 2011. Dissociated roles of the anterior cingulate cortex in reward and conflict processing as revealed by the feedback error-related negativity and N200. Biol. Psychol. 87 (1), 25–34. Bernat, E.M., Nelson, L.D., Steele, V.R., Gehring, W.J., Patrick, C.J., 2011. Externalizing psychopathology and gain–loss feedback in a simulated gambling task: dissociable components of brain response revealed by time–frequency analysis. J. Abnorm. Psychol. 120, 352–364. Botvinick, M.M., Niv, Y., Barto, A.C., 2009. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition 113, 262–280. Bunge, S.A., 2004. How we use rules to select actions: a review of evidence from cognitive neuroscience. Cogn. Affect. Behav. Neurosci. 4, 564–579. Buzsáki, G., Draguhn, A., 2004. Neuronal oscillations in cortical networks. Science 304, 1926–1929. Camerer, C., Ho, T.H., 1999. Experience-weighted attraction learning in normal form games. Econometrica 67, 827–874. Canolty, R.T., Edwards, E., Dalal, S.S., Soltani, M., Nagarajan, S.S., Kirsch, H.E., Berger, M.S., Barbaro, N.M., Knight, R.T., 2006. High gamma power is phase-locked to theta oscillations in human neocortex. Science 313, 1626–1628. Caplin, A., Dean, M., 2008. Axiomatic methods, dopamine and reward prediction error. Curr. Opin. Neurobiol. 18, 197–202. Cavanagh, J.F., Frank, M.J., 2014. Frontal theta as a mechanism for cognitive control. Trends Cogn. Sci. 1–8. Cavanagh, J.F., Shackman, A.J., 2014. Frontal midline theta reflects anxiety and cognitive control: meta-analytic evidence. J. Physiol. Paris http://dx.doi.org/10.1016/j. jphysparis.2014.04.003 (in press). Cavanagh, J.F., Frank, M.J., Klein, T.J., Allen, J.J.B., 2010. Frontal theta links prediction errors to behavioral adaptation in reinforcement learning. Neuroimage 49, 3198–3209. Cavanagh, J.F., Figueroa, C.M., Cohen, M.X., Frank, M.J., 2012a. Frontal theta reflects uncertainty and unexpectedness during exploration and exploitation. Cereb. Cortex 22, 2575–2586. Cavanagh, J.F., Zambrano-Vazquez, L., Allen, J.J.B., 2012b. Theta lingua franca: a common mid-frontal substrate for action monitoring processes. Psychophysiology 49, 220–238. Cavanagh, J.F., Masters, S.E., Bath, K., Frank, M.J., 2014. Conflict acts as an implicit cost in reinforcement learning. Nat. Commun. 5, 5394. Chase, H.W., Swainson, R., Durham, L., Benham, L., Cools, R., 2010. Feedback-related negativity codes prediction error but not behavioral adjustment during probabilistic reversal learning. J. Cogn. Neurosci. 23 (4), 936–946. Cheadle, S., Wyart, V., Tsetsos, K., Myers, N., de Gardelle, V., Herce Castañón, S., Summerfield, C., 2014. Adaptive gain control during human perceptual choice. Neuron 81, 1429–1441. Cohen, M.X., 2011. It's about time. Front. Hum. Neurosci. 5, 2. Cohen, M.X., 2014. Analyzing Neural Time Series Data: Theory and Practice. MIT Press, Cambridge, MA. Cohen, M.X., Cavanagh, J.F., 2011. Single-trial regression elucidates the role of prefrontal theta oscillations in response conflict. Front. Psychol. 2, 30. Cohen, M.X., Elger, C.E., Ranganath, C., 2007. Reward expectation modulates feedbackrelated negativity and EEG spectra. Neuroimage 35, 968–978. Collins, A.G.E., Frank, M.J., 2013. Cognitive control over learning: creating, clustering, and generalizing task-set structure. Psychol. Rev. 120, 190–229. Collins, A., Koechlin, E., 2012. Reasoning, learning, and creativity: frontal lobe function and human decision-making. PLoS Biol. 10, e1001293. Collins, A.G.E., Cavanagh, J.F., Frank, M.J., 2014. Human EEG uncovers latent generalizable rule structure during learning. J. Neurosci. 34, 4677–4685. Daw, N.D., 2011. Trial by trial data analysis using computational models. In: Delgado, M.R., Phelps, E.A., Robbins, T.W. (Eds.), Decision Making, Affect, and Learning: Attention and Performance XXIII. Oxford University Press, pp. 1–26. Daw, N.D., O'Doherty, J.P., Dayan, P., Seymour, B., Dolan, R.J., 2006. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879. Daw, N.D., Gershman, S.J., Seymour, B., Dayan, P., Dolan, R.J., 2011. Model-based influences on humans' choices and striatal prediction errors. Neuron 69, 1204–1215. Delorme, A., Makeig, S., 2004. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J. Neurosci. Methods 134, 9–21.
215
Den Ouden, H.E.M., Kok, P., de Lange, F.P., 2012. How prediction errors shape perception, attention, and motivation. Front. Psychol. 3, 548. Doll, B.B., Simon, D.A., Daw, N.D., 2012. The ubiquity of model-based reinforcement learning. Curr. Opin. Neurobiol. 22, 1075–1081. Engel, A.K., Fries, P., 2010. Beta-band oscillations—signalling the status quo? Curr. Opin. Neurobiol. 20, 156–165. Fischer, A.G., Ullsperger, M., 2013. Real and fictive outcomes are processed differently but converge on a common adaptive mechanism. Neuron 79, 1243–1255. Frank, M.J., Moustafa, A.A., Haughey, H.M., Curran, T., Hutchison, K.E., 2007. Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning. Proc. Natl. Acad. Sci. U. S. A. 104, 16311–16316. Friedman, D., Cycowicz, Y.M., Gaeta, H., 2001. The novelty P3: an event-related brain potential (ERP) sign of the brain's evaluation of novelty. Neurosci. Biobehav. Rev. 25, 355–373. Friston, K., 2003. Learning and inference in the brain. Neural Netw. 16, 1325–1352. Friston, K., 2005. A theory of cortical responses. Philos. Trans. R. Soc. Lond. B Biol. Sci. 360, 815–836. Gläscher, J., Daw, N., Dayan, P., O'Doherty, J.P., 2010. States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron 66, 585–595. Gold, J.I., Shadlen, M.N., 2007. The neural basis of decision making. Annu. Rev. Neurosci. 30, 535–574. Hajihosseini, A., Holroyd, C.B., 2013. Frontal midline theta and N200 amplitude reflect complementary information about expectancy and outcome evaluation. Psychophysiology 50 (6), 550–562. HajiHosseini, A., Rodríguez-Fornells, A., Marco-Pallarés, J., 2012. The role of beta-gamma oscillations in unexpected rewards processing. Neuroimage 60, 1678–1685. Hauser, T.U., Iannaccone, R., Stämpfli, P., Drechsler, R., Brandeis, D., Walitza, S., Brem, S., 2014. The feedback-related negativity (FRN) revisited: new insights into the localization, meaning and network organization. Neuroimage 84, 159–168. Holroyd, C.B., Coles, M.G., 2002. The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. Psychol. Rev. 109, 679–709. Holroyd, CBC.B., Mcclure, SMS.M., 2015. Hierarchical control over effortful behavior by rodent medial frontal cortex. Psychol. Rev. 122 (1), 54–83. Holroyd, C.B., Pakzad-Vaezi, K.L., Krigolson, O.E., 2008. The feedback correct-related positivity: sensitivity of the event-related brain potential to unexpected positive feedback. Psychophysiology 45, 688–697. Ichikawa, N., Siegle, G.J., Dombrovski, A., Ohira, H., 2010. Subjective and model-estimated reward prediction: association with the feedback-related negativity (FRN) and reward prediction error in a reinforcement learning task. Int. J. Psychophysiol. 78, 273–283. Jepma, M., Nieuwenhuis, S., 2011. Pupil diameter predicts changes in the exploration– exploitation trade-off: evidence for the adaptive gain theory. J. Cogn. Neurosci. 23, 1587–1596. Kelly, S.P., O'Connell, R.G., 2013. Internal and external influences on the rate of sensory evidence accumulation in the human brain. J. Neurosci. 33, 19434–19441. Krugel, L.K., Biele, G., Mohr, P.N.C., Li, S.C., Heekeren, H.R., 2009. Genetic variation in dopaminergic neuromodulation influences the ability to rapidly and flexibly adapt decisions. Proc. Natl. Acad. Sci. U. S. A. 106, 17951–17956. Kujawa, A., Proudfit, G.H., Klein, D.N., 2014. Neural reactivity to rewards and losses in offspring of mothers and fathers with histories of depressive and anxiety disorders. J. Abnorm. Psychol. 123, 287–297. Lachaux, J.P., Rodriguez, E., Martinerie, J., Varela, F.J., 1999. Measuring phase synchrony in brain signals. Hum. Brain Mapp. 8, 194–208. Lakatos, P., Karmos, G., Mehta, A.D., Ulbert, I., Schroeder, C.E., 2008. Entrainment of neuronal oscillations as a mechanism of attentional selection. Science 320, 110–113. Luck, S., 2005. An Introduction to the Event-Related Potential Technique (Cognitive Neuroscience). 1st ed. A Bradford Book. Lukie, C.N., Montazer-Hojat, S., Holroyd, C.B., 2014. Developmental changes in the reward positivity: an electrophysiological trajectory of reward processing. Dev. Cogn. Neurosci. 9, 191–199. Luu, P., Tucker, D.M., Makeig, S., 2004. Frontal midline theta and the error-related negativity: neurophysiological mechanisms of action regulation. Clin. Neurophysiol. 115, 1821–1835. Marco-Pallares, J., Cucurell, D., Cunillera, T., Garcia, R., Andres-Pueyo, A., Munte, T.F., Rodriguez-Fornells, A., 2008. Human oscillatory activity associated to reward processing in a gambling task. Neuropsychologia 46, 241–248. Nieuwenhuis, S., Aston-Jones, G., Cohen, J.D., 2005. Decision making, the P3, and the locus coeruleus–norepinephrine system. Psychol. Bull. 131, 510–532. Nolan, H., Whelan, R., Reilly, R.B., 2010. FASTER: fully automated statistical thresholding for EEG artifact rejection. J. Neurosci. Methods 192, 152–162. O'Connell, R.G., Dockree, P.M., Kelly, S.P., 2012. A supramodal accumulation-to-bound signal that determines perceptual decisions in humans. Nat. Neurosci. 15, 1729–1735. O'Reilly, J.X., 2013. Making predictions in a changing world-inference, uncertainty, and learning. Front. Neurosci. 7, 105. Payzan-LeNestour, E., Bossaerts, P., 2011. Risk, unexpected uncertainty, and estimation uncertainty: Bayesian learning in unstable settings. PLoS Comput. Biol. 7, e1001048. Pearce, J.M., Hall, G., 1980. A model for Pavlovian learning: variations in the effectiveness of conditioned but not of unconditioned stimuli. Psychol. Rev. 87, 532–552. Pearson, J.M., Hayden, B.Y., Raghavachari, S., Platt, M.L., 2009. Neurons in posterior cingulate cortex signal exploratory decisions in a dynamic multioption choice task. Curr. Biol. 19, 1532–1537. Pearson, J.M., Heilbronner, S.R., Barack, D.L., Hayden, B.Y., Platt, M.L., 2011. Posterior cingulate cortex: adapting behavior to a changing world. Trends Cogn. Sci. 15, 143–151.
216
J.F. Cavanagh / NeuroImage 110 (2015) 205–216
Philiastides, M.G., Biele, G., Vavatzanidis, N., Kazzer, P., Heekeren, H.R., 2010. Temporal dynamics of prediction error processing during reward-based decision making. Neuroimage 53, 221–232. Polich, J., 2007. Updating P300: an integrative theory of P3a and P3b. Clin. Neurophysiol. 118, 2128–2148. Ribas-Fernandes, J.J.F., Solway, A., Diuk, C., McGuire, J.T., Barto, A.G., Niv, Y., Botvinick, M.M., 2011. A neural signature of hierarchical reinforcement learning. Neuron 71, 370–379. Schroeder, C.E., Lakatos, P., 2009. Low-frequency neuronal oscillations as instruments of sensory selection. Trends Neurosci. 32, 9–18. Stephan, K.E., Penny, W.D., Daunizeau, J., Moran, R.J., Friston, K.J., 2009. Bayesian model selection for group studies. Neuroimage 46, 1004–1017. Sugrue, L.P., Corrado, G.S., Newsome, W.T., 2005. Choosing the greater of two goods: neural currencies for valuation and decision making. Nat. Rev. Neurosci. 6, 363–375. Talmi, D., Atkinson, R., El-Deredy, W., 2013. The feedback-related negativity signals salience prediction errors, not reward prediction errors. J. Neurosci. 33, 8264–8269.
Trujillo, L.T., Allen, J.J., 2007. Theta EEG dynamics of the error-related negativity. Clin. Neurophysiol. 118, 645–668. Uhlhaas, P.J., Roux, F., Rodriguez, E., Rotarska-Jagiela, A., Singer, W., 2010. Neural synchrony and the development of cortical networks. Trends Cogn. Sci. 14, 72–80. Walsh, M.M., Anderson, J.R., 2011. Modulation of the feedback-related negativity by instruction and experience. 2011. Walsh, M.M., Anderson, J.R., 2012. Learning from experience: event-related potential correlates of reward processing, neural adaptation, and behavioral choice. Neurosci. Biobehav. Rev. 36, 1870–1884. Weinberg, A., Riesel, A., Proudfit, G.H., 2014. Show me the money: the impact of actual rewards and losses on the feedback negativity. Brain Cogn. 87, 134–139. Wyart, V., de Gardelle, V., Scholl, J., Summerfield, C., 2012. Rhythmic fluctuations in evidence accumulation during decision making in the human brain. Neuron 76, 847–858.