c o r t e x 4 9 ( 2 0 1 3 ) 1 6 2 7 e1 6 3 5
Available online at www.sciencedirect.com
Journal homepage: www.elsevier.com/locate/cortex
Research report
Value and prediction error estimation account for volatility effects in ACC: A model-based fMRI study Massimo Silvetti a,b,*, Ruth Seurinck a,b and Tom Verguts a,b a b
Department of Experimental Psychology, Ghent University, Ghent, Belgium GIfMI (Ghent Institute for Functional and Metabolic Imaging), Ghent University Hospital, Ghent, Belgium
article info
abstract
Article history:
In order to choose the best action for maximizing fitness, mammals can estimate the
Received 23 September 2011
reward expectations (value) linked to available actions based on past environmental
Reviewed 30 January 2012
outcomes. Value updates are performed by comparing the current value with the actual
Revised 24 February 2012
environmental outcomes (prediction error). The anterior cingulate cortex (ACC) has been
Accepted 17 May 2012
shown to be critically involved in the computation of value and its variability across time
Action editor Stefano Cappa
(volatility). Previously, we proposed a new neural model of the ACC based on single-unit
Published online 26 May 2012
ACC neurophysiology, the Reward Value and Prediction Model (RVPM). Here, using the RVPM in computer simulations and in a model-based fMRI study, we found that highly
Keywords:
uncertain but non-volatile environments activate ACC more than volatile environments,
Volatility
demonstrating that value estimation by means of prediction error computation can
ACC
account for the effect of volatility in ACC. These findings suggest that ACC response to
Reinforcement learning
volatility can be parsimoniously explained by basic ACC reward processing. ª 2012 Elsevier Ltd. All rights reserved.
Reward Prediction error
1.
Introduction
In order to obtain resources for its preservation and achieve its behavioral goals, any organism has to establish an adaptive interaction with its environment. For example, an animal looking for food must discover which places in its environment are more likely to contain alimentary rewards. This class of problems is studied by reinforcement learning (RL), a theoretical framework developed in the field of machine learning (Sutton and Barto, 1998). The simplest way to maximize reward consists in estimating the probability with which a reward will follow some action or environmental state, and then selecting an action based on these estimates. For
example, a weasel could have learned that a particular odor nearby a hole in the ground indicates the presence of a prey, and thus it could select the action of entering the hole. These estimates (or predictions; denoted by V here) can be updated by evaluating how well they correspond to actually obtained rewards. The difference between predicted and actual rewards is called “prediction error” (here d). In other words, the estimates of V are constantly updated by the estimated d. Neurophysiological investigations showed that the anterior cingulate cortex (ACC) has neuronal populations that code for the predicted reward (Amiez et al., 2006; Matsumoto et al., 2003) and two different neural populations coding for prediction errors (d) (Amiez et al., 2006; Matsumoto et al., 2007). One
* Corresponding author. Department of Experimental Psychology, Ghent University, Henri Dunantlaan 2, 9000 Ghent, Belgium. E-mail address:
[email protected] (M. Silvetti). 0010-9452/$ e see front matter ª 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.cortex.2012.05.008
1628
c o r t e x 4 9 ( 2 0 1 3 ) 1 6 2 7 e1 6 3 5
of the d populations codes for positive prediction errors (here called dþ), i.e., when the reward is bigger than expected, and the other for negative prediction errors (here called d), when the reward is smaller than expected. Prediction error-related activity in ACC has been confirmed in humans by EEG and fMRI studies (Jessup et al., 2010; Oliveira et al., 2007; Nunez Castellar et al., 2010). Since many years, the ACC has been indicated as one of the main areas involved in adaptive control of behavior (Ridderinkhof et al., 2004), and operations other than reward prediction and prediction error have also been ascribed to it. For example, it has been proposed that ACC is involved in error processing (Critchley et al., 2005), in the estimation of the probability of committing an error (Brown and Braver, 2005), or in conflict monitoring (Botvinick et al., 2001). In earlier work we proposed a new neural model of ACC, the Reward Value and Prediction Model (RVPM) (Silvetti et al., 2011). This model is based exclusively on the available single unit evidence we described above. It computes reward expectations by means of prediction error signals. The RVPM has been able to provide a neurocomputational account of various EEG and fMRI findings in humans, including error detection, error likelihood estimation, and response conflict (Silvetti et al., 2011), showing how heterogeneous experimental findings on the ACC could be interpreted as deriving from the computation of reward expectations and prediction errors. Besides the proposed ACC functions cited above, a recent and influential paper showed that volatility modulates ACC activity (Behrens et al., 2007). Volatility is a measure of stability of statistical parameters (e.g., mean or variance) in a stochastic system. In reward paradigms, volatility refers to the stability of the probability with which rewards are associated with cues (environmental states or actions). This empirical finding raises the question whether ACC also computes higher order statistics (volatility) in addition to the lower order statistics of reward prediction or prediction error. On the one hand, this hypothesis can elegantly account for both behavioral and fMRI findings in humans. On the other hand, the same hypothesis does not seem to have a precise neurophysiological counterpart, because single unit studies have not documented ACC neuronal populations coding for volatility. In this paper we will show, using the RVPM, that volatility effects can be parsimoniously explained by means of reward prediction and prediction error computation. In the RVPM the average activity of prediction error neurons over a time bin reflects the reward uncertainty during that time bin. Therefore, the average activity of dþ and d neurons in the RVPM is maximal when the reward probability is .5 (i.e., when it is maximally uncertain; see Fig. S1 in Supplementary Material). Here we advance the hypothesis that average ACC activation during outcome monitoring (reward period) is mainly driven by the average prediction error evoked by outcomes. This also implies that ACC response to volatility is mainly due to general level of uncertainty, as the predictive links between cues and outcomes change over time in volatile environments, increasing the average activation of the prediction error units (Silvetti et al., 2011). It is useful to point out here a distinction between expected versus unexpected uncertainty (Yu and Dayan, 2005). In the rest of the paper, when we
will refer to uncertainty without any further specification, we will mean “expected uncertainty” (i.e., expected reliability of predictive cues) and not to “unexpected uncertainty” (i.e., unexpected changes of reliability of predictive cues) (Yu and Dayan, 2005). The concept of unexpected uncertainty is closely related to volatility. We propose that ACC calculates prediction error and is sensitive to all types of uncertainty. One consequence is that it is active also in volatile environments. A crucial prediction is that highly uncertain (but not volatile) environments should activate the ACC more than volatile environments. We test this in the current study. In the first part we investigate, by means of RVPM computer simulations, the effects of different levels of uncertainty and volatility on ACC activity. In the second part of the study we tested the predictions derived from the computer simulations by a model-based fMRI experiment, in which we administered to a group of healthy volunteers a task with monetary rewards in environments with different levels of volatility and uncertainty.
2.
Methods
2.1.
Simulation
The RVPM consisted in three computational modules (Silvetti et al., 2011) (Fig. 1, see also Supplementary Methods). The ACC module receives afferents from units coding for external cues (C1 and C2; representing either stimuli or planned actions) and from a module simulating brainstem dopaminergic afferents (e.g., from ventral tegmental area, VTA) (Williams and Goldman-Rakic, 1998). We performed 20 simulations, each including three runs of 72 trials (Fig. 2a). In each run we administered to the RVPM trials consisting in cues probabilistically followed by rewards (see also the Supplementary Methods). In the first run we exposed the model to a stationary (Stat) environment, where
Fig. 1 e Model structure: V: reward prediction unit, dD and dL: positive and negative error prediction units; C1 and C2: units coding for events; RW: unit generating reward signal.
c o r t e x 4 9 ( 2 0 1 3 ) 1 6 2 7 e1 6 3 5
1629
Fig. 2 e (a) Schema representing the three SEs we administered to the RVPM in the simulation. (b) Grand average of RVPM activity (output from whole ACC module) during a trial. (c) Grand average of the signal power between cue onset and reward offset, for all three SEs.
two cues were rewarded with constant probabilities (87% and 33% for C1 and C2 respectively). In the second run, we administered a second stationary (Stat2) environment, where the cues were rewarded with constant (i.e., non-volatile) but highly uncertain probabilities (each 60%). Finally, in the third run, we administered a volatile (Vol) environment, where the probability linking cues and rewards started as in the Stat condition, but the probabilities switched halfway (Behrens et al., 2007). In every trial, the model was required to decide between cue 1 and cue 2, and then a reward was given with a defined probability. For this purpose, we provided the RVPM with a module (Actor) choosing between the two cues. In particular, a Softmax Actor made a choice between one of the two cues on the basis of the expected reward for each cue (V signal). We used a Boltzmann probability distribution having as argument the V value during the last cue period: eVi =Temp pðCi Þ ¼ P V =Temp e i
(1)
i
where p(Ci) is the probability of selecting the ith cue, Temp ¼ 4 is the temperature parameter, and Vi is the V response to the
last presentation of the ith cue. We set the temperature parameter to a quite high value; as a result, there was just a small preference for the cue with the higher expectation (55%). This functioned as a convenient approximation for simulating the behavioral results of Behrens et al. (2007), where subjects often chose the cue with the lowest reward probability. Moreover, it ensured an adequate number of selections of both cues. The Actor module was used only in these simulations and not in the simulations producing the regressor signal for the model-based fMRI analysis (see the RVPM regressor in the following section).
2.2.
fMRI experiment
2.2.1.
Experimental procedure
Twenty-one right-handed healthy volunteers (mean age 23, range 19e26, 14 females) gave written informed consent to participate in the study, which was previously approved by the ethical committee of Ghent University. Two participants were excluded from the study because of excessive head motion during scanning. The protocol administered to each subject consisted of three phases, of which only the third
1630
c o r t e x 4 9 ( 2 0 1 3 ) 1 6 2 7 e1 6 3 5
included fMRI scanning. The first phase consisted in setting the luminance of the stimuli, the second session consisted of a training session, and the third phase consisted of the scanning session (see also the Supplementary Methods). Subjects performed a perceptual task with the same Stat, Stat2, and Vol conditions used in the computer simulations. Each subject performed each run (Stat, Stat2, Vol) twice for a total of six runs, or 432 task trials and 96 catch trials, following the experimental timeline shown in Fig. 3a. Trials were characterized by two different colors (red and blue), coding for reward probability (Knutson et al., 2000). Fig. 3b shows the sequence of events occurring within each task trial. One of the three disks appeared some milliseconds before the other two. The subjects were asked to press a button indicating which disk appeared first. The delay between the appearance of the first disk and the other two was adjusted during the training period, in order to obtain the desired reward probabilities (see also Behavioral task section in the Supplementary Material). The reward consisted in 6 Euro-cents, and it was given after each correct response. Subjects were not informed about the association between colors and reward probability. Immediately after the response, an accuracy feedback sign was displayed. Each experimental run was performed in one fMRI run. The mean total scanning duration was 450 4800 . To maintain subjects’ performance stable during the Stat2 epochs (the 60% hit rate was sensitive to learning, fatigue, fluctuations of attention) the program computed a running mean of the accuracy and continued adjusting the delay to maintain performance as close as possible to 60%.
2.2.2.
RVPM regressor and design matrix
In order to obtain the RVPM time series predicting the ACC activity, we administered to our model the same sequence of stimuluseoutcome pairs that each subject received (and produced) during the fMRI experiment (excluding the training sessions). We considered as RVPM output the sum of the three ACC units: V, dþ, and d. This allowed us to obtain a “personal” simulation of ACC activity for each single volume (i.e., every TR, 2 sec) of each subject (see visual summary in Fig. 4a). As both the behavioral choices and the response times (RTs) we administered to the model were those performed by the subjects, we did not implement in these simulations an Actor module. In the same design matrix we included also the regressors relative to task trials, catch trials and RTs in order to remove their effects as possible confounds. For a full description of all the regressors we used in the General Linear Model, we refer to the Methods section of the Supplementary Material.
2.2.3.
Imaging protocol
We acquired fMRI images by a Siemens Trio 3T scanner, with an 8-channel head coil. Head movements were minimized by mild restraints and cushioning. Each brain volume was acquired with 30 slices T2* weighted, voxel resolution 3.5 3.5 3.5 mm3, repetition time 2 sec, time echo 30 msec, and interslice distance .75 mm. We used SPM8 (http://www.fil. ion.ucl.ac.uk/spm/) and Marsbar toolbox (Brett et al., 2002) for data processing. The first four volumes of each run were discarded to allow for magnetic field equilibration. After that, all images were corrected for head movements and slice
Fig. 3 e (a) Experiment timeline. We administered three different SEs: stationary (Stat), stationary with high-level of uncertainty (Stat2) and volatile (Vol), following the same schema of the simulation (Fig. 2a). (b) Events occurring during a single trial (we arbitrarily depicted a red trial). First a fixation cross appears; the fixation cross is colored; then the actual task stimulus is shown (three disks); after the response, feedback is given. The catch trial timeline is described in Fig. S3 of the Supplementary Methods.
c o r t e x 4 9 ( 2 0 1 3 ) 1 6 2 7 e1 6 3 5
1631
Fig. 4 e (a) Simplified flowchart showing the model-based analysis in its three key steps: the administration to the model of each individual subject’s behavioral performance (RTs, rewards and cue type), the extraction of the model response and finally the use of this response as a regressor in the GLM, for estimating of the fit between the model output and the BOLD signal from each voxel. (b) Results of the model-based analysis showing the RVPM activity is highly correlated with the ACC BOLD response (t-contrast: StatDStat2DVol). The figure also shows the ventral striatal and left premotor activations. Bar plot: effect size related to RVPM regressor in the ACC region of interest (ROI) (local maximum MNI: 12, 14, 44, see Methods) across the three SEs (to be compared with the plot in Fig. 2c).
acquisition delays, then normalized to the SPM8 EPI template and spatially smoothed by an isotropic Gaussian Kernel with 8 mm full-width half-maximum.
3.
Results
3.1.
Simulation
The first step of our study consisted in providing a general prediction of ACC behavior as a function of different statistical environments (SEs). Fig. 1 shows the architecture of the RVPM [see Silvetti et al. (2011) and Supplementary Methods for a more detailed description]. The model learns to associate a value (reward expectation, V unit activity) to each cue, based
on the history of past outcomes (prediction error, d units activity). We presented three different environments to the RVPM by manipulating the SE, varying both uncertainty and volatility (Fig. 2a). In the Stat environment, two cues were rewarded with constant probabilities, in the other stationary (Stat2) environment, the cues were rewarded with highly uncertain probabilities, and finally, in the Vol environment, the probability linking cues and rewards switched (volatility) (Behrens et al., 2007). As shown in Fig. 2b, during the cue period, the RVPM showed the same average activity for all three epochs [F(2,57) ¼ .84 p ¼ n.s]. This is because the average reward rate was the same for each epoch (60%). However, during the reward period, there was a clear main effect of the SE factor [F(2,57) ¼ 53.1, p < .0001; Fig. 2b]. Post hoc analysis revealed
1632
c o r t e x 4 9 ( 2 0 1 3 ) 1 6 2 7 e1 6 3 5
that the RVPM activity was higher during the Vol run than during the Stat run [t(19) ¼ 3.91, p < .0001]. The reason is the higher average activation of dþ and d units during the Vol run. Indeed, after switching the associations between cues and reward rates, the RVPM performed a re-mapping of cuereward expectations using the prediction errors computed by the dþ and d units. However, the RVPM showed higher average activity during the Stat2 (highly uncertain but nonvolatile run) than during the Vol run [t(19) ¼ 7.43, p < .0001; see also Fig. 2c].
3.2.
fMRI study
Subjects performed a perceptual task with probabilistic reward in the same SEs used for the computer simulations (Stat, Stat2, and Vol) (Fig. 3a). Trials were characterized by two different cues, coding for reward probability (Fig. 3b). As described in the Methods sections, reward probability was manipulated by varying the difficulty of the perceptual task. The behavioral results showed that our preset error percentages were very well approximated and RTs were faster in the easy (87%) than in the difficult (33%) condition; details are reported in the Supplementary Results section. We administered the behavioral data generated by each single subject to the RVPM, and used the model output as a regressor to determine which voxels were significantly correlated with the RVPM activity (model-based approach, Fig. 4a). Note that the simulations used to generate the predicted BOLD signal were separate from those described in section 3.1. The simulations from section 3.1 provided general predictions on ACC behavior in different SEs (see Methods and Supplementary Methods). However, based on the simulation from section 3.1, a quantitative prediction is impossible at a single-subject basis. This is why we constructed a new and “personalized” prediction of the ACC BOLD signal for each single subject, based on his/her behavioral data, to analyze the fMRI data. Fig. 4b shows the voxels correlated to the model (t-contrast StatþStat2þVol), highlighting the fit between the ACC BOLD signal and the RVPM output. On the bottom right we plotted the effect size related to the RVPM regressor as a function of SE, in an region of interest (ROI) (269 voxels) surrounding the ACC local maximum (see Supplementary Methods). The fMRI results were in accordance with the computer simulation (Fig. 2b). Indeed, we obtained an SE main effect [F(2,55) ¼ 18.77, p < .0001]. Post hoc analysis revealed stronger ACC activation for the Vol than the Stat condition [t(18) ¼ 5.41, p < .0001], but, crucially, also a higher activation for Stat2 than the Vol condition [t(18) ¼ 1.74, p < .05]. Hence, the overall level of uncertainty rather than volatility determined ACC activation in the current experimental paradigm. To check the presence of ACC voxels showing a higher activity for the Vol than for the Stat2 condition (predicted by volatility theory), we also computed the contrast VoleStat2. Across the entire ACC (including our ROI), we found no voxels exceeding the uncorrected threshold of p < .05. Besides the ACC, other areas also showed a significant fit with the RVPM activity (Table 1), like VTA, ventral striatum and anterior insula (AIns). We also found a significant fit in fronto-lateral regions and in the right temporo-occipital junction (TOJ) (see Discussion).
Table 1 e List of activations resulting from the contrast StatDStat2DVol. p(FWE-cor)
Cluster size (vx)
Z
x y z (mm)
Regions
<.0001
2546
.001 .022 <.0001 .001 <.0001 .037
460 260 742 495 592 226
6.4 4.87 4.65 5.35 5.12 5 4.93 4.57 4.55
8 8 54 40 12 54 12 14 44 46 2 28 12 10 8 4 18 16 44 66 2 38 6 52 34 24 0
pre-SMA PMd_L ACC MFG_R Ventral Stri. VTA TOJ_R PMd_R AIns_R
Ventral Stri. ¼ Ventral Striatum, pre-SMA ¼ pre-Supplementary Motor Area, R ¼ right; L ¼ left. Coordinates are in MNI frame of reference. Clusters were obtained with threshold p < .001 voxelwise uncorrected, and none extent threshold. The first column of the table indicates the p values cluster-wise FWE corrected.
To remove a possible effect of the average amplitude signal of the RVPM regressor (that, as we showed in the Simulations paragraph, follows the trend Stat2>Vol>Stat), we ran an additional analysis. We normalized each RVPM regressor (mean ¼ 0, SD ¼ 1), eliminating any effect due to differences in average signal amplitude across conditions. Fig. 5 displays the t-map showing the voxels active for the main RVPM effect, the Stat2eVol contrast, and the VoleStat contrast [strict conjunction, in which we tested the Conjunction Null hypothesis (Nichols et al., 2005), see Supplementary Methods]. This contrast allowed finding the voxels that satisfied all the following conditions. (1) Being correlated with the RVPM regressor. (2) Showing a higher activation for Stat2 than for Vol environment. (3) Showing a higher activation for Vol than for Stat environment. The active cluster consisted of 51 voxels, each significant at p < .05 FWE voxel-wise corrected for the volume of the ACC ROI we built in the main analysis (which was here used as an inclusive mask, 269 voxels). The activation peak was on coordinates 6 16 50 [t(18) ¼ 3.01]. To test the possible presence of voxels more active for Vol than for Stat2 condition (i.e., predicted by volatility theory), we computed another contrast, consisting in the linear contrast VoleStat2.
Fig. 5 e t-map showing the active voxels for the conjunction of the contrasts StatDStat2DVol, Stat2eVol and VoleStat.
c o r t e x 4 9 ( 2 0 1 3 ) 1 6 2 7 e1 6 3 5
This contrast revealed no voxels, within the ACC ROI, exceeding the uncorrected threshold of p < .05. Finally, in order to exclude that Stat2 environment accidentally generated an amount of volatility greater than Vol environment (e.g., for local fluctuations of reward rates), we compared the reward rates variances of each SE and ran also a supplementary behavioral study. Both analyses confirmed that the Vol condition was indeed more volatile than the two stationary conditions (Stat and Stat2; see Supplementary Results section).
4.
Discussion
We proposed a biologically plausible model of ACC, based on the hypothesis that one of the core functions of ACC is to compute reward predictions and prediction errors (Rushworth and Behrens, 2008). The results presented in this paper support a unified theory of ACC function (Silvetti et al., 2011). This theory holds that computation of reward prediction and prediction error is responsible for most of the experimental findings on ACC function, including those that were previously explained by error detection, error likelihood estimation, conflict monitoring, and volatility. Starting from this framework, our computer simulations and fMRI experiment suggested the ACC response to volatility (Behrens et al., 2007) can be parsimoniously explained by the sensitivity of this area to prediction errors, providing a neurophysiological and computational rationale for the effect. Moreover, our fMRI study also confirmed a crucial prediction of the RVPM, namely, that highly uncertain but non-volatile environments can lead to higher ACC activation than volatile environments. This suggests that the average over-time response of this area is mainly driven by general uncertainty (i.e., by both expected and unexpected uncertainty). A recent theoretical paper described an ACC model based on similar neurocomputational assumptions (Alexander and Brown, 2011). This model, like the RVPM, also provides an explanation of volatility in terms of prediction error activity. The two models also exhibit a few differences; for instance, the RVPM computes expectations and prediction errors specifically for rewards, whereas Alexander and Brown’s model computes expectations and prediction errors more generally on environmental outcomes. Regardless of the differences between the two models, the convergence of independent theoretical results about the role of prediction error signals in volatile environments corroborates the robustness of the results of the current study. Finally, it is worth noting one relevant difference between our experimental approach and that used in some previous studies on the ACC. We investigated ACC processing of expected value and prediction error for stimuli (cue colors, i.e., environmental states), whereas the ACC has been often investigated in relation to choice selection (Behrens et al., 2007; Kennerley et al., 2011) (i.e., actions). Both theoretical considerations (Actore Critic framework, see below) and experimental findings (Amiez et al., 2006) indicate that stimulus and action evaluation are probably processed in an analogous way by the ACC, and the results of our study (correlation between RVPM
1633
activity and biological ACC activity) corroborate this point of view.
4.1.
ACC, prediction error and volatility
These findings expand upon the pioneering observations by Behrens et al. (2007) but, at the same time, introduce some new observations that deserve full discussion. In order to isolate the influence of volatility on ACC activation, Behrens et al. de-confounded the effect of prediction errors by using the signal generated by a RescorlaeWagner model (1972). This model provided positive and negative responses for positive and negative prediction errors. Here we adopted a different modeling approach, in which the RVPM coding of prediction errors is always positive, and provides an output more similar to the neuronal discharge rate signal (Matsumoto et al., 2007), producing a closer approximation to the biological BOLD response. Moreover, in order to demonstrate that the volatility effect was not due to prediction errors, Behrens et al. documented also an absence of linear correlation between the absolute value of prediction errors and volatility. However, this finding is not in conflict with the fact that ACC responds more strongly to uncertain environments, as the amount of uncertainty is not coded by single values of prediction errors, but rather by the average prediction error (i.e., by the average activity of d units over time). For instance, during the Stat period (87e33% condition) the RVPM showed higher maximal prediction error signals (due to highly unexpected violation of reward predictions) than during the Stat2 period (60e60% condition), but the average activity of d units was higher in the Stat2 period.
4.2.
Prediction error, noradrenalin and learning rate
In the same fMRI study, Behrens et al. also provided evidence that the volatility estimation determined the subjects’ learning rate. We acknowledge that modeling the learning rate modulation is beyond the scope of the RVPM, as it has a constant learning rate parameter. However, a dynamic learning rate parameter would not have changed our results, given that it would have led to a faster adaptation of the RVPM to the switches in reward contingencies during the volatile period, and thus to lower average activation of the d units, further increasing the gap between the Vol and the Stat2 condition. Moreover, using the RVPM framework, we can advance a hypothesis on learning rate modulation. Indeed, Yu and Dayan (2005) proposed that the neuromodulator noradrenalin (originating from the brainstem nucleus locus coeruleus, LC) signals unexpected uncertainty. We suggest that learning rate modulation arises from topedown influences from ACC to LC. Being so densely connected to different cortical areas, ACC would be ideally suited to provide LC a high-level summary of cortical processing, and it indeed has projections to LC (Aston-Jones and Cohen, 2005). We propose that LC activation, with its widespread cortical projection, sets learning rates in different cortical areas, including the ACC itself (Berridge and Waterhouse, 2003; Verguts and Notebaert, 2009; Jepma and Nieuwenhuis, 2011). The explicit computational formulation of this hypothesis and its eventual experimental test will be the focus of future research.
1634
4.3.
c o r t e x 4 9 ( 2 0 1 3 ) 1 6 2 7 e1 6 3 5
The ActoreCritic framework
In the framework of RL, a Critic is a system deputized to evaluate stimuli and possible actions in terms of expected reward, while the Actor is a system that selects actions based on the evaluations provided by the Critic. According to the RVPM, the role of the ACC would be that of a Critic module in the ActoreCritic framework (see also Supplementary Material), creating maps between cues (e.g., external stimuli, actions) and values indicating their value for the organism (Rushworth and Behrens, 2008). Consistent with this, ACC receives input from (high-level) motor areas, which code for actions, and also from the posterior parietal cortex (Devinsky et al., 1995), which can be considered as input coding for external stimuli. In mammals, it is reasonable to identify the Dorsolateral Prefrontal Cortex (DLPFC), basal ganglia and high-level motor areas (like the caudal cingulate zone, CCZ) as the Actor. Indeed these areas receive afferents from ACC (Devinsky et al., 1995), allowing a tight interaction between the two components. Besides ACC, a number of other areas showed activity correlated with the RVPM (Table 1). Some of them are typically involved in RL, such as VTA and ventral striatum (D’Ardenne et al., 2008). The others (namely the right mid frontal gyrus e MFG, the dorsal premotor e PMd and the TOJ) are typically not considered to be involved in reward processing. The right MFG has also been shown to code for prediction errors (Matsumoto et al., 2007) but with a less marked distinction between positive and negative prediction errors (thus, encoding general surprise). Hence, we propose right MFG may encode prediction errors in a similar way in humans, coding mainly for surprise of incoming events (Doricchi et al., 2010). The PMd activation may be interpreted as a reward prediction error signal afferent from ACC (Critic) to the PMd neural populations deputized to select actions (Actor). Finally, with respect to the TOJ, a previous fMRI study found prediction error activity in an area overlapping our TOJ activation (Summerfield and Koechlin, 2008). In the latter study the TOJ activity was related to prediction errors on expectations about the features of incoming visual stimuli, and not on reward expectations. This may also be relevant in our paradigm, because regardless of the reward, the colored cues predicted the incoming visual stimulus given by the feedback shape (Fig. 3b). Thus, the TOJ activity could be interpreted as deriving from prediction errors related to the expectations of the feedback shape. It is worth mentioning that the AIns was also among the reward-related areas we found. Recently, it has been proposed that the AIns has a complementary role to ACC (Preuschoff et al., 2008): whereas ACC is involved in reward processing and represents both reward and reward prediction errors, the AIns would represent risk and risk prediction errors. The insula is also strongly involved in representing feelings and predictions of feelings in others (Singer et al., 2009). These are not necessarily contradictory findings. In the ActoreCritic framework, the Critic can be multidimensional, representing different aspects of cue valuation. Neurophysiological and lesional studies provided indeed evidences that different cortical areas perform partially segregated roles of Critic, such as reward prediction (ACC, orbitofrontal cortex, OFC) (Schoenbaum et al., 2009;
Rushworth and Behrens, 2008) or risks (AIns) (Preuschoff et al., 2008) and costs evaluation (ACC, OFC) (Rushworth and Behrens, 2008). More broadly, the ubiquitous finding of ACC and AIns activation in cognitive tasks is congruent with the fact that cognition and emotion are not separate faculties: in an ActoreCritic system, cognition and emotion are two related components for the purpose of adaptive control (Dolan, 2002; Verguts and Notebaert, 2009).
Acknowledgments MS and TV were supported by Ghent University GOA grant BOF08/GOA/011. We acknowledge the support of Ghent University Multidisciplinary Research Platform “The integrative neuroscience of behavioural control”. RS is supported by the Fund for Scientific Research (Flanders Belgium). All the authors declare having no conflict of interest. The authors thank Fabrizio Doricchi, Marcel Brass, Emiliano Macaluso and Anand Ramamoorthy for useful comments on this work.
Supplementary material Supplementary data related to this article can be found online at doi:10.1016/j.cortex.2012.05.008.
references
Alexander WH and Brown JW. Medial prefrontal cortex as an action-outcome predictor. Nature Neuroscience, 14(10): 1338e1344, 2011. Amiez C, Joseph JP, and Procyk E. Reward encoding in the monkey anterior cingulate cortex. Cerebral Cortex, 16(7): 1040e1055, 2006. Aston-Jones G and Cohen JD. An integrative theory of locus coeruleus-norepinephrine function: Adaptive gain and optimal performance. Annual Review of Neuroscience, 28: 403e450, 2005. Behrens TE, Woolrich MW, Walton ME, and Rushworth MF. Learning the value of information in an uncertain world. Nature Neuroscience, 10(9): 1214e1221, 2007. Berridge CW and Waterhouse BD. The locus coeruleusnoradrenergic system: Modulation of behavioral state and state-dependent cognitive processes. Brain Research. Brain Research Reviews, 42(1): 33e84, 2003. Botvinick M, Braver TS, Barch DM, Carter CS, and Cohen JD. Conflict monitoring and cognitive control. Psychological Review, 108(3): 624e652, 2001. Brett M, Anton J, Valabregue R, and Poline J. Region of interest analysis using an SPM toolbox. NeuroImage; Presented at the 8th International Conference on Functional Mapping of the Human Brain; 2002, [abstract] 2002. Brown JW and Braver TS. Learned predictions of error likelihood in the anterior cingulate cortex. Science, 307(5712): 1118e1121, 2005. Critchley HD, Tang J, Glaser D, Butterworth B, and Dolan RJ. Anterior cingulate activity during error and autonomic response. NeuroImage, 27(4): 885e895, 2005.
c o r t e x 4 9 ( 2 0 1 3 ) 1 6 2 7 e1 6 3 5
D’Ardenne K, McClure SM, Nystrom LE, and Cohen JD. BOLD responses reflecting dopaminergic signals in the human ventral tegmental area. Science, 319(5867): 1264e1267, 2008. Devinsky O, Morrell MJ, and Vogt BA. Contributions of anterior cingulate cortex to behaviour. Brain, 118(Pt 1): 279e306, 1995. Dolan RJ. Emotion, cognition, and behavior. Science, 298(5596): 1191e1194, 2002. Doricchi F, Macci E, Silvetti M, and Macaluso E. Neural correlates of the spatial and expectancy components of endogenous and stimulus-driven orienting of attention in the Posner task. Cerebral Cortex, 20(7): 1574e1585, 2010. Jepma M and Nieuwenhuis S. Pupil diameter predicts changes in the explorationeexploitation tradeoff: Evidence for the adaptive gain theory. Journal of Cognitive Neuroscience, 23(7): 1587e1596, 2011. Jessup RK, Busemeyer JR, and Brown JW. Error effects in anterior cingulate cortex reverse when error likelihood is high. Journal of Neuroscience, 30(9): 3467e3472, 2010. Kennerley SW, Behrens TE, and Wallis JD. Double dissociation of value computations in orbitofrontal and anterior cingulate neurons. Nature Neuroscience, 14(12): 1581e1589, 2011. Knutson B, Westdorp A, Kaiser E, and Hommer D. FMRI visualization of brain activity during a monetary incentive delay task. NeuroImage, 12(1): 20e27, 2000. Matsumoto K, Suzuki W, and Tanaka K. Neuronal correlates of goal-based motor selection in the prefrontal cortex. Science, 301(5630): 229e232, 2003. Matsumoto M, Matsumoto K, Abe H, and Tanaka K. Medial prefrontal cell activity signaling prediction errors of action values. Nature Neuroscience, 10(5): 647e656, 2007. Nichols T, Brett M, Andersson J, Wager T, and Poline JB. Valid conjunction inference with the minimum statistic. NeuroImage, 25(3): 653e660, 2005. Nunez Castellar E, Kuhn S, Fias W, and Notebaert W. Outcome expectancy and not accuracy determines posterror slowing: ERP support. Cognitive, Affective & Behavioral Neuroscience, 10(2): 270e278, 2010. Oliveira FT, McDonald JJ, and Goodman D. Performance monitoring in the anterior cingulate is not all error related: Expectancy deviation and the representation of action-
1635
outcome associations. Journal of Cognitive Neuroscience, 19(12): 1994e2004, 2007. Preuschoff K, Quartz SR, and Bossaerts P. Human insula activation reflects risk prediction errors as well as risk. Journal of Neuroscience, 28(11): 2745e2752, 2008. Rescorla RA and Wagner AR. A theory of Pavlovian conditioning: Variation in the effectiveness of reinforcement and nonreinforcement. In Black AH and Prokasy WF (Eds), Classical Conditioning II: Current Research and Theory. New York: Appleton-Century-Crofts, 1972: 64e99. Ridderinkhof KR, Ullsperger M, Crone EA, and Nieuwenhuis S. The role of the medial frontal cortex in cognitive control. Science, 306(5695): 443e447, 2004. Rushworth MF and Behrens TE. Choice, uncertainty and value in prefrontal and cingulate cortex. Nature Neuroscience, 11(4): 389e397, 2008. Schoenbaum G, Roesch MR, Stalnaker TA, and Takahashi YK. A new perspective on the role of the orbitofrontal cortex in adaptive behaviour. Nature Reviews Neuroscience, 10(12): 885e892, 2009. Silvetti M, Seurinck R, and Verguts T. Value and prediction error in the medial frontal cortex: Integrating the single-unit and systems levels of analysis. Frontiers in Human Neuroscience, 5(75) 2011. Singer T, Critchley HD, and Preuschoff K. A common role of insula in feelings, empathy and uncertainty. Trends in Cognitive Sciences, 13(8): 334e340, 2009. Summerfield C and Koechlin E. A neural representation of prior information during perceptual inference. Neuron, 59(2): 336e347, 2008. Sutton RS and Barto AG. Reinforcement Learning: An Introduction. Cambridge (MA): MIT Press, 1998. Verguts T and Notebaert W. Adaptation by binding: A learning account of cognitive control. Trends in Cognitive Sciences, 13(6): 252e257, 2009. Williams SM and Goldman-Rakic PS. Widespread origin of the primate mesofrontal dopamine system. Cerebral Cortex, 8(4): 321e345, 1998. Yu AJ and Dayan P. Uncertainty, neuromodulation, and attention. Neuron, 46(4): 681e692, 2005.