Functions of striatal dopamine in learning and planning
Pergamon
PII: S0306-4522(00)00554-6
Neuroscience Vol. 103, No. 1, pp. 65±85 2001 65 q 2001 IBRO. Published by Elsevier Science Ltd Printed in Great Britain. All rights reserved 0306-4522/01 $20.00+0.00
www.elsevier.com/locate/neuroscience
MODELING FUNCTIONS OF STRIATAL DOPAMINE MODULATION IN LEARNING AND PLANNING R. E. SURI, a* J. BARGAS b and M. A. ARBIB a a
USC Brain Project, Los Angeles, CA 90089-2520, USA
b
Departamento de Bio®sica, Instituto de Fisiologia Cellular, Universidad Nacional Autonoma de Mexico, Mexico City DF, Mexico
AbstractÐThe activity of midbrain dopamine neurons is strikingly similar to the reward prediction error of temporal difference reinforcement learning models. Experimental evidence and simulation studies suggest that dopamine neuron activity serves as an effective reinforcement signal for learning of sensorimotor associations in striatal matrisomes. In the current study, we simulate dopamine neuron activity with the extended temporal difference model of Pavlovian learning and examine the in¯uences of this signal on medium spiny neurons in striatal matrisomes. The modeled in¯uences include transient membrane effects of dopamine D1 receptor activation, dopamine-dependent long-term adaptations of corticostriatal transmission, and effects of dopamine on rhythmic ¯uctuations of the membrane potential between an elevated ªup-stateº and a hyperpolarized ªdown-stateº. The most dominant activity in the striatal matrisomes is assumed to elicit behaviors via projections from the basal ganglia to the thalamus and the cortex. This ªstandard modelº performs successfully when tested for sensorimotor learning and goal-directed behavior (planning). To investigate the contributions of our model assumptions to learning and planning, we test the performance of several model variants that lack one of these mechanisms. These simulations show that the adaptation of the dopamine-like signal is necessary for sensorimotor learning and planning. Sensorimotor learning requires dopamine-dependent long-term adaptation of corticostriatal transmission. Lack of dopamine-like novelty responses decreases the number of exploratory acts, which impairs planning capabilities. The model loses its planning capabilities if the dopamine-like signal is simulated with the original temporal difference model, because the original temporal difference model does not form novel associative chains. Transient membrane effects of the dopamine-like signal on striatal ®ring substantially shorten the reaction time in the planning task. The capability for planning is improved by in¯uences of dopamine on the durations of membrane potential ¯uctuations and by manipulations that prolong the reaction time of the model. These results suggest that responses of dopamine neurons to conditioned stimuli contribute to sensorimotor reward learning, novelty responses of dopamine neurons stimulate exploration, and transient dopamine membrane effects are important for planning. q 2001 IBRO. Published by Elsevier Science Ltd. All rights reserved. Key words: reinforcement learning, temporal difference learning, internal model, striatum, novelty.
Midbrain dopamine neurons are phasically activated by unpredicted rewards or by the ®rst sensory event that allows the animal to predict the reward but do not respond to predicted rewards. When a predicted reward is omitted, their activity is depressed at the time when the reward fails to occur. 55 The reward prediction error of temporal difference models (TD models) reproduces these features of dopamine neuron activity. 42,57,63,64,68 Dopamine neurons also respond to novel, physically salient stimuli that elicit action potential bursts followed by a decrease in activity below baseline levels. 55 These biphasic novelty responses diminish with repeated stimulus presentations. TD models reproduce these characteristics
of dopamine novelty responses only if the associative weights of stimulus onsets are initialized with positive values. 64 In addition to associations between stimuli and rewards, humans and animals associate sensory events (stimuli, rewards or behavioral responses) with other sensory events, and use these associations to form novel associative chains. 5,19,22,41,47,83 Consequently, the TD model was extended to learn predictions for behaviorally relevant stimuli and for different reinforcers. 67,70 In order to form novel associative chains, this approach learns an ªinternal model of its environmentº that emulates the temporal development of real world processes. Since there is evidence that dopamine neuron activity is in¯uenced by the formation of novel associative chains, 84 dopamine neuron activity was modeled with such an internal model approach (Suri, submitted). Midbrain dopamine neurons project to the striatum, where they seem to in¯uence long-term adaptation of corticostriatal transmission. 55 In vitro, induction of long-term depression (LTD) by tetanic stimulation seems to require endogenous dopamine, depolarization of the postsynaptic membrane, Ca 21 in¯ux through L-type
*Corresponding author. Present address: Computational Neurobiology Laboratory, Salk Institute, Post Of®ce Box 85800, San Diego, CA 92186-5800, USA. Tel.: 1 1-858-453-4100 ext. 1215; fax: 11-858-587-0417. E-mail address:
[email protected] (R. E. Suri). Abbreviations: GPe, globus pallidus exterior; GPi, globus pallidus interior; LTD, long-term depression; LTP, long-term potentiation; NMDA, N-methyl-d-aspartate; SNr, substantia nigra pars reticulata; STN, subthalamic nucleus; TD, temporal difference. 65
66
R. E. Suri et al.
Fig. 1. Interactions between cortex, basal ganglia, and midbrain dopamine neurons mimicked by Actor±Critic models. 36,42,57,63,64 The striatum receives projections from somatosensory cortices that contain neurons reporting about sensory stimuli. 2 The striatum is divided into striosomes (patches) and matrisomes (matrix). 32 The prefrontal and insular cortices chie¯y project to striosomes, whereas sensory and motor cortices chie¯y project to matrisomes. 32 Midbrain dopamine neurons are contacted by medium spiny neurons in striosomes and project to both striatal compartments. 32,60 Striatal matrisomes directly inhibit the basal ganglia output nuclei GPi and SNr, whereas they indirectly disinhibit these output nuclei via the GPe and STN. 3,1 The basal ganglia output nuclei project via thalamic nuclei to motor, oculomotor, prefrontal and limbic cortical areas. 3
Ca 21 channels and activation of metabotropic glutamate receptors. 9,12,18 In contrast, the same tetanic stimulation protocol leads to N-methyl-d-aspartate (NMDA)dependent long-term potentiation (LTP) for striatal neurons of mice mutants lacking the D2 receptor. 13 Also, pulsative dopamine application during tetanic stimulation reverses LTD to LTP. 77 Furthermore, removal of the magnesium block was shown to reverse post-tetanic LTD to dopamine D1 receptor-dependent LTP. 11,13,14 Taken together, D2 receptor activation seems to be crucial for LTD, whereas D1 receptor activation and NMDA receptor-mediated responses may be more important for LTP. Simulation studies with TD models suggest that dopamine-dependent plasticity in corticostriatal synapses may be important for sensorimotor learning. 36,63,64 In such models, the TD model is termed the ªCriticº and the model component that learns sensorimotor associations is termed the ªActorº. The Critic was related to pathways from cortex via striatal striosomes to midbrain dopamine neurons and the Actor to pathways from cortex via striatal matrisomes to motor centers (Fig. 1). 36,42,57,63,64 In addition to corticostriatal plasticity, dopamine transiently modulates the ®ring rate of striatal medium spiny neurons. 16,55 According to in vitro studies, dopamine D1 class receptor activation attenuates the ®ring rate if ®ring is evoked from hyperpolarized potentials, 10,15,34,45,52,66 but enhances ®ring if it is evoked from elevated holding potentials. 35,65 Also, in vivo, dopamine neuron bursts modulate ®ring rates of striatal neurons through activation of D1 class receptors. 31 These effects of dopamine crucially depend on the membrane potential of striatal medium spiny neurons, which ¯uctuates in vivo with a frequency of about 1.2 Hz between a hyperpolarized down-state and a depolarized up-state. 61,78,82 The upstate seems to be induced and maintained by activity of corticostriatal neurons. 38,61,82 Membrane potential ¯uctuations were usually recorded in anesthetized animals, but also seem to occur in awake animals, although the durations of both states may be less regular. 80,81
Fig. 2. (A) Con®guration of T-maze to test planning and sensorimotor learning in rats. (B) Task to test planning and sensorimotor learning. The task is composed of three consecutive phases. Top: exploration phase. When stimulus blue is presented, the model selects with equal chance the act left or the act right. Act left is followed by presentation of stimulus red, whereas act right is followed by presentation of stimulus green. Middle: rewarded phase. Presentation of stimulus green is followed by reward presentation. Bottom: test phase. Stimulus blue is presented to test if the model elicits the correct act right or the incorrect act left. As in the exploration phase, act left is followed by presentation of stimulus red, whereas act right is followed by presentation of stimulus green and by that of the reward.
In a broad spectrum of situations, animals learn act outcomes and select their acts based on the current motivational value of these outcomes (reviewed in Refs 6, 22, 23 and 41). Such goal-directed behavior was termed ªplanningº in reinforcement learning studies and ªcognitionº in animal learning studies. 19,23,67,69,70 Planning and sensorimotor learning were demonstrated for rats in Tmaze experiments (Fig. 2A; reviewed in Refs 40 and 72). This experiment consists of three phases (Fig. 2B). In the exploration phase, the rat is repeatedly placed in the start box, where it can go left or right without seeing the two goal boxes at the end of the maze. When the rat turns to the left it reaches the red goal box and if it turns to the right it reaches the green goal box. No rewards are presented in this exploration phase. In the rewarded phase, the rat is fed in the green goal box. In the test phase, the rat is returned to the start of the T-maze. In the ®rst trial of the test phase, the majority of the rats turn right. Note that neither the act of turning right nor the act of turning left is ever temporally associated with reward. It was concluded that the rat forms a novel associative chain between its own act, the color of the box and the reward. Moreover, the rat selects its act in relation to the outcome predicted by this novel associative chain. Thus, the rat demonstrates, in this ®rst test phase trial, its
Functions of striatal dopamine in learning and planning
67
Fig. 3. Model architecture (compare with Fig. 1). The extended TD model serves as the Critic component (gray box), and the Actor component (remaining architecture) elicits acts. Actor: sensory stimuli in¯uence the membrane potentials of two medium spiny projection neurons in striatal matrisomes (large circles). The membrane potentials of these neurons are also in¯uenced by ¯uctuations between an elevated up-state and a hyperpolarized down-state that are simulated with the functions s1(t) and s2(t). Adaptations in corticostriatal weights (®lled dots) and dopamine D1 receptor-mediated membrane effects are in¯uenced by the membrane potential and the dopamine-like signal DA(t) (open dots). The ®ring rates y1(t) and y2(t) of both striatal neurons inhibit the basal ganglia output nuclei SNr and GPi. An indirect disinhibitory pathway from the striatum via the GPe and STN to the GPi/SNr suppresses insigni®cant inhibitions in the basal ganglia output nuclei. 7 The winning inhibition disinhibits the thalamus. These signals in the thalamus only lead to acts, coded by the signals act1(t) and act2(t), if they are suf®ciently strong and persistent. This is accomplished by integrating the cortical signals and eliciting acts when these signals reach a threshold. Critic: the Critic computes the dopamine-like reward prediction error DA(t) from the sensory stimuli, the reward signal, the thalamic signals (multiplied with the salience a), and the act signals act1(t) and act2(t).
capability for planning. Based on such experimental ®ndings, animal learning theorists suggest that animals form an internal model of their environment that allows them to predict the sensory consequences of their acts and to form novel associative chains. Furthermore, animals seem to use these predictions to select their acts. 6,23,67,70 These insights led to the implementation of neural network architectures in which an internal model, serving as the Critic, transiently in¯uences the Actor to elicit acts. 67 In test phase trials, the rat is fed in the green goal box but not in the red goal box. The more test phase trials are presented, the higher is the probability that the rat turns right. Since the rat is rewarded after right turns, this progressive performance improvement is interpreted as a result of sensorimotor learning. 72 Several studies demonstrated that dopamine is involved in planning tasks. 39,53,71,76 Dopamine may guide action selection in these tasks because dopamine bursts transiently in¯uence striatal activity and dopamine concentration is in¯uenced by the formation of novel associative chains. To test this hypothesis, we extend previous Actor±Critic models by simulating transient membrane effects of dopamine neuron activity on medium spiny neurons in striatal matrisomes. Then, we tested the performance of this model in a task that assessed planning and sensorimotor learning. 62 MODEL EQUATIONS
We model the in¯uence of dopamine neuron activity on medium spiny neurons in striatal matrisomes in concert with the sensorimotor components of the basal ganglia±thalamocortical system (Fig. 3). To compute the dopamine-like signal we adapt the previously proposed
internal model approach (Suri, submitted). This model is termed the ªextended TD modelº and serves as the Critic. We propose a ®ring rate model for striatal medium spiny neurons that mimics membrane potential ¯uctuations, as well as the effects of dopamine on membrane properties and on corticostriatal transmission. To model the transient in¯uences of dopamine on membrane properties of medium spiny neurons in striatal matrisomes, we simulate the in vitro ®nding that activation of D1 class dopamine receptors decreases ®ring evoked from resting potentials but increases ®ring evoked from elevated holding potentials. Likewise, we assume that the longterm effects of dopamine on corticostriatal transmission depend on the postsynaptic membrane potential, which is in¯uenced by synaptic inputs, by dopamine levels, and by rhythmic ¯uctuations of about 1 Hz between a depolarized up-state and a hyperpolarized down-state. Since one out of two acts can be selected in the simulated experiment, we model dopamine-dependent in¯uences on two neuron populations in striatal matrisomes. Each neuron population is thought to correspond to a small population of medium spiny neurons with highly correlated activations that is able to elicit one of these acts. We assume that the simulated signals in the basal ganglia±thalamocortical pathway correspond to neural activities of similar populations of neurons. For simplicity, we call these neuron populations ª(simulated) neuronsº. We assume that only the predominant striatal ®ring rates are represented in the basal ganglia output nuclei because of interactions between the direct and the indirect pathways. Strong and persistent depressions of ®ring rates in these basal ganglia output nuclei elicit acts via thalamocortical projections. The model is implemented in time steps of 100 ms to
68
R. E. Suri et al.
Fig. 4. (A) Model for effects of dopamine D1 class receptor activation on the ®ring rate of medium spiny neurons in vitro. The membrane potential Esub(t) depends on the constant resting membrane potential Erest, and on the product of the injected current I(t) and the resistance R. The membrane potential Esub(t) and dopamine D1 agonist concentration DA(t) in¯uence the value of the signal Wmem(t). The ®ring rate y(t) is a monotonically increasing function of the membrane potential Esub(t) and the signal Wmem(t). (B) Model for dopamine membrane effects and synaptic effects for a medium spiny neuron in vivo. As in the model for the in vitro ®ndings, the transient effect of dopamine on the striatal membrane potential via D1 class receptor activation is mimicked with the dopamine membrane effect signal Wmem(t). The weight Wsyn(t) coding for the gain of corticostriatal transmission is adapted according to dopamine concentration, time-averaged membrane potential and presynaptic activity. As dopamine concentration will be modeled with the ®ring rate of dopamine neurons [Eq. (17)], the signal coding for dopamine neuron activity DA(t) in¯uences Wmem(t) and Wsyn(t). Membrane potential ¯uctuations are simulated with a rhythmically ¯uctuating signal s(t).
reduce computation time (about 200 h for the results shown on a Sun Ultra 1) and is limited to physiological processes that are slower than 100 ms. We propose a ®ring rate model that does not take account of membrane current integrations. The model was implemented in the programming language Matlab (code available at www.cnl.salk.edu/,suri). We plan to translate the Matlab code to Neural Simulator Language (NSLJ), which will be executable on-line with standard web browsers (Suri et al., in preparation at http://wwwhpb.usc.edu/Projects/bmw.htm/). Striatal membrane effects of dopamine D1 agonists in vitro Effects of neuromodulators on neuronal membrane properties have been simulated with various modeling techniques. 28 Although we would prefer to use a biophysical Hodgkin±Huxley type model for the effects of dopamine D1 receptor on single ionic currents, we are not aware of such a model for striatal medium spiny neurons.
Unfortunately, neither the time-course of dopamine D1 receptor effects on single ionic currents nor the quantitative contribution of the single ion currents is known. For the current study, the conditions shown in ®gure 1 in Hernandez-Lopez et al. 35 are particularly interesting because they resemble membrane potential oscillations in vivo. In that study, the effects of dopamine D1 class agonists were investigated for ®ring evoked by 200- to 300-ms current steps at a frequency of 0.1±0.2 Hz. Bath application of D1 class agonists attenuated ®ring when evoked from resting membrane potentials, but increased ®ring evoked from elevated membrane potentials. We propose a crude phenomenological model for this situation (Fig. 4A). Since the effect of dopamine D1 class receptor activation on evoked ®ring switches its sign depending on the holding potential, we introduce a reverse potential (Table 1). This reverse potential corresponds to the hypothetical holding potential between resting potential and ®ring threshold, for which the effect of D1 activation on evoked ®ring vanishes as it reverses its sign. The term reverse potential should not be confused with the biophysically de®ned term reversal potential that is de®ned by a current or voltage reversal. The in¯uence of dopamine D1 receptor activation on the ®ring rate during a current step is modeled using a slowly varying parameter Wmem(t) that is initialized with a value of zero and then adapted with Wmem
t dWmem
t 2 100 1 hDA
t 2 100 E
t 2 100 2 reverse_potential:
1
The time t is given in units of ms throughout this paper. The constant d denotes the decay rate of the effects of D1. Since the effects of the D1 agonist decay to a value of 40% in 10 min, 35 we estimate the value of the decay rate d of the dopamine membrane effects to be 0.99985, which corresponds to a 0.015% decrease for each 100 ms. A signal DA(t) corresponds to the dopamine D1 agonist concentration and the parameter h is a scaling factor (for values, see section Simulations). A signal E(t) denotes the time-averaged membrane potential in mV and is de®ned below. The absolute value of the dopamine membrane effects Wmem(t) is limited to Wmem,max (Table 1), as this value is appropriate to reproduce the maximal amplitude of the effects of the D1 agonist. We estimate the membrane potential Esub(t) for values below the ®ring threshold. This signal does not have a physiological meaning when the neuron ®res. This membrane potential Esub(t) is computed from the resting membrane potential Erest, from the resistance R, and from the injected current I(t) according to Ohm's law with Esub
t Erest 1 RI
t:
2
From ®gure 1 in Hernandez-Lopez et al., 35 the value of the resting potential is estimated as Erest 282 mV and the value of the resistance R as 27 MV. The latter value does not correspond to a biophysical property of the neuron, but depends on the electrode used. Since the seal between the used intracellular sharp electrode and
69
Functions of striatal dopamine in learning and planning
the neural membrane is far from being complete, a major part of the injected current I(t) does not reach the inside of the neuron. As we assume that the ®ring rate increases with increasing values of the membrane potential Esub(t) and of the parameter Wmem(t), we compute the ®ring rate y(t) of the striatal neuron with y
t ymax £ tanh{a £ Esub
t 1 Wmem
t 2 firing_threshold=ymax }:
3
From ®gure 1 in Hernandez-Lopez et al., 35 the ®ring threshold is estimated to be 256 mV. This value is not equal to the average ®ring threshold in vivo and will be adjusted for the in vivo model. The hyperbolic tangent tanh is a sigmoid function, and the function ymax £ tanh{´/ymax} smoothly limits the ®ring rate y(t) to values below the maximal ®ring rate ymax of medium spiny neurons (Table 1). A factor a is used to scale the ®ring rate to experimental data and is estimated from ®ring rates for constant-current injections in the absence of dopamine agonists (Table 1). Note that Eq. (3) does not take into account that minor ®ring rate adaptations occur during a few hundreds of milliseconds after current step injections. 35,48 For the adaptation rule [Eq. (1)], we estimate the timeaveraged membrane potential E(t). Eq. (2) computes the membrane potential Esub(t) only for values below the ®ring threshold. Since the 100-ms step size of our implementation is too long to simulate single spikes, we approximate the time-averaged membrane potential E(t) for a certain ®ring rate with the average membrane potential of real neurons with this ®ring rate. This is achieved by computing the contribution of measured action potentials to the average membrane potential. We estimate for a single spike an area A (Table 1) as the integral over time of the difference between the membrane potential and the ®ring threshold. Therefore, the time-averaged membrane potential is 8 firing_threshold 1 y
t £ A=100 ms; > > > > > < if Esub
t . firing_threshold E
t
4 > > > > > : Esub
t; else: For membrane potentials Esub(t) below the ®ring threshold, this equation sets the time-averaged membrane potential E(t) to Esub(t), whereas for the ®ring neuron, E(t) is set to the time-averaged membrane potential of real neurons with this ®ring rate. Striatal dopamine modulation in vivo Dopamine membrane effects. We adapt the proposed model for dopamine membrane effects mediated by D1 class receptors in vitro [Eqs. (1)±(4)] to simulate the membrane effects of dopamine D1 receptor activation in vivo. Some parameter values are adapted to the in vivo condition. As effects of dopamine neuron bursts decay much faster in vivo than those of dopamine bath
application in vitro, 31,35,65,79 the decay rate d [Eq. (1)] is set to the value reported for dopamine bursts in vivo 31 (Table 1). Furthermore, the ®ring threshold for medium spiny neurons is set to the average value for medium spiny neurons (Table 1). As for the in vitro simulation, the reverse potential is set to a value just below ®ring threshold (Table 1). Dopamine modulation of corticostriatal transmission. To simulate dopamine-dependent long-term plasticity of corticostriatal transmission (Fig. 4B), we express longterm effects of dopamine on corticostriatal transmission in the dynamics of adaptive weights. 42,57 Motivated by experimental ®ndings (see Introduction), we assume that the direction of long-term effects of dopamine depends both on the membrane potential of the postsynaptic medium spiny neuron and on the dopamine concentration: Wsyn
t Wsyn
t 2 100 1 1DA
t 2 100E
t 2 100 2 synaptic_reverse_potentialu
t 2 100:
5 The parameter 1 denotes the adaptation rate of the corticostriatal weight (Table 1). The signal DA(t) represents the normalized dopamine concentration. This signal is zero for average physiological concentrations, negative for lower and positive for higher concentrations. As long-term adaptation in corticostriatal transmission presumably requires presynaptic activity, the adaptation of the corticostriatal weight Wsyn(t) is proportional to the presynaptic ®ring rate u(t 2 100). E(t) denotes the time-averaged membrane potential of the striatal medium spiny neuron [Eq. (4)]. The synaptic reverse potential is the membrane potential for which synaptic adaptation switches its sign and should not be confused with the biophysically de®ned synaptic reversal potential. Since dopamine membrane effects may mediate long-term adaptations in corticostriatal transmission, Eq. (5) is somewhat similar to Eq. (1), and the value of the synaptic reverse potential is set to a similar value as the reverse potential for dopamine membrane effects (Table 1). In contrast to Eq. (1), we assume in Eq. (5) that the synaptic adaptations do not decay during the simulated experiment. Striatal membrane potential ¯uctuations. We assume that the simulated medium spiny neurons switch their state each 400 ms when dopamine D1 receptor-mediated membrane effects and corticostriatal inputs are at baseline levels [Wmem(t) 0 and Wsyn(t)u(t) 0]. Furthermore, we assume that effects enhancing ®ring rate or membrane potential [Wmem(t) . 0 or Wsyn(t)u(t) . 0] prolong the duration of the up-state and shorten the duration of the down-state, whereas effects decreasing ®ring rate or membrane potential [Wmem(t) , 0, Wsyn(t)u(t) , 0] have the opposite effects on state durations. This is accomplished with the ¯uctuating function (
s
t
Eup ; during {400 ms 1 Wmem
t 1 Wsyn
tu
tw} Edown ; during {400 ms 2 Wmem
t 1 Wsyn
tu
tw}
:
6
70 Table 1. Standard model parameters for striatal medium spiny neurons in vivo and the basal ganglia±thalamocortical pathway (values are not always equal to values for the simulation of Fig. 7; see section Striatal dopamine modulation in vivo) Symbol
Value
Choice of value
D1 membrane effects
Decay of dopamine D1 membrane effects Maximal absolute value of the D1 membrane effect Wmem(t) Initial value of the D1 membrane effects Wmem,1(t) and Wmem,2(t) Maximal ®ring rate Scaling factor Firing threshold Reverse potential of D1 membrane effect Amplitude of dopamine D1 membrane effects Area below one spike
d Wmem,max Wmem,1(t 0) Wmem,2(t 0) ymax a ®ring_threshold reverse_potential h A
0.8/100 ms 9 0 6 spikes/100 ms 0.3 spikes £ (100 ms´mV) 21 2 46 mV 2 48 mV 180 6 mV £ 100 ms
From Ref. 31 From in vitro simulation From in vitro simulation From Refs 4 and 44 From Ref. 44 Average®ring threshold 78 Just below ®ring threshold Appropriate for planning Figure 3F in Ref. 78
Corticostriatal transmission
Learning rate for corticostriatal learning
1
0.45
Synaptic reverse potential Initial corticostriatal weights stimulus blue±act left and stimulus blue±act right
synaptic_reverse_ potential Wsyn,left(0), Wsyn,right(0)
2 41 mV 5
Chosen to induce sensorimotor learning in a few trials appropriate for simulated task To induce a few exploratory acts in exploration phase
Membrane potential ¯uctuations
Membrane potential in up-state Membrane potential in down-state Lower limit of membrane potential Dopamine modulation of up- and down-state durations
Eup Edown Emin w
2 47 mV 2 69 mV 2 82 mV 0.1
From Ref. 61 From Ref. 61 From Ref. 61 To get a small but signi®cant effect on model performance as compared to w0
Thalamus±cortex
Integration constant for leaky integrator Act threshold Salience of thalamic activity for extended TD model
lact actthres a
0.7/100 ms 2.6 0.1
For plausible reaction time in planning For plausible reaction time in planning Appropriate for successful planning
R. E. Suri et al.
Name
71
Functions of striatal dopamine in learning and planning
Parameters Eup and Edown denote the average membrane potentials in the up-state and in the down-state, respectively (Table 1). A parameter w scales the in¯uence of the dopamine membrane effects Wmem(t) and the corticostriatal activation Wsyn(t)u(t) on the typical state duration of 400 ms (Table 1).
These neurons inhibit two neurons in the thalamus:
Membrane potential of medium spiny neurons. Below ®ring threshold, we assume that the membrane potential Esub(t) is determined by membrane potential ¯uctuations s(t), the presynaptic activity u(t), and the corticostriatal weight Wsyn(t) with
u5
t a £ thalamus1
t;
Esub
t s
t 1 Wsyn
tu
t:
7
To avoid decreasing the simulated membrane potential Esub(t) below physiological values, it is limited to values above the constant Emin (Table 1). For simulations of in vivo conditions, Eq. (7) replaces Eq. (2), since Eq. (2) describes the current injection in vitro. Basal ganglia±thalamus±cortex In this section, we present the model for the in¯uence of sensory stimuli on activity in striatal matrisomes and, via basal ganglia±thalamocortical pathways, on motor acts (compare Fig. 3). Since only two acts are important for the simulated task, two striatal medium spiny neurons are simulated, and each of them can elicit one act. The effects of dopamine on corticostriatal transmission and dopamine membrane effects on the ®ring rates y1(t) and y2(t) of the two simulated medium spiny neurons are simulated by using, for each neuron, Eqs. (1) and (3)± (7). We assume that the ®ring rates of corticostriatal neurons code for the visual stimuli. Thus, we replace the presynaptic activity u(t) in Eqs. (5)±(7) with the signal for the visual stimulus blue delayed by 100 ms [u(t): u1(t 2 100)]. The delay of 100 ms accounts for delays of the sensory systems. Since associations of the stimuli green and red with the acts are not required, the simulated neurons do not receive inputs reporting about these stimuli. As Eqs. (8)±(11) are linear in the ®ring rates, we can simplify the equations without loss of generality and arbitrarily normalize the ®ring rates in the globus pallidus interior (GPi)/substantia nigra pars reticulata (SNr), thalamus and cortex. Therefore, tonic ®ring rates of the neurons are ignored, and the baselines and the amplitudes of the simulated ®ring rates are chosen arbitrarily. The model does not account for direct interactions between medium spiny neurons. Instead, we implement a selection mechanism resembling the proposal of Berns and Sejnowski. 7 We assume that ®ring rates GPi_SNr1(t) and GPi_SNr2(t) of two simulated neurons in the basal ganglia output nuclei are inhibited by striatal ®ring rates y1(t) and y2(t), but at baseline levels when both striatal neurons ®re due to the in¯uence of the indirect pathway via the globus pallidus exterior (GPe) and subthalamic nucleus (STN) [y1(t) . 0 and y2(t) . 0]: ( 0; if y1
t . 0 and y2
t . 0 GPi_SNrn
t
8 2yn
t; else:
thalamusn
t 2GPi_SNrn
t
for n 1; 2:
9
The thalamus elicits cortical ®ring rates u5(t) and u6(t) that are proportional to the salience parameter a (Table 1):
u6
t a £ thalamus2
t:
10
These cortical ®ring rates serve as inputs for the extended TD model. An act is elicited when the thalamic activation of motor cortical areas is substantial and persistent enough. To take the duration and the amplitude of the thalamic activity into account, cortical activity is computed from the thalamic signal with the leaky integrator cortexintegr;n
t lact £ cortexintegr;n
t 2 100 1 thalamusn
t;
11
where the initial values of the cortical activations equal zero [cortexintegr,n(t 0), for n 1 or 2] and lact denotes an integration constant (Table 1). The two acts are coded by the binary signals act1(t) (act left) and act2(t) (act right). We say that the model is executing an act when the corresponding act signal equals 1. The act with number n is executed when the signal cortexintegr,n(t) is above the threshold parameter actthres: ( 1; if cortexintegr;n
t . actthres actn
t 1 100
12 0; else: If both signals cortexintegr,1(t) and cortexintegr,2(t) are above actthres, then the act corresponding to the larger signal is selected. Since we assume that a selected act persists for 200 ms, actn(t) is set to the value of one for 200 ms. Extended temporal difference model The TD model assumes that each stimulus is represented with a series of short components following stimulus onset. This is achieved by mapping each stimulus to a ®xed temporal pattern of phasic signals x1(t), x2(t), ¼ that follow stimulus onset with varying delays. This temporal pattern is referred to as a ªcomplete serial compound stimulusº or ªtemporal stimulus representationº (Fig. 5A). The temporal stimulus representation is used to compute the reward prediction signal with X p
t vm
t £ xm
t; m
where vm(t) are the adaptive weights (Fig. 5B). 42,57,63,68 The extended TD model (Fig. 6) closely resembles the previously proposed internal model approach (Suri, submitted). To ensure learning of the correct dopaminelike signal, we use appropriate temporal representations of the input signals, but the extended TD model otherwise does not distinguish between the stimuli, the thalamic activities, the acts and the reward. The extended TD
72
R. E. Suri et al.
Fig. 5. TD model. (A) Temporal stimulus representation x1(t), x2(t) and x3(t). Stimulus u1(t) is represented over time as a series of phasic signals x1(t), x2(t) and x3(t) that cover stimulus duration. This temporal stimulus representation is used to implement a time estimation mechanism. Dopamine neurons appear to have access to a time estimation mechanism, since dopamine neuron activity is decreased below baseline levels when an expected reward is omitted. 55 (B) TD model. The temporal stimulus representation x1(t), x2(t) and x3(t) is computed for the stimulus u1(t). Each component xm(t) is multiplied with an adaptive weight vm(t) (®lled dots). The reward prediction p(t) is the sum of the weighted representation components. The difference operator D takes temporal differences from this prediction signal (discounted with factor g). The dopamine-like reward prediction error e(t) is computed from these temporal differences and from the reward signal. The weights vm(t) are adapted proportionally to the prediction error signal e(t) and the learning rate b.
Fig. 6. Extended TD model for two input events u1(t) and u2(t). The event signals uk(t) report about stimuli, rewards, thalamic activity and acts. Each temporal representation component xm(t) is multiplied with an adaptive weight vkm (®lled dots). Event prediction pk(t) is computed from the sum of the weighted components. Event prediction pk(t) is multiplied by a small constant k and fed back to the temporal event representation of this event uk(t). This feedback is necessary to form novel associative chains. Analogous to the TD model, the dopamine-like prediction error ek(t) is computed from the event uk(t) and from the temporal differences between successive predictions pk(t) 2 gpk(t 1 100) (discounted with a factor g). The weights vkm (®lled dots) are adapted as in the TD model.
model learns to predict all its input signals: the signals u1(t), u2(t) and u3(t) coding for the stimuli green, red and blue, respectively; the reward signal u4(t) reward (t 2 100); the thalamic activities u5(t) a £ thalamus1(t) and u6(t) a £ thalamus2(t); the acts u7(t) act1(t) and u8(t) act2(t) (Table 2). Sensory events (acts, stimuli, reward) are coded as signals with a value of 1 when they are present and 0 when they are absent, except stimulus blue in the extended TD model. This stimulus corresponds to the start box in the T-maze. Since this place serves as a known context, novelty responses of dopamine neurons should be smaller than those elicited by the stimuli red and green (see Introduction). Therefore, the presence of stimulus blue is coded with a signal of a small positive value, referred to as the ªsalience of stimulus blueº (Table 3). As in the TD model, each stimulus is represented with a series of stimulus representation components that cover the duration of its presentation to learn to predict when the stimulus occurs (Fig. 5A). Since the stimuli green and red are presented for 300 ms, both are represented
with three temporal representation components. Stimulus green is represented with the representation components x1(t), x2(t) and x3(t), and stimulus red with the representation components x4(t), x5(t) and x6(t). Stimulus blue, which is presented for at most 600 ms, is represented with the components x7(t), x8(t), ¼, x12(t) (Table 2). The temporal representation components of stimulus blue are set to the value of the salience of stimulus blue when they become active, and the components are set to zero when stimulus blue gets extinguished. The stimuli green and red are coded with the binary signals u1(t) and u2(t), respectively (100-ms sensory delays). The reward is not represented over time, as the reward signal reward(t) is only non-zero during 100 ms when the reward is presented. Also, the thalamic signal is not represented over time, because this internal signal is not binary and therefore does not have clear onsets and offsets. Instead of representing these signals over time, the reward signal is copied with a delay of 100 ms to the input representation component x13(t) reward(t 2 100), and the thalamic signals to the
73
Functions of striatal dopamine in learning and planning Table 2. De®nitions of Critic input signals uk(t) and their temporal representation components xm(t) (see section Extended temporal difference model)
Green goal box (stimulus green) Red goal box (stimulus red) Start box (stimulus blue) Reward Thalamic activity related to act 1 Thalamic activity related to act 2 Act left (act 1) Act right (act 2)
Signal
Temporal representation components
u1(t) u2(t) u3(t) u4(t) reward(t 2 100) u5(t) a £ thalamus1(t) u6(t) a £ thalamus2(t) u7(t) act1(t) u8(t) act2(t)
x1(t), x2(t), x3(t) x4(t), x5(t), x6(t) x7(t), x8(t), ¼, x12(t) x13(t) x14(t) x15(t) x16(t), x17(t) x18(t), x19(t)
representation components x14(t) a £ thalamus1(t) and x15(t) a £ thalamus2(t). Since acts usually lead to sensory stimuli during their execution, they are also represented over time during their duration. As each of the two acts act1(t) and act2(t) persisted for 200 ms, act1(t) is represented with the components x16(t) and x17(t), and act2(t) with the components x18(t) and x19(t) (Table 2). To form novel associative chains, a predicted event should elicit similar prediction signals as does the experience of this event. More precisely, a predicted stimulus should produce internal representation signals that resemble the representation of the stimulus itself. Therefore, Suri (submitted) proposed that the prediction of each stimulus is fed back to the temporal stimulus representation of this stimulus and used to estimate further prediction signals. The feedback is computed twice within each time step because the loop time t is assumed to be much shorter than the usual 100-ms step size of the model. The prediction pk(t 2 2t) for the input signal uk(t) (k 1, 2, ¼, 8) is computed from the product of the adaptive weights vkm(t) with the components of the temporal representation components xm(t): pk
t 2 2t
19 X m1
vkm
t £ xm
t:
13a
This prediction is fed back twice to the temporal stimulus representation with the two equations: pk
t 2 t
pk
t
19 X m1 19 X m1
vkm
t £ xm
t 1 kskm pk
t 2 2t;
13b
vkm
t £ xm
t 1 kskm pk
t 2 t:
13c
To avoid very large absolute values of the prediction signals pk(t), these signals are limited to values between 2pmax and 1pmax (Table 3). We introduce the small time constant t as it corresponds, in a neural substrate, to the minimal time period that is required to form a novel associative chain from two learned associations. However, we do not assign a value to this time constant t, as t occurs only in Eqs. (13a) and (13b) and prediction signals will be shown in the ®gures for time steps of 100 ms. The feedback constant k (Table 3) determines the gain of the feedback loop and therefore the impact of
a predicted stimulus on further stimulus predictions. The number of these update equations seems to correspond to the number of novel links in the associative chain the model can compute (unpublished result). We are aware neither of mathematical considerations nor of experimental evidence that would indicate which components of the temporal stimulus representation xm(t) should be in¯uenced by this feedback. For simplicity, we assume that the feedback in¯uences only the earliest component of the temporal stimulus representation. Feedbacks to later temporal stimulus representation components do not in¯uence most simulation results, but lead to slightly different time-courses of prediction signals in simulations that test the formation of novel associative chains (unpublished result). Feedback to the earliest component of the temporal stimulus representation is accomplished by setting the factor skm to 1 for the earliest component of the temporal stimulus representation of each stimulus or act. Also, for the reward and the two thalamic signals, skm is set to 1. Otherwise, the factor skm is set to 0 (s1,1 1, s2,4 1, s3,7 1, s4,13 1, s5,14 1, s6,15 1, s7,16 1, s8,18 1; skm 0 otherwise). The following equations of the extended TD model are analogous to those of the TD model. Therefore, the proposed extended TD model with the parameter k 0 is equivalent to a set of eight independent TD models. For k 0 and for each value of k, Eqs. (13a), (13b) and (13c)±(16) are equivalent to the TD model. The prediction errors ek(t) are computed with ek
t uk
t 2 100 2 pk
t 2 100 2 gpk
t;
14
where g is the discount factor (Table 3) and the input Table 3. Standard model parameters in the extended temporal difference model Parameter name Initial weights for novelty responses Critic learning rate Feedback constant Maximal value of prediction signals Discount factor Decay of eligibility trace Salience of stimulus blue for Critic
Symbol
Value
v b k pmax g lc None
0.001* 0.5 0.8 10 0.98/100 ms² 0.3/100 ms 0.05
*The value of v is chosen small enough to avoid errors due to exploration in the test phase, but large enough to signi®cantly increase the number of acts in the exploration phase. ²From Ref. 64.
74
R. E. Suri et al.
signals uk(t) (k 1, 2, ¼, 8) denote the three stimuli, the reward, the thalamic signals (multiplied with salience a) and the two act signals. To minimize the prediction error signals, the weights vkm(t) are incrementally adapted according to the product of the prediction errors ek(t) with the eligibility traces of the temporal representation xm
t with vkm
t 1 100 vkm
t 1 bek
txm
t:
15
The three weights vkm that associate the earliest component of the temporal representations of the three stimuli with the reward are initiated with the positive value v (Table 3) to reproduce dopamine novelty responses. 64 The other values of the matrix vkm are initialized with zeros [v4,1(t 0) v, v4,4(t 0) v, v4,7(t 0) v; vkm(t 0) 0 otherwise]. The traces xm (t) are slowly decaying versions of the input representation components xm(t). Such eligibility traces were introduced to explain how animals learn to associate sensory events even if they are separated by a delay period. 68 Although TD models with temporal stimulus representations learn to associate sensory events over a delay without representation traces, 42 traces accelerate learning. 69 At the beginning of an experiment, xm (t) is set to the initial condition xm (0) 0. Then, the traces are computed with xm
t lc xm
t 2 100 1
1 2 lc xm
t:
16
The parameter lc denotes the decay of the eligibility trace (Table 3). The output signal of the extended TD model is the reward prediction error e4(t) [Eq. (14)] that resembles the ®ring rate of dopamine neurons (Suri, submitted). Since phasic dopamine concentration in the extracellular space seems to be closely time-correlated with the evoked ®ring rate of dopamine neurons, 31 we compute the striatal dopamine concentration DA(t) with DA
t e4
t:
17
The simulated dopamine concentration DA(t) is used to simulate the effect of dopamine on medium spiny neurons in striatal matrisomes [Eqs. (1) and (5)]. SIMULATIONS
Striatal membrane effects of dopamine D1 agonists in vitro To test the proposed model for the in¯uences of the dopamine D1 class receptor on the striatal ®ring rate [Eqs. (1)±(4)], we simulate the experimental conditions of ®gure 1 in Hernandez-Lopez et al. 35 Therefore, we simulate current step injections of 1.3 nA amplitude and 300 ms duration with a frequency of 0.1 Hz. Furthermore, the holding potential of 257 mV is induced with a sustained current I(t) of 0.9 nA. Superimposed on this holding current, current steps of 0.4 nA amplitude and 300 ms duration are injected with a frequency of 0.1 Hz. Therefore, these two simulated current injection protocols differ only in the holding current and not in the total amount of current injection during current steps. For the in vitro experiment, this was at least approximately the
case. 35 The effective D1 agonist concentration h £ DA(t) is set to zero for the condition without D1 agonist application and to the value of 0.1 during D1 agonist application (the constant h is here only introduced to make the notation consistent throughout this paper). This value is chosen to reproduce the extent of the effect measured by Hernandez-Lopez et al. 35 To simulate the condition with D1 agonist application, current steps are repeated until the ®ring response is stable, which is the case after 30 s of simulated physical time. To reproduce dopamine modulation for the recorded neuron, 35 the value of the reverse potential is set to 258 mV. Values for the reverse potential between 270 and 257 mV lead to results that are consistent with the experimental ®ndings. Planning and sensorimotor learning task We test model performance in a task analogous to the T-maze task. In the exploration phase (Fig. 2B, top), each trial starts with presentation of a stimulus called ªblueº (stimulus blue) that represents sensory features of the start box. When either the act right or the act left is executed during presentation of stimulus blue, the stimulus is extinguished and either stimulus green or stimulus red is presented, respectively. The act right and the act left represent the rat's right and left turns, respectively, whereas the stimuli green and red correspond to the colors of the goal boxes. If no act is selected, stimulus blue is extinguished after 600 ms. No rewards are presented in this exploration phase. This phase is simulated for a time span corresponding to 80 s (exclusive of intertrial intervals), during which stimulus blue is presented about 100 times. The subsequent rewarded phase consists of only one trial (Fig. 2B, middle), in which presentation of stimulus green is followed by reward presentation. The beginning of the test phase (Fig. 2B, bottom) is equal to the exploration phase. Stimulus green but not stimulus red is followed by reward presentation. In all three phases, the stimuli green and red are presented for 300 ms and the reward for 100 ms. Planning is assessed in the ®rst trial of the test phase and the progress of sensorimotor learning in subsequent trials. At the beginning of each trial, the ¯uctuating function s(t) [Eq. (6)] is set to random initial conditions for each neuron, since physiological membrane potential ¯uctuations are only weakly periodic and intertrial intervals are assumed to be suf®ciently long. The simulated task requires that the presentation of stimulus blue is often followed by act left or act right during the exploration phase. To induce execution of these exploratory acts, the two corticostriatal weights Wsyn,left(t 0) and Wsyn,right(t 0) that associate stimulus blue with the ®ring rate of the two striatal neurons are initialized with a positive value (Table 1). RESULTS
Striatal membrane effects of dopamine D1 agonists in vitro The ®nding that D1 agonists can be inhibitory or
Functions of striatal dopamine in learning and planning
75
Fig. 7. Simulation of the experimental ®nding shown in ®gure 1 of Hernandez-Lopez et al. 35 with the proposed ®ring rate model [Eqs. (1)±(4)]. The signal E(t) (in mV) denotes the time-averaged membrane potential (B, lines 1 and 2). Above ®ring threshold, values of E(t) also correspond to ®ring rates (spikes/100 ms). (A) Current injection of 1.3 nA for 300 ms (bottom line). Without dopamine D1 agonist application, current injection leads to a ®ring rate of about 3 spikes/100 ms [line 1, h £ DA(t) 0]. The signal coding for the dopamine D1 receptor-mediated membrane effects Wmem(t) remains on the initial value of zero [not shown; follows from Eq. (1)]. With D1 agonist application, evoked ®ring is attenuated to less than 1 spike/100 ms [line 2, h £ DA(t) 0.1] because the value of the dopamine membrane effect signal Wmem(t) is negative (line 3). (B) Current injection of 1.3 nA for 300 ms from a sustained holding current of 0.9 nA (bottom line). Without dopamine D1 agonist application (line 1), the rate of evoked ®ring does not depend on the holding current (line 1 in B) because the dopamine membrane effect signal Wmem(t) remains on the value of zero (not shown). With dopamine D1 agonist application, evoked ®ring is increased to 4.5 spikes/100 ms [line 2, h £ DA(t) 0.1] because the dopamine membrane effect signal Wmem(t) is positive (line 3).
excitatory for medium spiny neurons depending on the holding potential (see ®gure 1 in Hernandez-Lopez et al.) 35 is reproduced with the model (Fig. 7). Since the term h £ DA(t) is set to zero to simulate the control condition without D1 agonist application, the membrane effect signal Wmem(t) remains on the initial value of zero [not shown; follows from Eq. (1)]. Therefore, the evoked ®ring rate (Fig. 7A, B, top) does not depend on the holding potential, but only on the injected current [Fig. 7A, B, bottom line; Eqs. (2) and (3)]. Although this model feature does not hold exactly for medium spiny neurons, 48 the simulated membrane potentials and ®ring rates approximately reproduce those for the medium spiny neuron shown at the top of ®gure 1 in HernandezLopez et al. 35 Since the term h £ DA(t) is positive to simulate D1 agonist application and the resting membrane potential of 282 mV is below the reverse potential, the dopamine membrane effect signal Wmem(t) is negative [Fig. 7A, line 3; Eq. (1)]. In contrast, Wmem(t) is positive if the holding membrane potential of 257 mV is above the reverse potential (Fig. 7B, line 3). Since the ®ring rate depends on the dopamine membrane effect signal Wmem(t) [Eq. (3)], D1 agonist application attenuates ®ring evoked from the resting potential of 282 mV, but enhances ®ring evoked from the holding potential of 257 mV (Fig. 7A, B, line 2). These simulated D1 effects approximately reproduce those measured for the medium spiny neuron shown in ®gure 1 in Hernandez-Lopez et al. 35
Planning and sensorimotor learning Since each trial started with a random state of the striatal membrane potentials, the model performed differently in each experiment. In the exploration phase, the model learns to associate act left with stimulus red and act right with stimulus green. No rewards are presented in this phase. In the ®rst trial (Fig. 8A), stimulus blue was presented and the model executed the act left (bottom line) that led to presentation of stimulus red (line 1). Since certain associative weights of the extended TD model had been initialized with positive values, this novel stimulus phasically activated the reward prediction signal [line 2; Eqs. (13a), (13b) and (13c)]. This led to a biphasic response of the dopamine-like reward prediction error signal [line 3; Eq. (14)] resembling dopamine novelty responses. Since the salience of stimulus blue had been set to a smaller value than that of the stimuli red and green (see section Extended temporal difference model and Table 3), onset of the stimulus blue led to very small activations of the reward prediction signal and the reward prediction error signal (hardly visible). The timeaveraged membrane potentials Eleft(t) and Eright(t) of the two medium spiny neurons in the striatal matrisome compartment ¯uctuated between the elevated up-state and the hyperpolarized down-state [line 5; Eqs. (4) and (7)]. The striatal membrane potentials were slightly increased during presentation of stimulus blue, since corticostriatal weights associating stimulus blue with
76
R. E. Suri et al.
Fig. 8. Model performance during exploration phase. (A) First trial. When stimulus blue was presented (line 1), the model elicited the act left (bottom line) that led to presentation of stimulus red (line 1). Since stimulus red was presented for the ®rst time, its onset phasically activated the reward prediction signal (line 2) and biphasically activated the dopamine-like reward prediction error signal (line 3). Membrane potentials of the two simulated striatal medium spiny neurons ¯uctuated between an elevated up-state and a hyperpolarized down-state (line 5). During presentation of stimulus blue, the simulated striatal neuron coding for act left was ®ring for 500 ms. Neurons in the motor cortex integrated this striatal ®ring rate over time (line 6). The act left was elicited (bottom line) when the integrated signal reached a threshold. (B) A trial at the end of the exploration phase. When stimulus blue was presented (line 1), the model elicited the act right (bottom line) that led to presentation of stimulus green (line 1). Since stimulus green had been presented repeatedly during the exploration phase, novelty responses were almost absent in the reward prediction signal (line 2) and in the dopamine-like reward prediction error signal (line 3). Prediction of stimulus green (line 4) was already increased when the striatal neuron coding for the act right increased its ®ring rate (line 5), because this had often antedated execution of act right and presentation of stimulus green.
striatal activity were set to positive initial values (compare section Simulations). The striatal neuron coding for act left was ®ring for 500 ms. This signal was integrated by two neurons in the motor cortex [line 6; Eq. (11)]. When the ®ring rate of the neuron coding for act left reached the act threshold actthres, this act was elicited [bottom line; Eq. (12)]. The signal coding for the act right remained on the value of zero (not shown). For the following 80 presentations of stimulus blue, the model executed 11 times the act left, 14 times the act right and 55 times no act (not shown). Trials without acts occurred when striatal membrane potentials of both neurons happened to ¯uctuate synchronously, as the indirect pathway [Eq. (8)] suppressed the effects of synchronous striatal ®ring on the cortical neurons. The model performance for the 81st presentation of stimulus blue is shown in Fig. 8B. Since the model selected act right (bottom line), stimulus green was presented (line 1). The signals coding reward prediction and reward prediction error remained almost zero, since novelty responses had extinguished as a consequence of the Critic learning rule [lines 2 and 3; Eq. (15)]. The green prediction signal was slightly activated when the striatal neuron coding the act right was ®ring (lines 4 and 5), as such increased striatal ®ring had often been followed by the act right in previous trials. Striatal ®ring in¯uenced the green prediction signal via basal ganglia output nuclei, the
thalamus and cortex [via salience a; see Fig. 3 and Eq. (10)]. Since act right had previously been followed by green presentation and an efference copy of the act signal reached the Critic [Fig. 3 and Eq. (13a)], the green prediction signal was fully activated when the act right was executed (bottom line). The green prediction peaked at the correct value of 3, as this value re¯ects the predicted future duration of stimulus green in units of 100 ms. The striatal ®ring rates were integrated over time in cortical neurons (line 6), which led to execution of the act right (bottom line). The rewarded phase consisted of only one trial (Fig. 9) in which presentation of the stimulus green (line 1) was followed by that of the reward (line 2). An act was not executed as the initial values of the corticostriatal weights disfavored acts when stimulus blue was absent (see section Simulations). Since the reward was unpredictable, the reward prediction error (line 3) was equal to the reward signal [Eq. (14)]. The temporal representation of stimulus green in the Critic consisted of three phasic components (lines 6±8; Fig. 5A) that followed green presentation with delays of 100, 200 and 300 ms. An eligibility trace was computed for each component [lines 9±12; Eq. (16)]. The adaptive weights v41(t), v42(t) and v43(t) (three lines at bottom, number 4 codes for the reward) associate the components x1(t), x2(t) and x3(t) of the temporal representation of stimulus green
77
Functions of striatal dopamine in learning and planning
Fig. 9. Associative learning during rewarded phase. In this second phase, presentation of stimulus green (line 1) was followed by presentation of the reward (line 2) and no act was executed. Since the reward was unpredictable, the reward prediction error (line 3) was equal to the reward signal. The three temporal representation components of stimulus green were phasic signals with peaks following green onset with different delays (lines 4±6). For each component an eligiblility trace was computed (lines 7±9) that was used to adapt the weight that associated this component with the reward (three lines at bottom). (All signals shown in this ®gure start with a value of zero.)
with the reward prediction signal [Eqs. (13a), (13b) and (13c)]. Each of these weights was adapted proportionally to the product of the trace of its component x1
t, x2
t or x3
t with the reward signal [Eq. (15)]. Likewise, the Critic learns the associative weights vkm(t) between the other input events (stimuli, reward, acts, thalamic ®ring rate; Fig. 6). Planning was assessed in the ®rst trial of the test phase, as this trial tested the model's ability to select the correct act right based on the formation of novel associative chains. In the ®rst trial (Fig. 10A), presentation of stimulus blue (line 1) was responded to with the correct act right (bottom line), which was followed by presentation of the stimulus green and the reward (line 1). This correct act was selected because of a positive feedback between the striatal ®ring coding for the act right (line 8) and the dopamine-like reward prediction error (line 5). In this trial, the striatal neuron coding for the act right happened to stay in the elevated up-state for several hundreds of milliseconds during presentation of stimulus blue (line 8). (If instead the other neuron had happened to stay in the up-state, this would not have increased the dopaminelike reward prediction error and no act might have been elicited.) This persistent striatal ®ring reached the Critic via basal ganglia output nuclei and the thalamus [via salience a; Fig. 3 and Eq. (10)]. Since the model had
previously learned to associate such striatal ®ring coding for the act right with presentation of stimulus green, the green prediction signal slightly increased (line 2). Because the model formed novel associative chains via the feedback over parameter k, this activation in the green prediction signal served as the stimulus green itself [Fig. 6; Eqs. (13a), (13b) and (13c)]. Since the model had learned in the rewarded phase to associate stimulus green with the reward, the reward prediction (line 4) and therefore the dopamine-like reward prediction error were slightly activated [®rst small peak in line 5; Eq. (14)]. This activation in the reward prediction error increased the signal representing dopamine membrane effects [line 6; Eq. (1)] and the corticostriatal weight [line 7; Eq. (5)] of the striatal neuron coding for the act right. In contrast, both signals were decreased for the act left, since the membrane potential of this neuron was below both reverse potentials. These in¯uences increased the ®ring rate of the striatal neuron coding for the act right and decreased the membrane potential of the neuron coding for the act left [line 8; Eqs. (3) and (7)], which led to the execution of the correct act right. In the following 60 simulated test phase trials the correct act right was executed, except in the incorrect trial 3 (not shown). In trial 19 (Fig. 10B), the prediction error signals (lines 3 and 5) were phasically increased at onset of stimulus blue, as all the event occurrences following stimulus blue were correctly predicted. These events were completely predictable, as the previous 10 trials had been alike (not shown). Because of the large corticostriatal weight associating stimulus blue with the act right [line 7; Eq. (5)], the striatal neuron coding for the act right became strongly activated [line 8; Eqs. (3), (4) and (7)] and the model executed the correct act right (bottom line). The previous three ®gures show the model performance of the standard model in one typical experiment. As the outcome of each experiment was in¯uenced by the randomly selected states of the striatal membrane potentials at the beginning of each trial, the model performance was tested in 1000 experiments. In addition, we produced seven model variants by setting, for each model variant, one model parameter to zero. As for the standard model, we simulated 1000 experiments for each of these physiologically less plausible model variants. During the exploration phase, the model variant without dopamine-like novelty responses to novel stimuli [v 0, Eq. (15)] and the model variant without transient effects of the dopamine-like signal on the membrane potentials of striatal neurons [h 0, Eq. (1)] executed substantially fewer acts than the other model variants (Table 4). The percentage of correct acts in trials 1±19 of the test phase is shown for each model variant in Fig. 11.
Table 4. Number of acts during the exploration phase for different model variants Standard
v0
b0
k0
a0
10
h0
f0
26.3 ^ 0.3
13.6 ^ 0.1
26.3 ^ 0.2
25.8 ^ 0.1
22.7 ^ 0.1
27.1 ^ 0.2
13.8 ^ 0.1
29.5 ^ 0.1
Mean values ^ S.E.M. were computed from 1000 experiments for each model variant.
78
R. E. Suri et al.
Fig. 10. Model performance in the test phase. When presentation of stimulus blue (line 1) was responded to with the correct act right (bottom line), the stimulus green was presented, which was followed by the reward presentation (line 1). (A) Successful planning in the ®rst trial. The signal coding for prediction of stimulus green (line 2) was already slightly activated when the ®ring rate of the striatal neuron coding for the act right was increased (line 8). Since the green prediction was associated with the reward prediction, the reward prediction shows a ®rst small activation (line 4). This signal shows a second higher peak when the partially predicted reward occurs. The ®rst slight activation of the reward prediction error (line 5) enhanced the ®ring rate of the striatal neuron coding for the act right (line 8), as the reward prediction error increased both the dopamine membrane effect signal (line 6) and the corticostriatal weight (line 7) coding for act right. The cortical neurons integrated the striatal neural activity over time, and the correct act right was elicited (bottom line) when the cortical ®ring rate reached the threshold (line 9). (B) Successful sensorimotor association in the 19th test phase trial. Since the onset of stimulus blue was unpredictable, this onset activated the prediction error signals for the stimulus green (line 3) and for the reward (line 5). These error signals were otherwise on the value of zero, as the presentations of the stimulus green and of the reward were correctly predicted. The corticostriatal weights associating stimulus blue with the striatal membrane potentials (line 7) substantially increased the membrane potential of the striatal neuron coding for the act right (line 8), which triggered execution of the correct act right (bottom line).
Signi®cantly more than 50% correct trials in trial 1 indicates planning capabilities, while performance improvements for successive trials demonstrates successful sensorimotor learning due to dopamine-dependent adaptations of corticostriatal weights [Eq. (5)]. For the standard model (Fig. 11, solid line with stars), trial 1 was correct for 79% of the 1000 experiments, then performance worsened for three trials but reached 100% correct trials with further training. For a model variant without dopamine-like novelty responses (v 0, dashed line with crosses), performance in the ®rst trial was signi®cantly worse than that of the standard model (74% correct, x 2 7.2, P , 0.01). For a model variant without adaptation of the dopamine-like signal [b 0, Eq. (15) and Fig. 6; dash dotted line with triangles], the dopamine-like signal unconditionally responded to the reward and the response did not transfer to predictive stimuli. Since responses of the dopamine-like signal reached the striatum too late to learn the correct weights, this model variant performed at chance levels in all trials. If the
dopamine-like signal was simulated with the TD model instead of with the extended TD model [k 0, Eqs. (13a), (13b) and (13c) and dashed line with triangles], novel associative chains were not formed. Since this model variant did not bene®t from the two previous training phases, performance was at chance levels in the ®rst trial. Surprisingly, performance then decreased for some trials below chance levels and only slowly improved with further training. For a model variant without striatal in¯uence on the dopamine-like signal (salience a 0; solid line with squares), the model performance was at chance levels in the ®rst trial. During further trials, the model performance progressively improved. For a model variant without adaptation of the sensorimotor weights [1 0, Eq. (5); Fig. 11, dash dotted line with triangles], 79% of the ®rst trials were correct, as planning was mostly controlled by transient dopamine D1 receptordependent membrane effects. Performance then worsened and kept slightly above chance levels because the Critic weights progressively deteriorated. Since sensorimotor
Functions of striatal dopamine in learning and planning
79
Fig. 11. Learning curves in the test phase for different model variants. Each curve was computed from 1000 experiments (standard errors ,1.6%). Trial 1 assesses planning and successive trials test the progress in sensorimotor learning. The standard model (solid line with stars) and the model variant without dopamine D1 receptor-mediated membrane effects (h 0, dash dotted line with triangles) performed best. The model variant without dopamine novelty responses (n 0, dashed line with crosses) performed in the ®rst trials signi®cantly worse than the standard model. The other learning curves were computed without adaptation of the dopamine-like signal (b 0, dash dotted line with triangles), with the TD model instead of the extended TD model (k 0, dashed line with triangles), without striatal activity in¯uencing the dopamine-like signal (salience a 0, solid line with squares), without adaptation of the corticostriatal weights (1 0, dash dotted line with triangles), and without in¯uence of the dopamine-like signal on the durations of up- and down-states (w 0, dotted line with circles).
learning was prevented in this model variant, the performance above chance levels re¯ects a minor in¯uence of dopamine D1 receptor-dependent membrane effects. A model variant without dopamine membrane effects (h 0; Fig. 11, dash dotted line with triangles) performed signi®cantly above chance levels in the ®rst trial (60% correct, x 2 21, P , 0.01), as dopaminedependent synaptic adaptations led to some planning capabilities. This model variant then learned the task more rapidly than the standard model. A model variant without the in¯uence of the dopamine-like signal on the durations of up- and down-states (w 0; Fig. 11, dotted line with circles) performed signi®cantly worse than the standard model (73% correct, x 2 9.9, P , 0.01 for ®rst trial). We had expected that the model performance in the test phase would progressively increase in successive trials because sensorimotor reward learning would progressively dominate planning processes. Surprisingly, the learning curves of some model variants were Ushaped. We suspected that the presentation durations of stimuli green and red (300 ms) interfered with the durations of up- and down-states (about 400 ms). Therefore, the performance was tested when the stimuli green and red were presented for 800 ms. Indeed, this prevented the performance of the model variant with the original TD model (k 0) from decreasing below chance levels. However, the learning curves of other model variants were still U-shaped. In such incorrect trials, the dopaminelike signal was usually slightly increased, although the striatal neuron coding for the incorrect act was active, which seemed to elicit the incorrect act. Therefore, we suspected that the stimulus blue became associated with
the reward during the ®rst correct test phase trials and that the activation of the dopamine-like signal elicited not only correct but also incorrect acts. This supposition was tested by decreasing the salience of stimulus blue in the Critic (from the standard value of 0.05 to 0.02). The learning curve of the model with decreased salience of stimulus blue was substantially better and less U-shaped than that computed with the standard parameters (improved trial 1: 12%; trial 2: 18%; trial 3: 114%). However, reaction times to stimulus blue increased (trial 1: 1200 ms; trial 2: 1600 ms; trial 3: 11000 ms), presumably as the dopamine-like signal did not elicit incorrect movements. The average reaction times for the model variants are shown in Fig. 12. The reaction time was de®ned as the sum of the time periods during which stimulus blue was present before the execution of any act. Consequently, presentations of stimulus blue without acts were not counted as trials, but contributed to the reaction time of the subsequent act. Reaction times usually decreased progressively with repeated trial presentations, as reaction times for sensorimotor performance were shorter than those for planning. For the standard model, the reaction time in the ®rst trial was 690 ^ 10 ms (mean ^ S.E.M.) and then progressively decreased to 200 ms. The reaction time of 200 ms corresponds to sensory and motor delays of the model [section Simulations and Eq. (12)]. For the model variants without planning capabilities (b 0, k 0, a 0), the reaction times in the ®rst trial were increased to about 2000 ms. Without transient dopamine membrane effects (h 0), this reaction time was even longer (3000 ^ 100 ms). In further simulations with the standard model, the
80
R. E. Suri et al.
Fig. 12. Average reaction times in trials 1±19 of the test phase for different model variants. The reaction time for the act in the ®rst trial, which assessed planning, was usually longer than the reaction times in successive trials, which assessed sensorimotor associations (line types and experimental data correspond with Fig. 11). The average reaction time of the standard model (solid line with stars) was much shorter than that of the model variant without dopamine D1 receptor-mediated membrane effects (h 0, dash dotted line with triangles).
experimental paradigm was changed slightly to show that for the performance in planning there is a trade-off between the reaction time and the percentage of correct acts. In these simulations, acts were prevented during the ®rst 10 presentations of stimulus blue in the test phase. This was achieved by setting the two act signals act1(t) and act2(t) to zero during these 10 blue presentations [Eq. (12)]. With this paradigm, performance in the ®rst trial of the test phase signi®cantly improved to 85.2 ^ 0.01% correct trials (mean ^ S.E.M., for 1000 experiments). If the ®rst 10 blue presentations were not considered as reaction time, the average reaction time to stimulus blue was similar to that in the standard paradigm (670 ms). Also, the performance in the following trials was better than that in the control condition (73%, 85%, 94%, 99% and 100% correct). This performance improvement was the result of adjustments of the corticostriatal weights during the 10 presentations of stimulus blue without executing any act and without presenting any reward. The corticostriatal weight that associated stimulus blue with the correct act right increased by an amount of 10.76 ^ 0.01, whereas the weight that associated stimulus blue with the incorrect act left decreased by an amount of 20.85 ^ 0.01 (mean ^ S.E.M., 1000 experiments). In this experimental paradigm with prolonged planning time, the model evaluates sequentially possible actions to improve sensorimotor corticostriatal weights without executing any act and without presentation of any reward. DISCUSSION
This simulation study demonstrates that characteristics of dopamine neuron activity and striatal dopamine modulation may be advantageous for exploration, sensorimotor learning, planning and behaviorally silent improvement of associative strengths. Furthermore, this study proposes, probably for the ®rst time, a biologically motivated model for planning. Three types of membrane potential-dependent in¯uences of dopamine on striatal medium spiny neurons are simulated: transient D1
receptor-mediated membrane effects [Eq. (1)], longterm adaptation of corticostriatal transmission [Eq. (5)], and in¯uences on the durations of up- and down-states [Eq. (6)]. In our model, all three types of dopamine effects contribute to planning capabilities. Dopamine D1 receptor-mediated membrane effects substantially decrease the reaction time for planning. Sensorimotor learning requires long-term adaptations of corticostriatal transmission and the transfer of the dopamine-like signal to the ®rst reward-predictive stimulus. Dopamine-like novelty responses stimulate exploration, which signi®cantly improves planning capabilities. Striatal membrane effects of dopamine D1 agonists Although the proposed model does not include transient effects of other dopamine receptor classes, 55 the model for transient dopamine D1 class receptor effects on the ®ring rate of striatal medium spiny neurons [Eqs. (1)±(4)] serves as a crude phenomenological model for transient effects of dopamine. According to the proposed model, in vivo dopamine application increases the ®ring rate when the membrane potentials averaged over several hundreds of milliseconds is depolarized and decreases the ®ring rate otherwise [Eqs. (1)±(4)]. This model is supported by the ®nding that striatal dopamine application in vivo increases the ratio between high ®ring rates and low ®ring rates (signal-to-noise ratio), 51,58 as the ®ring rate usually correlates with the time-averaged membrane potential. Since dopamine D1 receptor activation also seems to increase high but not low ®ring rates in prefrontal cortical pyramidal neurons, 25,26 cortical dopamine D1 receptor activation may have similar functions as those simulated in the current study for the striatum. Dopamine neuron activity Suri (submitted) proposed to model dopamine neuron activity with the reward prediction error of an internal model approach. This approach is simpli®ed for the current study and termed the ªextended TD
Functions of striatal dopamine in learning and planning
modelº. In contrast to the TD model, the dopamine-like reward prediction error of the extended TD model is in¯uenced by the formation of novel associative chains. Learning of reward predictions may involve plastic changes in projections from the cortex to striatal striosomes 36,42,57,64 and may depend on metabotropic glutamate receptor-driven calcium spikes of striatal medium spiny neurons. 8 Reciprocal projections between cortical areas may be involved in learning of stimulus predictions and formation of novel associative chains (Suri, submitted). After learning stimulus±reward associations, the reward prediction and the dopamine-like reward prediction error of the extended TD model are equal to these signals of the TD model (compare Fig. 10B with Ref. 68). We model biphasic dopamine responses to novel stimuli by initializing certain weights with positive values. 64 With this assumption, the model reproduces biphasic dopamine novelty responses (Fig. 8A). Furthermore, these novelty responses progressively decrease when the same neutral stimulus is presented repeatedly (Fig. 8B), which resembles habituation of dopamine novelty responses for repeated presentation of a neutral stimulus (see Introduction). Sensorimotor learning As in previous studies, we relate sensorimotor learning to dopamine-dependent plasticity in projections from the cortex to striatal matrisomes. 36,63,64 For successful sensorimotor learning, the dopamine-like signal has to be adaptive 63 and can be computed with the extended TD model (standard parameter values) or with the original TD model (parameter k 0). 63,68 Previous models assume that corticostriatal transmission increases when ªeligibility tracesº of pre- and postsynaptic ®ring are active together with the dopamine-like signal. In the current model, corticostriatal transmission adapts without eligibility traces according to the product of dopamine activity, presynaptic activity and postsynaptic membrane potential [Eq. (5)]. Both adaptation rules are motivated by reinforcement learning theories 33,67,69,70 and by experimental ®ndings about plasticity of corticostriatal transmission. Both rules seem to be consistent with the experimental ®nding that corticostriatal LTD induced by tetanic stimulation of corticostriatal ®bers is reversed by simultaneous pulsative dopamine application. 77 While the previously proposed learning rule 36,63,64 requires postsynaptic ®ring for weight adaptation, we propose here that the direction of weight adaptation reverses when the postsynaptic membrane potential changes from a hyperpolarized to a depolarized potential. Although plasticity of corticostriatal transmission without postsynaptic ®ring has not been reported, our assumption may be supported by the ®nding that enabling of NMDA receptor-mediated currents can reverse LTD to LTP. 11,13,14 Planning The simulated planning task requires the formation of novel associative chains and the selection of the act that
81
predicts the optimal outcome. These planning processes take place during the reaction time without executing any act or receiving any sensory input, while act preparation activity in striatal matrisomes alternates between both possible acts, driven by the ¯uctuating membrane potentials of these neurons. Indeed, activity of a subset of striatal neurons is related to act preparations. 56 Without executing any act, the reward-predictive values of striatal act preparation activities are evaluated by the extended TD model and re¯ected in the dopamine-like signal. For this evaluation, the extended TD model forms novel associative chains with the associations learned in the two previous training phases. This includes formation of the novel associative chain that the preparation for the act right may lead to presentation of stimulus green, which will be followed by the reward. Increases in the dopamine-like signal reinforce preparatory ®ring of the neuron that corresponds to the outcome with the best reward prediction. This reinforcement prolongs and increases reward-promising preparatory striatal activities. Thus, the dopamine-like signal guides the act preparation activities and elicits acts. Indeed, pharmacological studies suggest that dopamine may be involved in behavioral activation. 50,54 Our model simulations demonstrate that successful planning requires the extended TD model, as novel associative chains cannot be formed with the TD model. Therefore, previous models using a dopamine-like reinforcement signal for learning of sensorimotor associations are not able to solve the planning task. 36,42,57,63,64 Since the model sequentially evaluates the reward predictions for both alternative acts, reaction times for planning are longer than reaction times for sensorimotor responding. When the paradigm is adapted to prolong the reaction time, planning capabilities are improved, since the model improves corticostriatal weights by simulating experience with the extended TD model without performing any act and without any reward presentation. 67,69 Consistent with this simulation result, motor performance of humans and animals can be improved when planning processes are stimulated in the absence of rewards or movements. 21,23 The transient in¯uence of dopamine D1 receptor activation on the ®ring of striatal medium spiny neurons contributes to planning by shortening the reaction time and increasing the number of correct acts in the ®rst trial of the test phase. In addition, the long-term effects of dopamine on corticostriatal transmission contribute to planning capabilities. The model variant without sensorimotor learning exhibits planning capabilities in the ®rst test phase trial but the performance deteriorates in subsequent trials (1 0 in Fig. 11). The weight adaptations of the extended TD model cause this performance decline, as only these weights are adapted for this model variant. Such poor performance of the extended TD model in repeated test phase trials may also cause the U-shaped learning curves of other model variants. This suggests that our ad hoc assumptions for the extended TD model that lead to formation of novel associative chains may not allow successful performance in complex planning tasks. Unfortunately, the mathematical knowledge about TD
82
R. E. Suri et al.
models that form novel associative chains is still incomplete (Suri, submitted), and the experimental ®ndings about predictive neural activity in situations that test formation of novel associative chains are sparse. 84 Planning is a basic concept of modern control algorithms used in reinforcement learning and in engineering sciences. 29,69 Planning may not only be involved in ªcognitiveº tasks, 39,53,71,76 but also in ªmotorº tasks that cannot be solved with sensorimotor learning. 83 For example, saccadic eye movements to a behaviorally important object cannot be learned with sensorimotor learning if the muscle commands vary due to varying start or goal directions of the gaze. The proposed model suggests how such goal-directed actions may be generated. The proposed extended TD model (Critic) learns prediction signals that precede sensory consequences of planned acts. Indeed, neural recording studies during saccadic eye movement tasks report direct evidence for this postulated phenomenon. Sensory and sensorimotor neural activity in frontal eye ®elds, 30,74 the superior colliculus, 75 the lateral intraparietal area, 24 striate and extrastriate cortex 43 anticipates the retinal consequences of saccades about 100 ms before these saccades are elicited. According to the proposed model, this anticipatory activity selects reward-promising targets via increases in dopamine neuron activity. This is consistent with the observation that the onset of dopamine neuron activation to visual stimuli happens before the saccadic eye movement, 49 and with the view that dopamine attributes salience to reward-related stimuli and thereby triggers the animal's visual and internal attention to such targets. 46,49,54 Striatal membrane potential ¯uctuations Animal learning theorists suggest that animals vary their behavior in repetitive trials in order to ®nd the optimal behavior. 37,59 According to these theories, animals repeat rewarded variations because a reward that follows a certain behavioral variation strengthens the processes that lead to this variation. In the proposed model, each striatal neuron uses this strategy to optimize the dopaminelike reward prediction error. Spontaneous striatal membrane potential ¯uctuations lead to variations in the act selections and variations in the act preparation signals. Therefore, acts are varied in the exploration phase, and striatal act preparation signals are varied before they are executed in the planning phase. Membrane effects [Eq. (1)] and corticostriatal transmission [Eq. (5)] adapt proportionally to both the dopaminelike signal and this variation in the ®ring rate. Therefore, processes that caused a behavioral variation are strengthened by the dopamine-like signal if the outcome is better than expected and weakened if the outcome is worse. A reinforcement learning algorithm with such random variations was proposed for tasks that require continuous output signals. 33 Although state durations in awake animals may be less regular than in our model, 80,81 we do not expect that this would impair the performance of the proposed planning mechanism if these durations stay in certain limits. Since the different states of the
membrane potentials re¯ect different action plans, the planning mechanism requires that the ¯uctuation durations are shorter than the animal's reaction time for planning. Furthermore, the durations of membrane potential ¯uctuations have to be longer than the response time of the dopamine-like signal to the activations of striatal neurons, because this feedback elicits the actions. Since the simulated dopamine effects [Eqs. (5) and (15)] for elevated and for hyperpolarized membrane potentials are antagonistic, sustained activity increases of dopamine neurons projecting to matrisomes do not primarily serve as a reinforcement signal but rather in¯uence the reaction time and the number of acts. This seems to be consistent with ®ndings from pharmacological studies demonstrating that tonic increases in striatal dopamine levels do not serve as a reinforcement, but rather increase response speed and vigor. 50,54 Also, the ®nding that tonic dopamine levels in the striatum are increased or decreased when aversive stimuli are presented 53,54 suggests that slow changes in striatal dopamine levels are not primarily involved in reinforcement learning. Dopamine-like novelty responses stimulate exploration and improve planning capabilities As in previous studies, we simulate dopamine neuron activity as a reward prediction error signal and demonstrate that this signal may serve as an effective reinforcement signal for sensorimotor learning. 36,63,64 This hypothesis does not imply that dopamine is involved in anhedonia or that dopamine is always involved in sensorimotor reward learning. 54 In contrast to our view, it has been claimed that dopamine neurons do not signal a reward prediction error because they respond to nonrewarding stimuli. 46,49,54 Indeed, the onset of dopamine activation to visual stimuli hardly re¯ects reward prediction errors because this onset occurs before the saccadic eye movement to the stimulus and therefore before the subject identi®es the stimulus. 49 Dopamine neurons respond with an activity increase to rewards, and with an increase followed by a decrease to novel stimuli, physically salient stimuli and stimuli resembling rewarded stimuli. 55 Therefore, the onset of dopamine neuron activation does not reliably distinguish between rewards and salient stimuli, but rather the activity that immediately follows. Furthermore, these biphasic responses to salient but non-rewarding stimuli do not argue against the usefulness of dopamine neuron activity as a reinforcement signal. For actions that are much slower than the duration of biphasic dopamine responses (with eligibility traces of much more than 300 ms durations), the proposed learning rules suggest that the effects of the two phases should cancel out, suggesting that these biphasic responses do not reinforce slow actions. 36,63,64 For brief actions, including saccadic eye movements, our simulations demonstrate that dopamine-like biphasic novelty responses do not substantially impair sensorimotor learning, since these responses extinguish during exploration. Novelty responses increase the number of acts in the exploration phase (Table 4), suggesting that dopamine novelty responses stimulate exploration. Such
Functions of striatal dopamine in learning and planning
stimulation of exploratory acts improves learning in the exploration phase, which improves planning capabilities in the test phase (Fig. 11). In theoretical studies of TD learning, similar biphasic reward prediction error signals were used to stimulate exploration. In such studies, an agent often moves between different places in a maze and learns to ®nd the locations of rewards. Performance of such algorithms depends on a trade-off between exploration and exploitation, as well as on the exploration strategy. 27 One popular technique is to be optimistic for places that have never been explored. 20,69,73 Similar to the proposal to reproduce dopamine novelty responses with positive initial weights, 64 this has been achieved for TD models by attributing optimistic initial values to novel places. 20,68,69 When a novel place is visited that is preceded and followed by familiar places that are not associated with
83
the reward, the reward prediction error increases above the baseline level of zero when the novel place is visited and decreases below zero when the following place is visited. This biphasic reward prediction error in studies of TD learning resembles biphasic dopamine responses to novel stimuli. Furthermore, biphasic responses of both dopamine neurons and the reward prediction error signal progressively diminish when the same situation is repeatedly presented. This suggests that dopamine-like novelty responses may stimulate exploratory behaviors of animals, as does the novelty bonus for acts of reinforcement learning agents.
AcknowledgementsÐThis study was supported by a fellowship of the Swiss National Science Foundation (R.E.S.) and by a Program Project grant from the Human Brain Project (M.A.).
REFERENCES
1. Albin R. L., Young A. B. and Penney J. B. (1989) The functional anatomy of basal ganglia disorders. Trends Neurosci. 12, 366±375. 2. Alexander G. E., DeLong M. R. and Strick P. L. (1986) Parallel organization of functionally segregated circuits linking basal ganglia and cortex. A. Rev. Neurosci. 9, 357±381. 3. Alexander G. E. and Crutcher M. D. (1990) Functional architecture of basal ganglia circuits: neural substrates of parallel processing. Trends Neurosci. 13, 266±271. 4. Apicella P., Scarnati E., Ljungberg T. and Schultz W. (1992) Neuronal activity in monkey striatum related to the expectation of predictable environmental events. J. Neurophysiol. 68, 945±960. 5. Arbib M. A. (1972) The Metaphorical Brain, Wiley-Interscience, New York. 6. Balleine B. W. and Dickinson A. (1998) Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology 37, 407±419. 7. Berns G. S. and Sejnowski T. J. (1996) How the basal ganglia make decisions. In Neurobiology of Decision-Making (eds Damasio A. R., Damasio H. and Christen Y.). Springer, Berlin. 8. Brown J., Bullock D. and Grossberg S. (1999) How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. J. Neurosci. 19, 10,502±10,511. 9. Calabresi P., Maj R., Pisani A., Mercuri N. B. and Bernardi G. (1992) Long-term synaptic depression in the striatum: physiological and pharmacological characterization. J. Neurosci. 12, 4224±4233. 10. Calabresi P., Mercuri N., Stanzione P., Stefani A. and Bernardi G. (1987) Intracellular studies on the dopamine-induced ®ring inhibition of neostriatal neurons in vitro: evidence for D1 receptor involvement. Neuroscience 20, 757±771. 11. Calabresi P., Pisani A., Mercuri N. B. and Bernardi G. (1992) Long-term potentiation in the striatum is unmasked by removing the voltagedependent magnesium block of NMDA receptor channels. Eur. J. Neurosci. 4, 929±935. 12. Calabresi P., Pisani A., Mercuri N. B. and Bernardi G. (1994) Post-receptor mechanisms underlying striatal long-term depression. J. Neurosci. 14, 4871±4881. 13. Calabresi P., Saiardi A., Pisani A., Baik J. H., Centonze D., Mercuri N. B., Bernardi G. and Borrelli E. (1997) Abnormal synaptic plasticity in the striatum of mice lacking dopamine D2 receptors. J. Neurosci. 17, 4536±4544. 14. Centonze D., Gubellini P., Picconi B., Calabresi P., Giacomini P. and Bernardi G. (1999) Unilateral dopamine denervation blocks corticostriatal LTP. J. Neurophysiol. 82, 3575±3579. 15. Cepeda C., Chandler S. H., Shumate L. W. and Levine M. S. (1995) Persistent Na 1 conductance in medium-sized neostriatal neurons: characterization using infrared videomicroscopy and whole cell patch-clamp recordings. J. Neurophysiol. 74, 1343±1348. 16. Cepeda C. and Levine M. S. (1998) Dopamine and N-methyl-d-aspartate receptor interactions in the neostriatum. Devl Neurosci. 20, 1±18. 18. Choi S. and Lovinger D. M. (1997) Decreased probability of neurotransmitter release underlies striatal long-term depression and postnatal development of corticostriatal synapses. Proc. natn. Acad. Sci. USA 94, 2665±2670. 19. Craik K. (1943) The Nature of Explanation, Cambridge University Press, Cambridge. 20. Dayan P. and Sejnowski T. J. (1996) Exploration bonuses and dual control. Machine Learning 25, 5±22. 21. Decety J. (1996) The neurophysiological basis of motor imagery. Behav. Brain Res. 77, 45±52. 22. Dickinson A. (1980) Contemporary Animal Learning Theory, Cambridge University Press, Cambridge. 23. Dickinson A. and Balleine B. (1994) Motivational control of goal-directed action. Anim. Learn. Behav. 22, 1±18. 24. Duhamel J. R., Colby C. L. and Goldberg M. E. (1992) The updating of the representation of visual space in parietal cortex by intended eye movements. Science 255, 90±92. 25. Durstewitz D., Kelc M. and Gunturkun O. (1999) A neurocomputational theory of the dopaminergic modulation of working memory functions. J. Neurosci. 19, 2807±2822. 26. Durstewitz D., Seamans J. K. and Sejnowski T. J. (2000) Dopamine-mediated stabilization of delay-period activity in a network model of prefrontal cortex. J. Neuropysiol. 83, 1733±1750. 27. Fel'dbaum A. A. (1965) Optimal Control Systems, Academic, New York. 28. Fellous J. M. and Linster C. (1998) Computational models of neuromodulation. Neural Comput. 10, 771±805. 29. Garcia C. E., Prett D. M. and Morari M. (1989) Model predictive control: theory and practiceÐa survey. Automatica 25, 335±348. 30. Goldberg M. E. and Bruce C. J. (1990) Primate frontal eye ®elds. III. Maintenance of a spatially accurate saccade signal. J. Neurophysiol. 64, 489±508. 31. Gonon F. (1997) Prolonged and extrasynaptic excitatory action of dopamine mediated by D1 receptors in the rat striatum in vivo. J. Neurosci. 17, 5972±5978. 32. Graybiel A. M. (1990) Neurotransmitters and neuromodulators in the basal ganglia. Trends Neurosci. 13, 244±254.
84
R. E. Suri et al.
33. Gullapalli V. (1990) A stochastic reinforcement algorithm for learning real valued functions. Neural Networks 3, 671±692. 34. Hernandez-Lopez S., Bargas J., Reyes A. and Galarraga E. (1996) Dopamine modulates the afterhyperpolarization in neostriatal neurones. NeuroReport 7, 454±456. 35. Hernandez-Lopez S., Bargas J., Surmeier D. J., Reyes A. and Galarraga E. (1997) D1 receptor activation enhances evoked discharge in neostriatal medium spiny neurons by modulating an L-type Ca 21 conductance. J. Neurosci. 17, 3334±3342. 36. Houk J. C., Adams J. L. and Barto A. G. (1995) A model of how the basal ganglia generate and use neural signals that predict reinforcement. In Models of Information Processing in the Basal Ganglia (eds Houk J. C., Davis J. L. and Beiser D. G.). MIT, Cambridge, MA. 37. Hull C. L. (1952) A Behavioral System: An Introduction to Behavior Theory Concerning the Individual Organism, Yale University Press, New Haven, CT. 38. Katayama Y., Tsubokawa T. and Moriyasu N. (1980) Slow rhythmic activity of caudate neurons in the cat: statistical analysis of caudate neuronal spike trains. Expl Neurol. 68, 310±321. 39. Lange K. W., Robbins T. W., Marsden C. D., James M., Owen A. M. and Paul G. M. (1992) l-dopa withdrawal in Parkinson's disease selectively impairs cognitive performance in tests sensitive to frontal lobe dysfunction. Psychopharmacology, Berlin 107, 394±404. 40. MacCorquodale K. and Meehl P. E. (1954) Section 2: Edward C. Tolman. In Modern Learning Theory (ed. Estes W. K.). Appleton-CenturyCrofts, New York. 41. Mackintosh N. M. (1974) The Psychology of Animal Learning, Academic, London. 42. Montague P. R., Dayan P. and Sejnowski T. J. (1996) A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci. 16, 1936±1947. 43. Nakamura K. and Colby C. L. (1999) Updating of the visual representation in monkey striate and extrastriate cortex during saccades. Soc. Neurosci. Abstr. 25, 1163. 44. Nisenbaum E. S., Xu Z. C. and Wilson C. J. (1994) Contribution of a slowly inactivating potassium current to the transition to ®ring of neostriatal spiny projection neurons. J. Neurophysiol. 71, 1174±1189. 45. Pacheco-Cano M. T., Bargas J., Hernandez-Lopez S., Tapia D. and Galarraga E. (1996) Inhibitory action of dopamine involves a subthreshold Cs 1-sensitive conductance in neostriatal neurons. Expl Brain Res. 110, 205±211. 46. Pennartz C. M. (1995) The ascending neuromodulatory systems in learning by reinforcement: comparing computational conjectures with experimental ®ndings. Brain Res. Rev. 21, 219±245. 47. Piaget J. (1954) The Construction of Reality in the Child, Basic Books, New York. 48. Pineda J. C., Galarraga E., Bargas J., Cristancho M. and Aceves J. (1992) Charybdotoxin and apamin sensitivity of the calcium-dependent repolarization and the afterhyperpolarization in neostriatal neurons. J. Neurophysiol. 68, 287±294. 49. Redgrave P., Prescott T. J. and Gurney K. (1999) Is the short-latency dopamine response too short to signal reward error? Trends Neurosci. 22, 146±151. 50. Robbins T. W., Granon S., Muir J. L., Durantou F., Harrison A. and Everitt B. J. (1998) Neural systems underlying arousal and attention. Implications for drug abuse. Ann. N. Y. Acad. Sci. 846, 222±237. 51. Rolls E. T., Thorpe S. J., Boytim M., Szabo I. and Perrett D. I. (1984) Responses of striatal neurons in the behaving monkey. 3. Effects of iontophoretically applied dopamine on normal responsiveness. Neuroscience 12, 1201±1212. 52. Rutherford A., Garcia-Munoz M. and Arbuthnott G. W. (1988) An afterhyperpolarization recorded in striatal cells in vitro: effect of dopamine administration. Expl Brain Res. 71, 399±405. 53. Salamone J. D. (1992) Complex motor and sensorimotor functions of striatal and accumbens dopamine: involvement in instrumental behavior processes. Psychopharmacology, Berlin 107, 160±174. 54. Salamone J. D., Cousins M. S. and Snyder B. J. (1997) Behavioral functions of nucleus accumbens dopamine: empirical and conceptual problems with the anhedonia hypothesis. Neurosci. Biobehav. Rev. 21, 341±359. 55. Schultz W. (1998) Predictive reward signal of dopamine neurons. J. Neurophysiol. 80, 1±27. 56. Schultz W., Apicella P., Romo R. and Scarnati E. (1995) Context-dependent activity in primate striatum re¯ecting past and future behavioral events. In Models of Information Processing in the Basal Ganglia (eds Houk J. C., Davis J. L. and Beiser D. G.). MIT, Cambridge, MA. 57. Schultz W., Dayan P. and Montague P. R. (1997) A neural substrate of prediction and reward. Science 275, 1593±1599. 58. Servan-Schreiber D., Bruno R. M., Carter C. S. and Cohen J. D. (1998) Dopamine and the mechanisms of cognition: Part I. A neural network model predicting dopamine effects on selective attention. Biol. Psychiat. 43, 713±722. 59. Skinner B. F. (1938) The Behavior of Organisms: An Experimental Analysis, Appleton Century, New York. 60. Smith A. D. and Bolam J. P. (1990) The neural network of the basal ganglia as revealed by the study of synaptic connections of identi®ed neurons. Trends Neurosci. 13, 259±265. 61. Stern E. A., Kincaid A. E. and Wilson C. J. (1997) Spontaneous subthreshold membrane potential ¯uctuations and action potential variability of rat corticostriatal and striatal neurons in vivo. J. Neurophysiol. 77, 1697±1715. 62. Suri R. E. and Arbib M. A. (1998) Modeling sensorimotor learning in striatal projection neurons. Soc. Neurosci. Abstr. 24, 174. 63. Suri R. E. and Schultz W. (1998) Learning of sequential movements by neural network model with dopamine-like reinforcement signal. Expl Brain Res. 121, 350±354. 64. Suri R. E. and Schultz W. (1999) A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience 91, 871±890. 65. Surmeier D. J., Bargas J., Hemmings H. C. Jr, Nairn A. C. and Greengard P. (1995) Modulation of calcium currents by a D1 dopaminergic protein kinase/phosphatase cascade in rat neostriatal neurons. Neuron 14, 385±397. 66. Surmeier D. J., Eberwine J., Wilson C. J., Cao Y., Stefani A. and Kitai S. T. (1992) Dopamine receptor subtypes colocalize in rat striatonigral neurons. Proc. natn. Acad. Sci. USA 89, 10,178±10,182. 67. Sutton R. S. and Barto A. G. (1981) An adaptive network that constructs and uses an internal model of its world. Cogn. Brain Theory 4, 217±246. 68. Sutton R. S. and Barto A. G. (1990) Time derivative models of Pavlovian reinforcement. In Learning and Computational Neuroscience: Foundations of Adaptive Networks (eds Gabriel M. and Moore J.). MIT, Cambridge, MA. 69. Sutton R. S. and Barto A. G. (1998) Reinforcement Learning: An Introduction, MIT, Bradford Books, Cambridge, MA (available at http://wwwanw.cs.umass.edu/,rich/book/the-book.html). 70. Sutton R. S. and Pinette B. (1985) The learning of world models by connectionist networks. Proceedings of the Seventh Annual Conference of the Cognitive Science Society, Lawrence Erlbaum, Irvine, CA. 71. Taylor A. E. and Saint-Cyr J. A. (1995) The neuropsychology of Parkinson's disease. Brain Cogn. 28, 281±296. 72. Thistlethwaite D. (1951) A critical review of latent learning and related experiments. Psychol. Bull. 48, 97±129. 73. Thrun S. B. (1992) The role of exploration in learning control. In Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches (eds White D. A. and Sofge D. A.). Van Nostrand Reinhold, New York. 74. Umeno M. M. and Goldberg M. E. (1997) Spatial processing in the monkey frontal eye ®eld. I. Predictive visual responses. J. Neurophysiol. 78, 1373±1383.
Functions of striatal dopamine in learning and planning
85
75. Walker M. F., Fitzgibbon E. J. and Goldberg M. E. (1995) Neurons in the monkey superior colliculus predict the visual result of impending saccadic eye movements. J. Neurophysiol. 73, 1988±2003. 76. Wallesch C. W., Karnath H. O., Papagno C., Zimmermann P., Deuschl G. and Lucking C. H. (1990) Parkinson's disease patients' behaviour in a covered maze learning task. Neuropsychologia 28, 839±849. 77. Wickens J. R., Begg A. J. and Arbuthnott G. W. (1996) Dopamine reverses the depression of rat corticostriatal synapses which normally follows high-frequency stimulation of cortex in vitro. Neuroscience 70, 1±5. 78. Wickens J. R and Wilson C. J. (1998) Regulation of action-potential ®ring in spiny neurons of the rat neostriatum in vivo. J. Neurophysiol. 79, 2358±2364. 79. Williams G. V. and Millar J. (1990) Concentration-dependent actions of stimulated dopamine release on neuronal activity in rat striatum. Neuroscience 39, 1±16. 80. Wilson C. J. (1993) The generation of natural ®ring patterns in neostriatal neurons. Prog. Brain Res. 99, 277±297. 81. Wilson C. J. and Groves P. M. (1981) Spontaneous ®ring patterns of identi®ed spiny neurons in the rat neostriatum. Brain Res. 220, 67±80. 82. Wilson C. J. and Kawaguchi Y. (1996) The origins of two-state spontaneous membrane potential ¯uctuations of neostriatal spiny neurons. J. Neurosci. 16, 2397±2410. 83. Wolpert D. M., Ghahramani Z. and Jordan M. I. (1995) An internal model for sensorimotor integration. Science 269, 1880±1882. 84. Young A. M., Ahier R. G., Upton R. L., Joseph M. H. and Gray J. A. (1998) Increased extracellular dopamine in the nucleus accumbens of the rat during associative learning of neutral stimuli. Neuroscience 83, 1175±1183. (Accepted 17 November 2000)