Pergamon PII:
Neuroscience Vol. 91, No. 3, pp. 871–890, 1999 Copyright 䉷 1999 IBRO. Published by Elsevier Science Ltd Printed in Great Britain. All rights reserved 0306-4522/99 $20.00+0.00 S0306-4522(98)00697-6
A NEURAL NETWORK MODEL WITH DOPAMINE-LIKE REINFORCEMENT SIGNAL THAT LEARNS A SPATIAL DELAYED RESPONSE TASK R. E. SURI and W. SCHULTZ* Institute of Physiology and Program in Neuroscience, University of Fribourg, CH-1700 Fribourg, Switzerland
Abstract—This study investigated how the simulated response of dopamine neurons to reward-related stimuli could be used as reinforcement signal for learning a spatial delayed response task. Spatial delayed response tasks assess the functions of frontal cortex and basal ganglia in short-term memory, movement preparation and expectation of environmental events. In these tasks, a stimulus appears for a short period at a particular location, and after a delay the subject moves to the location indicated. Dopamine neurons are activated by unpredicted rewards and reward-predicting stimuli, are not influenced by fully predicted rewards, and are depressed by omitted rewards. Thus, they appear to report an error in the prediction of reward, which is the crucial reinforcement term in formal learning theories. Theoretical studies on reinforcement learning have shown that signals similar to dopamine responses can be used as effective teaching signals for learning. A neural network model implementing the temporal difference algorithm was trained to perform a simulated spatial delayed response task. The reinforcement signal was modeled according to the basic characteristics of dopamine responses to novel stimuli, primary rewards and rewardpredicting stimuli. A Critic component analogous to dopamine neurons computed a temporal error in the prediction of reinforcement and emitted this signal to an Actor component which mediated the behavioral output. The spatial delayed response task was learned via two subtasks introducing spatial choices and temporal delays, in the same manner as monkeys in the laboratory. In all three tasks, the reinforcement signal of the Critic developed in a similar manner to the responses of natural dopamine neurons in comparable learning situations, and the learning curves of the Actor replicated the progress of learning observed in the animals. Several manipulations demonstrated further the efficacy of the particular characteristics of the dopamine-like reinforcement signal. Omission of reward induced a phasic reduction of the reinforcement signal at the time of the reward and led to extinction of learned actions. A reinforcement signal without prediction error resulted in impaired learning because of perseverative errors. Loss of learned behavior was seen with sustained reductions of the reinforcement signal, a situation in general comparable to the loss of dopamine innervation in Parkinsonian patients and experimentally lesioned animals. The striking similarities in teaching signals and learning behavior between the computational and biological results suggest that dopamine-like reward responses may serve as effective teaching signals for learning behavioral tasks that are typical for primate cognitive behavior, such as spatial delayed responding. 䉷 1999 IBRO. Published by Elsevier Science Ltd. Key words: striatum, basal ganglia, synaptic plasticity, teaching signal, temporal difference.
A large body of experimental evidence has established general principles of associative learning which hold for a broad spectrum of learning situations. 15 This has led to the development of mathematical models according to which learning is dependent on the degree of unpredictability of primary reinforcers. 56 These theories suggest that reinforcers become progressively less efficient for task acquisition as their predictability grows during the course of learning. The difference between the actual occurrence of a reinforcer and its prediction is usually referred to as an ’error’ in the prediction of reinforcement. This concept has been employed in the temporal-difference (TD) algorithm of reinforcement learning, which computes the prediction error *To whom correspondence should be addressed. Abbreviation: TD, temporal-difference. 871
continuously in real time and uses it as “Effective Reinforcement Signal”. 78 The Effective Reinforcement Signal is positive when unpredicted reinforcement occurs but is nil when reinforcement is predicted by a conditioned stimulus. It is negative when predicted reinforcement is omitted. In addition, the Effective Reinforcement Signal is positive when a conditioned stimulus predicts the reinforcer. Learning in this model is based on adapting the connective weights of synapses according to this signal. TD models are more efficient than most conventional reinforcement models in learning a wide variety of behavioral tasks, from balancing a pole on a cart wheel 4 to playing world class backgammon. 81 Robots using TD algorithms learn to move about two dimensional space and avoid obstacles, reach and grasp 16 or insert a peg into a hole. 28 In biological
872
R. E. Suri and W. Schultz
applications, TD models replicate foraging behavior of honeybees, 49 simulate human decision making 50 and learn eye movements 18,48 and movement sequences. 75 The short-latency response of dopamine neurons to primary rewards and conditioned, rewardpredicting stimuli 42,46,60,62,66 might constitute a biological implementation of a TD reinforcement signal. 30,48,50,67 Similar to the Effective Reinforcement Signal, activations of dopamine neurons occur only following rewards that are unpredictable and not following fully predicted rewards. Depressions occur when a predicted reward is omitted. This short-latency response is largely restricted to appetitive stimuli and occurs rarely with unconditioned or conditioned nonnoxious aversive stimuli. 47 However, in variance to the reinforcement signal of basic TD models, dopamine neurons are also activated by novel stimuli, physically particularly salient events and stimuli resembling reward-related stimuli. 42,47,72 The present study investigated the potential efficacy of a dopamine-like TD reinforcement signal for learning a behavioral task that is typical for the functions of prefrontal cortex and basal ganglia. In a first step, we modeled a teaching signal which replicated as closely as possible the natural responses of dopamine neurons in behaving monkeys. A previous study showed the general resemblance between the reinforcement term of the TD algorithm and dopamine responses. 50 In the present study, we included a number of additional features of biological dopamine responses, such as the effects of temporal variations of reward delivery, 29 the processing of novel and physically salient stimuli 42,55,72 and the different response components revealed by response generalization. 47 We then studied how a TD model with this reinforcement signal learned a spatial delayed response task in several steps, analogous to the learning behavior of macaque monkeys studied in the laboratory. 66 This task comprises spatial discrimination, stimulus–response association, expectation of external signals, movement preparation and working memory and assesses some of the prominent cognitive functions of the prefrontal cortex and basal ganglia. 1,6,19,20,33,39,80,83 This approach allowed us to compare the computational Effective Reinforcement Signal with the biological dopamine responses over distinctive steps of learning, and to relate the performance of the model to the behavior of the monkeys in the laboratory. The results were previously presented in abstract form. 74
delivered to the mouth of the animal at unpredictable moments. 46 (2) Novel and conditioned stimuli: A novel visual or auditory stimulus was presented. After several repetitions, animals were operantly conditioned in reaction time tasks to perform a reaching movement in order to receive a reward. 42,46 These conditioned stimuli can be considered as reliable reward predictors. (3) Changed stimulus–reward interval: The interval between the last stimulus predicting the reward and the reward itself was shortened or prolonged for a few trials in animals familiar with a fixed interval. 29 The development and use of the dopamine-like reinforcement signal were studied during the learning of a simulated spatial delayed response task. Animals learn complicated tasks better by approach through intermediate tasks than by trial-and-error acquisition of the complete task. In the behavioral experiments to be modeled we used two successive intermediate tasks, a spatial choice task and an instructed spatial task. 66 (1) Spatial choice task (Fig. 1A): The animal was presented with three lights and two levers. The light above the left or right hand lever was illuminated in combination with the center light. The animal was rewarded for releasing the resting key and pressing the corresponding left or right lever. (2) Instructed spatial task (Fig. 1B): The left or right instruction light was illuminated 1 s before the center light which served as trigger for eliciting the movement. The animal was rewarded for pressing the lever indicated by the instruction. (3) Spatial delayed response task (Fig. 1C): The left or right instruction light was presented for 1 s, followed by the trigger light after a random delay of 1.5–2.5 s. Following the trigger light, the animal pressed the lever corresponding to the side of the instruction. Incorrect actions, such as withholding of movement, premature resting key release or lever press, or incorrect lever press terminated the trial without reward. Each task was run for several hundred trials. General architecture of the model A modified TD algorithm was implemented with an explicit Critic–Actor architecture 3 which resembles the general architecture of the basal ganglia 63 (Fig. 2). The Critic component generated an Effective Reinforcement Signal and sent it to the Actor component which learned and executed behavioral output. The Effective Reinforcement Signal of the Critic was modeled according to the basic characteristics of dopamine responses to novel stimuli, rewards and reward-predicting stimuli. The model did not generate responses to aversive stimuli, as these stimuli rarely lead to phasic responses in dopamine neurons 47 (aversive stimuli mostly induce only slower activating and depressant impulse responses 68 and increased dopamine release at a slower time-scale through presynaptic interactions 44,61). Similar to previous models of dopamine activity 50,67,75 we modeled firing rate and used time steps of 100 ms duration. Although this loses potential information contained in the intervals between impulses, data on interval coding of dopamine neurons are presently not available from behaving animals and thus constitute a factor of uncertainty for biologically plausible models. The Actor learned to perform associations between stimuli and actions in a spatial delayed response task using a similar learning schedule as described above for the monkeys in the laboratory. Effective reinforcement signal of the critic component
EXPERIMENTAL PROCEDURES
Behavioral situations The basic characteristics of dopamine neurons were modeled in three behavioral situations. (1) Free reward: Outside of any behavioral task, drops of apple juice were
Primary reward. Experimental findings. Dopamine neurons are activated by unpredicted food and liquid rewards. 46,60 They are also activated by rewards during the first few stimulus–reward pairings. However, the activation following the reward decreases after repeated pairings and disappears entirely upon completion of learning. 42,46 In
Delayed task model with dopamine-like reinforcement signal
873
Fig. 1. Behavioral tasks. A trial was initiated when the animal kept its right hand on a resting key. Illumination of an instruction light (round circle) above the left or right lever indicated the target of reaching. The trigger light (center square) determined the time when the resting key should be released and the lever should be touched. Left and right instruction lights alternated semirandomly (maximal three trials same side). (A) In the spatial choice task, instruction and trigger lights appeared at the same time, and both lights extinguished upon lever touch. (B) In the instructed spatial task, the trigger light appeared 1 s after the instruction light. (C) In the spatial delayed response task, the instruction was presented for 1 s. The trigger appeared 1.5–2.5 s after extinction of the instruction. Reward was given 0.5 s after correct lever touch in all tasks. In the model, the left instruction light corresponded to stimulus A, the center trigger light to stimulus B, the right instruction light to stimulus C, the left lever to action X and the right lever to action Z. Tasks A and B were intermediate tasks during the learning of the spatial delayed response task (C). 66 This procedure assures better task acquisition in monkeys than trial-and-error learning in a single step.
established tasks, dopamine neurons are activated by reward that occurs earlier or later than predicted. 29 Thus dopamine neurons are only activated by rewards occurring unpredictably in time. By contrast, dopamine neurons are depressed in their activity when a fully predicted reward is omitted. 29,66 The depression occurs at the time at which the reward would have occurred, despite the absence of any stimulus at that time. It thus appears that dopamine neurons code an error in the temporal prediction of reward.
Model. The ingestion of liquid and food rewards leads to vegetative changes in the body. Results of food discrimination experiments suggest that the effects of rewards ultimately depend on vegetative reactions. 31,59 The phasic visual, auditory, olfactory or somatosensory stimulus components of rewards would be associated with the taste induced on the tongue, and the taste in turn would be associated with the slower vegetative effects. In order to account for this, we modeled the reward as a composite of the
874
R. E. Suri and W. Schultz
sustained components which are elicited by the stimulus and decay after varying time-intervals. As longer timeintervals are more difficult to learn than shorter ones, the longer representation components are progressively smaller than the shorter ones (Fig. 3A). Similar models of temporal coding with large phasic components followed by progressively smaller tonic components have been suggested previously, 25 on the basis of behavioral evidence for biological timing mechanisms, 13,21 and would correspond to sustained activity observed in striatal and cortical neurons. 1,2,39,64,65,80,83 The adaptive weights Wlm are used to associate the temporal stimulus representation xlm(t) of stimulus l with the reward prediction Pl(t) X Wlm Xlm
t:
1 Pl
t m
Fig. 2. Overview of model architecture. The model consists of an Actor component (left) and a Critic component (right). Actor and Critic receive input stimuli 1 and 2 which are coded as functions of time, e1(t) and e2(t), respectively. The Critic computes the Effective Reinforcement Signal r(t) which serves to modify the weights vnl of the Critic and the weights wlm of the Actor at the adaptive synapses (heavy dots). (Critic) The function of the Critic is to associate the input stimuli 1 and 2 with the Effective Reinforcement Signal r(t). Every stimulus l is represented as a series of components xlm of different durations. Each of these components influences the reward prediction signal according to its own adaptive weight wlm. This form of temporal stimulus representation allows the Critic to learn the correct duration of the stimulus–reward interval. For every stimulus l, a specific prediction P1(t) is computed as the weighted sum of the representational components. A winner-take-all competition between predictions P1(t) of different stimuli sets all representational components to zero except the strongest one. The change in the prediction of a stimulus l is computed by taking the temporal difference between successive time steps gPl(t) ⫺ Pl(t ⫺ 1) where Pl(t ⫺ 1) is the previous prediction, Pl(t) the current prediction and g 0.98 per 100 ms the discounting factor. The temporal difference of successive predictions is summed over all stimuli and added to the primary reinforcement signal input g(t) coding the reward. The result of this summation is the Effective Reinforcement Signal r(t) which codes the error in the prediction of reward. (Actor) The Actor learns to associate stimuli with behavioral actions. A winnertake-all rule prevents the Actor from performing two actions at the same time.
predictive stimulus el(t) followed by the primary reward l(t) itself. The predictive stimulus corresponds to the phasic sensory stimulus of liquid or food, and the primary reward reflects the slower activation of gustatory receptors and the vegetative reactions. The Critic component of the model computes the prediction error and emits it as Effective Reinforcement Signal (Fig. 2). In order to model the temporal aspects of dopamine responses, we assured a prolonged influence of every stimulus by representing it as a fixed temporal pattern of signals following the stimulus. For learning the duration of a particular stimulus-reinforcer interval, the component of the temporal stimulus representation with a corresponding duration has to be characteristically different from the components matching other stimulus–reinforcer intervals. Mathematical considerations do not suggest a specific time-course of the stimulus representation components as TD models should learn independently of the chosen temporal stimulus representation. Since the temporal representation constitutes a memory process, we define the temporal representation xlm(t) of a stimulus as a series of
The weights Wlm are adapted during learning. Before learning, the prediction Pl(t) is nil for all times t if a stimulus is physically weak. After learning, Pl(t) is proportional to the amount of reward during the stimulus–reward interval, and nil at other times. For several stimuli, the prediction P(t) is the sum over the reward predictions associated with all stimuli X P
t Pl
t:
2 l
The response of dopamine neurons is modeled as the Effective Reinforcement Signal r
t d ⫹ l
t ⫹ gP
t ⫺ P
t ⫺ 1:
3
The primary reinforcement signal l(t) is positive at the time of primary reinforcement and zero at other times. One step of time t corresponds to 100 ms. The constant d is set to d 0.5 and accounts for the baseline firing rate of dopamine neurons, which is 2–4 impulses/s. The discounting factor g accounts for the decreasing motivational value of progressively more distant rewards and is set to g 0.98 per timestep (time-steps of 100 ms). Equation 3 closely follows previous computational studies in reinforcement learning. 3,5,76,78 Its full form is presented in the appendix. The TD algorithm is designed to adapt the weights of synapses that contribute to minimize prediction errors. Without a prediction error, the Effective Reinforcement Signal r(t) is equal to the baseline d. The weights are adapted proportional to prediction errors, which constiutes an underlying mechanism of learning rules in engineering, 4,35 and psychology. 15,56,78 Similar to a previous proposal, 37 the weights Wlm are changed according to a learning rule in which the changes of the temporal stimulus representation Xlm(t) are multiplied with the prediction error signal r(t) ⫺ d: Wlm
t Wlm
t ⫺ 1 ⫹ hc
r
t ⫺ dbxlm
t ⫺ 1 ⫺ gxlm
tc⫹
4 with hc as the learning rate. The brackets b c⫹ indicate that a negative number is set to zero and a positive number remains unchanged. The factor bxlm(t ⫺ 1) ⫺ gxlm (t)c⫹ in the learning rule is positive only for the decreasing component of the temporal stimulus representation which covers the time-interval between stimulus onset and the actual time t. Therefore, for an unexpected reinforcer only the weight for the particular stimulus representation component is adapted that covers the actual stimulus–reinforcer interval, and the other weights remain unchanged. Conditioned stimuli. Experimental findings. Dopamine neurons are activated by conditioned visual and auditory stimuli that have been repeatedly paired with a reward. 42,46,69 In the case of successive reward-predicting stimuli, dopamine neurons are activated by the earliest stimulus.66 Activation magnitude decreases with increasing intervals between the first reward-predicting stimulus and the reward. According
Delayed task model with dopamine-like reinforcement signal
875
Fig. 3. (A) Temporal stimulus representation in the Critic component of the model. The physical salience function e1(t) takes value one when the stimulus is present and zero otherwise. The magnitude of the temporal stimulus representation x1m(t) is proportional to the physical salience of the stimulus e1(t). The first component of the temporal stimulus representation x11(t) is phasically increased following stimulus onset. All following components begin 100 ms later and consist of smaller but longer increases. (B) Representation reset of weaker predictive stimuli by the winner-take-all mechanism in the Critic. All temporal representation components of the preceding stimulus 1 are set to zero when the prediction P1(t) associated with this stimulus is smaller than the prediction P2(t) from the subsequent stimulus 2. (The parameter r, describes the decrease in magnitude for the successive components and was set for illustrative reasons to 0.6/component in this figure).
to the present evidence,66 an interval increase by 1 s results in an approximately 20% decrease of activation (see also Fig. 6C, bottom and Fig. 7B, bottom).
Model. Most of the processing of conditioned stimuli follows from Eqns 1–4. The prediction of reward P(t) is zero for all times t when a known, physically weak stimulus is paired with a reward, as this stimulus will not elicit a dopamine response. After conditioning, the prediction P(t) computed in Eqns 1–4 increases rapidly at the time of the conditioned stimulus. For this reason, the difference of predictions between successive time-steps P(t) ⫺ P(t ⫺ 1) increases phasically and induces an increase of the Effective Reinforcement Signal at the time of the conditioned stimulus (Eqn 3). The discounting factor g in the TD algorithm is used to indicate that rewards predicted in the near future increase the prediction more significantly than rewards in the far future. 3,76,78 For this reason, the correctly learned prediction P(t) is equal to the discounted sum of future rewards. For the present model, g was estimated according to the magnitude of dopamine responses for different stimulus–reward intervals as g 0.98 per time-step of 100 ms. This decreases the magnitude of the Effective Reinforcement Signal at the time of the conditioned stimulus by 20% for each increase of 1 s of the stimulus–reward interval (0.98 10time-steps 0.8). Novel stimuli, physically salient stimuli and stimulus generalization. Experimental findings. Dopamine neurons are activated by novel stimuli, physically salient stimuli and stimuli resembling reward-related stimuli. These responses have different forms and temporal characteristics as compared to responses to conditioned stimuli. They consist of activations lasting about 100 ms, followed by depressions below baseline of similar or slightly longer durations.42,55,72,73 These responses disappear completely after repeated stimulus presentation, 42,55,72 but are only partially reduced with very intense stimuli. 73 Dopamine neurons show generalization
responses with similar biphasic forms to stimuli that physically resemble conditioned, reward-predicting stimuli. 47,69
Model. The first component of the temporal stimulus representation is defined as a phasic increase at stimulus onset (see Fig. 3A, the first and largest component). Novelty responses were modeled by introducing non-zero initial values of the weights Wl1 associated with the first phasic component of the temporal stimulus representation of a stimulus l. Wl1
t 0 0:2 and Wl:m苷1
t 0 0:
5
With these initial weights (Table 1), the Effective Reinforcement Signal following a novel, physically salient stimulus consists of a phasic increase followed by a decrease. Novelty and generalization responses are less directly related to the appetitive characteristics of stimuli than responses to conditioned, reward-predicting stimuli and might originate from fast but unspecific neural systems. The separation in fast and slow systems is modeled by assuming early and late components in the temporal stimulus representation (Fig. 3A). In order to reproduce the slow extinction of novelty and generalization responses, 29,42,73 we introduce a particular learning rate h c- for decreasing the weights Wl1 associated with novelty and generalization, which is smaller than the usual learning rate h c⫹ of the Critic (Table 1). Temporal aspects in reward prediction. Experimental findings. The reward prediction inherent in dopamine activity also concerns the precise timing of reward following a conditioned stimulus. Although dopamine neurons fail to respond to rewards occurring at a predicted time, they are strongly activated by rewards occurring earlier or later than predicted. 29 In addition, a depression is observed at the usual time of reward when a reward occurs later than predicted, but not when it occurs too early.
876
R. E. Suri and W. Schultz
s n(t)
Table 1. Constants used in simulations Description Critic
Actor
Symbol
Initial weights for m 苷 1** Initial weights for m 1** Learning rate for m 苷 1, and for m 1 in acquisition** Learning rate for m 1in extinction** Baseline of Effective Reinforcement Signal Discount factor Onset activations decrease of the stimulus representation components Reset duration Initial weights Learning rate for acquisition Learning rate for extinction Maximum of random distribution Decay of stimulus trace and action trace
Value
wlm wl1
0 0.2
h c⫹
0.08
h c⫺
0.002
d
0.5
g
0.98*
r
0.94*
t n nl h a⫹ h a⫺
10 s 0.4 0.1 0.7
s
0.5
d
0.96*
*Values with respect to 100 ms time-steps. **m 1 indicates the first component of the temporal stimulus representation.
Model. The basic TD algorithm reproduces well the depression and activation in relation to delayed rewards. 50 However, the Effective Reinforcement Signal of the basic TD algorithm computes a decrease at the predicted time of reward when reward occurs earlier than predicted, which is at variance with dopamine responses. 29 There are several possibilities to account for this experimentally observed nonlinear effect of the early reward. We assume that the surprisingly early reward induces attention, which disrupts the representation of the reward-predicting stimulus. Note that in the present model, reward delivery can act as a stimulus and has a predictive component, because the final reward is assumed to be associated with a vegetative state (see Results). As the event with the highest predictive component would induce the highest level of attention, we introduced a winner-take-all competition between predictions, in analogy to lateral inhibition between neighboring neurons. The attention shift to the stimulus with the higher reward prediction P(t) could disrupt earlier stimulus representations with lower prediction P(t ⫺ 1) and set the representation of the loosing stimulus to nil (Fig. 3B) X IfPstimulus_onset
t wl1 Xl1
t ⬎ P
t ⫺ 1; l
6 thenxl;m苷1
t 0:
Behavioral output by the Actor component Actions. The Effective Reinforcement Signal is emitted as teaching signal to the Actor component of the model. The Actor consists of a one layer neural network, which learns associations between stimuli el(t) and actions an(t) (Fig. 2, left). Every neuron of the Actor (large circles in Fig. 2, left) codes a specific action. We interpret el(t) 1 as ‘stimulus number l is on’ and el(t) 0 as ‘stimulus number l is off’. From every stimulus el(t), a stimulus trace el
t was defined as a slowly decaying consequence of the stimulus (see Appendix) which serves as a kind of memory. 32,77 Activations of the neurons were defined as the weighted sum of the stimulus traces el
t minus a small random perturbation
a 0n
t
X
! nnl e1
t ⫺ sn
t :
7
l
The model performed only one action at a time, similar to the behavior of an animal. This restriction was implemented by a winner-take-all rule between Actor neurons which allowed only the neuron with the largest activation a 0 n(t) to provide an output of one: ( 1 if a 0n
t ⬎ 0 and a 0n
t ⬎ a 0m
t :
8 an
t 0 else We interpret a 0 n(t) for a specific action n as ’the Actor performs action n’. In the present model, the Actor was implemented with three neurons corresponding to three possible behavioral reactions, namely left lever press (action X, ai(t) 1), right lever press (action Z, a3(t) 1), and withholding of movement (action Y, a2(t) 1). Learning. Stimuli and actions induce stimulus traces el
t and action traces an
t, in order to make previously used synapses eligible for modification by the Effective Reinforcement Signal occurring after the stimuli or actions are over. 32,77 Weights were increased for neurons if stimulus and action traces coincided with an increased Effective Reinforcement Signal, whereas weights were decreased if traces coincided with an Effective Reinforcement Signal decreased below baseline. The weights n nl were adapted according to the product of stimulus trace e1
t, action trace an
t and the difference between the Effective Reinforcement Signal r(t) and baseline activity d. 3 nnl
t nnl
t ⫺ 1 ⫹ ha r
t ⫺ dan
tel
t;
9
where h a is the learning rate. This learning rule includes the action trace an
t, in order to adapt only the weights of the winner neuron which has elicited the action. Weights were initially set to 0.4 in order to allow spontaneous, unlearned actions when stimuli were presented (Table 1). These initial weights were only used for the ’free reward’ situation. As several situations were employed, the weights computed in each situation were used as initial weights for the subsequent situation. The model used the same time-courses of stimuli as in experiments. Modifications of effective reinforcement signal In order to assess the contribution of the adaptive properties of the Critic to learning, we investigated a model variant with an unconditional reinforcement signal. The learning rates h c and h s of the Critic were set to zero after repeated exposure to ‘free reward’. This resulted in a lack of transfer of the reinforcement signal from primary reward to rewardpredicting stimuli (stimuli A, B or C). Thus, the increase in the reinforcement signal at the time of reward remained independent of its predictability and failed to occur with the predictive stimulus. Learning performance in the spatial choice task, instructed spatial task and spatial delayed response task was tested with this unconditional reinforcement signal. As learning with unconditional reinforcement resulted in perseverative errors already in the spatial choice task, we further tested the propensity for perseveration. The weights of the Actor were initially set to assure correct performance in trials alternating semirandomly between left and right, and the weights of the Critic were initialized with the values learned in the ‘free reward’ situation. We then made the model perform only on one side and tested for subsequent perseveration.
Delayed task model with dopamine-like reinforcement signal
877
Fig. 4. Response transfer from primary reward to earliest reward-predicting stimulus during learning. (A) First drop of free reward. Primary reward was modeled with two components. Input from stimulus D represented touch of the tongue, and the primary reinforcement was derived from the vegetative consequences of the liquid (top). The Effective Reinforcement Signal (bottom) was the result of the prediction (middle) being differentiated and summed with the primary reinforcement input. (Activity of dopamine neurons has not been measured in this situation). (B) Acquisition of full response to the composite reward. The prediction induced by stimulus D became more sustained and led to a phasic increase of the Effective Reinforcement Signal at onset of stimulus D. For comparison, the population histogram at the bottom shows the average response of 80 dopamine neurons to an unpredicted drop of liquid delivered to the mouth of a monkey. 46 (C) Conditioning of a stimulus by pairing with the composite reward. In trial 6 (top), the Effective Reinforcement Signal was slightly increased at onset of stimulus A and strongly increased at reward onset. In trial 20 (middle), the Effective Reinforcement Signal was stronger at onset of stimulus A than at reward onset. After 200 trials (bottom), the Effective Reinforcement Signal reached its final magnitude and was strongly increased after stimulus A but at baseline after the reward. The population histograms show a comparable gradual response transfer of dopamine responses during conditioning of an auditory stimulus. 46 The same time-scales apply to model and biological data in this and all subsequent figures. RESULTS
Basic properties of the Critic This section describes how the Effective Reinforcement Signal of the Critic replicated the basic characteristics of the responses of dopamine neurons to primary rewards and conditioned stimuli. First free reward. The rewarding drop of liquid was modeled as composite of a stimulus e4(t) and a primary reinforcement l(t) (stimulus D and ‘primary reinforcement’ in Fig. 4A, top, respectively). Due to initial positive weights, stimulus D
induced a positive prediction signal P(t) already in the first trial (Fig. 4A, middle). Following temporal differentiation, this signal is summed with the later primary reinforcement by the Critic (Fig. 2, right) to produce the Effective Reinforcement Signal (Fig. 4A, bottom). The biphasic initial response represents a novelty response and is due to the first occurrence of stimulus D. The biphasic novelty response is similar to novelty responses of dopamine neurons in animals. 42,73 Experienced with free reward. Following 200 pairings between stimulus D and primary
878
R. E. Suri and W. Schultz
reinforcement, the stimulus component of the reward (stimulus D) led to a positive and sustained prediction signal P(t) during the interstimulus interval (Fig. 4B, middle). The predictive component of the Effective Reinforcement Signal increased strongly because stimulus D came to completely predict the primary reinforcement, whereas the primary reinforcement component disappeared (Fig. 4B, bottom). The Effective Reinforcement Signal resembles dopamine responses to drops of apple juice delivered to the mouth of animals. 29,46 Conditioned stimulus. We employed the composite reward for conditioning an arbitrary stimulus A. The Effective Reinforcement Signal began to show an increase after stimulus A in trial six but remained increased at the time of reward (Fig. 4C, top). During successive trials, the signal further increased at the time of the earliest rewardpredicting stimulus A, but progressively decreased and returned to baseline at the time of reward (Fig. 4C, middle and bottom). The Effective Reinforcement Signal developed in a similar manner to dopamine responses during the different phases of associating auditory or visual stimuli with liquid or food reward. 42,46 Modified time of reward. In order to test the temporal aspects of reward prediction, we changed the time of reward occurrence. Using the weights computed with a fully conditioned stimulus A, the reward was delivered earlier or later than predicted (Fig. 5A, B). This resulted in an increased Effective Reinforcement Signal at the new time of reward in both situations. In addition, a phasic decrease occurred at the previous time of reward when reward was delivered later than predicted (Fig. 5B), but not with reward earlier than predicted (Fig. 5A). The differential decrease of the reinforcement signal was due to the relative durations and magnitudes of the predictions evoked by stimuli A and D. Through the learning process, the prediction evoked by stimulus A lasted during the constant and well established interval from stimulus A to reward onset (onset of stimulus D). With a premature reward, onset of the earlier reward (stimulus D) evoked a prediction of the primary reinforcement before the smaller reward prediction from stimulus A was terminated and thus reset this prediction because of the winner-take-all rule (Fig. 5A, middle). Without this competitive representation reset, the Effective Reinforcement Signal would have been decreased below baseline level at the predicted time of reward (not shown). By contrast, with a delayed reward, the reward prediction evoked by stimulus A terminated before reward onset and thus induced a phasic decrease at the habitual time of reward (Fig. 5B). The Effective Reinforcement Signals for rewards occurring earlier or later than
Fig. 5. Reward predictions including the time of reward occurrence. The time of reward was changed for one trial after the model was trained with a constant stimulus–reward interval. (A) Reward occurring earlier than predicted. The Effective Reinforcement Signal was increased at the new time of reward but remained unchanged at the previous time of reward, as the prediction from stimulus D had reset the prediction from stimulus A. (B) Reward occurring later than predicted. The Effective Reinforcement Signal was decreased at the predicted time of reward and increased at the new time of reward. These characteristics of the Effective Reinforcement Signal are comparable to the responses of dopamine neurons. 29
predicted are comparable to the activity of dopamine neurons in these situations. 29 Learning a delayed response task in successive steps This section describes how the modeled Effective Reinforcement Signal of the Critic developed and how the behavior of the Actor changed during the learning of a simulated spatial delayed response task via two intermediate tasks. Spatial choice task. The left or the right instruction (stimulus A or C) was presented simultaneously with the trigger (stimulus B), and the Actor was required to perform the appropriate movement out of three possible actions (left action X or right action
Delayed task model with dopamine-like reinforcement signal
879
Fig. 6. Actor performance and Effective Reinforcement Signal during learning of the spatial choice task. (A) In a correct trial during learning, simultaneous presentation of the trigger (stimulus B) and right instruction (stimulus C) was followed by correct right lever press (action Z). The Effective Reinforcement Signal was increased at the onset of reward-predicting stimuli B and C and at reward onset (stimulus D). The second peak led to an increase of the weight associating the right instruction and right lever press (stimulus C ⫺ action Z). Shaded areas indicate the time at which the Effective Reinforcement Signal interacted with stimulus and action traces and changed synaptic weights. (B) In an incorrect trial during learning, simultaneous presentation of the trigger and right instruction (stimuli B ⫹ C) was followed by erroneous left lever press (action X). This left the weight associating the right instruction with right lever press unchanged (line 5). The Effective Reinforcement Signal was increased at trigger-instruction onset and decreased at the time of expected reward. (C) After learning, the trigger and right instruction (stimuli B ⫹ C) were correctly followed by right lever press (action Z). The Effective Reinforcement Signal was increased at the time of reward-predicting stimuli B ⫹ C but was unaffected by the reward. Lines below letters Z and X indicate action durations. In parts (A–C), the Effective Reinforcement Signal was comparable to the activity of single dopamine neurons (A, B) or their populations (C) recorded in corresponding learning situations. 66
Z, withholding of movement Y) in order to receive reward. During learning (Fig. 6A), the trigger and the right instruction occurred simultaneously
(stimuli B ⫹ C; Fig. 6A, line 1) which left two identical stimulus traces (line 2). The Effective Reinforcement Signal increased (line 6), as trigger
880
R. E. Suri and W. Schultz
Fig. 7. Actor performance and Effective Reinforcement Signal during learning of the instructed spatial task. (A) In an erroneous trial during learning, the right instruction (stimulus C) was followed by premature right lever press (action Z). This error immediately terminated the trial. The Effective Reinforcement Signal was increased at instruction onset and decreased when the reward was predicted from the previously learned spatial choice task. The model replicated the frequently precocious movements by the animals when learning this task. The Effective Reinforcement Signal was comparable to dopamine responses measured in trials with precocious movements (unpublished results from study 66). (B) With more advanced learning, the right instruction (stimulus C) was followed by the trigger which elicited correct right lever press (action Z). The Effective Reinforcement Signal was increased at instruction onset and unaffected by the fully predicted trigger and reward. This was comparable to responses of dopamine neurons in this situation. 66
and instructions had already acquired rewardpredicting properties. The Actor pressed the lever to the right (action Z; line 3), leaving an action trace (line 4). This correct reaction resulted in delivery of reward and consequently increased the Effective Reinforcement Signal at reward onset D (line 6). This led to an increase in the synaptic weight associating the right instruction (stimulus C) with right lever press (action Z), according to the product (Effective Reinforcement Signal × trace of stimulus C × trace of action Z) (shaded areas in Fig. 6A, lines 2, 4, 5, 6). In an erroneous learning trial (Fig. 6B), presentation of trigger and right instruction (stimuli B and C, line 1) was followed by incorrectly pressing the left lever (action X; line 3), leaving a trace of left lever press (line 4) but no trace of right lever press. The absence of trace of right lever press prevented any change in the weight associating the right instruction C and right lever press (line 5). The incorrect action terminated the trial without leading to reward and consequently decreased the Effective Reinforcement Signal at the time of the partially expected reward (line 6). The decreased reinforcement signal led to a decrease in the synaptic weight associating the right instruction with left lever press (not shown), according to the product (Effective Reinforcement Signal × trace of stimulus C × trace of action X). The model reacted in a similar manner to the inappropriate withholding of movement (action Y)
following presentation of the instruction and trigger stimuli (not shown). After learning (Fig. 6C), the trigger and right instruction were correctly followed by lever press to the right. The Effective Reinforcement Signal was increased at the time of the reward-predicting stimuli and at baseline at the time of reward. Before, during and after learning, the Effective Reinforcement Signal was comparable to activity of dopamine neurons in the same, correctly or incorrectly performed task situations (histograms below Fig. 6A–C). Instructed spatial task. The instruction was presented 1.0 s before the trigger stimulus, and the action should occur only after the trigger. Initially, both the model and the animals performed erroneously by pressing the lever already after the instruction and before the trigger, as they reacted to the instruction according to the preceding spatial choice task (Fig. 7A). The error canceled the subsequent trigger and the reward. The Effective Reinforcement Signal increased after the instruction because it predicted reward in the previous task. However, as reward failed to occur with the error, the Effective Reinforcement Signal was decreased at the predicted time of reward. In a typical correct trial at a later learning stage (Fig. 7B), the instruction (stimulus C) indicated a movement to the right (action Z), and all action was withheld until the
Delayed task model with dopamine-like reinforcement signal
881
trigger occurred (stimulus B). Then the appropriate right lever (action Z) was pressed and reward was delivered (stimulus D and primary reinforcement). The Effective Reinforcement Signal was increased after the instruction but unaffected by the predicted occurrence of trigger and reward. Thus the increase occurred to the earliest reward-predicting stimulus. The Effective Reinforcement Signal was comparable to the activity of dopamine neurons during learning of the instructed spatial task. 66 Spatial delayed response task. In this final task, the delay between instruction and trigger varied randomly between 2.5 and 3.5 s. In a typical incorrect trial during early learning, an action was withheld (Fig. 8A). The Effective Reinforcement Signal showed a biphasic increase–decrease after the instruction, followed by a small decrease 1 s after the instruction at the time of the trigger in the previously learned task. A biphasic response followed by a second decrease was typical for a partially extinguished reward-predicting stimulus because of the different learning rates of the weights. The monophasic increase after the trigger reflected the reward-predicting properties of this stimulus acquired in the previous task. The Effective Reinforcement Signal decreased again 1.0 s after the trigger, at the time of reward in the previous task. With fully established task performance (Fig. 8B), the Effective Reinforcement Signal lost the two decreases but continued to show increases at instruction and trigger onsets. The increase following the trigger failed to disappear for all instruction–trigger intervals, as the time of trigger was variable and unpredictable. The Effective Reinforcement Signal following the trigger showed a progressively smaller increase with decreasing instruction–trigger intervals at this learning stage. It showed a constant increase with more advanced learning (not shown), as the Critic reached its final characteristics later than task performance. The characteristics of the Effective Reinforcement Signal in incorrect and correct trials were comparable to the activity of dopamine neurons during learning of the spatial delayed response task (A bottom; B right). 66 Behavioral performance. For all three spatial tasks, the actions emitted by the Actor showed similar improvements in performance to the behavior of the animals, resulting in largely comparable learning curves (Fig. 9). Performance dropped in both model and animals between successive tasks. During early learning of the spatial delayed response task, the error rate was higher in the model compared to the animal. The instruction–trigger interval was increased in a single step for the model, resulting in a flattened learning curve, whereas the interval was increased in small steps for the animals in order to maintain their cooperation.
Fig. 8. Actor performance and Effective Reinforcement Signal during learning of the spatial delayed response task. (A) In an erroneous trial during learning, the trigger failed to elicit an action (line 2). The Effective Reinforcement Signal (line 3) was comparable to responses of dopamine neurons in incorrect learning trials (line 4) 47. (B) In more advanced learning trials, the left instruction (stimulus A) preceded the trigger which was correctly followed by left lever press (action X). Bottom shows Effective Reinforcement Signal increases following both instruction and trigger stimuli in six trials with different interstimulus intervals, ranked according to decreasing intervals. Right part shows the corresponding activity of dopamine neurons in single, correct trials. Trials are shown rank-ordered according to intervals between instruction (small vertical bars) and trigger (vertical line). Vertical bars after the trigger indicate reward (unpublished results from study 66).
882
R. E. Suri and W. Schultz
Modifications of effective reinforcement signal
in perseveration on initially learned behavior. In the spatial choice task, the stimuli selectively instructing for left vs right trials shared a common feature, the trigger light, which lead to generalizations between both instructions. The Actor continued to respond to the initially reinforced side irrespective of the position indicated by the instruction, resulting in 50% incorrect trials. Performance in the instructed spatial and the spatial delayed response tasks was nearly entirely incorrect, as the Actor continued to react precociously to the instruction, as learned in the preceding spatial choice task. These errors would have been prevented by the decrease with omitted reward had the adaptive Effective Reinforcement Signal been used. Further simulations suggested that perseverative errors may constitute a general problem of learning with unconditional reinforcement signals (Fig. 10). In the example of the spatial choice task, weights were initialized with values assuring correct performance. Then a block of 10 regularly left–right alternating trials was presented, followed by 18 left trials and again 30 alternating trials. Whereas use of the Effective Reinforcement Signal resulted in correct performance of all 58 trials (Fig. 10A), the unconditional reinforcement signal resulted in task performance persevering on left actions during the final 30 alternating trials. Thus, with unconditional reinforcement, behavior persevered on the left action, because the 18 left trials had resulted in strong synaptic weights linking the trigger stimulus with left lever press. In addition, analysis of the weight space revealed that the probability for relearning the spatial choice task decreased with every left trial, thus impairing relearning of the complete task (dashed lines in Fig. 10B). Despite initially correct performance, severe errors occurred with unconditional reinforcement whenever a large number of successive trials was performed on the same side. Taken together, the common learning deficit in all behavioral situations tested with unconditional reinforcement consisted of perseveration on initially reinforced behavior when similar stimuli required different actions.
Learning with unconditional reinforcement. It was aimed to investigate the importance of the adaptive properties of the Critic by using a reinforcement signal that was unaffected by learning. This unconditional reinforcement signal increased also after fully predicted rewards, did not decrease with reward omission and did not increase after rewardpredicting stimuli. Thus, it failed to code a reward prediction error. This reinforcement signal had a baseline value of d 0.5 and increased phasically to 2.0 after a reward, identical to the signal shown in Fig. 4B bottom. The Actor was trained with the unconditional reinforcement signal in the three spatial tasks in exactly the same manner as with the Effective Reinforcement Signal. This resulted
Sustained decrease of Effective Reinforcement Signal. It was aimed to replicate pathological, sustained decreases of dopamine activity. In the model, the value of the Effective Reinforcement Signal was constantly set to zero. This contrasted with the extinction trials described above in which the signal maintained a positive baseline value d which only decreased phasically at the time of omitted reward. Thus, unpredicted rewards, reward predictions and reward omissions could no longer influence the Effective Reinforcement Signal. The Actor was set with the synaptic weights of the fully learned spatial choice task using the original Effective Reinforcement Signal. The left instruction
Fig. 9. Learning curves for the three successive tasks in model compared to monkey. One block contained 15 trials for the model (top) and about 40 trials for the animal (bottom). 66
Behavioral extinction. The behavior of the Actor was studied during extinction trials in which reward was withheld. After all investigations with the spatial delayed response task were terminated, the model was set to the synaptic weights of the fully acquired spatial choice task. The complete reward consisting of stimulus D and primary reinforcement was omitted during 40 simultaneous presentations of the left instruction and trigger stimuli. This induced a phasic decrease of the Effective Reinforcement Signal below baseline at the predicted time of reward. Performance remained correct for the first four trials, but deteriorated during the subsequent six trials and became entirely incorrect for the remainder of the block of 40 trials. Behavioral errors consisted of a single response to the incorrect side and of the absence of action in all other error trials. This simulation shows that unreinforced actions become extinguished in the model in a similar manner to in animals.
Delayed task model with dopamine-like reinforcement signal
883
Fig. 10. Advantage of learning with an adaptive Effective Reinforcement Signal (A) as compared to an unconditional reinforcement signal not coding a reward prediction error (B). In both simulations of the spatial choice task, initial weights were set to 0.4 for the associations left instruction–left action, trigger–left action, right instruction–right action, and trigger–right action, the remaining weights being zero. These values resulted in correct behavior. Performance was tested with 10 consecutively alternating trials, followed by 18 left-only trials and by 30 alternating trials. (A) Development of two weights associating instructions with actions trained with an adaptive Effective Reinforcement Signal. All trials were correctly performed. Weight changes became progressively smaller with increasing experience. (B) Development of the same weights with unconditional reinforcement. Weights for the association left instruction–left action continued to increase with correct performance. Weights for right instruction–right action remained constant after an initial increase, as the 15 right trials during the 30 alternating trials were incorrectly performed to the left side (indicated by dots). Weights did not approach an asymptotic value despite continuing experience. Dashed lines separate three regions in the weight space comprising only correct right trials, both right and left correct trials, and only left correct trials.
and the trigger (stimuli A and B, respectively) were simultaneously presented. The left lever (action X) was correctly pressed in the first and second trial, whereas no actions were performed in the remaining eight trials tested. Thus, a sustained decrease of the Effective Reinforcement Signal below the usual baseline resulted in extinction of previously trained actions, similar to extinction of action by omitted reward.
DISCUSSION
This study has demonstrated that TD Critic–Actor architectures with dopamine-like reinforcement signals can learn cognitive tasks that incorporate typical functions of the frontal cortex and basal ganglia. The present model has extended the basic TD algorithm in order to match the Effective Reinforcement Signal and the natural dopamine responses in their basic characteristics. The results show that the Effective Reinforcement Signal developed very similar adaptive properties to the dopamine responses in each learning step. The power of the dopamine-like Effective Reinforcement Signal is shown by the observation that the relatively simple Actor network without hidden layers acquired very effectively each learning step in this complicated task in a similar manner to monkeys in the laboratory. Together with a previous study employing a comparable extended TD reinforcement signal for learning sequential movements, 75 these results support the hypothesis that dopamine responses could be used as reinforcement signals for the acquisition of a range of behavioral tasks.
Dopamine-like teaching signal of the Critic component Dopamine neurons process an error in the prediction of reward, rather than responding unconditionally to rewards. Dopamine responses correspond in several aspects to the Effective Reinforcement Signal of traditional TD models. 3,30,48,50,67 In order to reproduce the responses of dopamine neurons in a larger spectrum of behavioral situations, the existing TD algorithms were modified with respect to stimulus representation and extended to include novelty responses, generalization responses and temporal aspects in reward prediction. The depression in dopamine activity at the time of omitted reward reflects a central timing mechanism. 29,66 This mechanism was modeled by representing a stimulus as a series of sustained signals of different durations. Only the weights associated with the particular signal were strengthened that covered the duration of the appropriate stimulus– reward interval. A similar timing mechanism was proposed in the ’spectral timing model’. 24,25 This form of stimulus representation allowed very effective learning of the temporal aspects of reward delivery, as demonstrated by the depression at the time of omitted reward. The biphasic responses of dopamine neurons to novel and physically salient stimuli were modeled by initializing the associative synaptic weight of the earliest phasic component of the temporal stimulus representation with a positive value. This assumption avoided the inconsistency of previous TD models of dopamine activity 50,67 with experimentally established responses to novel stimuli. 42,55,72,73
884
R. E. Suri and W. Schultz
Comparable biphasic responses of dopamine neurons were observed with stimuli resembling rewarded stimuli (generalization responses). 42 Such biphasic responses to salient stimuli may reflect ‘associations’ to reward and occur also in TD models without temporal stimulus representation. 4,5,81 The slow habituation of dopamine novelty responses to physically salient stimuli 29,73 was reproduced with a small learning rate of the weights associated with such biphasic responses. This assumption may also explain biphasic generalization responses if it is assumed that these result from extinction of stimulus components shared by the rewarded stimulus and the unrewarded similar stimulus. Furthermore, the slow learning rate of weights associated with phasic prediction signals suggests biphasic responses of dopamine activity after extinction of a conditioned stimulus. Modeling of the temporal aspects of reward prediction included a winner-take-all competition between predictions originating from different stimuli. The algorithm resets the prediction signal elicited by a stimulus when it is followed by a second reward-predicting or physically salient stimulus eliciting a stronger prediction signal. This reflects the shifting of attention in animal learning when a more important stimulus follows in a stimulus sequence. 43,57 A similar competition between temporal representations of different stimuli and the resetting by successive stronger stimuli was proposed previously to explain behavioral evidence for attention shifts. 24,25 This mechanism corrects inconsistencies of earlier TD models 50,67 with the experimental findings in situations in which a reward occurs earlier than predicted. Both Effective Reinforcement Signal and dopamine activity 29 are increased at the new, earlier time of reward but remain at baseline at the original, later time of reward. In previous TD models without representation reset, 50,67 a depression occurs at the original time of reward which is not consistent with the experimental findings. 29 The depression would compromise learning and produce behavioral extinction similar to omission of reward, although this would be partly compensated by the signal increase at the new time of reward. It would be interesting to test the operation of a representation reset in future experiments by assessing whether the depression at the time of an omitted reward still occurs when the animal is distracted by other salient stimuli. Properties of the Actor component The present model implemented the TD algorithm with a full Critic–Actor architecture. Sensorimotor associations were reinforced when the eligibility trace of this association coincided with an increased Effective Reinforcement Signal and extinguished
when the trace coincided with a decreased Effective Reinforcement Signal. Biphasic, initially positive Effective Reinforcement Signals, resembling novelty and generalization responses of dopamine neurons, influence learning in the actor component. Their effect on the slow movements simulated in the current study, with eligibility traces of several hundred milliseconds, is small as the effects from the two phases of the novelty responses cancel out. However, novelty and generalization responses may influence short actions, for example saccades or attentional processes, if the duration of the trace is only in the order of 100 ms. This suggests an influence of novelty and generalization responses on short attentional processes. This interpretation is supported by theoretical studies of TD learning which showed that positive initial weights, leading to an overestimation of reward prediction, improve learning performance because these weights stimulate novelty seeking. 79 Therefore, we suggest that biphasic dopamine responses stimulate visual search and attentional processes directed to novel, generalized and physically salient stimuli. These considerations suggest a computational basis for the hypothesis that dopamine neurotransmission attributes motivational salience to stimuli 58 and enhances attention. 9 Previous models of dopamine activity showed that the use of the TD reinforcement signal for selecting behavioral output without an explicit Actor permitted to replicate foraging behavior of honeybees 49 and human decision making. 50 More elaborate TD models with explicit Critic–Actor architectures very efficiently learned eye movements, 18,48 sequential movements, 75 and orienting reactions. 14 However, dopamine neurons were not recorded in tasks comparable to these simulations. In contrast, the present simulations reproduced both the activity of dopamine neurons and animal learning behavior. The complicated delayed response task was learned with a rather simple Actor comprising basically a single layer, reflecting the extraordinary power of the TD reinforcement signal modeled. Previous models of delayed response tasks adapted synaptic weights with unconditional reinforcement 27 and used more elaborate architectures with several layers or optimization methods with limited biological plausibility. 7,51 However, these studies replicated some of the task-related activities of neurons in frontal cortex and basal ganglia. It would be interesting to see how much complexity would be required for the Actor component of TD models with dopamine-like reinforcement signals in order to replicate similar neuronal activities. Biological substrates The Critic–Actor architecture is particularly attractive for neurobiology because of its separate teaching and performance modules and its local
Delayed task model with dopamine-like reinforcement signal
learning rules. The presently used architecture (Fig. 2) resembles the connectivity of the basal ganglia, including the reciprocity of striatonigral projections. 30,63 The learning of reward predictions may be mediated by projections from cortex via the patch-striosome compartments of striatum to midbrain dopamine neurons. The time derivative of the prediction may be computed in striatal patch neurons or in dopamine neurons. 50 The learning of sensorimotor associations and motor output by the Actor may involve projections from cortex via the matrix of striatum, globus pallidus and thalamus to premotor and motor cortical areas. Subsets of striatal and cortical neurons show sustained anticipatory activity preceding reward-predicting stimuli 2,65 similar to the prediction signal of TD models. Transient and sustained responses in striatal and cortical neurons 1,2,39,64,65,80,83 resemble the components of the presently modeled stimulus representation of the Critic with their different durations. In addition, convergence of information occurs in the striatal projections to basal ganglia output nuclei, 22 in analogy to the convergence of sensory representations to smaller numbers of actions in Critic–Actor architectures. 4 The Critic component emits the global Effective Reinforcement Signal to the Actor, similar to the divergent projections from midbrain dopamine neurons to a several hundredfold higher number of striatal neurons. 54 Many dendritic spines of striatal medium spiny neurons are contacted by both cortical and dopamine afferents, 10,17,26 thus allowing dopamine varicosities to decisively influence corticostriatal transmission. This may occur in the form of dopamine-dependent posttetanic long-term changes of corticostriatal transmission reported by in vitro studies. 11,12,84 Such dopamine-dependent plasticity might provide a biological basis for the three factor learning rule used for modifying synaptic weights in the Actor (stimulus trace × action trace × dopaminelike reinforcement signal). The model assumed that synapses in the Actor which were previously activated by stimuli and actions stayed eligible for modification by dopamine reinforcement signals. The physiological substrates of such ‘eligibility traces’ 77 in the striatum might consist of several mechanisms, such as sustained neuronal activity, 1,2,8 prolonged changes in calcium concentration, 53,85 or formation of calmodulindependent protein kinase II. 30 The present model did not include immediate influences of dopamine responses on activity of striatal neurons. This should not rule out the possibility that an Actor component could use the reward-predicting signals of dopamine neurons for directly guiding striatal activities in movement selection. The winner-take-all competition in the Actor may be considered as a simplified implementation of lateral inhibition between neurons. This mechanism produces competition, as the neuron with the
885
strongest activation suppresses the neurons with weaker activations. 23 Inhibitory interactions between neighboring neurons may take place in cortical neurons projecting to striatum. Anatomical evidence also allows lateral inhibition to occur between striatal medium spiny neurons, 5,36 although this was not confirmed in a recent physiological experiment. 34 Modifications of effective reinforcement signal The importance of the adaptive properties of the Effective Reinforcement Signal was investigated by using an unconditional reinforcement signal not coding an error in reward prediction. With the adaptive Effective Reinforcement Signal, incorrect actions were extinguished through reward omission, as the signal decreased below baseline when the reward should have occurred. By contrast, use of the unconditional reinforcement signal resulted in perseveration on incorrect, unrewarded actions and prevented shifting to correct actions. Perseveration of animal behavior could result from decreased learning in structures which are connected reciprocally with dopamine neurons. In correspondence to the recurrent architecture of the Critic, striatum and frontal cortex are connected reciprocally with dopamine neurons. 70 Perseveration is a typical symptom in Parkinson patients 40 with loss of dopamine neurons and in rats with lesions of medial frontal cortex. 52 These results demonstrate the efficacy of reinforcement signals coding an error in reward prediction and suggest that additional learning mechanisms are required when only unconditional reinforcement signals are used. The potential influence of decreased dopamine activity on associative learning was modeled by setting the Effective Reinforcement Signal to nil. This led to extinction of previously conditioned actions proportional to the number of trials. This result resembles the progressive decrease of leverpressing in animals systemically injected with the dopamine receptor blocker pimozide. 45,86 It is also compatible with the finding that Parkinson patients with degeneration of dopamine neurons show impaired associative learning. 38,41,71,82 It might be speculated that some of the Parkinsonian deficits in learned behavior could represent a kind of extinction effect following the loss of sustained dopamine reinforcing activity, in a manner demonstrated by the present model. CONCLUSIONS
Temporal difference algorithms can be implemented as a Critic–Actor architecture that resembles the structure of the basal ganglia, including the projection of dopamine neurons to the striatum. The reinforcement signal of the Critic strongly resembles the responses of midbrain dopamine
886
R. E. Suri and W. Schultz
neurons, and the present study refined the reinforcement signal with further characteristics of dopamine responses derived from neurophysiological experiments in behaving primates. The model further assumed a number of more putative biological processes, including dopamine-dependent plasticity of corticostriatal synapses. ‘Eligibility traces’ of neuronal activity were left by stimuli and actions and were acted upon by subsequent rewards. The model was able to learn a spatial delayed response task in a very similar manner to monkeys in the laboratory. The characteristics of the reinforcement signal were comparable to those of dopamine
responses during the different learning periods. Changes of the particular properties of dopamine responses resulted in reduced learning and perseveration. These results suggest that dopamine-like reward responses could be used as very effective reinforcement signals for learning behavioral tasks that are typical for primate behavior. Acknowledgements—This study was supported by the James S. McDonnell Foundation grant 94-39. R. E. Suri is presently at the USC Brain Project, University of Southern California, Hedco Neurosciences Building, 3614 Watt Way, Los Angeles CA 90089-2520.
REFERENCES
1. Alexander G. E. (1987) Selective neuronal discharge in monkey putamen reflects intended direction of planned limb movements. Expl Brain Res. 67, 623–634. 2. Apicella P., Scarnati E., Ljungberg T. and Schultz W. (1992) Neuronal activity in monkey striatum related to the expectation of predictable environmental events. J. Neurophysiol. 68, 945–960. 3. Barto A. G. (1995) Adaptive Critics and the Basal Ganglia. In Models of Information Processing in the Basal Ganglia (eds Houk J. C., Davis J. L. and Beiser D. G.), pp. 215–232. MIT, Cambridge. 4. Barto A. G., Sutton R. S. and Anderson C. W. (1983) Neuronlike adaptive elements that can solve difficult learning control ploblems. IEEE Trans on Systems, Man, and Cybernetics SMC-13, pp. 834–846. 5. Barto A. G., Sutton R. S. and Watkins C. J. C. H. (1990) Learning and sequential decision making. In Learning and Computational Neuroscience: Foundations of Adaptive Networks (eds Gabriel M. and Moore J.), pp. 539–602. MIT, Cambridge. 6. Bauer R. H. and Fuster J. M. (1976) Delayed-matching and delayed-response deficit from cooling dorsolateral prefrontal cortex in monkeys. J. comp. Physiol. Psychol. 90, 293–302. 7. Beiser D. G. and Houk J. C. (1998) Model of cortical-basal ganglionic processing: encoding the serial order of sensory events. J. Neurophysiol. 79, (6) 3168–3188. 8. Bishop G. A., Chang H. T. and Kitai S. T. (1982) Morphological and physiological properties of neostriatal neurons: an intracellular horseradish peroxidase study in the rat. Neuroscience 7, 179–191. 9. Blackburn J. R., Pfaus J. G. and Phillips A. G. (1992) Dopamine functions in appetitive and defensive behaviours. Prog. Neurobiol. 39, (3) 247–279. 10. Bouyer J. J., Park D. H., Joh T. H. and Pickel V. M. (1984) Chemical and structural analysis of the relation between cortical inputs and tyrosine hydroxylase-containing terminals in rat neostriatum. Brain Res. 302, 267–275. 11. Calabresi P., Pisani A., Mercuri N. B. and Bernardi G. (1992) Long-term potentiation in the striatum is unmasked by removing the voltage-dependent magnesium block of NMDA receptor channels. Eur. J. Neurosci. 4, 929–935. 12. Calabresi P., Saiardi A., Pisani A., Baik J. H., Centonze D., Mercuri N. B., Bernardi G. and Borrelli E. (1997) Abnormal synaptic plasticity in the striatum of mice lacking dopamine D2 receptors. J. Neurosci. 17, (12) 4536–4544. 13. Church R. M. (1978) The internal clock. In Cognitive Processes in Animal Behavior (eds Hulse S. H., Fowler H. and Honig W. K), Lawrence Erlbaum Associates, Hillsdale, New Jersey. 14. Contreras-Vidal, J. L. and Schultz, W. (1999) A predictive reinforcement model of dopamine neurons for learning approach behavior. J. comput. Neurosci. (in press). 15. Dickinson A. (1980) Contemporary Animal Learning Theory. Cambridge University Press, Cambridge. 16. Fagg, A. H. (1993) Reinforcement learning for robotic reaching and grasping. In New Perspectives in the Control of the Reach to Grasp Movement (eds Bennet K. M. B. and Castiello U.), pp. 281–308. North Holland. 17. Freund T. F., Powell J. F. and Smith A. D. (1984) Tyrosine hydroxylase-immunoreactive boutons in synaptic contact with identified striatonigral neurons, with particular reference to dendritic spines. Neuroscience 13, 1189–1215. 18. Friston K. J., Tononi G., Reeke G. N. Jr., Sporns O. and Edelman G. M. (1994) Value-dependent selection in the brain: simulation in a synthetic neural model. Neuroscience 59, 229–243. 19. Funahashi S., Bruce C. J. and Goldman-Rakic P. S. (1989) Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. J. Neurophysiol. 61, 331–349. 20. Fuster J. M. and Alexander G. E. (1971) Neuron activity related to short-term memory. Science 173, 652–654. 21. Gallistel C. R. (1990) The Organization of Learning. A Bradford Book, MIT, Cambridge, Massachusetts. 22. Graybiel A. M., Aosaki T., Flaherty A. W. and Kimura M. (1994) The basal ganglia and adaptive motor control. Science 265, 1826–1831. 23. Grossberg S. (1976) Adaptive pattern classification and universal recording: I. Parallel development and coding of neural feature detectors. Biol. Cybern. 23, 121–134. 24. Grossberg S. and Levine D. S. (1987) Neural dynamics of attentionally modulated Pavlovian conditioning: blocking, interstimulus interval, and secondary reinforcement. Appl. Optics 26, (23) 5015–5030. 25. Grossberg S. and Schmajuk N. A. (1989) Neural dynamics of adaptive timing and temporal discrimination during associative learning. Neural Networks 2, 79–102. 26. Groves P. M., Linder J. C. and Young S. J. (1994) 5-hydroxydopamine-labeled dopamine axons: three-dimensional reconstructions of axons, synapses and postsynaptic targets in rat neostriatum. Neuroscience 58, 593–604. 27. Guigon E., Dorizzi B., Burnod Y. and Schultz W. (1995) Neural correlates of learning in the prefrontal cortex of the monkey: a predictive network model. Cerebr. Cortex 5, 135–147. 28. Gullapalli V., Barto A. G. and Grupen R. A. (1994) Learning admittance mapping for force-guided assembly. In Proceedings
Delayed task model with dopamine-like reinforcement signal
29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64.
887
of the 1994 International Conference on Robotics and Automation, pp. 2633–2638. Computer Society Press, Los Alamitos, California. Hollerman J. and Schultz W. (1998) Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neurosci. 1, 304–309. Houk J. C., Adams J. L. and Barto A. G. (1995) A model of how the basal ganglia generate and use neural signals that predict reinforcement. In Models of Information Processing in the Basal Ganglia (eds Houk J. C., Davis J. L. and Beiser D. G.), pp. 215–232. MIT, Cambridge, Massachusetts. Hrupka B. J., Lin Y. M., Gietzen D. W. and Rogers Q. R. (1997) Small changes in essential amino acid concentrations alter diet selection in amino acid-deficient rats. J. Nutr. 127, (5) 777–784. Hull C. L. (1943) Principles of Behavior. Appleton-Century-Crofts, New York. Jacobsen C. F. and Nissen H. W. (1937) Studies of cerebral function in primates: IV. The effects of frontal lobe lesions on the delayed alternation habit in monkeys. J. comp. Physiol. Psychol. 23, 101–112. Jaeger D., Hitoshi K. and Wilson C. J. (1994) Surrond inhibition among projection neurons is weak of nonexistent in the rat neostriatum. J. Neurophysiol. 72, (3) 2555–2558. Kalman R. E. (1960) A new aproach to linear filtering and prediction problems. J. Basic Eng. Trans. ASME 82, 35–45. Kawaguchi Y., Wilson C. J. and Emson P. C. (1990) Projection subtypes of rat neostriatal matrix cells revealed by intracellular injection of biocytin. J. Neurosci. 10, 3421–3438. Klopf A. H. (1988) A neuronal model of classical conditioning. Psychobiology 16, (2) 85–125. Knowlton B. J., Mangels J. A. and Squire L. R. (1996) A neostriatal habit learning system in humans. Science 273, (5280) 1399–1402. Kubota K. and Niki H. (1971) Prefrontal cortical unit activity and delayed alternation performance in monkeys. J. Neurophysiol. 34, 337–347. Lees A. J. and Smith E. (1983) Cognitive deficits in the early stages of Parkinson’s disease. Brain 106, 257–270. Linden A., Bracke-Tolkmitt R., Lutzenberger W., Canavan A. G. M., Scholz E., Diener H. C. and Birbaumer N. (1990) Slow cortical potentials in Parkinsonian patients during the course of an associative learning task. J. Psychophysiol. 4, 145–162. Ljungberg T., Apicella P. and Schultz W. (1992) Responses of monkey dopamine neurons during learning of behavioral reactions. J. Neurophysiol. 67, 145–163. Mackintosh N. M. (1983) Theoretical analysis of instrumental conditioning. In Conditioning and Associative Learning (ed. Mackintosh N. M.), pp. 77–112. Oxford University Press, Oxford. Mark G. P., Blander D. S. and Hoebel B. G. (1991) A conditioned stimulus decreases extracellular dopamine in the nucleus accumbens after the development of a learned taste aversion. Brain Res. 551, 308–310. Mason S. T., Beninger R. J., Fibiger H. C. and Phillips A. G. (1980) Pimozide-induced suppression of responding: evidence against a block of food reward. Pharmac. Biochem. Behav. 12, 917–923. Mirenowicz J. and Schultz W. (1994) Importance of unpredictability for reward responses in primate dopamine neurons. J. Neurophysiol. 72, 1024–1027. Mirenowicz J. and Schultz W. (1996) Preferential activation of midbrain dopamine neurons by appetitive rather than aversive stimuli. Nature 379, 449–451. Montague P. R., Dayan P., Nowlan S. J., Pouget A. and Sejnowski T. J. (1993) Using aperiodic reinforcement for directed selforganization. In Advances in Neural Information Processing Systems (eds Giles C. L., Hanson S. J. and Cowan J. D.), pp.969– 976. Morgan Kaufmann, San Mateo, California. Montague P. R., Dayan P., Person C. and Sejnowski T. J. (1995) Bee foraging in uncertain environments using predictive hebbian learning. Nature 377, 725–728. Montague P. R., Dayan P. and Sejnowski T. J. (1996) A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci. 16, (5) 1936–1947. Moody S. L., Wise S. P., di Pellegrino G. and Zipser D. (1998) A model that accounts for activity in primate frontal cortex during a delayed matching-to-sample task. J. Neurosci. 18, 399–410. Muir J. L., Everitt B. J. and Robbins T. W. (1996) The cerebellar cortex of the rat and visual attentional function: dissociate effects of medio-frontal, cingulate, anterior dorsolateral, and parietal cortex lesion on a five choice serial reaction time task. Cerebr. Cortex 6, 470–481. Muller W. and Connor J. A. (1991) Dendritic spines as individual compartments for synaptic Ca 2⫹ responses. Nature 354, 73– 76. Percheron G., Francois C., Yelnick J. and Fenelon G. (1989) The primate nigro-striato-pallido-nigral system. Not a mere loop. In Neural Mechanisms in Disorders of Movement (eds Crossman A. R. and Sambrook M. A.), pp. 103–109. John Libbey, London. Rasmussen K., Strecker R. E. and Jacobs B. (1986) Single unit response of noradrenegic, serotonergic and dopamine neurons in freely moving cats to simple sensory stimuli. Brain Res. 369, 336–340. Rescorla R. A. and Wagner A. R. (1972) A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and non-reinforcement. In Classical Conditioning II: Current Research and Theory (eds Black A. H. and Prokasy W. F.), Appleton-Century-Crofts, New York. Revusky S. (1971) The role of interference in association over a delay. In Animal Memory (eds Honig W. K. and James P. H. R.), pp. 155–213. Academic, New York. Robinson T. E. and Berridge K. C. (1993) The neural basis for drug craving: an incentive-sensitization theory of addiction. Brain Res. Rev. 18, 247–291. Rogers Q. R. and Harper A. E. (1970) Selection of a solution containing histidine by rats fed a histidine-imbalanced diet. J. comp. Physiol. Psychol. 72, 66–71. Romo R. and Schultz W. (1990) Dopamine neurons of the monkey midbrain: contingencies of responses to active touch during self-initiated arm movements. J. Neurophysiol. 63, 592–606. Salamone J. D., Cousins M. S. and Snyder B. J. (1997) Behavioral functions of nucleus accumbens dopamine: empirical and conceptual problems with the anhedonia hypothesis. Neurosci. biobehav. Rev. 21, (3) 341–359. Schultz W. (1986) Responses of midbrain dopamine neurons to behavioral trigger stimuli in the monkey. J. Neurophysiol. 56, 1439–1462. Schultz W. (1998) Predictive reward signal of dopamine neurons. J. Neurophysiol. 80, 1–27. Schultz W., Apicella P., Romo R. and Scarnati E. (1995) Context-dependent activity in primate striatum reflecting past and
888
65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86.
R. E. Suri and W. Schultz future events. In Models of Information Processing in the Basal Ganglia (eds Houk J. C., Davis J. L. and Beiser D. G.), pp. 11– 27. MIT, Cambridge. Schultz W., Apicella P., Scarnati E. and Ljungberg T. (1992) Neuronal activity in monkey ventral striatum related to the expectation of reward. J. Neurosci. 12, 4595–4610. Schultz W., Apicella P. and Ljungberg T. (1993) Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. J. Neurosci. 13, (3) 900–913. Schultz W., Dayan P. and Montague P. R. (1997) A neural substrate of prediction and reward. Science 275, (5306) 1593–1599. Schultz W. and Romo R. (1987) Responses of nigrostriatal dopamine neurons to high-intensity somatosensory stimulation in the anesthetized monkey. J. Neurophysiol. 57, (1) 201–217. Schultz W. and Romo R. (1990) Dopamine neurons of the monkey midbrain: contingencies of responses to stimuli eliciting immediate behavioral reactions. J. Neurophysiol. 63, 607–624. Smith A. D. and Bolam J. P. (1990) The neural network of the basal ganglia as revealed by the study of synaptic connections of identified neurons. Trends Neurosci. 13, 259–265. Sprengelmeyer R., Canavan A. G. M., Lange H. W. and Ho¨mberg V. (1995) Associative learning in degenerative neostriatal disorders: contrasts in explicit and implicit remembering between Parkinson’s and Huntington’s disease patients. Mov. Disord. 10, 85–91. Steinfels G. F., Heym J., Strecker R. E. and Jacobs B. L. (1983) Behavioral correlates of dopamine unit activity in freely moving cats. Brain Res. 258, 217–228. Strecker R. E. and Jacobs B. L. (1985) Substantia nigra dopamine unit activity in behaving cats: effect of arousal on spontaneous discharge and sensory evoked activity. Brain Res. 361, 339–350. Suri R. and Schultz W. (1996) A neural learning model based on the activity of primate dopamine neurons. Soc. Neurosci. Abstr. 22, 1389. Suri R. E. and Schultz W. (1998) Learning of sequential movements by neural network model with dopamine-like reinforcement signal. Expl Brain Res. 121, 350–354. Sutton R. S. (1984) Temporal Credit Assignment in Reinforcement Learning. PhD thesis, Department of Computer and Information Science, University of Massachusetts, Amherst, MA, USA. Sutton R. S. and Barto A. G. (1981) Toward a modern theory of adaptive networks: expectation and prediction. Psychol. Rev. 88, 135–170. Sutton R. S. and Barto A. G. (1990) Time derivative models of Pavlovian reinforcement. In Learning and Computational Neuroscience: Foundations of Adaptive Networks (eds Gabriel M. and Moore J.), pp. 539–602. MIT, Cambridge. Sutton R. S. and Barto A. G. (1998) Section 2.7 Optimistic Initial Values. In Reinforcement Learning: An Introduction. pp. 39– 41. MIT Press/Bradford Books, Cambridge, MA. Tanji J., Taniguchi K. and Saga T. (1980) Supplementary motor area: neuronal responses to motor instructions. J. Neurophysiol. 43, 60–68. Tesauro G. (1994) TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput. 6, 215–219. Vriezen E. R. and Moscovitch M. (1990) Memory for temporal order and conditional associative-learning in patients with Parkinson’s disease. Neuropsychologia 28, 1283–1293. Weinrich M. and Wise S. P. (1982) The premotor cortex of the monkey. J. Neurosci. 2, 1329–1345. Wickens J. R., Begg A. J. and Arbuthnott G. W. (1996) Dopamine reverses the depression of rat corticostriatal synapses which normally follows high-frequency stimulation of cortex in vitro. Neuroscience 70, (3) 1–5. Wickens J. and Ko¨tter R. (1995) Cellular models of reinforcement. In Models of Information Processing in the Basal Ganglia, (eds Houk J. C., Davis J. L. and Beiser D. G.), pp. 187–214. MIT, Cambridge, Massachusetts. Wise R. A., deWit J. S. H. and Gerber G. J. (1978) Neuroleptic-induced ’anhedonia’ in rats: pimozide blocks reward quality of food. Science 201, 262–264. (Accepted 10 November 1998) APPENDIX
Equations are given in the exact form. Values of time dependent variables were zero at the beginning of the simulations. Presentation of stimuli A (left light), B (center light), C (right light) and D (touch receptors of tongue) were represented with the functions el(t), with l 1, 2, 3 or 4. The function el(t), describing the physical salience of one of these stimuli, took value one during presentation of the stimulus and zero otherwise. Actor The stimulus trace of a stimulus el(t) (Fig. 6), decaying with the parameter d (Table 1), was defined with: h
el
t ⫺ 2 ⫹ del
t ⫺ 1: e
t The function h() limited the argument to values smaller or equal to one and was defined as ( y for y ⬍ 1 h
y : 1 else
10
11
Since movement time for pressing a lever was about 300 ms, 66 actions an chosen at time t ⫺ 1 continued over three successive time-steps ( 300 ms) an
t 1 if
an
t ⫺ 1 1 and an
t ⫺ 3 0: 0
12
If there was no ongoing action (an(t) 0), activity a n(t) was computed from the weights n nl and the stimulus trace el
t " # X a 0n
t nnl el
t ⫺ sn
t Xg
e
t:
13 l
Delayed task model with dopamine-like reinforcement signal
889
The value of s n (t) e [0,s ] was chosen randomly (Table 1). The monkeys were pretrained to press levers when stimuli were presented. For the model, pretraining was omitted, and actions were instead assumed to be elicited in response to stimuli. The function g(e(t)) allowed actions only if a stimulus was presented ( 1; if el
t ⫺ el
t ⫺ 1 ⬎ 0 for l 1; 2 or 3 g
e
t :
14 0; else As the animal could only press one lever at a time, actions were selected using competition between activities a 0 n(t) in the form of a winner-take-all rule ( 1 if a 0n
t ⬎ 0 and a 0n
t ⬎ a 0m
t an
t :
15 0 else The representation of selected action was extended over time with the traces an
t h
an
t ⫹ dan
t ⫺ 1
16
with limiting function h() (eq. 10) and parameter d (Table 1) describing the decay of the trace. Critic Stimulus representation. This subsection describes the computation of the temporal representation xlm(t) from a stimulus el(t) (Fig. 2A). For a single stimulus el(t), the temporal representation xlm(t) depended only on the onset of this stimulus el(t) and not on its offset. xlm(t) was defined with the recursive function Xlm
t f
el
t; xlm
t ⫺ 1:
17
Three cases were distinguished for the definition of function f(): Case 1: Onset of stimulus el(t) elicited the first representation component xl1
t el
t; xl;m苷1
t 0:
18
Case 2: The slower components followed one time-step (100 ms) after the onset of stimulus el(t) xlm
t rm⫺1 el
t:
19
Case 3: More than one time-step after the onset of stimulus el(t), the components xlm
t clm
txlm
t ⫺ 1=g
20
increased gradually with discount factor g (Table 1). This increase was chosen in order to make the time-course of the representation components proportional to the time-course of the desired prediction signal. The decays of the representation components were implemented with the function 8 0 for n max
xlm
t ⫺ 1 > > <
21 cln ne1::m > > : else 1 which set the largest component of xlm(t) to zero. Effective Reinforcement Signal.
Changes in the temporal representation were defined with Dxlm
t bxlm
t ⫺ 1 ⫺ gxlm
tc⫹
22
b c⫹ indicates that negative values of the argument were set to zero and positive values of the argument remained unchanged. The onset prediction P stimulus_onset(t) was defined as the sum over onset predictions of all presented stimuli $ % X Pstimulus_onset
t Wl1 Xl1
t :
23 l
⫹
Only the sum over the first components of the temporal stimulus representation was taken, as the signals of other components were zero at the time of stimulus onset (Fig. 2). If the onset prediction P stimulus_onset (t) of a later stimulus was larger than the prediction Pl(t ⫺ 1) originating from an earlier stimulus, $ % X Pstimulus_onset
t ⬎ Pl
t ⫺ 1=g
24 l
⫹
then the old stimulus representation xlm(t) was reset to zero with xlm
t f
el
t; 0:
25
The function f() was defined by Eqns 17–21. In addition, the value of the reset function bl(t) was set to one during time interval t bl
t ⫹ t 0 1; ifel
t ⫺ 1 苷 0and0 ⬍ t 0 ⬍ t
26
890
R. E. Suri and W. Schultz
and to zero otherwise. The prediction Pl(t) (Figs 4A, B, 5) associated with a stimulus el(t) was defined as Pl
t
X
Wlm xlm
t
27
bl ⫺ 1
gPl
t ⫺ Pl
t ⫺ 1:
28
m
and the temporal difference in the prediction as V
t ⫺
X l
After a reset, the factor (bl ⫺ 1) set the temporal difference signal in the time interval t to zero. The Effective Reinforcement Signal r(t) was defined as r
t bd ⫹ l
t ⫺ V
tc⫹
29
with the primary reinforcement signal l(t) and constant baseline level d (Table 1). Learning Critic. The synaptic weights of the Critic were adapted according to a two factor learning rule. The Effective Reinforcement Signal r(t) with respect to baseline d was multiplied by presynaptic activity change Dxlm (t) Wlm
t Wlm
t ⫺ 1 ⫹ hc
r
t ⫺ dDxlm
t:
30
As the positive factors Dxlm(t) were only non-zero at the offsets of the temporal stimulus representation components, only weights predicting the correct time of the reinforcer were adapted. Dopamine responses to conditioned stimuli seem to adapt much faster than responses to novel and salient stimuli extinguish. Therefore, the learning rate for decreasing the weight of the first component of the temporal stimulus representation was smaller than the usual learning rate (Table 1) ( hc
hc⫹
else
31
hc⫺ for m 1 in extinction
Actor. The Actor weights were adapted according to a three factor learning rule, comprising the factors Effective Reinforcement Signal r(t), stimulus trace el
t and action trace an
t vnl
t vnl
t ⫺ 1 ⫹ ha
r
t ⫺ dan
tel
t:
32
For learning the delayed response task, previously learned associations between instructions and actions had to be extinguished before the correct associations between the trigger and the actions could be learned. If reward predictions extinguished before the incorrect actions did, the Effective Reinforcement Signal would remain at baseline value, and these actions would persevere. In order to avoid such perseveration, specific learning rates for acquisition and extinction assured that actions were extinguished faster than they were acquired (Table 1) ( ha
ha⫹ for r
t ⫺ d ⬎ 0 acquisition ha⫺ for r
t ⫺ d ⱕ 0 extinction
:
Matlab programs of the model are available at ftp://ftp.usc.edu/pub/bsl/Suri/Suri_Schultz.
33