Studies in History and Philosophy of Biological and Biomedical Sciences 45 (2014) 57–67
Contents lists available at ScienceDirect
Studies in History and Philosophy of Biological and Biomedical Sciences journal homepage: www.elsevier.com/locate/shpsc
Deep and beautiful. The reward prediction error hypothesis of dopamine Matteo Colombo Tilburg Center for Logic, General Ethics, and Philosophy of Science, Tilburg University, P.O. Box 90153, 5000 LE Tilburg, The Netherlands
a r t i c l e
i n f o
Article history: Received 12 February 2013 Received in revised form 21 October 2013 Available online 16 November 2013 Keywords: Dopamine Reward-prediction error Explanatory depth Incentive salience Reinforcement learning
a b s t r a c t According to the reward-prediction error hypothesis (RPEH) of dopamine, the phasic activity of dopaminergic neurons in the midbrain signals a discrepancy between the predicted and currently experienced reward of a particular event. It can be claimed that this hypothesis is deep, elegant and beautiful, representing one of the largest successes of computational neuroscience. This paper examines this claim, making two contributions to existing literature. First, it draws a comprehensive historical account of the main steps that led to the formulation and subsequent success of the RPEH. Second, in light of this historical account, it explains in which sense the RPEH is explanatory and under which conditions it can be justifiably deemed deeper than the incentive salience hypothesis of dopamine, which is arguably the most prominent contemporary alternative to the RPEH. Ó 2013 Elsevier Ltd. All rights reserved.
When citing this paper, please use the full journal title Studies in History and Philosophy of Biological and Biomedical Sciences
1. Introduction According to the reward-prediction error hypothesis of dopamine (RPEH), the phasic activity of dopaminergic neurons in specific regions in the midbrain signals a discrepancy between the predicted and currently experienced reward of a particular event. The RPEH is widely regarded as one of the largest successes of computational neuroscience. Terrence Sejnowski, a pioneer in computational neuroscience and prominent cognitive scientist, pointed at the RPEH, when, in 2012, he was invited by the online magazine Edge.org to answer the question ‘‘What is your favorite deep, elegant, or beautiful explanation?’’ Several researchers in cognitive and brain sciences would agree that this hypothesis ‘‘has become the standard model [for explaining dopaminergic activity and reward-based learning] within neuroscience’’ (Caplin & Dean, 2008, p. 663). Even among critics, the ‘‘stunning elegance’’ and the ‘‘beautiful rigor’’ of the RPEH are recognized (Berridge, 2007, pp. 399, 403). However, the type of information coded by dopaminergic transmission—along with its functional role in cognition and behaviour—is very likely to go beyond reward-prediction error. The RPEH is not the only available hypothesis about what type of
E-mail address:
[email protected] 1369-8486/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.shpsc.2013.10.006
information is encoded by dopaminergic activity in the midbrain (cf., Berridge, 2007; Friston et al., 2012; Graybiel, 2008; Wise, 2004). Current evidence does not speak univocally in favour of this hypothesis, and disagreement remains about to what extent the RPEH is supported by available evidence (Dayan & Niv, 2008; O’Doherty, 2012; Redgrave & Gurney, 2006). On the one hand, it has been claimed that ‘‘to date no alternative has mustered as convincing and multidirectional experimental support as the prediction-error theory of dopamine’’ (Niv & Montague, 2009, p. 342; see also Niv, 2009; Glimcher, 2011); on the other hand, the counter-claims have been put forward that the RPEH is an ‘‘elegant illusion’’ and that ‘‘[s]o far, incentive salience predictions [that is, predictions of an alternative hypothesis about dopamine] appear to best fit the data from situations that explicitly pit the dopamine hypotheses against each other’’ (Berridge, 2007, p. 424). How has the RPEH become so successful then? What does it explain exactly? And, granted that it is at least intuitively uncontroversial that the RPEH is beautiful and elegant, in which sense can it be justifiably deemed deeper than alternatives? The present paper addresses these questions by firstly reconstructing the main historical events that led to the formulation and subsequent success of the RPEH (Section 2).
58
M. Colombo / Studies in History and Philosophy of Biological and Biomedical Sciences 45 (2014) 57–67
With this historical account on the background, it is elucidated what and how the RPEH explains, contrasting it to the incentive salience hypothesis—arguably its most prominent current alternative. It is clarified that both hypotheses are concerned only with what type of information is encoded by dopaminergic activity. Specifically, the RPEH has the dual role of accurately describing the dynamic profile of phasic dopaminergic activity in the midbrain during reward-based learning and decision-making, and of explaining this profile by citing the representational role of dopaminergic phasic activity. If the RPEH is true, then a mechanism composed of midbrain dopaminergic neurons and their phasic activity carries out the task of learning what to do in the face of expected rewards, generating decisions accordingly (Section 3). The paper finally explicates under which conditions some explanation of learning, motivation or decision-making phenomena based on the RPEH can be justifiably deemed deeper than some alternative explanation based on the incentive salience hypothesis. Two accounts of explanatory depth are considered. According to one account, deeper explanatory generalizations have wider scope (e.g., Hempel, 1959); according to the other, deeper explanatory generalizations show more degrees of invariance (e.g., Woodward & Hitchcock, 2003). It is argued that, although it is premature to maintain that explanations based on the RPEH are actually deeper—in either of these two senses of explanatory depth—than alternative explanations based on the incentive salience hypothesis, relevant available evidence indicates that they may well be (Section 4). The contribution of the paper to existing literature is summarised in the conclusion. 2. Reward-prediction error meets dopamine Dopamine is a neurotransmitter in the brain.1 It has significant effects on many aspects of cognition and behaviour, including motor control, learning, attention, motivation, decision-making and mood regulation. Dopamine is implicated in pathologies such as Parkinson’s disease, schizophrenia, attention deficit hyperactivity disorder (ADHD) and addiction. These are some of the reasons for why so much work has been directed at understanding the type of information carried by neurons that utilize dopamine as a neurotransmitter as well as their functional roles in cognition and behaviour. Neurons that use dopamine as a neurotransmitter to communicate information are called dopamine or dopaminergic neurons. Such neurons are phylogenetically old, and found in all mammals, birds, reptiles and insects. Dopaminergic neurons are localized in several brain networks in the diencephalon (a.k.a. interbrain), mesencephalon (a.k.a. midbrain) and olfactory bulb (Björklund & Dunnett, 2007). Approximately 90% of the total number of dopaminergic neurons is in the ventral part of the midbrain, which comprises different dopaminergic networks with separate pathways. One of these pathways is the nigrostriatal pathway. It links the substantia nigra, a structure in the midbrain, with the striatum, which is the largest nucleus of the basal ganglia in the forebrain and has two components: the putamen and the caudate nucleus. Another pathway is the mesolimbic, which links the ventral tegmental area in the midbrain to structures in the forebrain, external to the basal ganglia, such as the amygdala and the medial prefrontal cortex. Dopamine neurons show two main patterns of firing activity, which modulates the level of extracellular dopamine: tonic and phasic activity (Grace, 1991). Tonic activity consists of regular
firing patterns of 1–6 Hz that maintain a slowly-changing, extracellular, base-level of extracellular dopamine in afferent brain structures. Phasic activity consists of a sudden change in the firing rate of dopamine neurons, which can increase up to 20 Hz, causing a transient increase of extracellular dopamine concentrations. The discovery that neurons can communicate by releasing chemicals was due to the German-born pharmacologist Otto Loewi—Nobel Prize winner in Physiology and Medicine along with co-recipient Sir Henry Dale—in 1921 (cf., Loewi, 1936). The discovery of dopamine as a neurotransmitter in the brain dates 1957, and was due to the Swedish pharmacologist Arvid Carlsson—Nobel Prize in Physiology and Medicine in 2000 along with co-recipients Eric Kandel and Paul Greengard (cf., Carlsson, 2003). Carlsson’s work in the 1950s and 1960s paved the way to the findings that the basal ganglia contain the highest dopamine concentrations, that dopamine depletion is likely to impair motor function and that patients with Parkinson’s disease have markedly reduced concentrations of dopamine in the caudate and putamen (cf. Carlsson, 1959, 1966). Since at least the 1950s, the search for the mechanisms of reward-based learning and motivation has been taking place. James Olds and Peter Milner set out to investigate how electrical stimulation of certain brain areas could reinforce behaviour. They implanted electrodes in different areas of rats’ brains and allowed them to move about a Skinner box. Rats received stimulation whenever they pressed a lever in the box. When this stimulation was targeted at the ventral tegmental area and basal forebrain, the rats showed signs of positive reinforcement, as they would repeatedly press the lever up to 2000 times per hour. These results suggested to Olds and Milner that they had ‘‘perhaps located a system within the brain whose peculiar function is to produce a rewarding effect on behavior’’ (Olds & Milner, 1954, p. 426). The notion of ‘‘reward’’ here is to be understood within Thorndike’s (1911) and Skinner’s (1938) theories of learning. As Olds and Milner put it: ‘‘In its reinforcing capacity, a stimulus increases, decreases, or leaves unchanged the frequency of preceding responses, and accordingly it is called a reward, a punishment, or a neutral stimulus’’ (Olds & Milner, 1954, p. 419). So, some brain stimulation or some environmental stimulus is ‘‘rewarding’’ if animals learn to perform actions that are reliably followed by that stimulation or stimulus. Later experiments confirmed that electrical self-stimulation of specific brain regions has the same impact on motivation as other natural rewards, like food or water for hungry or thirsty animals (Crow 1972; Trowill, Panksepp, & Gandelman, 1969). The idea that some neurotransmitter could be a relevant causal component of some mechanism of reward-based learning and motivation was substantiated by pharmacological studies (Stein, 1968, 1969). Based on subsequent pharmacological (Fibiger, 1978) and anatomical research (Lindvall & Björklund, 1974), hypotheses about the involvement of dopaminergic neurons in this mechanism began to be formulated. In Roy Wise’s (1978) words: ‘‘[from the available evidence] it can be concluded that dopamine plays a specialized role in reward processes . . . It does seem to be the case that a dopaminergic system forms a critical link in the neural circuitry which confers rewarding qualities on intracranial stimulation . . . and on intravenous stimulant injections’’ (Wise, 1978, pp. 237–238). Wise (1982) put forward one of the first hypotheses about dopamine function in cognition and behaviour that aimed to
1 Neurotransmitters are chemicals that carry information from one neuron to another across synapses. Synapses are structures connecting neurons that allow one nerve cell to pass an electric or chemical signal to one or more cells. Synapses consist of a presynaptic nerve ending, which can contain neurotransmitters, a postsynaptic nerve ending, which can contain receptor sites for neurotransmitters, and the synaptic cleft, which is a physical gap between the presynaptic and the postsynaptic ending. After neurotransmitters are released by a presynaptic ending, they diffuse across the synaptic cleft and then bind with receptors on the postsynaptic ending, which alters the state of the postsynaptic neuron.
M. Colombo / Studies in History and Philosophy of Biological and Biomedical Sciences 45 (2014) 57–67
explain a set of relevant data from anatomy, pharmacology, brain self-stimulation, pathology and lesion studies. It was called the anhedonia hypothesis, and was advanced based on pharmacological evidence that moderate doses of neuroleptics drugs (i.e. dopamine antagonists)2 can disrupt behavioural phenomena during reinforcement tasks without severely compromising motor function (cf., Costall & Naylor, 1979). The anhedonia hypothesis was proposed as an alternative to simple motor hypotheses that claimed that the dopamine system is a mechanism of motor control and that dopaminergic impairment causes only motor deficits (see, e.g., Koob, 1982). The anhedonia hypothesis stated that ‘‘the normal functioning of some unidentified dopaminergic substrate (it could be one or more of several dopaminergic projections in the brain) and its efferent connections are necessary for the motivational phenomena of reinforcement and incentive motivation and for the subjective experience of pleasure that usually accompanies these phenomena’’ (Wise, 1982, p. 53).3 This hypothesis is committed to the claims that some network of dopaminergic neurons to be specified is a causally relevant component of the mechanism of reinforcement, that some network of dopaminergic neurons is necessary to feel pleasure, and that pleasure is a necessary correlate of reinforcement. The explanatory link between dopamine and pleasure was superficial, however. Testing the effects of selective lesion of the mesostriatal dopamine system on rats’ reactions to different tastes, Berridge, Venier, and Robinson (1989) observed that the predictions of both the anhedonia and the motor hypothesis were not borne out. It was found that the subjective experience of pleasure is not a necessary correlate of reinforcement and that dopaminergic neurons are not necessary for pleasure (see Wise, 2004, for a later reassessment of the evidence). On the basis of taste-reactivity data and of the psychopharmacological effects of drug addiction, and drawing upon earlier theories of incentive motivation (e.g., Bindra, 1974; Toates, 1986), Kent Berridge and colleagues put forward the incentive salience hypothesis of dopamine (ISH). According to this hypothesis, dopamine release by mesencephalic structures such as the ventral tegmental area assigns ‘‘incentive value’’ to objects or behavioural acts. Incentive salience is a motivational, ‘‘magnet-like’’ property, which makes external stimuli or internal representations more salient, and more likely to be wanted, approached or consumed. Attribution of incentive salience to a stimulus that predicts some reward makes both the stimulus and the reward ‘‘wanted’’ (Berridge & Robinson, 1998; Robinson & Berridge, 1993). Since the ISH has been claimed to be the most foremost contemporary alternative to the RPEH (e.g., Berridge, 2007), the sections below consider it more closely, and compare it along two dimensions of explanatory depth with the RPEH. For now, let us move to the next steps on the road to the RPEH. In the 1980s, research on the role of dopamine in motor function remained an active topic of research (e.g., Beninger, 1983; Stricker & Zigmond, 1986; White, 1986; see Dunnett & Robbins, 1992, for a later review). This interest was justified by earlier findings that Parkinsonian patients display a drastic reduction of dopaminergic neurons in the striatum (Ehringep & Hornykiewicz, 1960; Hornykiewiczl, 1966), associated with symptoms like tremor, hypokinesia and rigidity. Wolfram Schultz was among the neuroscientists working on the relationship between dopamine deple-
2
59
tion, motor function and Parkinson’s disease (Schultz, Ruffieux, & Aebischer, 1983). As a way to assess this relationship, he used single-cell recordings of dopaminergic neurons in awake monkeys while they were performing reaching movements for food reward in response to auditory or visual stimuli (Schultz et al., 1983; Schultz, 1986). Phasic activity of midbrain dopamine neurons was found to be associated with the presentation of the visual or auditory stimuli that would be followed by the food reward. Some such neurons showed phasic changes in activity also at the time the reward was obtained. Execution of reaching movements was less significantly associated with dopaminergic activity, indicating that activity of midbrain dopaminergic neurons does not encode specific movement parameters. Schultz and colleagues hypothesised that such activity carried out some more general function having to do with a change in the level of behavioural reactivity triggered by stimuli leading to a reward. In the following ten years, Schultz and colleagues carried out similar single-cell recording experiments from midbrain dopaminergic neurons in the ventral tegmental area and substantia nigra of awake monkeys while they repeatedly performed an instrumental or Pavlovian conditioning task4 (Ljungberg, Apicella, & Schultz, 1992; Romo & Schultz, 1990; Schultz, Apicella, & Ljungberg, 1993; Schultz & Romo, 1990; Mirenowicz & Schultz, 1994). In a typical experiment a thirsty monkey was seated before two levers. After a visual stimulus was displayed (e.g. a light flashing), the monkey had to press the left but not the right lever in order to receive the juice reward. An idiosyncratic pattern of dopaminergic activity was observed during this experiment. During the early phase of learning—when the monkey was behaving erratically—dopamine neurons displayed a phasic burst of activity only when the reward was obtained. After a number of trials, as the monkey had learnt the correct stimulus-action-reward association, the response of the neurons to the reward disappeared. Now, whenever the visual stimulus was displayed, the monkey began to show anticipatory licking behaviour, and its dopaminergic neurons showed phasic bursts of activity associated with the presentation of the visual stimulus. If an expected juice reward was omitted, the neurons responded with a dip of activity, below basal firing rate, at the time at which reward would have been delivered, which suggested that dopaminergic activity is sensitive to both the occurrence and time of the reward. The pattern of dopaminergic activity observed in these types of tasks was explained in terms of generic ‘‘attentional and motivational processes underlying learning and cognitive behavior’’ (Schultz et al., 1993, p. 900). Schultz and colleagues did not refer to previous research by Wise and others about the involvement of dopamine in the mechanisms of reward, motivation and learning, nor did they refer to the growing literature on reinforcement learning from psychology and artificial intelligence. Thus, in the early 1990s, the questions what type of information dopaminergic activity encodes and what its causal role is in the mechanism of reward-based learning and motivation were outstanding. Meanwhile, by the late 1980s, Reinforcement Learning (RL) had been established as one of the most popular computational frameworks in machine learning and artificial intelligence. RL offers a collection of algorithms to solve the problem of learning what to do in the face of rewards and punishments received by taking different actions in an unfamiliar environment (Sutton & Barto, 1998). One widely used RL algorithm is the temporal difference (TD)
Drugs that block the effects of dopamine by binding to and occluding the dopamine receptor are called dopamine antagonists. ‘Incentive motivation’ is synonymous with ‘secondary reinforcement’ (or ‘conditioned reinforcement’), which refers to a stimulus or situation that has acquired its function as a reinforcer after it has been associated with a primary reinforcer such as water or food. When a stimulus acquires incentive properties, it acquires not only the ability to elicit and maintain instrumental behaviour, but also to attract approach and to elicit consummatory behaviour (cf., Bindra, 1974). 4 In instrumental (or operant) conditioning, animals learn to respond to specific stimuli in such a way as to obtain rewards and avoid punishments. In Pavlovian (or classical) conditioning, no response is required to get rewards and avoid punishments, since rewards and punishment come after specific stimuli independent of the animal’s behaviour. 3
60
M. Colombo / Studies in History and Philosophy of Biological and Biomedical Sciences 45 (2014) 57–67
learning algorithm, whose development is most closely associated with Rich Sutton’s (1988). The development of the TD-algorithm was influenced by earlier theories of animal learning in mathematical psychology, especially by a seminal paper by Bush and Mosteller (1951), which formulated a formal account of how rewards increment the probability of a given behavioural response during instrumental conditioning tasks. Bush and Mosteller’s account was extended by Rescorla and Wagner (1972), whose model set the basis for the TD-learning algorithm. The Rescorla–Wagner model is a formal model of instrumental and Pavlovian conditioning that describes the underlying changes in associative strength between a signal (e.g., a conditioned stimulus) and a subsequent stimulus (e.g., an unconditioned stimulus). The basic insight is similar to the one informing the Bush–Mosteller model: learning depends on error in prediction. As Rescorla and Wagner put it: ‘‘Organisms only learn when events violate their expectations. Certain expectations are built up about the events following a stimulus complex; expectations initiated by the complex and its component stimuli are then only modified when consequent events disagree with the composite expectation’’ (Rescorla & Wagner, 1972, p. 75). Accordingly, learning is driven by prediction errors, and the basic unit of learning is the conditioning trial. Change in associative strength between a conditioned stimulus and an unconditioned stimulus is a function of differences between what was predicted (i.e. the animal’s expectation of the unconditioned stimulus, given all the conditioned stimuli present on the trial) and what actually happened (i.e. the unconditioned stimulus) in a conditioning trial. TD-learning extends the Rescorla–Wagner model by taking account of the timing of different stimuli within learning trials, which in fact influences how associative strength changes. TDlearning is driven by the difference between temporally successive estimates (or predictions) of a certain quantity—for example, the total amount of reward expected over the future (i.e. value). At any given time step, the estimate of this quantity is updated to bring it closer to the estimate at the next time step. The TD-learning algorithm makes predictions about what will happen. Then it compares these predictions with what actually happens. If the prediction is wrong, then the difference between what was predicted and what actually happened is used for learning. This core of TDlearning is captured by two equations. The first is an update rule:
VðSÞnew ¼ VðSÞold þ gdðt outcome Þ;
ð1Þ
where V(S) denotes the value of a chosen option S, g is a learning rate parameter, and d(toutcome) is the temporal-difference rewardprediction error computed at each of two consecutive time steps (tstimulus and toutcome = tstimulus + 1). The second equation defines reward-prediction error at time t as:
dðtÞ ¼ rðtÞ þ VðtÞ Vðt 1Þ;
ð2Þ
where V(t) is the predicted value of some option at time t, and r(t) is the reward outcome obtained at time t. The reward-prediction error at toutcome is used to update V(S), that is the value of the chosen option. The potential of TD-learning, and of RL more generally, to build neural networks models and help interpret some results in brain science was clear from the 1980s. As Sutton and Barto (1998, p. 22) recall, ‘‘some neuroscience models developed at that time are well interpreted in terms of temporal-difference learning (Byrne, Gingrich, & Baxter, 1990; Gelperin, Hopfield, & Tank, 1985; Hawkins and Kandel, 1984; Tesauro, 1986),’’ however, the connection between dopamine and TD-learning had still to be explicitly made. In the early 1990s, Read Montague and Peter Dayan were working in Terry Sejnowski’s Computational Neurobiology lab at the Salk Institute in San Diego. Dayan’s PhD at the University of Edinburgh, in artificial intelligence and computer science, focused
on RL, while some of Montague’s work, both as a graduate student and postdoc in biophysics and neuroscience, focused on models of self-organization and learning in neural networks. They both approached questions about brains and neural circuits by asking what computational functions they carry out (cf., Dayan, 1994). One spring day in 1991 TD-learning and dopamine got connected (Montague, 2007, pp. 108–109). Dayan came across one of Schultz and colleagues’ articles, which presented data from recordings of dopamine neurons’ activity during an instrumental learning task. By examining the plots in the article showing the firing patterns of the monkeys’ dopamine neurons, Dayan and Montague recognized the signature of TD-learning. The similarity between the reward-prediction error signal used in TD-learning algorithms and Schultz and colleagues’ recordings was striking. Activity of dopamine neurons appeared to encode a reward-prediction error signal. Montague, Dayan and Sejnowski began writing a paper that interpreted Schultz and colleagues’ results within the computational framework of RL. Their project was to provide a unifying TD-learning model that could explain the neurophysiological and behavioural regularities observed in Schultz and colleagues’ experiments. In a short abstract, Quartz, Dayan, Montague, and Sejnowski (1992) laid out the insight that on-going activity in dopaminergic neurons in the midbrain can encode comparisons of expected and actual reward outcomes, which drive learning and decision-making. The insight was articulated by Montague, Dayan, Nowlan, Pouget, and Sejnowski (1993), and presented at the Conference on Neural Information Processing Systems (NIPS)—an annual event bringing together researchers interested in biological and artificial learning systems. In that paper a statement of the connection between TD-learning and Schultz and colleagues’ pattern of results was made: Some ‘‘diffuse modulatory systems . . . appear to deliver reward and/or salience signals to the cortex and other structures to influence learning in the adult. Recent data (Ljungberg et al., 1992) suggest that this latter influence is qualitatively similar to that predicted by Sutton and Barto’s (1981, 1987) classical conditioning theory’’ (Montague et al., 1993, p. 970). However, theoretical neuroscience—computational neuroscience, particularly—was in its infancy at the time, and it had not been recognized as a field integral to neuroscience yet (cf., Abbott, 2008). The paper that Montague, Dayan and Sejnowski had originally set out to publish in 1991 was getting rejected by every major journal in neuroscience, partly because the field was dominated by experimentalists (Montague, 2007, p. 285). Dayan and Montague approached the issue from a different angle. The foraging behaviour of honeybees was known to follow the pattern of TD-learning; and evidence from single-cell recordings and from intracellular current injections indicated that bees use a neurotransmitter, octopamine, to achieve TD-learning (Hammer, 1993). Motivated by these findings, Dayan and Montague developed a model of the foraging behaviour of honeybees (Montague, Dayan, Person, & Sejnowski, 1995). First, they identified a type of neural architecture, which might be common to both vertebrates and invertebrates, and could implement TD-learning. Second, they argued for the biological feasibility of this neurocomputational architecture, noting that bees’ diffuse octopaminergic system is suited to carry out TD-learning. Finally, they showed that a version of the TD-learning algorithm running on the neurocomputational architecture they specified could produce some learning and decision-making phenomena displayed by bees during foraging. Montague and colleagues highlighted that: ‘‘There is good evidence for similar predictive responses in primate mesencephalic dopaminergic systems. Hence, dopamine delivery in primates may be used by target neurons to guide action selection and learning, suggesting the conservation of an important functional principle, albeit
M. Colombo / Studies in History and Philosophy of Biological and Biomedical Sciences 45 (2014) 57–67
differing in its detailed implementation’’ (Montague et al., 1995, p. 728).5 By the mid-1990s, other research groups recognized that the activity of dopaminergic neurons in tasks of the sort used by Schultz and colleagues can be accurately described as implementing some reward-prediction error algorithm. Friston, Tononi, Reeke, Sporns, and Edelman (1994) considered value-dependent plasticity in the brain in the context of evolution and adaptive behaviour. They hypothesised that the ascending neuromodulatory systems, and—in light of Ljunberg et al.’s (1992) findings—the dopaminergic system in particular, are core components of some value-based mechanism, whose processes are selective for rewarding stimuli. Houk, Adams, and Barto (1995) put forward a hypothesis about the computational architecture of the basal ganglia, where dopaminergic neurons would control learning and bias the selection of actions by computing reward-prediction errors. The neuroscience community started to pay closer attention to the relationship between TD-learning and dopamine. After five years, Montague, Dayan and Sejnowski had their original paper published in The Journal of Neuroscience (Montague, Dayan, & Sejnowski, 1996). In this paper, after having noted with Wise’s (1982) that dopamine neurons are involved in a number of cognitive and behavioural functions, they examined Schultz and colleagues’ results. These results indicated that whatever is encoded in dopaminergic signals should be capable of explaining four sets of data. ‘‘(1) The activities of these neurons do not code simply for the time and magnitude of reward delivery. (2) Representations of both sensory stimuli (lights, tones) and rewarding stimuli (juice) have access to driving the output of dopamine neurons. (3) The drive from both sensory and reward representations to dopamine neurons is modifiable. (4) Some of these neurons have access to a representation of the expected time of reward delivery’’ (Montague et al., 1996, p. 1938). Montague et al. (1996) emphasised an underappreciated aspect of Schultz and colleagues’ results: dopaminergic neurons are sensitive not only to the expected and actual experienced magnitude of reward, but also to the precise temporal relationships between the occurrence of a reward-predictor and the occurrence of the actual reward. This aspect was crucial to draw the connection between TD-computation and dopaminergic activity. For it suggested that dopamine neurons should be able to represent relationships between reward-predictors, predictions of both the likely time and magnitude of a future reward, and the actual experienced time and magnitude of the reward. The core of Montague and colleagues’ (1996) consists in laying out the computational framework of reinforcement learning and bringing it to bear on neurophysiological and behavioural evidence related to dopamine so as to connect neural function to cognitive function. By means of modelling and computer simulations, they showed that the type of algorithm that can solve learning tasks of a certain kind could accurately and compactly describe the behaviour of many dopaminergic neurons in the midbrain: ‘‘the fluctuating delivery of dopamine from the VTA [i.e., ventral tegmental area] to cortical and subcortical target structures in part delivers information about prediction errors between the expected amount of reward and the
61
actual reward’’ (Ibid., p. 1944, emphasis in original). One year later, in 1997, Montague and Dayan published another similar paper coauthored with Schultz in Science, which has remained the reference about the RPEH. 3. Reward-prediction error and incentive salience: what do they explain? In light of Montague et al. (1996) and Schultz, Dayan, and Montague (1997), the RPEH can now be more precisely characterised. The hypothesis states that the phasic firing of dopaminergic neurons in the ventral tegmental area and substantia nigra ‘‘in part’’ encodes reward-prediction errors. Montague and colleagues did not claim that all type of activity in all dopaminergic neurons encode only (or in all circumstances) reward-prediction errors. Their hypothesis is about ‘‘a particular relationship between the causes and effects of mesencephalic dopaminergic output on learning and behavioural control’’ (Montague et al., 1996, p. 1944). This relationship may hold for a certain type of activity of some dopaminergic neurons during certain kinds of learning and decisionmaking tasks. The claim is not that dopaminergic neurons encode only reward-prediction errors. The claim is neither that prediction errors can only be computed by dopaminergic activity, nor that all learning and action selection is carried out through reward-prediction errors or dependant on dopaminergic activity. The RPEH relates dynamic patterns of activity in specific structures of the brain to a precise computational function. As reward-prediction errors are differences between experienced and expected rewards, whether or not dopamine neurons respond to a particular reward depends on whether or not this reward was expected at all, on its expected magnitude, and on the expected time of its delivery. Thus, the hypothesis relates dopaminergic responses to two types of variables: reward and belief (or expectation) about the magnitude and time of delivery of a reward that may be obtained in a situation. Accordingly, the RPEH can be understood as relating dopaminergic responses to probability distributions over prizes (or lotteries), from which a prize with a certain magnitude is obtained at a given time (Caplin & Dean, 2008; Caplin, Dean, Glimcher, & Rutledge, 2010). The hypothesis has the dual role of accurately describing the dynamic profile of phasic dopaminergic activity in the midbrain during reward-based learning and decision-making, and of explaining this profile by citing the representational role of dopaminergic phasic activity. Thus, the RPEH addresses two distinct questions. First, how are some of the regularities in dopaminergic neurons’ firing patterns accurately and compactly described? Second, what is the computational function carried out by those firing patterns? By answering the second question, the RPEH furnishes the basis for a neurocomputational explanation of reward-based learning and decision-making. Neurocomputational explanations explain cognitive phenomena and behaviour (e.g., blocking and second-order conditioning6) by identifying and describing relevant mechanistic components (e.g., dopaminergic neurons), their organized activities (e.g., dopaminergic neurons’ phasic firings), the computational routines they perform
5 This conclusion might engender confusion, as it is not obvious how the suggestion that a ‘functional principle’ is ‘conserved’ across evolution should be understood. When two unrelated organisms share some trait, this is often taken as evidence that the trait is homologous (i.e., derived from a common ancestor). But this is never sufficient evidence of homology. A proper phylogenetic reconstruction of the trait (involving more than two species at the very least) is necessary for establishing homology. Given the available evidence, Montague and colleagues’ suggestion is better understood in terms of analogy rather than homology. Two (probably more) species—including honeybees, macaque monkeys and other primates—might have independently evolved distinct diffuse neurotransmitter systems that have analogous (similar) functional properties. The similarity is not due to a common ancestor. Rather, the similarity is due to convergent evolution: both species faced similar environmental challenges and selective pressures, which implicates that TD-learning is an adaptive strategy to solve a certain class of learning and decision-making problems that recur across species. I am grateful to an anonymous referee for drawing my attention to this point. 6 In classical conditioning, blocking is the phenomenon that little or no conditioning occurs to a new stimulus if it is combined with a previously conditioned stimulus during the conditioning process. Second-order conditioning is a phenomenon where a conditional response is acquired by a neutral stimulus, when the latter is paired with a stimulus that has previously been conditioned.
62
M. Colombo / Studies in History and Philosophy of Biological and Biomedical Sciences 45 (2014) 57–67
(e.g., computations of reward-prediction errors) and the informational architecture of the system that carries out those computations (e.g., the actor-critic architecture, which implements TD-learning and maps onto separable neural components, see, e.g., Joel, Niv, & Ruppin, 2002; Balleine, Daw, & O’Doherty, 2009). Neural computation can be understood as the transformation—via sensory input and patterns of activity of other neural populations—of neural firings according to algorithmic rules that are sensitive only to certain properties of neural firings (Churchland & Sejnowski, 1992; Colombo, 2013; Piccinini & Bahar, 2013). If RPEH is true, then a neurocomputational mechanism composed of midbrain dopaminergic neurons and their phasic activity carries out the task of learning what to do in the face of expected rewards and punishments, generating decisions accordingly. Currently, several features of this mechanism remain to be identified. So, RPEH-based explanations of reward-learning and decisionmaking are currently gappy (Dayan & Niv, 2008; O’Doherty, 2012). Nonetheless, several cognitive and brain scientists would agree that some RPEH-based neurocomputational mechanism to be specified can adequately explain many learning and decisionmaking phenomena in a manner not only more beautiful but also deeper than available alternatives. The current most prominent available alternative to the RPEH is probably the incentive salience hypothesis (ISH). This hypothesis states that firing of dopaminergic neurons in a larger mesocorticolimbic system mediates only incentive salience attribution. In Berridge’s words: ‘‘Dopamine mediates only a ‘wanting’ component, by mediating the dynamic attribution of incentive salience to reward-related stimuli, causing them and their associated reward to become motivationally ‘wanted’’’ (Berridge, 2007, p. 408). The ISH relates dopaminergic activations to a psychological construct: incentive salience (a.k.a. ‘‘wanting’’). Thereby, it answers the question what the causal role of dopamine is in reward-related behaviour. By answering this question, the ISH furnishes the basis for a neuropsychological explanation of reward-based motivation and decision-making. The ISH is committed to the claim that dopaminergic firing codes incentive salience that bestows stimuli or internal representations with the properties of being appetitive and attention-grabbing. Incentive salience attributions need not be conscious and need not involve feelings of pleasure (a.k.a. ‘‘liking’’). Dopaminergic activity is necessary to motivate actions aimed at some goal, as it would be a core component of a mechanism of motivation (or mechanism of ‘‘wanting’’). Dopaminergic neurons are not relevant components of the mechanism of reward-based learning: ‘‘to say dopamine acts as a prediction error to cause new learning may be to make a causal mistake about dopamine’s role in learning: it might . . . be called a ‘dopamine prediction error’’’ (Berridge, 2007, p. 399). So, the ISH can be considered an alternative to the RPEH because it denies two central claims of the RPEH: first, it denies that dopamine encodes reward-prediction errors; second, it denies that dopamine is a core component of a mechanism of reward-based learning. Similarly to the RPEH, the explanation grounded in the ISH is gappy: it includes dopamine as a core component of the mechanism of incentive salience attribution and motivation, but it leaves several explanatory gaps (see, e.g., Berridge, 2007, note 8). Particularly, the hypothesis is under-constrained in at least three ways that make it less precise than the RPEH. First, it does not precisely identify the relevant anatomical location of the dopaminergic components; second, it is uncommitted as to possible different roles of
phasic and tonic dopaminergic signals; finally, it is not formalised by a single computational model that could yield quantitative predictions. As they are formulated, both the RPEH and ISH are only concerned with what type of information is encoded by dopaminergic activity. Nonetheless, putting forward different claims about what dopaminergic neurons do, these hypotheses motivate different dopamine-centred explanations of phenomena related to learning, motivation and decision-making. Although these dopaminecentred explanations are currently tentative and incomplete, and so it might be premature to argue that one explanation is actually deeper than the other, it is worthwhile explicating under which conditions a RPEH-based explanation can be justifiably deemed deeper than an alternative ISH-based explanation, while pointing at relevant available evidence.
4. Explanatory depth, reward-prediction error and incentive salience A number of accounts of explanatory depth have recently been proposed in philosophy of science (e.g., Strevens, 2009; Weslake, 2010; Woodward & Hitchcock, 2003). While significantly different, these accounts agree that explanatory depth is a feature of generalizations that express the relationship between an explanans and an explanandum. According to Woodward and Hitchcock (2003), in order to be genuinely explanatory, a generalization should exhibit patterns of counterfactual dependence relating the explanans to the explanandum. Explanatory generalizations need not be laws or exceptionless regularities. They should enable us to answer what-if-things-had-been-different questions that show what the explanandum phenomenon depends upon. These questions concern the ways in which the explanandum would change under changes or interventions on its explanans, where the key feature of such interventions is that they do not causally affect the explanandum except through their effect on the explanans (Woodward, 2003). The degree of depth of an explanatory generalization is a function of the range of the counterfactual questions concerning possible changes in the target system that it can answer. Given two competitive explanatory generalizations G1 and G2, if G1 is invariant (or continue to hold) under a wider range of possible interventions or changes than G2, then G1 is deeper than G2.7 Call this the ‘‘invariance account of explanatory depth.’’ According to a different view (cf., Hempel, 1959), explanatory generalizations should enable us to track pervasive uniformities in nature by being employable to account for a wide range of phenomena displayed by several types of possible systems. The depth of an explanatory generalization is a function of the range of possible systems to which it can apply.8 For a hypothesis to apply to a target system is for the hypothesis to accurately describe the relevant structures and dynamics of the system—where what is relevant and what is not is jointly determined by the causal structure of the real-world system under investigation, the scientists’ varying epistemic interests and purposes in relation to that system, and the scientists’ audience. Given two competitive explanatory generalizations G1 and G2, if G1 can be applied to a wider range of possible systems or phenomena than G2, then G1 is deeper than G2. So, deeper explanatory generalizations have wider scope. Call this: the ‘‘scope account of explanatory depth.’’9
7 Woodward & Hitchcock (2003, sec. 3) distinguish a number of ways in which a generalization may be more invariant than another. For the purposes of this paper suffices it to point out that what they share is that they spell out different ways in which an explanatory generalization enables us to answer what-if-things-had-been-different-questions. 8 Kitcher (1989) put forward a similar idea. His view, however, is that depth is a function of the range of actual situations to which it can apply. For a discussion of some of the problems raised by this view, see Woodward & Hitchcock (2003, sec. 4). 9 It bears mention that the verdict is still out on how these two views on depth relate to one another (see, e.g., Strevens, 2004 for a discussion of this issue).
M. Colombo / Studies in History and Philosophy of Biological and Biomedical Sciences 45 (2014) 57–67
4.1. Depth as scope, reward-prediction error and incentive salience If some RPEH-based explanatory generalization can be applied to a wider range of possible phenomena or systems than some alternative ISH-based explanatory generalization, then the RPEHbased generalization is deeper according to the scope account of explanatory depth. What is available evidence relevant to assess this claim? ISH-based explanations have been most directly applied to rats’ behaviour and to the phenomenon of addiction in rodents and humans. In the late 1980s early 1990s, incentive salience was offered to explain the differential effects on ‘‘liking’’ (i.e. the experience of pleasure) and ‘‘wanting’’ (i.e. incentive salience) of pharmacological manipulations of dopamine in rats during taste-reactivity tasks (Berridge, Venier, & Robinson, 1989). Since then incentive salience has been used to explain results from electrophysiological and pharmacological experiments that manipulated dopaminergic activity in mesocorticolimbic areas of rats performing Pavlovian or instrumental conditioning tasks (cf. Berridge & Robinson, 1998; Peciña, Cagniard, Berridge, Aldridge, & Zhuang, 2003; Tindell, Berridge, Zhang, Peciña, & Aldridge, 2005; Wyvell, & Berridge, 2000; Robinson, Sandstrom, Denenberg, & Palmiter, 2005). Most ISH-explanations applied to humans concern a relatively small set of phenomena observed in addiction and Parkinson’s disease (O’Sullivan et al., 2011; Robinson & Berridge, 2008). From the viewpoint of incentive salience, addiction to certain substances or behaviours is caused by over-attribution of incentive salience. Compulsive behaviour would depend on an excessive attribution of incentive salience to drug-rewards and their cues, due to hypersensitivity or ‘‘sensitization’’ (i.e. an increase in a drug effect caused by repeated drug administration) in mesocortical dopaminergic projections. Sensitized dopaminergic systems would then cause pathological incentive motivation for drugs or other stimuli. It may appear that a RPEH-based explanation has obviously wider scope than an ISH-based explanation. For TD-learning has been applied to many biological and artificial systems (see e.g. Sutton & Barto, 1998, chap. 11). TD-learning seems to be widespread in nature. For instance, recall that while Montague et al. (1995) argued that release of octopamine by a specific neuron in the honeybee brain may signal a reward-prediction error, they also suggested that the same type of ‘‘functional principle’’ guiding learning and action selection may well be conserved across species. However, if honeybees, primates and other species share an analogous TD-learning mechanism, or if many artificial systems implement TD-learning, this is not evidence for the wider explanatory scope of a RPEH-based explanation. Rather, it is evidence for the wider explanatory scope of RL, and particularly of TD-learning. The RPEH and the ISH are about dopamine. So, relevant evidence for wider scope should involve dopaminergic neurons and their activity. RPEH-based explanations of learning and decision-making apply at least to rats, monkeys, and humans. The RPEH was formulated by comparing monkey electrophysiological data during instrumental and Pavlovian conditioning tasks to the dynamics of a TD reward-prediction error signal (Montague et al., 1996; Schultz et al., 1997). Since then, single-cell experiments with monkeys have strengthen the case for a quantitatively accurate correspondence between phasic dopaminergic firings in the midbrain and TD reward-prediction errors (Bayer & Glimcher, 2005; Bayer, Lau, & Glimcher, 2007). Recordings from the ventral tegmental area of rats that performed a dynamic odour-discrimination task indicate that the RPEH generalizes to that species as well (Roesch, Calu, &
63
Schoenbaum, 2007). Finally, a growing number of studies using functional magnetic imaging (fMRI) in humans engaged in decision-making and learning tasks have shown that activity in dopaminergic target areas such as the striatum and the orbitofrontal cortex correlates with reward-prediction errors of TD-models (Berns, McClure, Pagnoni, & Montague, 2001; Knutson, Adams, Fong, & Hommer, 2001; McClure, Berns, & Montague, 2003a; O’Doherty, Dayan, Friston, Critchley, & Dolan, 2003). These findings are in fact coherent with the RPEH, since fMRI measurements seem to reflect the incoming information that an area is processing, and striatal and cortical areas such as the orbitofrontal cortex are primary recipients of dopaminergic input from the ventral tegmental area (cf., McClure & D’Ardenne, 2009; Niv & Schoenbaum, 2008).10 Some RPEH-based explanation is employable to account for many phenomena related to learning and decision-making. Among the cognitive phenomena and behaviour for which some RPEHbased explanation has been put forward are: habitual vs. goaldirected behaviour (Daw, Niv, & Dayan, 2005; Tricomi, Balleine, & O’Doherty, 2009), working memory (O’Reilly & Frank, 2006), performance monitoring (Holroyd & Coles, 2002), pathological gambling (Ross, 2010), and a variety of psychiatric conditions including depression (e.g., Huys, Vogelstein, & Dayan, 2009; for a review of computational psychiatry see Montague, Dolan, Friston, & Dayan, 2012). Most relevant here, a RPEH-based explanation of incentive salience itself has also been proposed, which indicates that different hypotheses about dopamine may well cohere to much greater extent than might be supposed once they are properly formalized (McClure, Daw, & Montague, 2003b). According to this proposal, incentive salience corresponds to expected future reward, and dopamine—as suggested by the RPEH—serves the dual role of learning to predict future reward and of biasing action selection towards stimuli predictive of reward. McClure and colleagues demonstrated that some of the phenomena explained by the ISH such as the dissociation between wanted and liked objects directly follow from the role in biasing action selection that dopamine possesses according to the RPEH. Dopamine release would assign incentive salience to stimuli or actions by increasing the likelihood of choosing some action that leads to reward. So, dopamine receptor antagonism would reduce the probability of selecting any action, because estimated values for each available option would also decrease. If this proposal correctly captures the concept of incentive salience—which has been debated (Zhang, Berridge, Tindell, Smith, & Aldridge, 2009)—then there would be telling reason to believe that some RPEH-based explanation has indeed wider scope than some ISH-based explanation. We would be in the position to make a direct comparison between them, and the ISH-based explanation would be entailed by a more general RPEH-based explanation. So, for any possible target system to which the RPEH-based explanation applies, there will be an ISH-based explanation that applies to the same system, but not vice versa. 4.2. Depth as invariance, reward-prediction error and incentive salience If some RPEH-based generalization is invariant (or continue to hold) under a wider range of possible interventions or changes than an alternative ISH-based generalization, then the RPEH-based generalization is deeper according to the invariance account of explanatory depth. These interventions—recall—should not causally affect the explanandum phenomena except through their effect on the dopamine-centred mechanism whose behaviour is described by the explanatory generalization.
10 It is currently not possible to use fMRI data to assess the dopaminergic status of the MRI signal. Reliable information about the precise source of the fMRI signal is difficult to gather for the cortex, let alone the basal ganglia, particularly because neuromodulators can be vasoactive themselves (see Kishida et al., 2011 and Zaghloul et al., 2009 on methodologies for the measurement of sub-second dopamine release in humans).
64
M. Colombo / Studies in History and Philosophy of Biological and Biomedical Sciences 45 (2014) 57–67
In order to assess the relative degree of depth of alternative RPEH-based and ISH-based explanations, relevant interventions should be on particular mechanisms found in a particular biological lineage, such as some dopamine-centred mechanism found in primates. Interventions on merely analogous mechanisms found across biological lineages will not provide evidence relevant for depth-as-invariance. It should also be noted that the degree of precision of the RPEH is higher than that of the ISH. Unlike the ISH, the RPEH makes claims specific to dopaminergic phasic activity, and to dopaminergic neurons in the ventral tegmental area and in the substantia nigra. It may be thought that this means that the range of interventions on dopaminergic activity relevant to assess depth is narrower for a RPEH-based explanation than for an ISH-based explanation: While for the ISH-based explanation relevant interventions may be on both phasic and tonic activity, or on dopaminergic neurons in mesocorticolimbic areas besides the ventral tegmental area and the substantia nigra, for the RPEH-based explanation relevant interventions will not concern tonic activity or dopaminergic neurons in any mesocorticolimbic circuit. This thought is mistaken, however. For the ISH remains silent about the specific range of interventions relevant to assess the invariance of an ISH-based explanation. So, the range of interventions relevant to assess the depth of an ISH-based explanation does not strictly contain the range of interventions relevant to assess the depth of a RPEH-based explanation. Finally, unlike the RPEH, the ISH lacks an uncontroversial formalization that could be implemented in the design of an experiment and yield precise, quantitative predictions. So, a RPEH-based explanation may be deeper than an alternative ISH-based explanation, even if they are both invariant under interventions on e.g. phasic dopaminergic activity in the ventral tegmental area. For the RPEH-based explanation will yield more accurate answers about how the explanandum phenomenon will change. One set of available relevant evidence concerns dopamine manipulations by drugs. If some dopamine-centred explanatory generalization, based on either the RPEH or ISH, correctly gives information about how some target behaviour would change, had dopaminergic signalling been enhanced or reduced, then the generalization will show some degree of depth-as-invariance. Before illustrating this idea with two examples, some caveats are in order. One caveat is that neuromodulators like dopamine have multiple, complex, and poorly understood effects on target neural circuits and on cognition at different spatial and temporal scales (Dayan, 2012). Knowledge is lacking about the precise effects on neurophysiology and behaviour of drugs that intervene on dopaminergic signals. Moreover—as mentioned above—RPEH-based and ISH-based explanations are currently tentative and gappy, partly precisely because the effects of interventions on dopaminergic systems are currently not well-understood. Being tentative and gappy, it remains controversial to what extent such explanations would be always indeed fundamentally different (cf. e.g. McClure et al., 2003b; Niv & Montague, 2009, pp. 341–342). To probe the explanatory link between dopaminergic signalling and reward-based learning and decision-making, Pessiglione, Seymour, Flandin, Dolan, and Frith (2006) used an instrumental learning task that involved monetary gains and losses, in combination with a pharmacological manipulation of dopaminergic signalling, as well as computational and functional imaging techniques, in healthy humans. Either haloperidol (an antagonist of dopamine receptors) or L-DOPA (a metabolic precursor of dopamine) was administered to different groups of participants. The effects of these manipulations were examined on both brain activity and choice behaviour. It was found that L-DOPA enhanced activity in the striatum—a main target of dopaminergic signalling—while haloperidol diminished it, which suggested that the magnitude of
reward-prediction error signals targeting the striatum was enhanced (or diminished) by treatment with L-DOPA (or haloperidol). Choice behaviour was found to be systematically modulated by these manipulations. L-DOPA improved learning performance towards monetary gains, while haloperidol decreased it; that is: participants treated with L-DOPA were more likely than participants treated with haloperidol to choose stimuli associated with greater reward. Computational modelling results demonstrated that differences in reward-prediction error magnitude were sufficient for a TD-learning model to predict the effects of the manipulations on choice behaviour. Some of the caveats spelled out above apply to Pessiglione et al.’s study. Pessiglione and colleagues acknowledged that it was not obvious how the drugs they administered affected different aspects of dopaminergic signalling with respect to e.g. tonic versus phasic firing, or distinct dopamine receptors. Notably, they did not consider whether their interventions might have affected learning behaviour (i.e. one of their target explananda) through effects also on motivation or attention. Nonetheless, evidence of the sort they provided is relevant to assess the relative depth of RPEH-based explanatory generalizations. For it can demonstrate that reward-based learning and decision-making are often modulated by reward-prediction errors encoded by dopaminergic activity. And it may indicate that the relation between RPEs and dopaminergic activity on the one hand, and choice behaviour during some reinforcement learning tasks, on the other, shows some degree of invariance. There are fewer human studies to probe the link between dopaminergic manipulation with drugs, and incentive salience attribution and motivation. One such study concerned the mechanism of sexual motivation. Oei, Rombouts, Soeter, van Gerven, and Both (2012) investigated how dopamine modulates activation in the ventral striatum, which together with its dopaminergic pathways is suggested to be a component of a larger mesocorticolimbic mechanism of incentive salience, during subconscious processing of sexual stimuli. Incentive salience is thought to be an essential property of sexual stimuli that would motivate behavioural approach tendencies, capture attention, and elicit urges to pursue sex. An ISH-based explanation of sexual motivation would claim that sexual desire is produced by processes in a large mesocorticolimbic network driven by the release of dopamine into striatal targets. Dopaminergic activations would bestow incentive salience to sexual cues and sexual unconditioned stimuli, making these stimuli ‘‘wanted’’ and attention grabbing. Oei and colleagues used fMRI combined with dopaminergic manipulations through administration of L-DOPA and haloperidol in healthy participants to probe the amplification (or depression) of the incentive salience of unconsciously perceived sexual cues. It was found that L-DOPA significantly enhanced activation in the ventral striatum and dorsal anterior cingulate—a brain region involved in cognitive control, action selection, emotional processing and motor control—when sexual stimuli were subliminally presented, in contrast to emotionally neutral and emotionally negative stimuli. Haloperidol, instead, decreased activation in those areas when the sexual stimuli were presented. It was concluded that the processing of sexual incentive stimuli is sensitive to pharmacological manipulations of dopamine levels in the midbrain. These findings provide some evidence relevant to assess the degree of depth of an ISH-based explanation of sexual motivation because they would indicate that such explanation might be invariant over changes of the magnitude of dopaminergic signals in the mesocorticolimbic network, which would enable sexual motivation, as well as over changes in conscious perception of the sexual incentive stimuli. However—as Oei and colleagues acknowledged—these results did not speak on whether dopamine-dependent regulation of incentive salience attribution is
M. Colombo / Studies in History and Philosophy of Biological and Biomedical Sciences 45 (2014) 57–67
related to increases (or decreases) in sexual desire or behavioural approach tendencies. Neither did they discriminate whether or not the dopaminergic changes they observed could be predicted by the reward-prediction signals in a TD-learning algorithm. Hence, studies such as this one leave open the possibilities that the dopaminergic intervention did not in fact affect attention or motivation to pursue sexual behaviour (i.e. target explananda phenomena of an ISH-based explanation of sexual motivation), and that the intervention could have affected sexual motivation through the effects of reward-prediction error neural computations underlying reinforcement learning. 5. Conclusion This paper has made two types of contributions to existing literature, which should be of interest to both historians and philosophers of cognitive science. First, the paper has provided a comprehensive historical overview of the main steps that have led to the formulation of the RPEH. Second, in light of this historical overview, it has made explicit what precisely the RPEH and the ISH explain, and under which circumstances neurocomputational explanations of learning and decision-making phenomena based on the RPEH can be justifiably deemed deeper than explanations based on the ISH. From the historical overview, it emerges that the formulation and subsequent success of the RPEH depend, at least partly, on its capacity of combining several threads of research across psychology, neuroscience and machine learning. By bringing the computational framework of RL to bear on neurophysiological and behavioural sets of data that have been gathered about dopaminergic neurons since 1960s, the RPEH connects dopamine’s neural function to cognitive function in a quantitatively precise and compact fashion. It should now be clear that the RPEH, as well as the ISH, which is arguably its current main alternative, are hypotheses about the type of information encoded by dopaminergic activity. As such, they do not explain by themselves why or how people and other animals display certain types of phenomena related to learning, decision-making or motivation. Nonetheless, putting forward different claims about what dopaminergic neurons encode, these hypotheses furnish the basis for distinct dopamine-centred explanations of those phenomena. The paper has examined some such explanations, contrasting them along two dimensions of explanatory depth. It has not been established that RPEH-based explanations are actually deeper—in either of the two senses of explanatory depth considered—than alternative explanations based on the ISH. For the dopaminecentred explanations that the two hypotheses motivate are currently tentative and incomplete. Nonetheless, from the relevant available evidence discussed in the paper, there are grounds to tentatively believe that currently, for at least some phenomenon related to learning, decision-making or motivation, some RPEH-based explanation has wider scope or has more degrees of invariance than some ISH-based alternative explanation. Acknowledgements I am sincerely grateful to Aistis Stankevicius, Charles Rathkopf, Peter Dayan, and especially to Gregory Radick, editor of this journal, and to two anonymous referees, for their encouragement, constructive criticisms and helpful suggestions. The work on this project was supported by the Deutsche Forschungsgemeinschaft (DFG) as part of the priority program ‘‘New Frameworks of Rationality’’ ([SPP 1516]). The usual disclaimers about any remaining error or misconception in the paper apply.
65
References Abbott, L. F. (2008). Theoretical neuroscience rising. Neuron, 60, 489–495. Balleine, B. W., Daw, N. D., & O’Doherty, J. P. (2009). Multiple forms of value learning and the function of dopamine. In P. W. Glimcher, C. F. Camerer, E. Fehr, & R. A. Poldrack (Eds.), Neuroeconomics: Decision making and the brain (pp. 367–388). New York: Academic Press. Bayer, H. M., & Glimcher, P. W. (2005). Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47(1), 129–141. Bayer, H. M., Lau, B., & Glimcher, P. W. (2007). Statistics of midbrain dopamine neuron spike trains in the awake primate. Journal of Neurophysiology, 98(3), 1428–1439. Berns, G. S., McClure, S. M., Pagnoni, G., & Montague, P. R. (2001). Predictability modulates human brain response to reward. Journal of Neuroscience, 21(8), 2793–2798. Berridge, K. C. (2007). The debate over dopamine’s role in reward: The case for incentive salience. Psychopharmacology (Berl), 191, 391–431. Berridge, K. C., & Robinson, T. E. (1998). What is the role of dopamine in reward: Hedonic impact, reward learning, or incentive salience? Brain Research Reviews, 28, 309–369. Berridge, K. C., Venier, I. L., & Robinson, T. E. (1989). Taste reactivity analysis of 6OHDA aphagia without impairment of taste reactivity: Implications for theories of dopamine function. Behavioral Neuroscience, 103, 36–45. Bindra, D. A. (1974). A motivational view of learning, performance, and behavior modification. Psychological Review, 81, 199–213. Björklund, A., & Dunnett, S. B. (2007). Dopamine neuron systems in the brain: An update. Trends in Neurosciences, 30(5), 194–202. Bush, R. R., & Mosteller, F. (1951). A mathematical model for simple learning. Psychological Review, 58, 313–323. Byrne, J. H., Gingrich, K. J., & Baxter, D. A. (1990). Computational capabilities of single neurons: Relationship to simple forms of associative and nonassociative learning in aplysia. In R. D. Hawkins & G. H. Bower (Eds.), Computational Models of Learning (pp. 31–63). New York: Academic Press. Caplin, A., & Dean, M. (2008). Dopamine, reward prediction error, and economics. Quarterly Journal of Economics, 123(2), 663–701. Caplin, A., Dean, M., Glimcher, P. W., & Rutledge, R. B. (2010). Measuring beliefs and rewards: A neuroeconomic approach. Quarterly Journal of Economics, 125(3), 923–960. Carlsson, A. (1959). The occurrence, distribution, and physiological role of catecholamines in the nervous system. Pharmacological Reviews, 11, 490–493. Carlsson, A. (1966). Morphologic and dynamic aspects of dopamine in the central nervous system. In E. Costa, L. J. Côté, & M. D. Yahr (Eds.), Biochemistry and pharmacology of the basal ganglia (pp. 107–113). Hewlett, NY: Raven Press. Carlsson, A. (2003). A half-century of neurotransmitter research: Impact on neurology and psychiatry. In H. Jörnvall (Ed.), Nobel lectures. Physiology or medicine, 1996–2000 (pp. 308–309). Singapore: World Scientific Publishing Co.. Churchland, P. S., & Sejnowski, T. J. (1992). The computational brain. Cambridge: MA, MIT Press. Colombo, M. (2013). Constitutive relevance and the personal/subpersonal distinction. Philosophical Psychology, 26, 547–570. Costall, B., & Naylor, R. J. (1979). Behavioural aspects of dopamine agonists and antagonists. In A. S. Horn, J. Korf, & B. H. C. Westerink (Eds.), The neurobiology of dopamine (pp. 555–576). London: Academic Press. Crow, T. J. (1972). A map of the rat mesencephalon for electrical selfstimulation. Brain Research, 36, 265–273. Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8, 1704–1711. Dayan, P. (1994). Computational modelling. Current Opinion in Neurobiology, 4(2), 212–217. Dayan, P. (2012). Twenty-five lessons from computational neuromodulation. Neuron, 76, 240–256. Dayan, P., & Niv, Y. (2008). Reinforcement learning: The good, the bad and the ugly. Current Opinion in Neurobiology, 18, 185–196. Dunnett, S. B., & Robbins, T. W. (1992). The functional role of mesotelencephalic dopamine systems. Biological Reviews of the Cambridge Philosophical Society, 67(4), 491–518. Ehringep, H., & Hornykiewicz (1960). Verteilung von Noradrenalin und Dopamin (3Hydroxytyramin) im Gehirn des Menschen und ihr Verhalten bci Erkrankungen des extrapyramidalen Systems. Klinisch Wochenschrift, 38, 1236–1239. Fibiger, H. C. (1978). Drugs and reinforcement mechanisms: A critical review of the catecholamine theory. Annual Review of Pharmacology and Toxicology, 18, 37–56. Friston, K. J., Shiner, T., FitzGerald, T., Galea, J. M., Adams, R., Brown, H., et al. (2012). Dopamine, affordance and active inference. PLoS Computational Biology, 8, e1002327. http://dx.doi.org/10.1371/journal.pcbi.1002327. Friston, K. J., Tononi, G., Reeke, G. N., Sporns, O., & Edelman, G. M. (1994). Valuedependent selection in the brain: Simulation in a synthetic neural model. Neuroscience, 59, 229–243. Gelperin, A., Hopfield, J. J., & Tank, D. W. (1985). The logic of limax learning. In A. Selverston (Ed.), Model Neural Networks and Behavior (pp. 237–262). New York: Plenum Press. Glimcher, P. W. (2011). Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis. Proceeding of the National Academy of Science USA, 108(Suppl. 3), 15647–15654.
66
M. Colombo / Studies in History and Philosophy of Biological and Biomedical Sciences 45 (2014) 57–67
Grace, A. A. (1991). Phasic versus tonic dopamine release and the modulation of dopamine system responsivity: A hypothesis for the etiology of schizophrenia. Neuroscience, 41(1), 1–24. Graybiel, A. M. (2008). Habits, rituals and the evaluative brain. Annual Review of Neuroscience, 31, 359–387. Hammer, M. (1993). An identified neuron mediates the unconditioned stimulus in associative olfactory learning in honeybees. Nature, 366, 59–63. Hawkins, R. D., & Kandel, E. R. (1984). Is there a cell-biological alphabet for simple forms of learning? Psychological Review, 91, 375–391. Hempel, C. G. (1959). The logic of functional analysis. In L. Gross (Ed.), Symposium on sociological theory (pp. 271–287). New York: Harper & Row. Repr. with revisions. 1965. In Aspects of scientific explanation and other essays in the philosophy of science, 297–330. New York: Free Press.. Holroyd, C. B., & Coles, M. G. H. (2002). The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109(4), 679–709. Hornykiewiczl, O. (1966). Dopamine (3-hydroxytyramine) and brain function. Pharmacological Reviews, 18, 925–964. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generates and uses neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of Information Processing in the Basal Ganglia (pp. 249–270). Cambridge, MA: MIT Press. Huys, Q. J. M., Vogelstein, J., & Dayan, P. (2009). Psychiatry:insights into depression through normative decision-making models. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural information processing systems (21, pp. 729–736). MIT Press. Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: New anatomical and computational perspectives. Neural Networks, 15, 535–547. Kishida, K. T., Sandberg, S. S., Lohrenz, T., Comair, Y. G., Saez, I. G., Phillips, P. E. M., et al. (2011). Sub-second dopamine detection in human striatum. PLoS ONE, 6(8), e23291. Kitcher, P. (1989). Explanatory unification and the causal structure of the world. In P. Kitcher & W. Salmon (Eds.), Scientific explanation (pp. 410–505). Minneapolis: University of Minnesota Press. Knutson, B., Adams, C. M., Fong, G. W., & Hommer, D. (2001). Anticipation of increasing monetary reward selectively recruits nucleus accumbens. Journal of Neuroscience, 21(16), RC159. Koob, G. F. (1982). The dopamine anhedonia hypothesis: A pharmacological phrenology. Behavioral and Brain Sciences, 5, 63–64. Lindvall, O., & Björklund, A. (1974). The organization of the ascending catcholamine neuron systems in the rat brain as revealed by the glyoxylic acid fluoresence method. Acta Physiologica Scandinavica(Suppl. 412), 1–48. Ljungberg, T., Apicella, P., & Schultz, W. (1992). Responses of monkey dopamine neurons during learning of behavioral reactions. Journal of Neurophysiology, 67, 145–163. Loewi, O. (1936). The chemical transmission of nerve action. Nobel Lecture. Reprinted in Nobel Lectures, Physiology or Medicine, Vol. 2 (1922–1941), pp. 416–432. Amsterdam: Elsevier, 1965. Available online at: URL
. McClure, S. M., Berns, G. S., & Montague, P. R. (2003a). Temporal prediction errors in a passive learning task activate human striatum. Neuron, 38(2), 339–346. McClure, S. M., & D’Ardenne, K. (2009). Computational neuroimaging: Monitoring reward learning with blood flow. In J.-C. Dreher & L. Tremblay (Eds.), Handbook of reward and decision making (pp. 229–247). Oxford: Academic Press. McClure, S. M., Daw, N. D., & Montague, P. R. (2003b). A computational substrate for incentive salience. Trends in Neuroscience, 26(8), 423–428. Mirenowicz, J., & Schultz, W. (1994). Importance of unpredictability for reward responses in primate dopamine neurons. Journal of Neurophysiology, 72(2), 1024–1027. Montague, P. R. (2007). Your brain is almost perfect: How we make decisions. New York: Plume. Montague, P. R., Dayan, P., Nowlan, S. J., Pouget, A., & Sejnowski, T. J. (1993). Using aperiodic reinforcement for directed self-organization. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems 5 (pp. 969–977). San Mateo (CA): Morgan Kaufmann. Montague, P. R., Dayan, P., Person, C., & Sejnowski, T. J. (1995). Bee foraging in uncertain environments using predictive Hebbian learning. Nature, 377, 725–728. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16(5), 1936–1947. Montague, P. R., Dolan, R. J., Friston, K. J., & Dayan, P. (2012). Computational psychiatry. Trends in Cognitive Sciences, 16, 72–80. Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology, 53(3), 139–154. Niv, Y., & Montague, P. R. (2009). Theoretical and empirical studies of learning. In P. W. Glimcher et al. (Eds.), Neuroeconomics: Decision making and the brain (pp. 331–351). New York: Academic Press. Niv, Y., & Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in Cognitive Sciences, 12(7), 265–272. O’Doherty, J. P. (2012). Beyond simple reinforcement learning: The computational neurobiology of reward-learning and valuation. The European Journal of Neuroscience, 35(7), 987–990. O’Doherty, J., Dayan, P., Friston, K., Critchley, H., & Dolan, R. (2003). Temporal difference learning model accounts for responses in human ventral striatum
and orbitofrontal cortex during Pavlovian appetitive learning. Neuron, 38, 329–337. O’Reilly, R. C., & Frank, M. J. (2006). Making working memory work: A computational model of learning in prefrontal cortex and basal ganglia. Neural Computation, 18, 283–328. O’Sullivan, S. S., Wu, K., Politis, M., Lawrence, A. D., Evans, A. H., Bose, S. K., et al. (2011). Cue-induced striatal dopamine release in Parkinson’s disease-associated impulsive-compulsive behaviours. Brain, 134, 969–997. Oei, N. Y., Rombouts, S. A., Soeter, S. P., van Gerven, J. M., & Both, S. (2012). Dopamine modulates reward system activity during subconscious processing of sexual stimuli. Neuropsychopharmacology, 37, 1729–1737. Olds, J., & Milner, P. M. (1954). Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. Journal of Comparative and Physiological Psychology, 47, 419–427. Peciña, S., Cagniard, B., Berridge, K. C., Aldridge, J. W., & Zhuang, X. (2003). Hyperdopaminergic mutant mice have higher ‘wanting’ but not ‘liking’ for sweet rewards. Journal of Neuroscience, 23, 9395–9402. Pessiglione, M., Seymour, B., Flandin, G., Dolan, R. J., & Frith, C. D. (2006). Dopaminedependent prediction errors underpin reward-seeking behaviour in humans. Nature, 442, 1042–1045. Piccinini, G., & Bahar, S. (2013). Neural Computation and the Computational Theory of Cognition. Cognitive Science, 34, 453–488. Quartz, S. R., Dayan, P., Montague, P. R., & Sejnowski, T. J. (1992). Expectation learning in the brain using diffuse ascending projections. Society for Neuroscience Abstracts, 18, 1210. Redgrave, P., & Gurney, K. (2006). The short-latency dopamine signal: A role in discovering novel actions? Nature Reviews Neuroscience, 7, 967–975. Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory (pp. 64–99). New York: Appleton Century Crofts. Robinson, T. E., & Berridge, K. C. (1993). The neural basis of drug craving. An incentive-sensitization theory of addiction. Brain Research Reviews, 18, 247–291. Robinson, T. E., & Berridge, K. C. (2008). Review. The incentive sensitization theory of addiction: Some current issues. Philosophical Transactions of the Royal Society B: Biological Sciences, 363(1507), 3137–3146. Robinson, S., Sandstrom, S. M., Denenberg, V. H., & Palmiter, R. D. (2005). Distinguishing whether dopamine regulates liking, wanting, and/or learning about rewards. Behavioral Neuroscience, 119, 5–15. Roesch, M. R., Calu, D. J., & Schoenbaum, G. (2007). Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards. Nature Neuroscience, 10(12), 1615–1624. Romo, R., & Schultz, W. (1990). Dopamine neurons of the monkey midbrain: Contingencies of responses to active touch during self-initiated arm movements. Journal of Neurophysiology, 63, 592–606. Ross, D. (2010). Economic models of pathological gambling. In D. Ross, H. Kincaid, D. Spurrett, & P. Collins (Eds.), What is addiction? (pp 131–158). Cambridge (MA): MIT Press. Schultz, W. (1986). Responses of midbrain dopamine neurons to behavioral trigger stimuli in the monkey. Journal of Neurophysiology, 56, 1439–1461. Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. Journal of Neuroscience, 13, 900–913. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., & Romo, R. (1990). Dopamine neurons of the monkey midbrain: Contingencies of responses to stimuli eliciting immediate behavioral reactions. Journal of Neurophysiology, 63, 607–624. Schultz, W., Ruffieux, A., & Aebischer, P. (1983). The activity of pars compacta neurons of the monkey substantia nigra in relation to motor activation. Experimental Brain Research, 51, 377–387. Skinner, B. F. (1938). The behavior of organisms. New York: D. Appleton-Century. Stein, L. (1968). Chemistry of reward and punishment. In D. H. Efron (Ed.), Proceedings of the American college of neuropsychophar-macology (pp. 105–123). Washington, DC: U.S. Government Printing Office. Stein, L. (1969). Chemistry of purposive behavior. In J. T. Tapp (Ed.), Reinforcement and behavior (pp. 328–355). New York: Academic Press. Strevens, M. (2004). The causal and unification accounts of explanation unified— Causally. Noûs, 38, 154–176. Strevens, M. (2009). Depth: An account of scientific explanation. Cambridge, MA: Harvard University Press. Stricker, E. M., & Zigmond, M. J. (1986). Brain monoamines, homeostasis, and adaptive behavior. In Handbook of physiology. Intrinsic regulatory systems of the brain (Vol. IV, pp. 677–696). Bethesda, MD: American Physiological Society. Sutton, R. S., & Barto, A. G. (1987). A temporal-difference model of classical conditioning. Proceedings of the ninth annual conference of the cognitive science society. Seattle, WA. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1981). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88(2), 135–170. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. An introduction. Cambridge, MA: MIT Press. Tesauro, G. J. (1986). Simple neural models of classical conditioning. Biological Cybernetics, 55, 187–200. Thorndike, E. (1911). Animal intelligence. New York: MacMillan.
M. Colombo / Studies in History and Philosophy of Biological and Biomedical Sciences 45 (2014) 57–67 Tindell, A. J., Berridge, K. C., Zhang, J., Peciña, S., & Aldridge, J. W. (2005). Ventral pallidal neurons code incentive motivation: Amplification by mesolimbic sensitization and amphetamine. European Journal of Neuroscience, 22, 2617–2634. Toates, F. (1986). Motivational systems. Cambridge: Cambridge University Press. Tricomi, E., Balleine, B., & O’Doherty, J. (2009). A specific role for posterior dorsolateral striatum in human habit learning. European Journal of Neuroscience, 29, 2225–2232. Trowill, J. A., Panksepp, J., & Gandelman, R. (1969). An incentive model of rewarding brain stimulation. Psychological Review, 76(1969), 264–281. Weslake, B. (2010). Explanatory depth. Philosophy of Science, 77(2), 273–294. White, N. M. (1986). Control of sensorimotor function by dopaminergic nigrostriatal neurons: Influences of eating and drinking. Neuroscience and Biobehavioral Review, 10, 15–36. Wise, R. A. (1978). Catecholamine theories of reward: A critical review. Brain Research, 152, 215–247. Wise, R. A. (1982). Neuroleptics and operant behavior: The anhedonia hypothesis. Behavioral and Brain Sciences, 5, 39–88.
67
Wise, R. A. (2004). Dopamine, learning and motivation. Nature Reviews Neuroscience, 5, 483–494. Woodward, J. (2003). Making things happen: A theory of causal explanation. New York: Oxford University Press. Woodward, J., & Hitchcock, C. (2003). Explanatory generalizations, pt. 2, plumbing explanatory depth. Noûs, 37, 181–199. Wyvell, C. L., & Berridge, K. C. (2000). Intra-accumbens amphetamine increases the conditioned incentive salience of sucrose reward: Enhancement of reward ‘‘wanting’’ without enhanced ‘‘liking’’ or response reinforcement. Journal of Neuroscience, 20, 8122–8130. Zaghloul, K. A., Blanco, J. A., Weidemann, C. T., McGill, K., Jaggi, J. L., Baltuch, G. H., et al. (2009). Human substantia nigra neurons encode unexpected financial rewards. Science, 323(5920), 1496–1499. Zhang, J., Berridge, K. C., Tindell, A. J., Smith, K. S., & Aldridge, J. W. (2009). A neural computational model of incentive salience. PLoS Computational Biology, 5, e1000437.