Neuroscience Research 148 (2019) 1–8
Contents lists available at ScienceDirect
Neuroscience Research journal homepage: www.elsevier.com/locate/neures
Perspective
Revisiting a theory of cerebellar cortex Tadashi Yamazaki a,∗ , William Lennon b a b
Graduate School of Informatics and Engineering, The University of Electro-Communications, Japan Department of Electrical and Computer Engineering, University of California, San Diego, United States
a r t i c l e
i n f o
Article history: Received 4 September 2018 Received in revised form 14 February 2019 Accepted 5 March 2019 Available online 26 March 2019 Keywords: Cerebellum Learning Theory Reinforcement learning Actor-critic model Molecular layer interneuron Marr-Albus-Ito model
a b s t r a c t Long-term depression at parallel fiber-Purkinje cell synapses plays a principal role in learning in the cerebellum, which acts as a supervised learning machine. Recent experiments demonstrate various forms of synaptic plasticity at different sites within the cerebellum. In this article, we take into consideration synaptic plasticity at parallel fiber-molecular layer interneuron synapses as well as at parallel fiberPurkinje cell synapses, and propose that the cerebellar cortex performs reinforcement learning, another form of learning that is more capable than supervised learning. We posit that through the use of reinforcement learning, the need for explicit teacher signals for learning in the cerebellum is eliminated; instead, learning can occur via responses from evaluative feedback. We demonstrate the learning capacity of cerebellar reinforcement learning using simple computer simulations of delay eyeblink conditioning and the cart-pole balancing task. © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Contents 1. 2.
3.
4.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1. Derivation of the cerebellar RL model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2. Simulation of delay eyeblink conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3. Demonstration of the cart-pole balancing task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1. Cerebellar cortex as an RL machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2. MLIs as a predictor of CF signals in cerebellar RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3. Role of rebound potentiation in cerebellar RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4. Potential complementary roles of the cerebellum with the basal ganglia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.1. Modeling a compound learning rule for PF-PC and PF-MLI synapses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.2. Simulation of delay eyeblink conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3. Demonstration of the cart-pole task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1. Introduction Fifty years ago, David Marr published a seminal paper titled “A theory of cerebellar cortex” (Marr, 1969), in which he proposed that the cerebellar cortex is a supervised learning (SL) machine
∗ Corresponding author at: 1-5-1 Chofugaoka, Chofu 182-8585, Japan. E-mail address:
[email protected] (T. Yamazaki).
known as a perceptron (Rosenblatt, 1958). At the core of the theory lies the distributed representation of mossy fiber (MF) inputs by granule cells, as well as the synaptic plasticity at parallel fiberPurkinje cell (PF-PC) synapses. In particular, it was assumed that the PF-PC synapses are potentiated by conjunctive activation of the PF and the climbing fiber (CF) innervating the same PC. Two years later, James Albus revised the theory, proposing that the PF-PC synapses should be depressed but not potentiated by the conjunctive activation (Albus, 1971). More than 10 years later,
https://doi.org/10.1016/j.neures.2019.03.001 0168-0102/© 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
2
T. Yamazaki, W. Lennon / Neuroscience Research 148 (2019) 1–8
Masao Ito and his colleagues first discovered the synaptic plasticity in vivo predicted by Albus, which is currently known as long-term depression (LTD) of PF-PC synapses (Ito et al., 1982; Ito and Kano, 1982; Ito, 1989). Thus, the Marr-Albus-Ito model was established. To date, this theory has continued to provide the basis for understanding the computational principles of the cerebellum (Ito, 1984, 2012). Since its establishment, the theory has been challenged and extended repeatedly (Strata, 2009; Raymond and Medina, 2018). In particular, the involvement of PF-PC LTD in the adaptation of the vestibulo-ocular reflex (VOR) was a long-lasting debate in the field of neuroscience (Miles and Lisberger, 1981; Ito, 1982). Ito (1982) proposed that the PF-PC LTD plays a causal role in VOR adaptation, whereas Miles and Lisberger (1981) argued that the cerebellar cortex only guides plasticity in the vestibular nuclei. The latest attempt to deny the causal role of PF-PC LTD in VOR adaptation was made in 2011 by Schonewille et al. (2011). They reported that their genetically manipulated mice, which seemed to lack normal PF-PC LTD, showed normal motor learning behavior, including VOR adaptation. They even proposed that PF-PC LTP but not LTD is indispensable for cerebellar learning (Schonewille et al., 2010). Ito’s group refuted the inconsistent findings by providing multiple potential interpretations from both experimental and theoretical view points (Ito et al., 2014). Yamazaki et al. (2015) proposed a theory of cerebellar memory consolidation that unifies the inconsistent experimental results with the standard Marr-Albus-Ito model. Later, Ito’s group reported that even the genetically manipulated mice can show PF-PC LTD (Yamaguchi et al., 2016), suggesting that PF-PC LTD is still central in memory acquisition, whereas long-term potentiation (LTP) at MF-vestibular nucleus neuron synapses plays an important role in memory consolidation (Shutoh et al., 2006). Recently, Kakegawa et al. (2018) succeeded in manipulating the internalization of ˛-amino3-hydroxy-5-methyl-4-isoxazolepropionic acid (AMPA) receptors on PC dendrites, which is the key mechanism underlying PF-PC LTD, using optical stimuli, and demonstrated that blocking the internalization resulted in a failure of motor learning. This is the first piece of evidence for the causal role of PF-PC LTD in cerebellar motor learning. The current consensus seems to be that the widely distributed synaptic plasticity within the cerebellum acts synergistically (Gao et al., 2012); however, PF-PC LTD plays a central role in cerebellar learning. To date, at least nine types of synaptic plasticity within the cerebellum have been reported (D’Angelo, 2014), and now the situation is more complicated. The question thus arises: how do multiple plasticity mechanisms work synergistically? Furthermore, could the distributed plasticity enhance the computational power of the cerebellar cortex beyond what is explained by the MarrAlbus-Ito model? In this article, we focus on the functional role of molecular layer interneurons (MLIs) and the associated synaptic plasticity (Jörntell et al., 2010). Previous in vivo studies showing CF-driven changes in the PF-MLI receptive fields led to the hypothesis that concomitant PF and CF activation strengthens PF-MLI synapses, but PF stimulation alone weakens them (Jörntell and Ekerot, 2002; Jörntell and Ekerot, 2003; Jörntell and Ekerot, 2011), in a mirror-symmetric fashion similar to that observed in PF-PC synaptic plasticity. In vitro studies also showed adaptive change in MLIs’ activity (Liu and CullCandy, 2000; Rancillac and Crépel, 2004; Bender et al., 2009). Taking these into account, in this article, we extend the Marr-Albus-Ito model (Fig. 1). We start from a set of simple equations describing such forms of plasticity, and derive a learning principle known as reinforcement learning (RL), which is more capable than SL (Sutton and Barto, 1998). The essential difference between RL and SL is that explicit “teacher” or “error” information is not required in RL. Rather, RL assumes a condition in which one receives evaluative
Fig. 1. Theoretical model of the cerebellar cortex. (A) The standard perceptron model (i.e., the Marr-Albus-Ito model). (B) The present model incorporating an MLI network and associated synaptic plasticity. Stars represent sites wherein synaptic plasticity exists. Arrow heads represent excitatory connections, whereas round heads represent inhibitory connections. The self-inhibition on MLI in (B) represents a recurrent inhibitory network composed of MLIs. Red lines, symbols, and text represent the added components. Abbreviations: MF, mossy fiber; GR, granule cell; PF, parallel fiber; PC, Purkinje cell; IO, inferior olive; CF, climbing fiber; MLI, molecular layer interneuron.
feedback called a “reward” in response to an action in a certain situation, and maximizes the expected reward in the future through trial and error. Thus, RL is minimally supervised, as subjects are not explicitly instructed, but must discover for themselves, what action to take in a situation, based on the reward they receive. RL is so powerful that an RL algorithm combined with deep learning technology has beaten the best human players of the game Go (Silver et al., 2016), and furthermore the algorithm requires no prior knowledge of how to play the game (Silver et al., 2017). Therefore, if the cerebellum acts as an RL machine, it would be much more powerful and capable of flexible learning than previously thought. Swain et al. (2011) first mentioned the possibility of the involvement of the cerebellum in RL. In contrast, this article is the first to propose how an RL algorithm could be implemented by the cerebellar circuit.
2. Results 2.1. Derivation of the cerebellar RL model We propose a theoretical model of the cerebellar cortex in which there are two plasticity sites, incorporating MLIs and their asso-
T. Yamazaki, W. Lennon / Neuroscience Research 148 (2019) 1–8
3
ciated network and plasticity into the standard Marr-Albus-Ito model, as follows: PC(t)
= wi PF i (t) − MLI(t) + PC,
MLI(t)
= vi PF i (t) − vi PF i (t − 1) + cCF(t),
wi (t + 1)
= wi (t) − ˛PF i (t)MLI(t),
vi (t + 1)
= vi (t) + ˇPF i (t)MLI(t),
i i
i
(1)
where PC(t) is the activity of a PC at time t, PFi (t) is the activity of the ith PF, wi is the synaptic weight of PFi on the PC, MLI(t) is the activity of the population of MLIs, PC is the baseline activity of the PC, vi is the synaptic weight of PFi on the population of MLIs, CF(t) is the activity of a CF, f(t) represents the temporal average of f(t), and c, ˛, ˇ, and are constants. The detailed derivation is shown in the Methods. The set of equations represents the simplest form of an actorcritic model (Barto et al., 1983), which is the description of an RL algorithm. In an actor-critic model, an actor calculates the preferences of actions for given states p(s(t), a(t)), where s(t) is a given state and a(t) is a taken action at time step t. The preference p is updated as p(s(t + 1), a(t + 1)) = p(s(t), a(t)) + ˛ı(t)ei (t),
(2)
where ˛ is a constant describing a learning efficacy, ı(t) is the evaluation of the action called temporal difference (TD) error, and ei (t) is the activity of input pathway i at t called eligibility. ı(t) is calculated as ı(t) = r(t) + V (s(t + 1)) − V (s(t)),
(3)
where r(t) is a reward at t, is a constant called a discount factor, and V(s(t)) is the value of s(t), which is the evaluation of the given state. A critic updates the value of V(s(t)) as V (s(t + 1)) = V (s(t)) + ˇı(t)e (t),
(4)
e (t)
where ˇ is a constant and is eligibility at t. Now, we could consider the following correspondence to regard Eq. (1) as an actorcritic model: wi (t)PFi (t)(and PC(t)) i
MLI(t)
⇔ −ı(t)
CF(t)
⇔ −r(t)
vi (t)PFi (t) i
⇔ p(s(t), a(t)) (5)
⇔ −V (s(t))
2.2. Simulation of delay eyeblink conditioning
Fig. 2. Delay eyeblink conditioning. (A) Conditioning paradigm. When a CS (e.g., tone) and a paired US (e.g., airpuff) are presented repeatedly to a subject, the subject learns to elicit a CR (e.g., eyeblink). If the US onset is offset from the CS onset (e.g., by 500 ms), the subject elicits the CR during the US onset. (B–D) Simulation Results. (B) PC activity at the 1st, 50th, and 100th trial (red, orange, and green, respectively). (C) the same plot with (B) except that the CS was coterminated with the US. (D) PC activity at the 100th trial if only PF-PC plasticity is intact (red) or only PF-MLI plasticity is intact (blue). In both panels, the horizontal axis represents time (ms), and the vertical axis represents the normalized activity. Activities are shown for only the ±100 ms around the US onset.
First, we demonstrate how the cerebellar RL acts differently from a standard SL cerebellar model. Delay eyeblink conditioning is a temporal association task involving the cerebellum (Mauk and Donegan, 1997; Christian and Thompson, 2003). In this task, an animal receives repeated paired presentations of a tone (conditioned stimulus; CS) and an airpuff (unconditioned stimulus; US). The animal is conditioned to close its eyes in response to the tone (conditioned response; CR), and the CR is elicited after a delay equal to the interstimulus interval (ISI) between the CS and US onsets (Fig. 2A). In other words, the animal learns the timing of the CR. Previous theoretical studies based
on the standard Marr-Albus-Ito model have succeeded in reproducing these experimental results (Yamazaki and Tanaka (2009) for review). Here, we show how the extended RL model shows dynamics different from those of the standard models during the acquisition of the CR. The detailed simulation settings are described in the Methods. Fig. 2B shows the activities of a PC at the 1st, 50th, and 100th trials. The PC learns to decrease its activity to a greater degree and earlier than the US onset, suggesting that the PC learns to pause in advance of the US. Even when the CS was coterminated with the US, the pause was acquired successfully (Fig. 2C). In contrast, if one
In particular, PC and MLI serve as an actor and a critic, respectively, and CF conveys the reward. It should be noted that because PCs are inhibitory neurons and PF-PC synapses are depressed rather than potentiated by CF activity (PF-PC LTD), signs for some terms including r are flipped. In this sense, CF activity might be better called punishment rather than reward.
4
T. Yamazaki, W. Lennon / Neuroscience Research 148 (2019) 1–8
of the PF-PC and PF-MLI plasticity is blocked, the predictive pause is abolished (Fig. 2D). If only the PF-PC plasticity is intact, which is the case of the SL, the PC shows a larger pause around the US onset, while the anticipated component is almost abolished compared with the normal case. On the other hand, if only PF-MLI plasticity is intact, the PC no longer shows such a strong pause. These results suggest that the PF-PC and PF-MLI plasticity mechanisms could contribute differently to generate the strong and anticipated pause of the PC activity. 2.3. Demonstration of the cart-pole balancing task To demonstrate the learning capability in RL, we applied it to a cart-pole balancing task. The purpose of this task is to keep a pole on a cart standing still while moving the cart either towards the left or the right (Fig. 3A). We assumed that when a PC increases its firing rate from baseline, it issues a motor command to move the cart towards the left. In contrast, when this activity decreases, a command is issued to move the cart towards the right. When the pole falls down, a negative reward (i.e., punishment) is fed to the CF. Here, we do not provide the correct direction in which the cart should move when the pole falls down. Instead, we only tell the cerebellar RL machine that the pole has fallen down now. Fig. 3B shows how learning occurred. In the initial trials, the pole fell down almost immediately. Around the 40th trial, our model learned how to move the cart correctly, and after 41 trials, our model constantly succeeded in keeping the pole standing still for 1 min. The same task could not be performed successfully until the 960th trial, if the PF-MLI plasticity was blocked (i.e., through a standard SL). Fig. 3C and D respectively represents the pole angle and the cart position trajectories at the 1st, 40th, and 100th trials. At the 1st trial, the pole fell down immediately after the start. At the 40th trial, the cart first moved to the left and then flipped the direction around 5 s to keep the pole standing, but after that the pole fell down around 10 s. At the 100th trial, the cart succeeded in keeping the balance for 60 s while moving back and forth repeatedly. Thus, the cart acquired a behavior that resulted in it moving back and forth at the appropriate time to keep the pole standing. 3. Discussion 3.1. Cerebellar cortex as an RL machine Multiple synergistic plasticity mechanisms are likely simultaneously involved in cerebellar learning. In this study, we integrated PF-MLI synaptic plasticity into the Marr-Albus-Ito model. Blocking PF-PC LTD abolished the adaptive change in the final output of the cerebellar cortex; this indicates that PF-PC LTD plays the central role and is therefore the final resort in cerebellar learning. Additionally, MLIs and their associated plasticity assist PCs in learning efficiently, as demonstrated by our cart-pole task. In this way, we extended, but did not replace, the Marr-Albus-Ito model. The resulting network can be considered as an actor-critic model, where PCs and MLIs act as the actor and the critic in the scheme, respectively, and CFs convey information on reward, which is a scalar value representing the evaluative feedback of the previously generated action. In our model, CFs no longer need to provide explicit teacher or error information. A number of previous studies have shown that CF activities contain information about directional errors in reaching (Kitazawa et al., 1998) and in saccades (Herzfeld et al., 2015). Directional errors represent the mismatch between the desired direction and actual direction. The errors tell the agent how much the realized action differed from the desired one, and that errors should be minimized in future trials. Directional errors
Fig. 3. Cart-pole task. (A) Schematic of the system. x is the location of the cart, is the angle made by the pole, and F is the external force applied to the cart. (B) Learning performance. The horizontal axis represents trials, and the vertical axis represents the time (s) until failure (i.e., the pole falls down). (C) Pole angle trajectory in radian for 60 s at the 1st, 40th, and 100th trial (red, orange, and green, respectively). (D) Cart position trajectory in m for 60 s at the 1st, 40th, and 100th trial (red, orange, and green, respectively).
themselves do not tell an agent how to correct the errors explicitly in terms of joint torques or muscle contractions. Rather, they are evaluations of the actual directions made by the agent. In this sense, directional errors can be regarded as rewards or punishments in the context of RL. The computational power of the cerebellar cortex as a SL machine has been examined using various machine learning problems (Hausknecht et al., 2017). Hausknecht’s study demonstrated that the cerebellum modeled in a SL paradigm failed to perform complex tasks in which explicit teacher signals are not provided. This is natural, because SL machines require explicit teacher sig-
T. Yamazaki, W. Lennon / Neuroscience Research 148 (2019) 1–8
nals. However, with our present model, it would be possible to successfully complete these tasks. Moreover, owing to its repetitive anatomical structure, the cerebellar cortex could be considered a collection of a number of corticonuclear microcomplexes or RL modules. Thus, the entire cerebellar cortex could act as a massively parallel RL machine. Parallel RL has been studied in depth in the field of machine learning, and such parallel RL machines have demonstrated a much more powerful learning capacity than single RL machines (Kulkarni et al., 2016; Van Seijen et al., 2017). We have built a very large-scale spiking network model of the cerebellum, the size of which is comparable to that of a cat’s cerebellum. This model is composed of 1280 learning modules (Yamazaki et al., 2019). We have also built a spiking network model of MLIs with synaptic plasticity (Lennon et al., 2014, 2015). The model resulting from their integration could be useful as a massively parallel RL machine for machine learning and for solving real world problems. A major limitation of the present formulation is that there is no direct experimental evidence or rigorous mathematical justification on the feedback inhibition term i vi PFi (t − 1) in Eq. (1). Rather, we just assumed that this inhibition is fed by the MLI network. We could say this is our working hypothesis, and we believe that we could implement a detailed spiking network model that works as we expect based on our previous model (Lennon et al., 2014, 2015). Last but not the least, even if the cerebellum is an RL machine, PF-PC LTD remains the central principle for cerebellar learning (Kakegawa et al., 2018). The other plasticity mechanisms considered in our theory play supportive roles for faster and more efficient learning. 3.2. MLIs as a predictor of CF signals in cerebellar RL In the standard RL paradigm, the role of the critic is to provide the reward prediction error (RPE) to the actor. In our cerebellar learning scheme, MLIs perform this role; namely, the role of MLIs is to provide the prediction error of CF signals to PCs. In delay eyeblink conditioning, CFs emit spikes at the onset of a US, suggesting that CFs convey teacher signals for the correct timing information (Medina et al., 2002). Recent studies, however, demonstrate that CFs convey other types of information too. In particular, Ohmae and Medina (2015) reported RPE-like complex spike activity in PCs during eyeblink conditioning in mice. In this particular case, the inferior olive (IO) could provide the RPE information as do dopamine neurons in the substantia nigra pars compacta (SNc), as hypothesized by Swain et al. (2011). Thus, the IO and PCs could perform RL without the aid of MLIs. Further investigations will clarify the different roles of MLIs and the IO in cerebellar RL. 3.3. Role of rebound potentiation in cerebellar RL Rebound potentiation (RP) is a form of synaptic plasticity (Kano et al., 1992, 1996), wherein activation of CFs enhances inhibition from proximal MLIs that would receive the spillover of the glutamate being sent from the CF to a PC that it sends input to. RP has been examined in terms of eye movement reflex adaptation (Tanaka et al., 2013; Hirano, 2013), and is considered a mechanism complementary to PF-PC LTD (Hirano et al., 2016). Its functional role in cerebellar learning, however, remains unclear. If RP was incorporated in the present model, the plasticity would enhance the inhibition exerted by MLIs to a PC, which would in turn enhance the effect of the 2nd term on the right-hand side of the equation of wi (t) in Eq. (1), specifically the coefficient ˛, is increased. Thus, in terms of cerebellar RL, RP corresponds to a mechanism that would change the learning efficacy parameter ˛ in Eq. (1) adaptively dependent on CF activity. The effect of the adaptive change in learning efficacy has been well studied in terms of speed-
5
ing up the learning process (Murphy, 2012). Therefore, we propose that RP plays a supportive role in enabling faster and more efficient learning with shorter training periods and a smaller number of trials. Tanaka et al. (2013) reported paradoxical experimental results in RP-deficient mice. These mice showed normal VOR and optokinetic response (OKR), and even gain-up adaptation in OKR, but failed to achieve gain-up adaptation in VOR, although the circuit mechanisms for VOR and OKR are thought to be identical. Perhaps, OKR gain-up adaptation is easier than VOR gain-up adaptation, and the time courses for VOR and OKR gain-up adaptation might be different. Thus, in RP-deficient mice, OKR adaptation is still possible, but VOR adaptation is not. 3.4. Potential complementary roles of the cerebellum with the basal ganglia Traditionally, RL has been thought to be mediated by the basal ganglia (Schultz et al., 1997). Specifically, dopamine neurons in the ventral tegmental area (VTA) or SNc show RPE-like activity. Doya (1999, 2000) proposed that the basal ganglia and the cerebellum play different roles in RL and SL, respectively. On the contrary, the present results argue that both the basal ganglia and cerebellum are RL machines. The question then arises: how do the basal ganglia and cerebellum act cooperatively as distinct RL machines? In the machine learning community, several attempts have been made to enhance the power of RL more than just a single RL machine. One successful attempt is parallel RL, as we discussed in the previous subsection, and another is hierarchical RL. In hierarchical RL, two RL machines are instantiated and communicate with each other hierarchically. The higher RL machine performs global planning and decomposes a large complex task into several small and simple subtasks, while the lower RL machine solves these subtasks one by one. Hierarchical RL has been demonstrated to work efficiently in adaptive robotic control (Morimoto and Doya, 2001) as well as in an ATARI game that cannot be solved by a single RL machine (Kulkarni et al., 2016). Although there is no direct experimental evidence or theoretical consideration to date, if the basal ganglia and the cerebellum work cooperatively as a hierarchical RL machine, the resulting network would demonstrate powerful learning capabilities. 4. Methods 4.1. Modeling a compound learning rule for PF-PC and PF-MLI synapses Let PFi (t), PC(t), MLI(t), and CF(t) be the activity of the ith PF, a PC, a population of MLIs, and a CF at time t, respectively. We describe PC(t) as PC(t) = wi PFi (t) − MLI(t) + PC,
(6)
i
where wi is the synaptic weight of PFi on a PC. The 1st term on the right-hand side represents excitatory PF inputs; the 2nd term, inhibitory inputs from the MLI population; and the 3rd term, baseline activity of the PC. We assume that this equation represents the simple spike activity of the PC. Hence, we ignore the contribution of CF input in this equation. Further, we describe MLI(t) as MLI(t) = vi PFi (t) − FB(t) + c CF(t),
(7)
i
where vi is the synaptic weight of PFi on the MLI, and c is a positive constant. The 1st term on the right-hand side represents excitatory PF inputs; the 2nd term, the feedback inhibition from the recurrent network of MLIs (Rieubland et al., 2014); and the 3rd term, the CF
6
T. Yamazaki, W. Lennon / Neuroscience Research 148 (2019) 1–8
input. Here, we assume that MLIs receive excitatory inputs from the CF via the spillover of glutamate (Szapiro and Barbour, 2007). The feedback term was assumed to be
Finally, we redefine MLI(t) as
FB(t) = vi PFi (t − 1),
and obtain
MLI(t) = vi PFi (t) − vi PFi (t − 1) + cCF(t), i
(8)
i
wi (t + 1) = wi (t) − ˛ PFi (t)MLI(t)
which means that the recurrent inhibitory network of MLIs retains the activity of excitatory inputs from the previous time step with a slight decay denoted by parameter ≤ 1, which is an important element in temporal difference learning in the actor-critic framework. We describe the change in wi as wi (t + 1) = wi (t) − a1 PFi (t)CF(t) − a2 PFi (t)MLI(t),
4.2. Simulation of delay eyeblink conditioning In delay eyeblink conditioning, the paired presentation of the CS and US is fed by PFs and a CF, respectively (Fig. 2A). In our simulation, a CS is represented by sequential activation of PFs. Let PFi (t) be the activity of the ith PF (1 ≤ i ≤ N), which is defined as
PFi (t) = exp
CF(t) =
where b1 and b2 are positive constants. The 2nd term on the right-hand side represents an increase occuring through a Hebbian mechanism (Rancillac and Crépel, 2004; Smith and Otis, 2005), and the 3rd term represents an increase occuring through the conjunctive activation of PF and CF via the spillover of glutamate from the CF (Jörntell and Ekerot, 2002; Jörntell and Ekerot, 2003; Jörntell and Ekerot, 2011). Putting Eqs. (6)–(8) into Eqs. (9) and (10), we obtain
= wi (t) − a1 PFi (t)CF(t) − a2 PFi (t)(vi PFi (t) −
i
,
(15)
1 t = 500ms, 0
i
(16)
otherwise.
a1 CF(t)) a2
= wi (t) − a2 PFi (t)(vi PFi (t) − vi PFi (t − 1) + c CF(t)), i
2
vi PFi (t − 1) + c CF(t))
i
i
(t − it) PF
The CS-US pair is fed repeatedly 100 times. Initial weights of wi and vi are set at 1.0 for any i, and parameters in Eq. (1) are set arbitrarily as ˛ = 0.02, ˇ = 0.01, = 1.0, c = 1.0, and PC = 0. In Fig. 2, the PC activity is normalized so that the baseline activity becomes 1.0.
= wi (t) − a2 PFi (t)(vi PFi (t) − vi PFi (t − 1) + c CF(t) + i
−
where N = 1000 is the number of PFs, t is the temporal resolution, which is set at 1 ms, and PF is a time constant set at 5 ms. Let T be the duration of the CS set at 1 s; hence, Nt = T. A US is represented by an instantaneous activation of a CF. The activation CF(t) is defined as
(10)
= wi (t) − a1 PFi (t)CF(t) − a2 PFi (t)MLI(t)
(14)
Eqs. (6), (13), and (14) derive Eq. (1). In this formulation, we intended to indicate that afferent inputs (PFs) to the actor (PC) are modulated by the activity of the critic (MLI). To do so, CF-dependent plasticity on the PC is effectively included in the 2nd term on the right-hand side, and hence Eqs. (14) do not represent the term CF(t) explicitly. Nevertheless, we emphasize that Eqs. (14) contain the CF-dependent plasticity at PFPC synapses.
(9)
vi (t + 1) = vi (t) + b1 PFi (t)MLI(t) + b2 PFi (t)CF(t),
vi (t + 1) = vi (t) + ˇ PFi (t)MLI(t) .
where a1 and a2 are positive constants and · · · denotes the averaging over time. The 2nd term on the right-hand side represents a decrease induced by the conjunctive activation of PFi and CF (e.g., PF-PC LTD) (Ito, 1989). The 3rd term represents a decrease induced by MLIs indirectly. Electrophysiological studies have shown that activation of GABAA receptors on a granule cell axon enhances excitability of the cell (Stell et al., 2007; Pugh and Jahr, 2011, 2013; Dellal et al., 2012). Correlated activation of PFs and MLIs could induce secretion of GABAA from MLIs, which in turn enhances the excitability of the granule cells via the spillover of GABAA . We assume that the enhancement could strengthen the PF-PC LTD further via the 2nd term, and therefore decrease the PF-PC synaptic weight. Some studies, however, demonstrated that inhibition of PCs during the conjunctive activation could suppress the LTD induction (Ekerot and Kano, 1985; Rowan et al., 2018). Thus, the balance between the enhancement and suppression might be important. Next, we describe the change in vi as
wi (t + 1)
(13)
i
(11)
vi (t + 1) = vi (t) + b1 PFi (t)(vi PFi (t) − vi PFi (t − 1) + c CF(t)) + b2 PFi (t)CF(t) i
i
= vi (t) + b1 PFi (t)(vi PFi (t) − vi PFi (t − 1) + c CF(t) + i
i
b2 CF(t)) b1
= vi (t) + b1 PFi (t)(vi PFi (t) − vi PFi (t − 1) + c CF(t)), i
i
where c = c + a1 /a2 and c = c + b2 /b1 . Here for simplicity, we set a1 = a2 = ˛, b1 = b2 = ˇ, and c + 1 = c: wi (t + 1)
= wi (t) − ˛PFi (t)(vi PFi (t) − vi PFi (t − 1) + (c + 1)CF(t)) i
i
= wi (t) − ˛PFi (t)(vi PFi (t) − vi PFi (t − 1) + cCF(t)), i
i
vi (t + 1) = vi (t) + ˇPFi (t)(vi PFi (t) − vi PFi (t − 1) + (c + 1)CF(t)) i
i
= vi (t) + ˇPFi (t)(vi PFi (t) − vi PFi (t − 1) + cCF(t)). i
i
(12)
T. Yamazaki, W. Lennon / Neuroscience Research 148 (2019) 1–8
4.3. Demonstration of the cart-pole task We followed the same settings for the cart-pole task described elsewhere (Barto et al., 1983). Briefly, the dynamics of a cart and a pole (Fig. 3A) are described by the following nonlinear differential equations:
g sin (t) + cos (t) ¨
p ˙ ˙ −F(t) − ml˙ 2 sin (t) + c sgn (x) − mc + m ml
= l
=
x¨
4 mcos2 (t) − mc + m 3
(17)
˙ F(t) + ml ˙ 2 sin (t) − ¨ cos − c sgn (x) mc + m
,
where (t) is the angle made by the pole at time t ((t) = 0 is the standing position), x(t) is the position of cart at time t, g = 9.8 m/s2 is the gravitational acceleration, mc = 1.0 kg is the mass of the cart, m = 0.1 kg is the mass of the pole, l = 0.5 m is the half-pole the length, c = 0.0005 is the coefficient of friction of the cart on the track, p = 0.000002 is the coefficient of friction of the pole on the track, F(t) =±10.0 N is the force applied to the cart’s center of mass at time t, and sgn(x) =+1 if x > 0 or − 1 otherwise. Note, however, that we flipped the sign of the gravitational acceleration based on Frorian (2005). We considered the position and speed of the cart, as well as the angle and angular speed of the pole, to define the state space of the cart-pole. We split the state space into 162 pieces according to Barto et al. (1983), and assumed that the space is represented by 162 PFs:
PFi (t) =
1 when PF i represents the current state, 0
otherwise.
(18)
where 1 ≤ i ≤ 162. Thus, only 1 out of 162 PF is activated for each time step. CF(t) shows the evaluation feedback in the pole balancing. Specifically,
CF(t) =
1 when the pole angle exceeds ± 12deg from the standing position, 0
otherwise.
(19) The cart was moved to the left when PC(t) < PC, but was moved to the right otherwise. In this formulation, because only 1 PF is activated for each time step, learning did not proceed well. Instead, we employed eligibility traces for PFs such as in reservoir in Yamazaki and Tanaka (2007), (PC) and used them for updating wi (t) and vi (t). Let PFi (t + t) be the activity of the ith PF on the PC at time t. It is defined as (PC)
PFi
(PC)
(t + 1) = ıPFi
(t) + (1 − ı) · sgn PC(t) − PC xi (t) + ,
(20)
where ı = 0.9 determines the decay rate, and xi (t) = 1 when the cart is in the ith state (1 ≤ i ≤ 162) (and this value is 0 otherwise). We assumed that PF activity is modulated by postsynaptic PC activity through the retrograde inhibition of Ca2+ influx (Kreitzer and Regehr, 2001; Kawamura et al., 2006; Crepel, 2007). is a Gaussian noise with the average and standard deviation set at 0 and 0.01, respectively. Similarly, the effect of the ith PF on the MLI population at time t is defined as (MLI)
PFi
(MLI)
(t + 1) = PFi
(t) + (1 − )xi (t),
(21)
where = 0.8 determines the decay rate. Initial weights of wi and vi are set at 1.0 for any i, and parameters in Eq. (1) are set arbitrarily as ˛ = 1000, ˇ = 0.5, and = 1.0. Acknowledgements The authors would like to thank Professors Nobito Yamamoto and Robert Hecht-Nielsen, the Japan Society for the Promotion
7
of Science (#SP13031), and the U.S. National Science Foundation (#OISE-1308822) for making this collaboration possible. We also thank Drs. Kenji Doya, Hiroki Kurashige, Soichi Nagao, Shigeru Tanaka, and Hiroshi Yamaura for their insightful comments and discussions. Part of this study was supported by JSPS Kakenhi Grant Number 26430009, and Grant-in-Aid for High Performance Computing with General Purpose Computers (Research and development in the next-generation area, Large Scale Computational Sciences with Heterogeneous Many-Core Computers) from the Ministry of Education, Culture, Sports, Science and TechnologyJapan. This paper is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO). The authors would like to extend their deepest gratitude to the late Dr. Masao Ito for his mentorship and inspirational pursuit of understanding this mysterious brain region. Dr. Ito’s legacy will live on through his nearly seven decades of scientific contributions to the field of neuroscience. References Albus, J.S., 1971. A theory of cerebellar function. Math. Biosci. 10, 25–61. Barto, A.G., Sutton, R.S., Anderson, C.W., 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern. SMC 13-13 (5), 834–846. Bender, V., Pugh, J., Jahr, C., 2009. Presynaptically expressed long-term potentiation increases multivesicular release at parallel fiber synapses. J. Neurosci. 29, 10974–10978. Christian, K.M., Thompson, R.F., 2003. Neural substrates of eyeblink conditioning: acquisition and retention. Learn. Mem. 11, 427–455. Crepel, F., 2007. Developmental changes in retrograde messengers involved in depolarization-induced suppression of excitation at parallel fiber-purkinje cell synapses in rodents. J. Neurophysiol. 97, 824–836. D’Angelo, E., 2014. The organization of plasticity in the cerebellar cortex: from synapses to control. Prog. Brain Res. 210, 31–58. Dellal, S.S., Luo, R., Otis, T.S., 2012. GABAA receptors increase excitability and conduction velocity of cerebellar parallel fiber axons. J. Neurophysiol. 107, 2958–2970. Doya, K., 1999. What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Netw. 12, 961–974. Doya, K., 2000. Complementary roles of basal ganglia and cerebellum in learning and motor control. Curr. Opin. Neurobiol. 10, 732–739. Ekerot, C.F., Kano, M., 1985. Long-term depression of parallel fibre synapses following stimulation of climbing fibres. Brain Res. 342, 357–360. Frorian, R.V., 2005. Correct equations for the dynamics of the cart-pole system. https://www.semanticscholar.org/paper/Correct-equations-for-the-dynamicsof-the-cart-pole-Florian/f665cfa1143e6414cdcb459c5d84796908af4727. Gao, Z., van Beugen, B., De Zeeuw, C., 2012. Distributed synergistic plasticity and cerebellar learning. Nat. Rev. Neurosci. 13, 619–635. Hausknecht, M., Li, W., Mauk, M., Stone, P., 2017. Machine learning capabilities of a simulated cerebellum. IEEE Trans. Neural Net Learn. Syst. 28, 510–522. Herzfeld, D.J., Kojima, Y., Soetedjo, R., Shadmehr, R., 2015. Encoding of action by the purkinje cells of the cerebellum. Nature 526, 439–442. Hirano, T., 2013. Long-term depression and other synaptic plasticity in the cerebellum. Proc. Jpn. Acad. Ser. B 89, 183–195. Hirano, T., Yamazaki, Y., Nakamura, Y., 2016. LTD, RP, and motor learning. Cerebellum 15, 51–53. Ito, M., 1982. Cerebellar control of the vestibulo-ocular reflex-around the flocculus hypothesis. Annu. Rev. Neurosci. 5, 275–297. Ito, M., 1984. The Cerebellum and Neural Control. Raven Press. Ito, M., 1989. Long-term depression. Ann. Rev. Neurosci. 12, 85–102. Ito, M., 2012. The Cerebellum: Brain for the Implicit Self. FT Press. Ito, M., Kano, M., 1982. Long-lasting depression of parallel fiber-purkinje cell transmission induced by conjunctive stimulation of parallel fibers and climbing fibers in the cerebellar cortex. Neurosci. Lett. 33, 253–258. Ito, M., Sakurai, M., Tongroach, P., 1982. Climbing fibre induced depression of both mossy fibre responsiveness and glutamate sensitivity of cerebellar purkinje cells. J. Physiol. 324, 113–134. Ito, M., Yamaguchi, K., Nagao, S., Yamazaki, T., 2014. Long-term depression as a model of cerebellar plasticity. Prog. Brain Res. 210, 1–30. Jörntell, H., Bengtsson, F., Schonewille, M., De Zeeuw, C., 2010. Cerebellar molecular layer interneurons – computational properties and roles in learning. Trends Neurosci. 33, 524–532. Jörntell, H., Ekerot, C., 2002. Reciprocal bidirectional plasticity of parallel fiber receptive fields in cerebellar purkinje cells and their afferent interneurons. Neuron 34, 797–806. Jörntell, H., Ekerot, C., 2003. Receptive field plasticity profoundly alters the cutaneous parallel fiber synaptic input to cerebellar interneurons in vivo. J. Neurosci. 23, 9620–9631. Jörntell, H., Ekerot, C., 2011. Receptive field remodeling induced by skin stimulation in cerebellar neurons in vivo. Front. Neural Circuits 5, 3.
8
T. Yamazaki, W. Lennon / Neuroscience Research 148 (2019) 1–8
Kakegawa, W., Katoh, A., Narumi, S., Miura, E., Motohashi, J., Takahashi, A., Kohda, K., Fukazawa, Y., Yuzaki, M., Matsuda, S., 2018. Optogenetic control of synaptic AMPA receptor endocytosis reveals roles of LTD in motor learning. Neuron 99, 1–14. Kano, M., Kano, M., Fukunaga, K., Konnerth, A., 1996. Ca2+ -induced rebound potentiation of gamma-aminobutyric acid-mediated currents requires activation of Ca2+ /calmodulin-dependent kinase II. Proc. Natl. Acad. Sci. U.S.A. 93, 13351–13356. Kano, M., Rexhausen, U., Dreessen, J., Konnerth, A., 1992. Synaptic excitation produces a long-lasting rebound potentiation of inhibitory synaptic signals in cerebellar Purkinje cells. Nature 356. Kawamura, Y., Fukaya, M., Maejima, T., Yoshida, T., Miura, E., Watanabe, M., Ohno-Shosaku, T., Kano, M., 2006. The CB1 cannabinoid receptor is the major cannabinoid receptor at excitatory presynaptic sites in the hippocampus and cerebellum. J. Neurosci. 26, 2991–3001. Kitazawa, S., Kimura, T., Yin, P.-B., 1998. Cerebellar Complex Spikes Encode both Destinations and Errors in Arm Movements. Kreitzer, A., Regehr, W., 2001. Retrograde inhibition of presynaptic calcium influx by endogenous cannabinoids at excitatory synapses onto purkinje cells. Neuron 29, 717–727. Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J., 2016. Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Processing Systems, vol. 29. Curran Associates, Inc., pp. 3675–3683. Lennon, W., Hecht-Nielsen, R., Yamazaki, T., 2014. A spiking network model of cerebellar purkinje cells and molecular layer interneurons exhibiting irregular firing. Front. Comput. Neurosci. 8, 157. Lennon, W., Yamazaki, T., Hecht-Nielsen, R., 2015. A model of in vitro plasticity at the parallel fiber–molecular layer interneuron synapses. Front. Comput. Neurosci. 9, 150. Liu, S., Cull-Candy, S., 2000. Synaptic activity at calcium-permeable AMPA receptors induces a switch in receptor subtype. Nature 405, 454–458. Marr, D., 1969. A theory of cerebellar cortex. J. Physiol. (Lond.) 202, 437–470. Mauk, M.D., Donegan, N.H., 1997. A model of Pavlovian eyelid conditioning based on the synaptic organization of the cerebellum. Learn. Mem. 3, 130–158. Medina, J.F., Nores, W.L., Mauk, M.D., 2002. Inhibition of climbing fibres is a signal for the extinction of conditioned eyelid responses. Nature 416 (6878), 330–333. Miles, F., Lisberger, S., 1981. Plasticity in vestibulo-ocular reflex: a new hypothesis. Ann. Rev. Neurosci. 4, 273–299. Morimoto, J., Doya, K., 2001. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Rob. Auton Syst. 36, 37–51. Murphy, K.P., 2012. Machine Learning: A Probabistic Perspective. MIT Press. Ohmae, S., Medina, J.F., 2015. Climbing fibers encode a temporal-difference prediction error during cerebellar learning in mice. Nat. Neurosci. 18, 1798–1803. Pugh, J.R., Jahr, C.E., 2011. Axonal GABAA receptors increase cerebellar granule cell excitability and synaptic activity. J. Neurosci. 31, 565–574. Pugh, J.R., Jahr, C.E., 2013. Activation of axonal receptors by GABA spillover increases somatic firing. J. Neurosci. 33, 16924–16929. Rancillac, A., Crépel, F., 2004. Synapses between parallel fibres and stellate cells express long-term changes in synaptic efficacy in rat cerebellum. J. Physiol. 554, 707–720. Raymond, J.L., Medina, J.F., 2018. Computational principles of supervised learning in the cerebellum. Ann. Rev. Neurosci. 41, 233–253. Rieubland, S., Roth, A., Häusser, M., 2014. Structured connectivity in cerebellar inhibitory networks. Neuron 81, 913–929. Rosenblatt, M., 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–408.
Rowan, M.J., Bonnan, A., Zhang, K., Amat, S.B., Kikuchi, C., Taniguchi, H., Augustine, G.J., Christie, J.M., 2018. Graded control of climbing-fiber-mediated plasticity and learning by inhibition in the cerebellum. Neuron 16, 2018. Schonewille, M., Belmeguenai, A., Koekkoek, S., Houtman, S., Boele, H., van Beugen, B., Gao, Z., Badura, A., Ohtsuki, G., Amerika, W., Hosy, E., Hoebeek, F., Elgersma, Y., Hansel, C., De Zeeuw, C., 2010. Purkinje cell-specific knockout of the protein phosphatase PP2B impairs potentiation and cerebellar motor learning. Neuron 67 (4), 618–628. Schonewille, M., et al., 2011. Reevaluating the role of LTD in cerebellar motor learning. Neuron 70, 43–50. Schultz, W., Dayan, P., Montague, P.R., 1997. A neural substrate of prediction and reward. Science 275 (5306), 1593–1599. Shutoh, F., Ohki, M., Kitazawa, H., Itohara, S., Nagao, S., 2006. Memory trace of motor learning shifts transsynaptically from cerebellar cortex to nuclei for consolidation. Neuroscience 139, 767–777. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D., 2016. Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D., 2017. Mastering the game of go without human knowledge. Nature 550, 354–359. Smith, S., Otis, T., 2005. Pattern-dependent, simultaneous plasticity differentially transforms the input–output relationship of a feedforward circuit. Proc. Natl. Acad. Sci. U.S.A. 102, 14901–14906. Stell, B.M., Rostaing, P., Triller, A., Marty, A., 2007. Activation of presynaptic GABAA receptors induces glutamate release from parallel fiber synapses. J. Neurosci. 27, 9022–9031. Strata, P., 2009. David Marr’s theory of cerebellar learning: 40 years later. J. Physiol. 587, 5519–5520. Sutton, R.S., Barto, A.G., 1998. Introduction to Reinforcement Learning, 1st edition. MIT Press, Cambridge, MA, USA. Swain, R.A., Kerr, A.L., Thompson, R.F., 2011. The cerebellum: a neural system for the study of reinforcement learning. Front. Behav. Neurosci. 5, 8. Szapiro, G., Barbour, B., 2007. Multiple climbing fibers signal to molecular layer interneurons exclusively via glutamate spillover. Nat. Neurosci. 10, 735–742. Tanaka, S., Kawaguchi, S., Shioi, G., Hirano, T., 2013. Long-term potentiation of inhibitory synaptic transmission onto cerebellar Purkinje neurons contribute to adaptation of vestibulo-ocular reflex. J. Neurosci. 33, 17209–17220. Van Seijen, H., Fatemi, M., Romoff, J., Laroche, R., Barnes, T., Tsang, J., 2017. Hybrid reward architecture for reinforcement learning. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., pp. 5392–5402. Yamaguchi, K., Itohara, S., Ito, M., 2016. Reassessment of long-term depression in cerebellar Purkinje cells in mice carrying mutated GluA2 C terminus. Proc. Natl. Acad. Sci. U.S.A. 113, 10192–10197. Yamazaki, T., Igarashi, J., Makino, J., Ebisuzaki, T., 2019. Real-time simulation of a catscale artificial cerebellum on PEZY-SC processors. Int. J. High Perform. Comput. Appl. 33, 155–168. Yamazaki, T., Nagao, S., Lennon, W., Tanaka, S., 2015. Modeling memory consolidation during posttraining periods in cerebellovestibular learning. Proc. Natl. Acad. Sci. U.S.A. 112, 3541–3546. Yamazaki, T., Tanaka, S., 2007. The cerebellum as a liquid state machine. Neural Netw. 20, 290–297. Yamazaki, T., Tanaka, S., 2009. Computational models of timing mechanisms in the cerebellar granular layer. Cerebellum 8, 423–432.