Accepted Manuscript
Foraging decisions as multi-armed bandit problems: applying reinforcement learning algorithms to foraging data Juliano Morimoto PII: DOI: Reference:
S0022-5193(19)30056-6 https://doi.org/10.1016/j.jtbi.2019.02.002 YJTBI 9822
To appear in:
Journal of Theoretical Biology
Received date: Revised date: Accepted date:
7 September 2018 1 February 2019 5 February 2019
Please cite this article as: Juliano Morimoto , Foraging decisions as multi-armed bandit problems: applying reinforcement learning algorithms to foraging data, Journal of Theoretical Biology (2019), doi: https://doi.org/10.1016/j.jtbi.2019.02.002
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT Highlights
AC
CE
PT
ED
M
AN US
CR IP T
Foraging behaviour can be seen as multi-armed bandit problems Reinforcement Learning algorithms can be used to give insights into foraging behaviour A deterministic algorithm (Upper-Confidence-Bound) is a better estimator of empirical foraging data than a Bayesian algorithm (Thompson Sampling) Machine Learning can become an important framework to analyse and infer biological phenomena
ACCEPTED MANUSCRIPT Foraging decisions as multi-armed bandit problems: applying reinforcement learning algorithms to foraging data Author: Juliano Morimoto1,2,*
CR IP T
Affiliations:
1 - Department of Biological Sciences, Macquarie University, NSW 2109, Australia
2 - Programa de Pós-Graduação em Ecologia e Conservação, Federal University of Paraná,
AN US
Curitiba, Brazil, 19031, CEP: 81531-990
M
*To whom correspondence should be addressed.
ED
E-mail:
[email protected]
PT
Data and scripts availability: The data is available in Mendeley along with this paper. R
AC
CE
scripts are available as Supplementary Material.
ACCEPTED MANUSCRIPT Abstract Finding resources is crucial for animals to survive and reproduce, but the understanding of the decision-making underlying foraging decisions to explore new resources and exploit old resources remains lacking. Theory predicts an ‘exploration-exploitation trade-off’ where animals must balance their effort into either stay and exploit a seemingly good resource or
CR IP T
move and explore the environment. To date, however, it has been challenging to generate flexible yet tractable statistical models that can capture this trade-off, and our understanding of foraging decisions is limited. Here, I suggest that foraging decisions can be seen as multiarmed bandit problems, and apply deterministic (i.e., the Upper-Confidence-Bound or
AN US
‘UCB’) and Bayesian algorithms (i.e., Thompson Sampling or ‘TS’) to demonstrate how these algorithms generate testable a priori predictions from simulated data. Next, I use UCB and TS to analyse empirical foraging data from the tephritid fruit fly larvae Bactrocera tryoni
M
to provide a qualitative and quantitative framework to analyse animal foraging behaviour. Qualitative analysis revealed that TS display shorter exploration period than UCB, although
ED
both converged to similar qualitative results. Quantitative analysis demonstrated that, overall, UCB is more accurate in predicting the observed foraging patterns compared with TS, even
PT
though both algorithms failed to quantitatively estimate the empirical foraging patterns in
CE
high-density groups (i.e., groups with 50 larvae and, more strikingly, groups with 100 larvae), likely due to the influence of intraspecific competition on animal behaviour. The framework
AC
proposed here demonstrates how reinforcement learning algorithms can be used to model animal foraging decisions.
ACCEPTED MANUSCRIPT Introduction Foraging behaviour is fundamental for animals‟ fitness (Bell 2012). Without food and/or mates, animals fail to contribute to the next generation and their genes eventually disappear from the population (Simpson & Raubenheimer 2012). In this sense, foraging strategies can be considered a product of evolution and therefore, adaptive (Raubenheimer & Simpson
CR IP T
2018). Foraging behaviour is complex, and animals must assess internal (physiological) and external (environmental) factors, gather information from current and previous experience, and forecast future conditions in order to make a decision of when, where, and for how long to forage (Bell 2012; Simpson & Raubenheimer 2012; Corrales-Carvajal, Faisal & Ribeiro
AN US
2016; Raubenheimer & Simpson 2018). In other words, animals have to interact with their environment in order to gain information that guide their foraging decision-making [e.g., (Krakauer & Rodriguez-Girones 1995; Laland & Williams 1997; Inglis et al. 2001; Dall et al.
M
2005)].
ED
When foraging, animals need to invest time and effort into either staying in a current patch
PT
and exploiting a seemingly good resource or moving and exploring the environment in search of potentially better resources (e.g., more nutritious food patch) [see e.g., (Mcnamara &
CE
Houston 1985) and more recently (Patrick et al. 2017; Monk et al. 2018; von Helversen et al. 2018)]. This strategic allocation into exploring or exploiting patches is known as the
AC
“exploration-exploitation trade-off” (Berger-Tal et al. 2014). A wide range of models have been proposed to better understand the exploration-exploitation trade-off and how it influences foraging behaviour [e.g., (Krebs 1977; Maynard-Smith 1982; Mcnamara 1982; Parker & Sutherland 1986; Stephens & Krebs 1986; Houston & Mcnamara 1987; McNamara & Houston 1990; Parker & Smith 1990; Kacelnik, Krebs & Bernstein 1992; Krakauer & Rodriguez-Girones 1995; Dall et al. 2005)]. For instance, optimal foraging theory – a
ACCEPTED MANUSCRIPT prominent yet criticised stream of thought (Pierce & Ollason 1987) – assumes that an animal should be capable of ranking its foraging options and select those with greatest returns (Krebs 1977; Stephens & Krebs 1986). However, it is unclear how optimal foraging theory takes into account non-optimal foraging choices (Pyke 1984; Pierce & Ollason 1987; Kennedy & Gray 1993). As a result, it is difficult to gain insights into the exploration-exploitation trade-off
CR IP T
under this framework because exploration of some patches over others can be seen as
aberrant. More recent studies have taken into account sampling as well as learning as factors that could explain why animals often try sub-optimal options before selecting repeatedly the most profitable ones [e.g., (Mcnamara & Houston 1985; Cohen 1993; Dall et al. 2005;
AN US
McNamara, Green & Olsson 2006)], but some of the models are thought to be too
computationally demanding to have evolved in animals with simple brains (e.g., insects) (Real 1991; Thuijsman et al. 1995). This bonanza of approaches [see also (Simpson &
M
Raubenheimer 2012; Giraldeau & Caraco 2018; Raubenheimer & Simpson 2018) for recent comprehensive reviews] reflects the challenge of generating flexible statistical models that
ED
can capture the behaviour of animals searching for resources and, as a result, our
PT
understanding of the foraging decisions that contribute to the trade-off between exploration
CE
and exploitation in nature is limited.
Given that animals seek information in order to adopt foraging strategies that maximise their
AC
rewards, foraging can be considered as a Reinforcement Learning (RL) process (Mcnamara & Houston 1985; Hughes et al. 1992; Niv et al. 2002; McNamara, Green & Olsson 2006; Kolling & Akam 2017). RL is defined as a process through which organisms learn from repetitive interactions with their environment, with the aim of achieving a goal (Sutton & Barto 1998). The benefit of modelling foraging as a RL problem is that computational tools to explore animal behaviour can be borrowed from the RL framework and applied to animal
ACCEPTED MANUSCRIPT foraging data. For instance, a recent study has developed an Artificial Neural Network model of honeybee‟s foraging behaviour and implemented the algorithm in a robot that simulated the conditions in which bees forage (Niv et al. 2002). While the results suggest that this implementation performs a remarkable job in achieving the same results as those observed by real honeybees, these highly sophisticated algorithms can be costly to design and/or difficult
CR IP T
to implement (Venkatachalam 1993; Tu 1996), need to overcome multiple computational challenges to achieve good performance (Lin 1993), and can be mathematically complex and often computationally demanding (Gurney 2014; Mäkisara et al. 2014; Shanmuganathan & Samarasinghe 2016). Could we be overcomplicating the way in which we describe nature?
AN US
What if the decision-making performed by animals can be modelled with more tractable algorithms that nonetheless describe behaviour with accuracy?
M
A recent study has suggested that foraging choices in bees could be modelled as a two-armed bandit problem, whereby bees could choose between two foraging options with unknown (to
ED
the bee) probabilities of rewards (Keasar et al. 2002). Bees seemed to use associative
PT
learning to solve this two-armed bandit problem and forage for resources. This study demonstrated that simple learning rules can yield important models from which animal
CE
behaviour can be inferred. However, animals are often exposed to multiple (and simultaneous) foraging options, and although the two-armed bandit situations described in
AC
Keasar et al. (2002) are useful, they likely have limited applications in more realistic scenarios. With this in mind, I suggest that foraging decisions can be seen as multi-armed bandit problems whereby animals forage amongst many patches with unknown quality for both rewards and information (see definition of multi-armed bandit problem below). This approach allows for the use of simple, computationally tractable algorithms capable of generating a priori predictions. Moreover, the approach also allows for a better
ACCEPTED MANUSCRIPT understanding of the exploration-exploitation trade-off in biological data, providing a framework from which insights into the mechanisms underlying animal foraging decisions can be deduced. I first explain what the multi-armed bandit problem is in a jargon-free language accessible to biologists, and justify why foraging decisions can be seen as multiarmed bandit problems. Next, I show how two Reinforcement Learning (RL) algorithms
CR IP T
applied to multi-armed bandit problems – the deterministic Upper-Confidence Bound (UCB) [see e.g., (Carpentier et al. 2011) and references therein] and the Bayesian algorithm
Thompson Sampling (TS) [see e.g., (Russo & Van Roy 2016) and references therein] – can be used in simulated and empirical foraging datasets to generate testable predictions and
AN US
investigate the foraging patterns emerging from empirical data (Table 1). Finally, I draw the comparison between the outcomes of UCB and TS with the empirical data, and discuss the insights from this comparison for our understanding of animal foraging behaviour. Thus, the
M
scope of this study is to conceptualise foraging behaviour as a RL multi-armed bandit problem and apply computationally-cheap, tractable multi-armed bandit algorithms to
ED
generate a priori predictions and gain insight into biological data. The simplicity of the UCB
PT
and TS means that the algorithms can (i) be accessible to a broader range of scientists (not only computer scientists and statisticians) (ii) be implemented with low costs and (iii) provide
CE
a simple framework from which future bio-inspired research can build upon more complex
AC
approaches (Pfeifer, Lungarella & Iida 2012).
Foraging as a multi-armed bandit problem The multi-armed bandit problem arises when an individual has many options to choose from – each option with unknown rewards and only a limited amount of resources to invest into gathering information about each option's rewards. The goal is to maximise rewards, and the
individual needs to allocate limited resources into gathering information from the available
ACCEPTED MANUSCRIPT options (exploration) while obtaining the maximum rewards from the most profitable option (exploitation) (Mahajan & Teneketzis 2008). In a biological context, a multi-armed bandit problem could be identified when, for example, an animal possesses a finite set of resources (e.g., energy) that must be partitioned amongst several competing options (e.g., foraging patches) in such a way that maximises the animal‟s rewards (e.g., nutrients acquisition) (Fig
CR IP T
1a). The animal‟s goal is to find the patch with highest reward (i.e., most nutritious patch) as quickly as possible. However, the animal has no a priori knowledge of the rewards that are gained in each of the patches, and must acquire information by experimenting each patch. The longer the animal experiments each patch (longer exploration), the more accurate its
AN US
information will be about the distribution of rewards amongst the patches, but the lower its rewards will be because of the frequent use of reward-poor patches (shorter exploitation of best patch). Conversely, the shorter the animal experiments each patch (shorter exploration),
M
the less accurate its information will be about the distribution of rewards amongst the patches, which can result in lower reward due to over-exploitation of sub-optimal options
ED
(exploiting a patch with less-than-the-highest rewards). Therefore, by exploring for longer,
PT
the animal gathers almost-perfect information of the nutritional quality of all patches, but loses the opportunity to exploit the most nutritious patch for longer. Conversely, by not
CE
exploring each patch enough as to gather accurate information about the rewards, the animal risks exploiting a deceptively good sub-optimal patch while overlooking the patch with the
AC
true highest rewards. In this context, animals can be seen as entities foraging for rewards (e.g., nutrients) as well as information (the quality of the patches in the environment) in order to learn which foraging strategy to employ (Greggers & Menzel 1993).
Results and Discussion RL algorithms can be used to generate a priori predictions of foraging data
ACCEPTED MANUSCRIPT Upper-Confidence Bound (UCB) and Thompson Sampling (TS) are two simple algorithms often applied to multi-armed bandit problems [e.g., (Mahajan & Teneketzis 2008; Carpentier et al. 2011; Russo & Van Roy 2016)] (see Table 1). UCB is a deterministic algorithm where decisions are made based on previous experience and the calculation of, as the name suggest, a confidence bounds. TS relies on the Bayesian updating of priors and sampling of the
CR IP T
posterior distribution to determine which foraging decisions to make at a given time (Table 1). To demonstrate how both algorithms can be used to generate a priori predictions of
foraging data, I considered a familiar biological problem of an insect larvae choosing the best foraging patch upon where to feed. Insect larvae are relatively simple organisms that display
AN US
accurate feeding choices, and their main role in insect development is to acquire nutrients for growth. As I discuss below, insect larvae also allow for controlled experimental designs to test animal foraging strategies [see e.g., (Zucoloto 1987; Wong et al. 2017; Morimoto et al.
M
2018)]. I simulated a foraging scenario in silico where larvae had a choice between five patches varying in rewards. The rewards of each patch were determined randomly by
ED
sampling from a uniform distribution in a way that I also had no a priori information on the
PT
best patch prior to the algorithms‟ implementation (note that I did this to avoid any potential biases, but it is possible to setup the reward of each patch based on prior knowledge or
CE
hypothesis). For this simulation, virtual larvae could interact with the environment for as many iterations as desired, and 2,000 iterations was chosen (see below for the empirical
AC
dataset). The goal was for the virtual larvae to find the best patch by exploring the environment (i.e., exploration-exploitation trade-off). As a baseline, I also implemented a random selection algorithm, where larvae selected patches upon where to forage with fixed probability of 0.2 (i.e., equal probability for all five patches) (see Table 1 for details). As expected, both UCB and TS sampling performed significantly better than the random algorithm, and converged towards the same solution to which patch larvae had the best
ACCEPTED MANUSCRIPT rewards (Fig 1b, Fig S1). The way in which the solution was achieved differed though, as UCB had longer exploration period than TS given that patches of lower quality were selected with non-zero probability for longer than in the TS (i.e., UCB did not exploit solely the best solution as promptly as TS) (see box highlighted in Fig 1b). In fact, UCB selected less-thanoptimal patches with probability of almost 0.1 (see highlighted box and dashed line in Fig
CR IP T
1b), confirming an exaggeratedly long exploration phase relative to TS. Nonetheless, both algorithms found the same solution. The simulation suggested that patch #3 provided the best reward for the larvae, and therefore an a priori expectation for an experiment could be that larvae would forage primarily on patch #3. One could also predict that, during the exploration
AN US
phase, larvae would explore patch #1 with relatively higher frequency than patches #2, #4, and #5, as shown in Fig 1b, likely because patch #1 was the second-best patch in rewards for the larvae. Finally, if the number of decisions that larvae make in a given time in the
M
experiment is known (see below for discussion on this topic), one could potentially compare the speed at which larvae in the experiment converged to the best patch with predictions from
ED
the UCB and TS, thereby allowing for speculation on the underlying larval decision-making
PT
process. For instance, if larval behaviour in the experiment resembled the convergence predicted by TS, then it could be possible that larvae adopted a Bayesian process to decide
CE
where to forage whereas if the exploration period was longer and more similar to the convergence predicted by UCB, then larvae could be using deterministic foraging rules to
AC
adjust their foraging choices. Of course, a definite answer for the neural pathways involved in the decision-making might be difficult or impossible to obtain empirically, but this approach may provide an additional way to gain indirect insights into this mechanism.
RL can be used to gain qualitative insights into foraging behaviour
ACCEPTED MANUSCRIPT Having shown how the algorithms can be used to generate predictions from simulated data, I applied the UCB and the TS algorithms to an empirical dataset that investigated foraging behaviour of the tephritid fruit fly Bactrocera tryoni (aka “Queensland fruit fly”) larvae [see (Morimoto et al. 2018)]. I assumed that, if a larva was present in a given patch, it was because the larva had a reward from staying in the patch as opposed to moving to another
CR IP T
patch. This allowed me to estimate the reward matrix of each patch, whereby I assigned a value of 1 for each larva present in a given patch and a value of 0 for each larva not present in that patch (see Methods for details). Because in this dataset it was impossible to know how many decisions each larva in the group makes per unit of time, I could not relate the iteration
AN US
number in the RL algorithm to any specific point in time in the empirical data. Thus, the time-dependent RL estimates presented in this section are qualitative (see below for
quantitative estimates of RL algorithms). The empirical data showed that larvae exploration
M
period lasts for approximately four hours, after which a decision on which patch to exploit is relatively evident for groups of all densities (Fig 2). The qualitative analysis using both UCB
ED
and TS showed that, for groups with 10 larvae, patch #3 was likely the patch with the highest reward for the larvae (Fig 2). Both UCB and TS also revealed that for groups with 25 or more
PT
larvae, patch #2 was likely the patch the with the greatest reward for the larvae. Although the
CE
majority of the larvae exploited the preferred patch (50% or more of the larvae in the group), larvae did not all aggregate into a single patch and instead also utilized non-preferred patches
AC
at a lower frequency. It is important to notice that while UCB and TS have reached similar conclusions when analysing empirical data, TS converged to the solution more rapidly than UCB, thereby providing a clearer view of the foraging patterns in the data by suggesting a short exploration phase and a longer exploitation phase compared with UCB.
RL can be used in quantitative analysis of foraging data
ACCEPTED MANUSCRIPT For a quantitative (time-independent) analyses of the data, RL algorithms need to be applied to estimate larval distribution across patches when the probability of choice for each foraging patch have converged. To do this, I calculated the number of larvae that UCB and TS expected in each patch using the probability of choice of each patch obtained at the end of all iterations from the RL algorithms. Both UCB and TS predicted the number of larvae in each
CR IP T
patch with accuracy for groups with 10 larvae (Fig 3a, Table 2). However, for groups with 25 and 50 larvae (Fig 3b), TS statistically overestimated the number of larvae in patch #2 and underestimated the number of larvae in patch #3 (Fig 3b-c, Table 2). UCB estimated the correct number of larvae across all patches in groups with 25 larvae, although UCB
AN US
underestimated the number of larvae in patch #3 and overestimated the number of larvae in patch #5 in groups with 50 larvae (Table 2, Fig 3b-c). In groups with 100 larvae, TS failed to estimate the number of larvae across all patches, while USB only estimated correctly the
M
number of larvae in patches #5 and Agar (Fig 3d, Table 2). In total, TS over- or underestimated the number of larvae in ten occasions whereas UCB only made six of these
ED
mistakes (Table 2). Overall, the deterministic UCB was overall better at estimating larval
PT
distribution across multi-choices foraging options than TS, although both UCB and TS failed dramatically when applied to foraging data from high-density groups. It is possible that, at
CE
higher larval densities, other factors have played a role in determining the spatial distribution of the larvae such as, for example, larval competition and larval social interactions [e.g.,
AC
(Quiring & Mcneil 1984; Vijendravarma, Narasimha & Kawecki 2013)]. These factors have not yet been incorporated into simple RL algorithms such as UCB and TS, and remains an important open area for future modelling using RL algorithms.
Previous studies have claimed that animal foraging decisions can be modelled with an underlying Bayesian approach [for instance (Dall et al. 2005; McNamara, Green & Olsson
ACCEPTED MANUSCRIPT 2006)]. Our results show that the deterministic UCB algorithm was overall better at estimating larval foraging decisions in a multi-armed foraging scenario compared with the Bayesian TS algorithm, which suggest that animal decision-making can in some cases be accurately modelled with simple deterministic algorithms. Because UCB and TS are relatively simple algorithms to implement, our results demonstrate the potential to apply
CR IP T
these algorithms in decision-making tasks by bio-inspired robots. For instance, it could be possible to implement UCB for simple decision-making such as where to forage in a given space in parallel with more complex models [e.g., Artificial Neural Networks as in (Niv et al. 2002)] for more complex decision-making tasks. By partitioning the algorithms functions
AN US
based on the complexity of the decision-making tasks, robots could become more efficient at learning tasks from the environment, utilising costly algorithms only in tasks in which complexity is mandatory. An important study to develop could be, for example, to implement
M
the UCB and/or TS algorithms described here jointly with the Artificial Neural Network developed for honeybees in Niv et al. (2002), and assess the performance of the honeybees
ED
robots. Can honeybee robots learn faster when multiple RL algorithms are programmed into
PT
the system? Future investigations using multiple RL algorithms to tasks with different complexity levels will answer this question. In short, it might be possible to use UCB and TS
CE
in simple decision-making scenarios, allowing for the implementation of more complex and sophisticated algorithms in more complex tasks, increasing learning efficiency while
AC
decreasing the costs of coding, implementation, and learning time of complex algorithms in bio-inspired robotics.
Both UCB and TS demonstrated that groups with 10 larvae strongly prefer patch #3 (i.e., 60% nutrient concentration) whereas groups with 25 or more larvae prefer patch #2 (i.e., 80% nutrient concentration). It is not totally clear why larvae foraging preference shifts from
ACCEPTED MANUSCRIPT lower to higher concentration diets as group size increased. One possibility is that, as the number of conspecifics increase in the foraging group, the rate of nutrient acquisition decreases, and to balance this, larger groups could prefer more concentrated diets. Furthermore, feeding might be facilitated when larvae forage in groups either by increasing feeding rate or helping larvae overcome toxicity in the diet [e.g., as seen in caterpillars such
CR IP T
as in (Clark & Faeth 1997; Denno & Benrey 1997)]. Highly concentrated diets are known to have toxic effects [e.g. (Um, D'Alessio & Thomas 2006; Musselman et al. 2011; Fanson & Taylor 2012; Schwarz, Durisko & Dukas 2014)]. Larvae could overcome this toxicity by feeding in large groups, which in turn could explain why foraging preference increased from
AN US
60% to 80% when group size increase from 10 to 25 larvae or more in the experiment. However, it is important to mention that there is no evidence that increasing diet
concentration from patch #3 to patch #2 confers growth advantages to larvae in larger groups
M
(Morimoto et al. 2018). Investigating the mechanisms underpinning the shift in larval feeding pattern is out of the scope of this study, but future experiments should address other factors
ED
that are affected when the size of foraging groups increase, such as water loss and
PT
thermoregulation [see (Klok & Chown 1999)]. This will allow us to gain a better understanding of the density-dependent factors influencing larval feeding patterns and the
CE
ecological factors shaping foraging decisions in immature insects.
AC
Material and Methods Scripts
R scripts for the implementation of UCB and TS are given in the Supplementary Information. Initial prior for TS was established from the Beta distribution with parameters
= 1 and
1; prior was updated every iteration (see Table 1 for overview of the algorithms). All
=
ACCEPTED MANUSCRIPT algorithms were coded in R 3.4.0 (R Development Core Team 2017). Plots were made using the package „ggplot2‟ (Wickham 2009). Larval foraging experiment Fly stock. We collected eggs from a laboratory-adapted stock of B. tryoni (>17 generationsold). The colony has been maintained in non-overlapping generations in a controlled
CR IP T
environment room (humidity 65 ± 5%, temperature 25 ± 0.5oC) with light cycle of 12h light: 0.5h dusk:11h dark: 0.5h dawn). In the lab, the fly colony was fed as following: larvae were maintained using the larval diet formulation of Moadeli, Taylor and Ponton (2017) for the last 7 generations (previously maintained on a carrot-based diet); adults were given free-
AN US
choice diet of hydrolysed yeast (MP Biomedicals, Cat. no 02103304) and commercial refined sucrose (CSR® White Sugar). Eggs were collected for 2h in a 300mL semi-transparent white plastic bottle that had numerous perforations (<1mm diameter) to allow female oviposition,
M
after which eggs were transferred to fly colony larval diet with a soft brush and were allowed to hatch and larvae to develop until they reached 2nd instars. Foraging arenas. We poured
ED
20mL of diet with different nutrient concentrations (from 100% to 20% of the standard larval
PT
diet) and allowed to set at room temperature. Meanwhile, we also prepared an agar solution that contained the same components as the diets varying in nutrient concentration diets except
CE
that no yeast or sugar was included. 20mL of the agar solution was used to cover 90mm diameter Petri dishes. Five equally spaced holes were made in the agar base of each foraging
AC
arena by perforating it with a 25mm diameter plastic tube. The same tube was used to cut discs from the experimental diets. The discs of experimental diet were then deposited in the holes in the agar plate; the agar plate with the diets served as "foraging arenas". The agar was necessary to allow larvae to move across patches freely and minimise the risk of desiccation. The pH of all experimental diets and the agar base was adjusted to 3.8-4 with citric acid. Foraging experiment. Groups of B. tryoni larvae with 10, 25, 50 and 100 individuals were
ACCEPTED MANUSCRIPT allowed to forage freely for eight hours in foraging arenas containing five foraging patches with different nutrient dilutions (from 100% to 20% concentration). Larvae in each patch was counted at 1h, 2h, 4h, 6h and 8h after the onset of the experiment as described in Morimoto et al. (2018). These time points were used as the decision-making points to create the reward
Estimating the reward matrix from empirical data
CR IP T
matrix (see below).
To create a reward matrix, I assumed that, if a larva was present in a given patch, it was
because the larva had a reward from staying in the patch as opposed to moving to another
AN US
patch. This allowed me to assign a value of 1 for each larva present in a given patch and a value of 0 for each larva not present in that particular patch. For example, if there were 10 larvae in the foraging arena and 5 were present in patch #1 at a given time, then patch #1 was
M
assigned a reward streak of 1-1-1-1-1-0-0-0-0-0 and so on for all patches across all larval groups and for all time points. The number of iterations was proportional to the size of the
ED
larval groups, such that for groups with 10 larvae, there was a total of 10 larvae in a group x 6 replicate groups x 5 time points = 300 iterations; groups with 25 larvae there was a total of 25
PT
larvae in a group x 6 replicate groups x 5 time points = 750; groups with 50 larvae there was
CE
a total of 50 larvae in a group x 6 replicate groups x 5 time points = 1,500 iterations; and groups with 100 larvae there was a total of 100 larvae in a group x 6 replicate groups x 5 time
AC
points = 3,000 iterations. In this way, each iteration represented one decision of one larva at a given time. It is important to mention that larvae were assumed to decide where to forage freely, and that these decisions reflected their location at each of the experimental time points. Without an individual tracking system, it was impossible to determine how many decisions each larva can make in any given time (e.g., 1, 10 or 100 decisions every second, minute or hour of the experiment), and therefore we cannot estimate how many decisions
ACCEPTED MANUSCRIPT were made in-between time points. As a result, I could not directly relate the iteration number in the RL output to the time in the empirical data set. For this reason, time-dependent RL estimates are qualitative. To generate quantitative (time-independent) estimates of larval foraging behaviour, I used the final probability at the end of all iterations of the RL algorithms and multiplied by the number of larvae in the group. I compared this estimate with
CR IP T
the number of larvae at the final time point (i.e., 8h after the onset of the experiment) in the empirical dataset. To estimate the error around the RL estimates, I created 10 replicate
reward matrices for each dataset per larval density using resampling method (N = 4 larval densities * 2 algorithms * 10 replicates = 80 runs). For statistical inferences, I used a
AN US
Generalised Linear Model with Poisson error distribution for count data and quasi extension to account for overdispersion. The predictions from UCB and TS were compared against the observed data in a pairwise fashion and the p-values were corrected for multiple comparison
Acknowledgments
ED
Conflict of interests None to declare.
= 0.001.
M
by adjusting the significance level to
PT
I acknowledge Dr Fleur Ponton for the support during the development of this work and two
CE
anonymous reviewers for comments that improved the final version of this manuscript. References
AC
Bell, W.J. (2012) Searching behaviour: the behavioural ecology of finding resources. Springer Science & Business Media. Berger-Tal, O., Nathan, J., Meron, E. & Saltz, D. (2014) The exploration-exploitation dilemma: a multidisciplinary framework. PLoS One, 9, e95693. Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R. & Auer, P. (2011) Upperconfidence-bound algorithms for active learning in multi-armed bandits. International Conference on Algorithmic Learning Theory, pp. 189-203. Springer. Clark, B.R. & Faeth, S.H. (1997) The consequences of larval aggregation in the butterfly Chlosyne lacinia. Ecological entomology, 22, 408-415.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Cohen, D. (1993) The equilibrium distribution of optimal search and sampling effort of foraging animals in patchy environments. Adaptation in stochastic environments, pp. 173-191. Springer. Corrales-Carvajal, V.M., Faisal, A.A. & Ribeiro, C. (2016) Internal states drive nutrient homeostasis by modulating exploration-exploitation trade-off. Elife, 5, e19920. Dall, S.R.X., Giraldeau, L.A., Olsson, O., McNamara, J.M. & Stephens, D.W. (2005) Information and its use by animals in evolutionary ecology. Trends in Ecology & Evolution, 20, 187-193. Denno, R.F. & Benrey, B. (1997) Aggregation facilitates larval growth in the neotropical nymphalid butterfly Chlosyne janais. Ecological entomology, 22, 133-141. Fanson, B.G. & Taylor, P.W. (2012) Protein: carbohydrate ratios explain life span patterns found in Queensland fruit fly on diets varying in yeast: sugar ratios. Age, 34, 13611368. Giraldeau, L.-A. & Caraco, T. (2018) Social foraging theory. Princeton University Press. Greggers, U. & Menzel, R. (1993) Memory Dynamics and Foraging Strategies of Honeybees. Behavioral Ecology and Sociobiology, 32, 17-29. Gurney, K. (2014) An introduction to neural networks. CRC press. Houston, A.I. & Mcnamara, J.M. (1987) Switching between Resources and the Ideal Free Distribution. Animal Behaviour, 35, 301-302. Hughes, R.N., Kaiser, M.J., Mackney, P.A. & Warburton, K. (1992) Optimizing Foraging Behavior through Learning. Journal of Fish Biology, 41, 77-91. Inglis, I.R., Langton, S., Forkman, B. & Lazarus, J. (2001) An information primacy model of exploratory and foraging behaviour. Animal Behaviour, 62, 543-557. Kacelnik, A., Krebs, J.R. & Bernstein, C. (1992) The ideal free distribution and predator-prey populations. Trends in Ecology & Evolution, 7, 50-55. Keasar, T., Rashkovich, E., Cohen, D. & Shmida, A. (2002) Bees in two-armed bandit situations: foraging choices and possible decision mechanisms. Behavioral Ecology, 13, 757-765. Kennedy, M. & Gray, R.D. (1993) Can Ecological Theory Predict the Distribution of Foraging Animals? A Critical Analysis of Experiments on the Ideal Free Distribution. Oikos, 68, 158. Klok, C.J. & Chown, S.L. (1999) Assessing the benefits of aggregation: thermal biology and water relations of anomalous Emperor Moth caterpillars. Functional Ecology, 13, 417-427. Kolling, N. & Akam, T. (2017) (Reinforcement?) Learning to forage optimally. Current Opinion in Neurobiology, 46, 162-169. Krakauer, D.C. & Rodriguez-Girones, M.A. (1995) Searching and learning in a random environment. Journal of Theoretical Biology, 177, 417-429. Krebs, J. (1977) Optimal foraging: theory and experiment. Nature, 268, 583-584. Laland, K.N. & Williams, K. (1997) Shoaling generates social learning of foraging information in guppies. Animal Behaviour, 53, 1161-1169. Lin, L.-J. (1993) Reinforcement learning for robots using neural networks. PhD, CarnegieMellon Univ Pittsburgh PA School of Computer Science. Mahajan, A. & Teneketzis, D. (2008) Multi-armed bandit problems. Foundations and Applications of Sensor Management, pp. 121-151. Springer. Mäkisara, K., Simula, O., Kangas, J. & Kohonen, T. (2014) Artificial Neural Networks. Elsevier. Maynard-Smith, J. (1982) Evolution and the Theory of Games. Cambridge University Press.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Mcnamara, J. (1982) Optimal Patch Use in a Stochastic Environment. Theoretical population biology, 21, 269-288. McNamara, J.M., Green, R.F. & Olsson, O. (2006) Bayes' theorem and its applications in animal behaviour. Oikos, 112, 243-251. Mcnamara, J.M. & Houston, A.I. (1985) Optimal Foraging and Learning. Journal of Theoretical Biology, 117, 231-249. McNamara, J.M. & Houston, A.I. (1990) State-dependent ideal free distributions. Evolutionary Ecology, 4, 298-311. Moadeli, T., Taylor, P.W. & Ponton, F. (2017) High productivity gel diets for rearing of Queensland fruit fly, Bactrocera tryoni. Journal of Pest Science, 2, 507-520. Monk, C.T., Barbier, M., Romanczuk, P., Watson, J.R., Alós, J., Nakayama, S., Rubenstein, D.I., Levin, S.A., Arlinghaus, R. & Calcagno, V. (2018) How ecology shapes exploitation: a framework to predict the behavioural response of human and animal foragers along exploration-exploitation trade-offs. Ecology letters, 21, 779-793. Morimoto, J., Nguyen, B., Tarahi Tabrizi, S., Ponton, F. & Taylor, P.W. (2018) Social and nutritional factors shape larval aggregation, foraging, and body mass in a polyphagous fly. Scientific reports, 8, 14750. Musselman, L.P., Fink, J.L., Narzinski, K., Ramachandran, P.V., Hathiramani, S.S., Cagan, R.L. & Baranski, T.J. (2011) A high-sugar diet produces obesity and insulin resistance in wild-type Drosophila. Disease Models & Mechanisms, 4, 842-849. Niv, Y., Joel, D., Meilijson, I. & Ruppin, E. (2002) Evolution of reinforcement learning in uncertain environments: A simple explanation for complex foraging behaviors. Adaptive Behavior, 10, 5-24. Parker, G. & Sutherland, W. (1986) Ideal free distributions when individuals differ in competitive ability: phenotype-limited ideal free models. Animal Behaviour, 34, 1222-1242. Parker, G.A. & Smith, J.M. (1990) Optimality Theory in Evolutionary Biology. Nature, 348, 2733. Patrick, S.C., Pinaud, D., Weimerskirch, H. & Morand-Ferron, J. (2017) Boldness predicts an individual's position along an exploration-exploitation foraging trade-off. Journal of Animal Ecology, 86, 1257-1268. Pfeifer, R., Lungarella, M. & Iida, F. (2012) The Challenges Ahead for Bio-Inspired 'Soft' Robotics. Communications of the ACM, 55, 76-87. Pierce, G.J. & Ollason, J. (1987) Eight reasons why optimal foraging theory is a complete waste of time. Oikos, 111-118. Pyke, G.H. (1984) Optimal Foraging Theory - a Critical-Review. Annual review of ecology and systematics, 15, 523-575. Quiring, D.T. & Mcneil, J.N. (1984) Exploitation and Interference Intraspecific Larval Competition in the Dipteran Leaf Miner, Agromyza-Frontella (Rondani). Canadian Journal of Zoology-Revue Canadienne De Zoologie, 62, 421-427. R Development Core Team (2017) R: A language and environment for statistical computing. R. Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. Raubenheimer, D. & Simpson, S.J. (2018) Nutritional ecology and foraging theory. Curr Opin Insect Sci, 27, 38-45. Real, L.A. (1991) Animal choice behavior and the evolution of cognitive architecture. Science, 253, 980-986.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Russo, D. & Van Roy, B. (2016) An Information-Theoretic Analysis of Thompson Sampling. Journal of Machine Learning Research, 17, 2442-2471. Schwarz, S., Durisko, Z. & Dukas, R. (2014) Food selection in larval fruit flies: dynamics and effects on larval development. Naturwissenschaften, 101, 61-68. Shanmuganathan, S. & Samarasinghe, S. (2016) Artificial neural network modelling. Springer. Simpson, S.J. & Raubenheimer, D. (2012) The nature of nutrition: a unifying framework from animal adaptation to human obesity. Princeton University Press. Stephens, D.W. & Krebs, J.R. (1986) Foraging theory. Princeton University Press. Sutton, R. & Barto, A. (1998) Reinforcement learning: An introduction. . MIT Press, Cambridge, MA. Thuijsman, F., Peleg, B., Amitai, M. & Shmida, A. (1995) Automata, matching and foraging behavior of bees. Journal of Theoretical Biology, 175, 305-316. Tu, J.V. (1996) Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. Journal of clinical epidemiology, 49, 1225-1231. Um, S.H., D'Alessio, D. & Thomas, G. (2006) Nutrient overload, insulin resistance, and ribosomal protein S6 kinase 1, S6K1. Cell Metabolism, 3, 393-402. Venkatachalam, A. (1993) Software cost estimation using artificial neural networks. Proceedings of the International Joint Conference on Neural Networks, pp. 987-990. IEEE, Nagoya. Vijendravarma, R.K., Narasimha, S. & Kawecki, T.J. (2013) Predatory cannibalism in Drosophila melanogaster larvae. Nature communications, 4, 1789. von Helversen, B., Mata, R., Samanez-Larkin, G.R. & Wilke, A. (2018) Foraging, exploration, or search? On the (lack of) convergent validity between three behavioral paradigms. Evolutionary Behavioral Sciences, 12, 152-162. Wickham, H. (2009) ggplot2: elegant graphics for data analysis. Wong, A.C.-N., Wang, Q.-P., Morimoto, J., Senior, A.M., Lihoreau, M., Neely, G.G., Simpson, S.J. & Ponton, F. (2017) Gut microbiota modifies olfactory-guided microbial preferences and foraging decisions in Drosophila. Current Biology, 27, 2397-2404. e2394. Zucoloto, F.S. (1987) Feeding habits of Ceratitis capitata (Diptera: Tephritidae): can larvae recognize a nutritionally effective diet? Journal of Insect Physiology, 33, 349-353.
ACCEPTED MANUSCRIPT
PT
ED
M
AN US
CR IP T
Figure caption
Figure 1. Foraging decisions as multi-armed bandit problems. (a) Schematic
CE
representation of a multi-armed bandit problem in animal foraging. The animal (in this case a fly) is given a choice between five foraging patches. The animal has no a priori information
AC
of the real distribution of rewards, and must make decisions based on the information obtained from exploration of the patches. (b) Larval foraging decisions in silico using a random algorithm, TS, and UCB. The patch with highest rewards rate was assigned to each patch by sampling randomly from a uniform distribution. The Box (red) the exploration region, where the probability of choosing any patch other than the patch with the highest reward was greater than zero.
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 2. Qualitative (time-dependent) estimates of UCB and TS when applied to empirical foraging data. The behaviour of the TS and UCB algorithms, as well as the observed larval spatial distribution in the experiment (N = 6) is given for groups with 10, 25, 50, and 100 B. tryoni larvae foraging arenas with 5 food patches varying in nutrient concentration. Shaded area represents the standard error estimates of the RL algorithms.
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 3. Quantitative (time-independent) analysis of empirical foraging data using
CE
UCB and TS algorithms. The number of larvae (y-axis) in each of the patches (x-axis) for
AC
groups with (a) 10 larvae, (b) 25 larvae, (c) 50 larvae, and (d) 100 larvae. White – observed empirical data; blue – UCB estimates; orange – TS estimates.
Tables
CR IP T
ACCEPTED MANUSCRIPT
Table 1. Overview of the Random selection, UCB and TS algorithms. Upper Confidence Bound (UCB)
Step 1: At each iteration, select patch i with probability ⁄ where N is the number of available patches (here, N = 5).
Step 1: Sample each patch once in the first N iterations.
Step 2: Calculate the confidence in the reward of each patch by estimating the upper confidence bound1 based on the rewards obtained in the first N iterations.
M
Step 3: Select patch i with the highest upper confidence bound. Step 4: Update upper confidence bound.
ED
Repeat Step 1 for M iterations (Supplementary Code 1).
Thompson Sampling (TS)
AN US
Random selection
Step 3: Select and forage on patch i that drew the highest value from the random sampling of Step 2.
PT CE AC
Repeat Steps 3 and 4 for M iterations (Supplementary Code 3).
Upper confidence bound of patch i was calculated as
reward of the patch over M iterations.
Step 2: Sample a random value from the Beta distribution of each patch.
Step 4: Update the Beta prior distribution with the reward information from patch i. If foraging in patch i was rewarded, update as + 1, otherwise update as + 1. 2
Repeat Steps 4 and 5 for M iterations (Supplementary Code 2).
1
Step 1: Set non-informative prior. Here, the Beta prior distribution with parameters for each patch.
̅
, where ̅ is the average ( ⁄ )
is the upper confidence estimated as √
where s is the number of times patch i has been selected. Note that for a constant value of s and an increasing number of M (patch is not selected over many iterations), the value of
ACCEPTED MANUSCRIPT increases and patch i becomes more likely to be chosen, and vice-versa. This dynamic allows for a balance in the exploitation of known patches and exploration of new patches.
2
Note that Thompson Sampling sample random value from the posterior distribution and
AC
CE
PT
ED
M
AN US
CR IP T
update the beliefs of the reward distribution for each patch at each iteration.
CR IP T
ACCEPTED MANUSCRIPT
Table 2 – Statistical comparison between the observed number of larvae in each food patch and the estimates from UCB and TS algorithms. Outputs from GLM models with quasi-Poisson distribution for count data. Multiple comparison was taken into account by adjusting the statistical significance level to
25 1 Estimate Std Error t-value p-value 0.12222 0.41674 0.29327 0.77195 -1.0427 0.55365 -1.88331 0.07236 2 0.09105 0.07419 1.22724 0.23215 0.49849 0.06966 7.15609 <0.0001 3 -0.87031 0.24943 -3.48914 0.00198 -1.86243 0.3527 -5.28058 <0.0001 4 0.43954 0.23369 1.8809 0.07271 -0.61619 0.28835 -2.13693 0.04346 5 0.29267 0.25236 1.15975 0.25805 -0.80169 0.32066 -2.5001 0.01999 Agar 0.222 0.18699 1.18722 0.24726 -0.75199 0.23168 -3.24587 0.00356
AN US
Larval density: Patch # UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS
M
ED
AC
CE
UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS
10 1 Estimate Std Error t-value p-value 0.37294 0.28963 1.28767 0.21067 -0.41552 0.33662 -1.23436 0.22953 2 -0.58533 0.26041 -2.24767 0.03448 -0.78661 0.27506 -2.85979 0.00886 3 -0.20477 0.16027 -1.27764 0.21412 0.28299 0.14657 1.93084 0.06592 4 0.62415 0.40725 1.5326 0.13901 0.14266 0.43681 0.32658 0.74694 5 1.06815 0.53118 2.01091 0.0562 0.26236 0.58471 0.44871 0.65784 Agar 0.38526 0.32992 1.16774 0.25488 -0.5164 0.39374 -1.31152 0.20262
PT
Larval density: Patch #
= 0.001. Bold: p < 0.001. Comparisons are relative to the empirical data.
Table 2. Cont. Larval density: Patch #
100 1 Estimate Std Error t-value p-value -1.26213 0.2039 -6.18989 <0.0001 -2.61428 0.35004 -7.46842 <0.0001 2 0.46321 0.07299 6.34661 <0.0001 0.57578 0.07191 8.00666 <0.0001 3 -1.61315 0.22251 -7.24967 <0.0001 -2.48758 0.31852 -7.80978 <0.0001 4 -1.11793 0.21939 -5.0957 <0.0001 -2.69724 0.41003 -6.57818 <0.0001 5 0.23463 0.172 1.36417 0.18572 -1.15342 0.24127 -4.78065 <0.0001 Agar -0.51799 0.15087 -3.43332 0.00227 -1.93102 0.24139 -7.99974 <0.0001
AN US
UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS
M
AC
CE
PT
UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS Patch # UCB TS
50 1 Estimate Std Error t-value p-value -0.30155 0.28288 -1.066 0.29748 -1.47468 0.40002 -3.68653 0.00122 2 0.14708 0.13003 1.13118 0.26964 0.57257 0.1221 4.68944 0.0001 3 -0.88319 0.20345 -4.34103 <0.0001 -1.75847 0.27512 -6.39171 <0.0001 4 0.1596 0.30906 0.51642 0.61049 -0.74857 0.37861 -1.97715 0.06013 5 2.67139 0.52431 5.0951 <0.0001 1.58515 0.54442 2.91161 0.00786 Agar 0.26748 0.18521 1.44416 0.16218 -0.78484 0.2333 -3.36409 0.00268
ED
Larval density: Patch #
CR IP T
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Graphical Abstract