Agent Based Modeling, Statistics of

Agent Based Modeling, Statistics of

Agent Based Modeling, Statistics of David Banks, Duke University, Durham, NC, USA Jacob Norton, North Carolina State University, Raleigh, NC, USA Ó 20...

416KB Sizes 3 Downloads 63 Views

Agent Based Modeling, Statistics of David Banks, Duke University, Durham, NC, USA Jacob Norton, North Carolina State University, Raleigh, NC, USA Ó 2015 Elsevier Ltd. All rights reserved.

Abstract Agent-based models (ABMs) have become an important simulation tool for understanding certain categories of complex phenomena. They are widely used in epidemiology, ecology, transportation research, social networks, and other applications in which the global behavior is determined by local behavior (which is usually quite simple, and can be represented through a concise set of rules). However, unlike many other models, such as linear regression, the statistical issues associated with ABMs are largely undeveloped.

Introduction Agent-based models (ABMs) are a simulation strategy that is especially useful when the phenomenon of interest is complex, not continuously dependent on underlying parameters, and can be described in terms of local actions. In Europe and in some academic disciplines, such as ecology, ABMs are commonly referred to as ‘individual-based models.’ Popular applications of ABMs include: l

Weather forecasting, in which each agent is a cubic kilometer of atmosphere, and the local interactions are the exchange of pressure, temperature, and moisture (cannot find this one). l Auctions, as in Yahoo! or Google, to determine which ads are shown to users (Charles et al., 2013). l Traffic flow models, as in TRANSIMS, where agents (drivers) space themselves according to the actions of other nearby drivers, and make route choices based on congestion avoidance (Smith et al., 1995). l Genetic algorithms, in which the agents are primitive algorithms, which interact so as to evolve more successful algorithms (Chatterjee et al., 1996). ABMs are commonly used in epidemiology, economics, social networks, and many other fields. Often the primary question of interest concerns emergent behavior generated by the cumulative actions of distinct entities, each making choices that satisfy its own requirements. As a simple example to fix the basic concepts, assume that a safety engineer wants to determine how long it would take to evacuate an office building. The ABM approach would start with a virtual representation of the building, with the rooms and doors and stairs, and then place a typical number of people (agents) at random within the building. Each agent would follow two rules: (1) When the firebell rings, go to the nearest exit; and (2) If the exit is blocked, go to the closest exit that is unblocked. There would be additional constraints on how much crowding is possible in, for example, a stairwell or doorway. In this framework, the safety engineer might run the simulator 100 times, and create a histogram of how long was needed to complete the evacuation. One particularly attractive feature of the ABM approach is that the engineer can play ‘what if?’ scenarios, such as blocking certain stairwells or adding agents who act as fire marshals and direct the traffic. A second

International Encyclopedia of the Social & Behavioral Sciences, 2nd edition, Volume 1

attractive feature is that the engineer can examine the rule set to see if it makes sense – perhaps something more complicated is needed, such as allowing agents who are away from their office to return and pick up their laptops before exiting. This example indicates a number of features that are common to most ABMs. First, there is a ‘geography’ in which the agents operate. In this example, it was the virtual building; in TRANSIMS it is the road network; and with weather forecasting it is a three-dimensional grid of cubic kilometers. However, geography is not essential – the auction example and the genetic algorithms example do not require one. Additionally, there are agents, which follow specific rules. The agents may be of multiple types, such as the employees and the fire marshals in the building evacuation example. Alternatively, the agents may be identical up to a prespecified parameter; for example, in virtual auctions different bidders respond to different keywords, have different amounts of money, and so forth, but all have the same objective. The rule sets are the most critical common element. The rules are usually simple, but very flexible, and can be readily examined for plausibility. In particular, the rules may allow very unsmooth simulations; for example, in the building evacuation example, an agent may pick among separate exit paths. Often, an ABM evolves over time, and there is interest in some ensemble behavior that is determined by the interaction of the agents. In the evacuation example, the ensemble behavior is how long it takes to empty the building, and this depends on crowding and the decisions of the agents in the building who only have local information about which exits are blocked. In the TRANSIMS example, the relevant behavior may be some measure of congestion, and in the weather forecasting application it may be temperature or rainfall prediction. The following sections will elaborate on these issues. Section The History of ABMs traces the history of ABMs, with special emphasis on three influential applications. Section Limitations of ABMs lays out the research challenges in the ABM methodology, and points up some of the ways in which these are being addressed.

The History of ABMs ABMs grew up with the era of modern computing. Their inception is rooted in cellular automata, a special case of ABMs

http://dx.doi.org/10.1016/B978-0-08-097086-8.42110-0

367

368

Agent Based Modeling, Statistics of

invented by John von Neumann and Stanislaw Ulam while pioneering the computing era at Los Alamos in the 1940s (Wolfram, 2002). Cellular automata are ABMs in which the geography is a grid, and the grid points or the cells formed by the grid interact according to well-chosen rules. The most famous of these is Conway’s Game of Life (Gardner, 1970), but the field is rich. Stephen Wolfram’s A New Kind of Science is an extensive survey of cellular automata, and argues that these represent a critical frontier for science. Cellular automata have been generalized in many ways. In probability, interacting particle systems extend cellular automata to situations in which time is continuous rather than discrete and randomness plays a larger role (Liggett, 1985). But the main development has been increasingly rich characterizations of the cells and the actions that are taken; instead of cells in a space that are assigned colors according to some rule set, they have become agents, whose locations may change, rules may evolve, and which can respond adaptively to their environment. The following subsections trace the historical growth in the sophistication of ABMs through three highly influential examples. Kauffman’s model introduced agents that interact through a random network rather than a grid, and led to new insights in biology and new mathematical problems. Artificial societies created by Epstein and Axtell opened the door to a wide range of ABM applications in the social sciences, and the analysis of the spread of rabies in raccoon populations introduced the use of hierarchical statistical models in ABMs.

Kauffman’s Random Networks Stanley Kauffman wanted to understand how the same DNA could produce all of the different tissue types found in organisms. To study this, he used an ABM described in his seminal paper, ‘Metabolic Stability and Epigenesis in Randomly Constructed Genetic Nets’ (Kauffman, 1969). In Kauffman’s model, each agent is a gene. Each gene is either off or on, 0 or 1, depending upon the inputs it receives from other genes. Genes receive inputs from other genes, calculate a Boolean function, and send the result to other genes. In terms of the previous framework, the rule set is the Boolean function or functions, and the geography is the network through which inputs are received and outputs are transmitted. In Kauffman’s ABM, a large number (n) of agents are connected in a network at random, subject to the constraint that each agent must receive inputs from k agents (possibly including itself) and send output to k agents. Given its inputs, each agent uses a Boolean function to produce its binary output. Different agents may have different Boolean functions, and Kauffman assigned those functions to agents at random (excluding two degenerate functions, which always produce 0 or 1). The property of interest was the number of stable cycles that such a system could produce. If a gene receives k inputs, then there are 2k possible vectors of inputs (since each of the k inputs may be 0 or 1). And for any given vector of inputs, there are two possible outputs, 0 or 1. So the number of possible functions that map {0,1}k / {0,1} k is 22 . Table 1 shows 3 of the 16 possible Boolean functions when k ¼ 2.

Table 1 The first table corresponds to Boolean operator AND, the second is OR, and the last is tautology, one of the two degenerate operators that Kauffman excluded. The operator names derive from truth tables used in formal logic

Input Output 0 0 0 0 0 1 0 1 0 1 1 1

Input 0 0 0 1 1 1 1 0

Output 0 1 1 1

Input Output 0 0 1 0 1 1 1 0 1 1 1 1

Figure 1 shows how Kauffman’s model enabled study of the number of stable cycles in a randomly connected Boolean network. The first panel shows five randomly connected agents (nodes). The agents are either OR or AND operators, as indicated. The second panel shows the transitions between states for the largest component; i.e., if the five agents are initialized as (0, 1, 0, 0, 1) (i.e., the value in the lowest corner of the panel), then the next state is (1, 0, 0, 1, 1), then (1, 0, 1, 1, 1), then (1, 1, 1, 1, 1). This last state is an absorbing state; if the system reaches that state, it does not leave that state. The last panel shows the three other components that are possible. If the system starts in the (0, 1, 0, 0, 0) state, then it evolves to the absorbing state (0, 0, 0, 0, 0). If it starts in the (0, 0, 1, 0, 1) state, it evolves to a stable cycle with three states: (1, 0, 0, 0, 0) going to (0, 0, 0, 1, 0) going to (0, 0, 0, 0, 1) which returns to (1, 0, 0, 0, 0), and so forth.

Epstein and Axtell’s Artificial Societies Epstein and Axtell (1996) brought the ABM perspective strongly into the field of social science. Specifically, they showed that simple rules could lead agents to display many of the complex behaviors found in human societies, including population dynamics, hunter-gatherer migration, division of labor, and a barter economy. The Epstein and Axtell ABM was based on a planar lattice, which they called a ‘sugarscape.’ At each intersection of the lattice a resource, ‘sugar,’ grew at a constant rate. Initially, a fixed number of agents were placed randomly at the intersections of the lattice, with the rule that they were to consume sugar (at a constant rate greater than sugar’s growth rate) until the supply was exhausted, and then move to the nearby lattice point and continue consuming. The result was that agents tended to move in large circles whose circumference matched the rate of growth of sugar, so that when the agent returned to its starting point, the sugar was fully replenished. This mirrors the migratory patterns of hunter-gatherer societies, in which phenology drives movement. Figure 2 shows one time point in a sugarscape simulation. Next, Epstein and Axtell added gender and reproduction. When there was sufficient food and agents of opposite gender were on adjacent lattice intersections, they would have a child. This led to population pyramids, carrying capacity limits to growth, and many other features found in population dynamics. If the rules were extended so that families preferred to stay near each other, tribalism emerged. Additional rules allowed pollution, diffusion of pollution, accumulation of wealth, the evolution of genetic traits, the spread of disease, specialized labor, cultural tags (memes) that could be shared or defended, trade, and combat. In all, 17 rules

Agent Based Modeling, Statistics of

369

(a)

(b)

(c)

Figure 1 This figure shows how the randomly connected Boolean network, under various initializations, transitions to different stable behaviors, where each stable behavior is an absorbing state or cycle. (a) A network of five agents (genes) in which each agent receives input from two agents and transmits its output to two agents. (b) The largest confluent of states. (c) The remaining three state confluents. Notice that two are, at their center, state cycles.

were sufficient to produce a rich range of social behavior. To illustrate these rules, consider three of them (taken verbatim from Epstein and Axtell, 1996): 1. Sugarscape growback: At each lattice position, sugar grows back at a rate of a per time interval up to the capacity of that position. 2. Agent movement: Look out as far as vision permits in each of the four lattice directions, north, south, east, and west: a. Considering only unoccupied lattice positions, find the nearest position producing maximum welfare; b. Move to the new position; c. Collect all the resources at that location. 3. Agent mating: a. Select a neighboring agent at random; b. If the neighboring agent is of the opposite sex and if both agents are fertile and at least one of the agents has an empty neighboring site then a newborn is produced by crossing over the parents’ genetic and cultural characteristics; c. Repeat for all neighbors.

Note that the first rule includes a tunable parameter; there are many such cases in the full list; and this is common in ABMs in general. The sugarscape rules are simple to program and easily interpretable in the context of the model. But they do not lend themselves to mathematical expression or analysis.

Rabies in Raccoons Hooten and Wikle (2010) introduced Bayesian hierarchical models into ABM research. Their technique does not apply to all of the very wide range of ABM formulations, but it is useful for spatiotemporal processes with fairly simple structure. Their motivating application is the spread of rabies in raccoon populations in Connecticut between 1991 and 1995. On a gridded map representing the townships in the state of Connecticut, they represented the presence or absence of rabies by a binary random variable whose distribution depended upon the states in the neighboring townships at the preceding time period, as well as covariates (which could also vary in time).

370

Agent Based Modeling, Statistics of

Limitations of ABMs ABMs are popular and will be used for the foreseeable future. The primary reasons for this popularity are that they are relatively simple to code and straightforward to validate, at the least at a basic level. Not all problems are amenable to ABM representation, but for those that are, ABMs are generally easier to conceptualize and communicate than models based on complex stochastic processes or other mathematical representations. Nonetheless, ABMs are problematic, because there is no robust theory of statistical inference. An ABM is a model, just as a linear regression is a model. Statisticians know how to fit linear models, how to assess fit, how to make predictions from linear models with quantified uncertainty, and so forth. But there is virtually no principled theory yet for ABMs.

Verification and Validation

Figure 2 A snapshot of the Sugarscape model described in Epstein, J., Axtell, R., 1996. Growing Artificial Societies: Social Science from the Bottom up. Brookings Institution Press, MIT Press, Cambridge/ Washington, DC. Agent locations reflect geographical variation in the rate of growth of ‘sugar.’ The image was captured from the NetLogo model described in Li, J., Wilensky, U., 2009. NetLogo Sugarscape 3 Wealth Distribution Model. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL. http:// ccl.northwestern.edu/netlogo/models/Sugarscape3WealthDistribution. 0

Let ut ¼ (u(1, t),u(2, t), . ,u(m, t)) denote the binary vector showing the presence or absence of rabies at time t for each of the m townships and let Xt ¼ (x(1, t), x(2, t), . , x(m, t)) denote a matrix of corresponding covariates, such as population density, adjacency to the Connecticut River, and so forth. Define the neighborhood for township i by Ni ; this is a set of townships. Then the basic model for the spread of the disease is   ui;t uNi ;t1  ui;t h uNi ;t1 ; xNi ;t1 where h(,,,) is a very general updating function, the subscript Ni indicates the townships relevant to the disease spread at township i (i.e., its neighboring townships), and the bracket notation indicates that the presence or absence of rabies is a random variable with parameters that depend on the conditioning within the bracket. The only substantive difference between this model and a Gaussian state space model is that the random variables need not be Gaussian (which generally precludes closed-form solution, putting this in the realm of ABM simulation). This model is flexible, and enables disease spread to be anisotropic (i.e., directional, e.g., along the Connecticut River). It enables probabilistic statements about the posterior probability of disease in a particular township, but usually requires Markov chain Monte Carlo (cf Robert and Casella, 2004) to evaluate. It does not apply to all ABMs (e.g., genetic algorithms or the evacuation of a building), but when it does apply, it permits more explicitly statistical inference on the behavior of the ABM.

Verification pertains to determining whether the code in an ABM is error-free. Validation asks whether the ABM is sufficiently faithful to reality. Verification lies outside the scope of this article, other than to acknowledge that it is a significant problem, especially in complex ABMs that may entail many thousands of lines of code. Regarding validation, there are different approaches. Some of these are more rigorous than others, and some of these apply better to some situations than others (cf Louie and Carley, 2008). But none of these is fully satisfactory. Physics-based validation is commonly used in the hard sciences. One builds a simulation that incorporates all of the physical laws and interactions that are appropriate, and then feels confident that their model is faithful to reality. This can work on smaller problems where all mechanisms are fully understood. Examples, where it is arguably successful, include planetary motions (cf Miller and Page, Section 6.6., 2007), flight simulators, and perhaps virtual mock-ups of semiconductor manufacturing processes. But it tends to break down as the stochasticity and complexity of the problem increase. Also, physics-based modeling often constructs the process in more detail that is actually required for reasonable fidelity, and thus takes a very long time to run. A second approach might be termed intelligent design. This is the most common validation protocol, and is probably used in all but the most critical applications. Domain experts think through the simulation carefully, building in or approximating all the effects, which they think are substantial. Then they hope that they have been smart enough and wait to see if their results work well enough. Intelligent design can handle more complicated situations than the physics-based models, but it is not a true validation. The review process is more like a careful check of the thinking behind the model. Face validity is a true validation protocol. The designer tests the ABM by using selected inputs to explore the output behavior. Often the selected inputs are chosen according to a statistical design, such as a Latin hypercube (cite), which increases efficiency greatly when the dimension of the input space is not too large. Alternatively, the designer can select values for the inputs that correspond to expected behaviors or regions in which predictive accuracy is especially important. Face validation is insufficient when the parameter space is large

Agent Based Modeling, Statistics of

and there are many interactions. But, to varying degrees, it is used for systems such as TRANSIMS and the battlefield simulations produced at the Defense Modeling and Simulation Office (Davis and Anderson, 2004). A stronger validation protocol is based upon comparison to another, independently derived, model. This is not done often enough, but it has the potential to be a powerful tool. The advantages are that one can better explore the full range of model behavior. The disadvantage is that it often requires duplication of effort and a great deal of more development expense. But sometimes different scientific teams have developed different approaches to the same problem, as in the weather forecasting example, for which stochastic partial differential equation approximations are also used. In that case, comparison of the models can highlight the strengths and deficiencies of both. The strongest form of validation occurs when one compares ABM outputs to the historical record for the phenomenon of interest. In principle, this is possible for weather forecasting, epidemic modeling, and TRANSIMS. However, all of these examples are noisy systems, so one expects divergence between the ABM prediction and the actual historical data. One must decide whether the error is unbiased and its variance matches the application of interest, and this requires a great deal of historical data and many runs of the ABM. Although comparison to real-world data is the strongest form of validation, it is still inadequate. One does not have confidence in the fidelity of the simulation in regimes that have not been previously observed, and this is often the context of greatest interest.

Inference Given an arbitrary ABM, there is no clearly formulated inferential procedure (as is available, say, for a linear regression model). One would like to determine how to tune the parameters in an ABM to fit a given data set, or how to decide which covariates are actually important to the behavior of interest in the ABM. There are two possible strategies for improving statistical inference in situations for which one cannot write out the likelihood function. The first is based upon emulators. These are Gaussian process approximations to complex systems, where Bayesian methods allow one to combine real-world data with multiple runs of the ABM to estimate tuning functions that provide the best possible fit, and to identify regions of the input space for which the emulator offers a poor approximation to the ABM. Emulators were proposed by Kennedy and O’Hagan (2001), and have been subsequently elaborated by many researchers. The second possible strategy is Approximate Bayesian Computation (ABC). The method was conceptually proposed by Rubin (1984), but realized in its modern form by Tavaré et al. (1997). ABC starts with a prior over the parameter space of the ABM. It generates a realization of those parameters, runs the ABM, and produces a simulated data set. That data set is compared to real-world data; if it is close with respect some metric appropriate to the research domain, then the random parameters that generated the sample are accepted, and that point in the parameter space has increased posterior probability. The ABC process repeats until one has an estimate of posterior density function.

371

Both emulators and ABCs are ongoing areas of research, and their strengths and weaknesses, especially in the context of high-dimensional applications, are not fully understood. In particular, it is not known how to decide which of these is most useful in a particular ABM application.

Distances between Models A third issue in the statistics of ABMs concerns the comparison of two models, when at least one of the models is an ABM. For example, consider the problem of estimating the spread of an epidemic. One person might build an ABM that included a geography based on a city network, where agents have rule sets that move them around the city, and when an infected agent meets an uninfected agent, there is a chance of transmitting the disease. But a second person might create a similar ABM, but with more and different detail, such as higher transmission rates in day care centers or periodic crowding such as church services or smaller time steps that allow more opportunity for people to meet. There is no clear strategy for deciding what degree of elaboration is needed, nor when one model is a proper subset of another. This is complicated by the fact that the input parameters for one model may be the same as, entirely different from, or partially overlapping the input sets for the other. A similar issue arises when deciding between an ABM and a differential equation model. For example, again in the context of epidemiology, a mathematician might be tempted to use a Kermack–McKendrick model (Kermack and McKendrick, 1927), in which the change in the numbers of susceptibles (S), infected (I), and recovered (R) is described by a system of coupled differential equations: dS dI dR ¼ bSI ¼ bSI  gI ¼ gI dt dt dt Qualitatively, the differential equations should produce similar dynamics to those obtained from an elaborate ABM. But there is no formal procedure for deciding how close these two models are, nor whether one is substantially better than the other in terms of the emergent behavior of interest. Bagni et al. (2002) discuss comparison of such epidemiolgical models in more detail.

See also: Hierarchical Models: Random and Fixed Effects; Social Simulation: Computational Models.

Bibliography Bagni, R., Berchi, R., Cariello, P., 2002. A comparison of simulation models applied to epidemics. Journal of Artificial Societies and Social Simulation 5 (3). Charles, D., Chakrabarty, D., Chickering, M., Devanur, N.R., Wang, L., 2013. Budget smoothing for internet ad auctions: a game theoretic approach. In: Proceedings of the Fourteenth ACM Conference on Electronic Commerce. ACM, New York, pp. 163–180. Chatterjee, S., Laudato, M., Lynch, L.A., 1996. Genetic algorithms and their statistical applications: an introduction. Computational Statistics & Data Analysis 22 (6), 633–651. Davis, P., Anderson, R., 2004. Improving the composability of DoD models and simulations. The Journal of Defense Modeling and Simulation: Applicatins, Methodology, Technology 1, 5–17.

372

Agent Based Modeling, Statistics of

Epstein, J., Axtell, R., 1996. Growing Artificial Societies: Social Science from the Bottom up. Brookings Institution Press, MIT Press, Cambridge/Washington, DC. Gardner, M., 1970. Mathematical games the fantastic combinations of John Conway’s new solitaire game“life”. Scientific American 223, 120–123. Hooten, M., Wikle, C., 2010. Statistical agent-based models for discrete spatiotemporal systems. Journal of the American Statistical Association 105, 236–248. Kauffman, S.A., 1969. Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology 22, 437–467. Kennedy, M., O’Hagan, A., 2001. Bayesian calibration of computer models. Journal of the Royal Statistical Society, Series B 63, 425–464. Kermack, W.O., McKendrick, A.G., 1927. A contribution to the mathematical theory of epidemics. Proceedings of the Royal Society A 115, 700–720. Li, J., Wilensky, U., 2009. NetLogo Sugarscape 3 Wealth Distribution Model. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL. http://ccl.northwestern.edu/netlogo/models/ Sugarscape3WealthDistribution. Liggett, T.M., 1985. Interacting Particle Systems. Springer, New York.

Louie, M., Carley, K., 2008. Balancing the criticisms: validating multi-agent models of social systems. Simulation Modeling: Practice and Theory 16, 242–256. Miller, J.H., Page, S.E., 2007. Complex Adaptive Systems. Princeton University Press, Princeton, NJ. Robert, C., Casella, G., 2004. Monte Carlo Statistical Methods, second ed. SpringerVerlag, New York. Rubin, D.B., 1984. Bayesianly justifiable and relevant frequency calculations for the applies statistician. Annals of Statistics 12, 1151–1172. Smith, L., Beckman, R., Baggerly, K., 1995. TRANSIMS: Transportation Analysis and Simulation System (No. LA-UR–95–1641). Los Alamos National Lab, NM. Tavaré, S., Balding, D.J., Griffiths, R.C., Donnelly, P., 1997. Inferring coalescence times from DNA sequence data. Genetics 145, 505–518. Wolfram, S., 2002. A New Kind of Science, vol. 5. Wolfram Media, Champaign, IL, p. 876. Wolters, B., Steffens, T., 2008. Learning agent-behavior for agent-based simulation using genetic algorithms. In: Proceedings of the European Simulation and Modeling Conference 2008. Le Havre, France, pp. 284–288.