ARTICLE IN PRESS Engineering Applications of Artificial Intelligence 23 (2010) 560–568
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Learning in groups of traffic signals Ana L.C. Bazzan a,,1, Denise de Oliveira a, Bruno C. da Silva b a b
´tica, UFRGS, Caixa Postal 15064, 91.501-970 Porto Alegre, RS, Brazil Instituto de Informa Department of Computer Science, University of Massachusetts, Amherst, MA 01003-9264, USA
a r t i c l e in f o
a b s t r a c t
Article history: Received 28 January 2009 Received in revised form 28 August 2009 Accepted 1 November 2009 Available online 29 December 2009
Computer science in general, and artificial intelligence and multiagent systems in particular, are part of an effort to build intelligent transportation systems. An efficient use of the existing infrastructure relates closely to multiagent systems as many problems in traffic management and control are inherently distributed. In particular, traffic signal controllers located at intersections can be seen as autonomous agents. However, challenging issues are involved in this kind of modeling: the number of agents is high; in general agents must be highly adaptive; they must react to changes in the environment at individual level while also causing an unpredictable collective pattern, as they act in a highly coupled environment. Therefore, traffic signal control poses many challenges for standard techniques from multiagent systems such as learning. Despite the progress in multiagent reinforcement learning via formalisms based on stochastic games, these cannot cope with a high number of agents due to the combinatorial explosion in the number of joint actions. One possible way to reduce the complexity of the problem is to have agents organized in groups of limited size so that the number of joint actions is reduced. These groups are then coordinated by another agent, a tutor or supervisor. Thus, this paper investigates the task of multiagent reinforcement learning for control of traffic signals in two situations: agents act individually (individual learners) and agents can be ‘‘tutored’’, meaning that another agent with a broader sight will recommend a joint action. & 2009 Elsevier Ltd. All rights reserved.
Keywords: Traffic signal control Reinforcement learning Multiagent systems Multiagent learning Markov decision process
1. Introduction Given the increasing demand for mobility in our society and the fact that it is not always possible to provide additional capacity, a more efficient use of the available transportation infrastructure is necessary. This issue relates closely to artificial intelligence (AI) and multiagent systems. AI and multiagent techniques have been used in many stages of the process of traffic management. In fact many actors in a transportation system fit very well the concept of an autonomous agent: the driver, the pedestrian, and the traffic light. This paper focus on control strategies as defined in the control loop described by Papageorgiou (2003) (the physical network, its model, the model of demand and disturbances; control devices; surveillance devices; and the control strategy). Specifically, control strategies based on learning are emphasized. In order to discuss this relationship, the next section briefly reviews basic
Corresponding author.
E-mail addresses:
[email protected] (A.L. Bazzan),
[email protected] (D. de Oliveira),
[email protected] (B.C. da Silva). 1 Author partially supported by CNPq. 0952-1976/$ - see front matter & 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2009.11.009
concepts on reinforcement learning and summarizes the current discussion on approaches for multiagent reinforcement learning (MARL). Techniques and methods from control theory are applied in traffic control in a fine-grained way. This leads to the problem that those techniques can be applied only to single intersections or a small number of them. For any arbitrarily big network, the real-time solution of the control loop faces a number of apparently insurmountable difficulties (Papageorgiou, 2003). Similarly, the problems posed by many actors in a multiagent reinforcement learning scenario are inherently more complex than in single-agent reinforcement learning (SARL). These problems arise mainly due to the fact that no agent lives in a vacuum (Littman, 1994). While one agent is trying to model the environment (other agents included), other agents are doing the same (or at least reacting). This results in an environment which is inherently non-stationary. Thus, the notions of convergence as previously known cannot any longer be guaranteed. One popular formalism for MARL is the one based on stochastic games (SG), which is investigated by game theory and is an extension of the basic Markov decision processes (MDP). However, the aforementioned increase in complexity has some consequences for this formalization. First, the approaches
ARTICLE IN PRESS A.L.C. Bazzan et al. / Engineering Applications of Artificial Intelligence 23 (2010) 560–568
proposed for the case of general sum SG require several assumptions regarding the game structure (agents’ knowledge, self-play, etc.). Also, it is rarely stated what agents must know in order to use a particular approach. Those assumptions restrain the convergence results to common payoff (team) games and other special cases such as zero-sum games. Moreover, the focus is normally put on two-agent games, and not infrequently, two-action stage games. Otherwise, an oracle is needed if one wants to deal with the problem of equilibrium selection when two or more equilibria exist. This can be implemented in several ways (e.g. using the concept of focal point) but it would probably require extra communication. Second, despite recent results on formalizing multiagent reinforcement learning using SG, these cannot be used for systems of many agents, if any flavor of jointaction is explicitly considered, unless the obligation of visiting all pairs of state-action is relaxed, which has impacts on the convergence. The problem with using a high number of agents happens mainly due to the exponential increase in the space of joint actions. Up to now, these issues have prevented the use of SG-based MARL in real-world problems, unless simplifications are made, such as letting each agent learn individually through single-agent based approaches. It is well known that this approach is not effective since agents converge to sub-optimal states. Therefore, partitioning the problem into several, smaller multiagent systems may be a good compromise between complete distribution and complete centralization. This research line has also been tried by others: Boutilier et al. (1999) propose decomposition of actions, rewards and other components of an MDP. In Kok and Vlassis (2006) a sparse cooperative reinforcement learning algorithm is used in which local reward functions only depend on a subset of all possible states and variables. In traffic networks, where each agent represents a traffic signal controlled junction, the problems mentioned above may prevent the use of MARL in control optimization since this is a typical many-agent system in which joint actions do matter. It is not clear whether one must address all agents of the network at once, e.g. in a single-agent learning effort. This seems not to be the case and practice in traffic engineering, even because the control using standard measures (see e.g. Hunt et al., 1981; Diakaki et al., 2003) is computationally demanding as well. Rather, there is a trend in traffic engineering towards decentralization of control, with the network being divided into smaller groups. As shown later in this paper, even small groups pose a challenge for MARL as the number of joint actions and states grows exponentially. Thus alternative approaches to cope with this increase are necessary. Here we propose the use of the standard individual agent-based reinforcement learning approach, with the addition of a supervisor that observes these individuals for a time period, collects joint actions and records their rewards in a kind of case base, where a case is the set of states agents find themselves in, plus the rewards they get; the solution of the case is the joint action. After that period is over, the supervisor observes the state in which the agents it controls are in, looks for the best reward it has observed so far when agents were in that same joint state, and recommends a joint action. This way, the present paper addresses a many-agent system in which these are divided into groups that are then supervised by further agents that have an overview of the group’s performance. The paper is organized as follows. In Section 2 some approaches to distributed traffic signal control, which are based on AI and multiagent systems are discussed. Section 3 focuses on multiagent learning. In Sections 4 and 5 our approach is introduced; experiments and results are discussed. A conclusion and future directions are presented in Section 6.
561
2. AI and MAS based approaches to distributed traffic signal control It is beyond the scope of this paper to review the extensive literature on traffic signals control. The reader is referred to Bazzan (2009) for a survey on opportunities for learning and multiagent systems in traffic control. We notice however that classical approaches, e.g. based on optimization via hill climbing (e.g. Robertson, 1969), signal synchronization and actuated control (e.g. Hunt et al., 1981) can be combined within the hierarchical approach proposed here in Section 4 in the sense that those forms of control can be encapsulated in agents which could then be supervised by another agent. The latter has a broader view and can collect instances of joint control in order to make future recommendations for the agents it supervises. In what follows we review some traffic signal control approaches proposed within the AI community in order to situate the reader and allow some comparisons at a qualitative level. Several AI and multiagent system based approaches to traffic modeling have been proposed. However, the focus of these approaches has usually been on the management level. On the other hand, our work focuses on a fine-grained level or traffic flow control via traffic signals. For an overview on other management and control measures, as well as open challenges, see Bazzan (2009). There is again an expressive number of publications reporting applications of various AI techniques to traffic control, such as genetic algorithms and fuzzy inference. We do not include all of them here because our focus is on distributed and/or decentralized control of traffic signals. However, for completeness, some references for readers that are interested in those forms of AI based control are included next. A reservation-based intersection control is proposed in Dresner and Stone (2004) for a simplified version of intersections without conventional traffic signals designed for dealing with autonomous guided vehicles. Balan and Luke (2006) propose history-based controllers intended to provide global fairness, reducing the variance in waiting time. Both are expected to work only with future technologies related to autonomous vehicles and auctions among intersections. Next we review some distributed approaches to decentralized control of traffic signals in more detail. A simple stage game is discussed in Bazzan (2005) for synchronization of traffic signals. Interactions are modeled as coordination games where the highest reward is given when neighbor traffic signals coordinate their actions so that they synchronize their green phases. Different signal plans can be selected in order to coordinate in a given traffic direction or during a pre-defined period of the day. This approach uses techniques of evolutionary game theory: self-interested agents receive a reward or a penalty given by the environment. Moreover, each agent possesses only information about its local traffic state. However, payoff matrices (or at least the utilities and preferences of the agents) are required, i.e. these figures have to be explicitly formalized by the designer of the system. In Oliveira et al. (2005) the formation of groups of traffic signals was considered in an application of a technique of distributed constraint optimization called cooperative mediation. However, in this case the mediation itself is not decentralized: group mediators communicated their decisions to the mediated agents in their groups and these agents just carried out the tasks. Also, the mediation process could require a long time in highly constrained scenarios, imposing a negative impact in the coordination mechanism. For these reasons Oliveira and Bazzan (2006) proposes a decentralized and swarm-based model of task (signal plan) allocation in which the dynamic group formation,
ARTICLE IN PRESS 562
A.L.C. Bazzan et al. / Engineering Applications of Artificial Intelligence 23 (2010) 560–568
without mediation, combines the advantages of decentralization via swarm intelligence and dynamic group formation. Camponogara and Kraus (2003) have studied a simple scenario with only two intersections, using stochastic game theory and reinforcement learning. With this approach, their results were better than a best-effort (greedy) technique, better than a random policy, and also better than Q-learning. Also, in Nunes and Oliveira (2004) a set of different techniques were applied in order to try to improve the learning ability of agents in a simple scenario. A hierarchical multiagent system with three levels is proposed in France and Ghorbani (2003). On the first level, local traffic agents (LTAs) represent intersections. Because the local optimum may not be a good one when observed from another perspective, there is a second level in the system in which a coordinator traffic agent (CTA) supervises a few LTAs. The paper does not address the important issue of how to resolve conflicts among the LTAs though. To address the issue of the highly dynamic and non-stationary nature of flow patterns, one solution would be to keep multiple models of the environment (and their respective policies). In Silva et al. (2006a), a technique is proposed that allows for automatic partitioning of the environment dynamics into relevant partial models, thus using model-based reinforcement learning. However the approach is single-agent and an extension is necessary to deal with joint states and joint actions. Wiering (2000) describes the use of reinforcement learning by traffic signal agents in order to minimize the overall waiting time of vehicles in a small grid. Those agents learn a value function that estimates the expected waiting times of vehicles given different settings of traffic signals. One interesting issue tackled in this research is that a kind of co-learning is considered: the value functions are learned not only by the traffic signals, but also by the vehicles that can thus calculate policies to select optimal routes to the respective destinations. The ideas and some of the results presented in this paper are important. However, the necessary communication for knowledge formation has a high cost. Also, there is no account of experience made by the drivers based on their local experiences only. In summary, the state-of-the-art regarding traffic light controller is as follows. Classical approaches from traffic engineering are fine-grained but in general cannot deal with a high number of intersections in real-time, especially in non-trivial, non-arterialbased topologies. AI based approaches normally address the coarse-grained, management level. Moreover, some are not yet ready for immediate deployment as the technology required is either expensive or non-existing. Few works do deal with control in fine-grained level but either these cannot handle many intersections or require a lot of communication, or do not solve conflicts.
Q-learning is a model-free approach to reinforcement learning that does not require the agent to have access to information about how the environment works. Q-learning works by estimating state-action values, the Q-values, which are numerical estimators of quality for a given pair of state and action. More precisely, a Q-value Q ðs; aÞ represents the maximum discounted sum of future rewards an agent can expect to receive if it starts in state s, chooses action a and then continues to follow an optimal policy. Q-learning algorithm approximates Q ðs; aÞ as the agent acts in a given environment. The update rule for each experience tuple /s; a; s0 ; rS is given by Eq. (1) where a is the learning rate and g is the discount for future rewards. If all pairs state-action are visited during the learning process, then Q-learning is guaranteed to converge to the correct Q-values with probability one (Watkins and Dayan, 1992). When the Q-values have nearly converged to their optimal values, the action with the highest Q-value for the current state can be selected. During the learning process, the trade-off between exploitation versus exploration has to be considered (Kaelbling et al., 1996) Q ðs0 ; a0 Þ-Q ðs; aÞ ð1Þ Q ðs; aÞ’Q ðs; aÞ þ a r þ gmax 0 a
3.2. Multiagent reinforcement learning: stochastic games Learning in systems with two or more players has a long history in game-theory. The connection between multiagent systems and game-theory as to what regards learning has been explored as well at least from the 1990s. Thus, it seems natural to the reinforcement learning community to explore the existing formalisms behind stochastic (Markov) games (SG) as an extension for MDPs. Most of the research on SG-based MARL so far has been based on a static, two-agent stage game (i.e. a repeated game) with common payoff (payoff is the same for both agents), and with few actions available as in Claus and Boutilier (1998). The zero-sum case was discussed in Littman (1994) and attempts of generalizations to general-sum SG appeared in Hu and Wellman (1998), among many others (as a comprehensive description is not possible here, the reader is referred to Shoham et al. (2007) and references therein). The multiagent reinforcement learning problem can also be approached using a formalism that considers not individual states and actions as above, but joint states and/or actions, for any number of agents. In this case the formalism to model this is a stochastic game or multiagent Markov decision process (SG or MMDP), a generalization of a MDP for n agents. An n-agent SG is a tuple ðN; S; A; R; TÞ where:
3. Multiagent learning
3.1. Single agent reinforcement learning
Usually, reinforcement learning (RL) problems are modeled as Markov decision processes (MDPs). These are described by a set of states, S, a set of actions, A, a reward function Rðs; aÞ-R and a probabilistic state transition function Tðs; a; s0 Þ-½0; 1. An experience tuple /s; a; s0 ; rS denotes the fact that the agent was in state s, performed action a and ended up in s0 with reward r. The goal of an MDP is to calculate the optimal policy p , which is a mapping from states to actions such that the discounted future reward is maximized.
N ¼ 1; . . . ; i; . . . n is the set of agents. S is the discrete state space (set of n-agent stage games). A ¼ Ai is the discrete action space (set of joint actions). Ri is the reward function (R determines the payoff for agent i as r i : S A1 Ak -R). T is the transition probability map (set of probability distributions over the state space S).
If all agents keep mappings of their joint states and actions, this implies that each agent needs to maintain tables whose sizes are exponential in the number of agents: jS1 j jSn j jA1 j jAn j. This is hard even in the case of single state game where jSj ¼ 1. For example, assuming that agents playing the repeated game have only two actions, the size of the tables is 2jNj . Therefore one possible approach is to partition the agents in order to decrease jNj. Even after this partition, it is necessary to
ARTICLE IN PRESS A.L.C. Bazzan et al. / Engineering Applications of Artificial Intelligence 23 (2010) 560–568
redefine Bellman’s equations for the multiagent environment. For instance, using Q-learning the problem is how to update the Q a Þ’ values that now also depend on other agents: Q i ðs; ~ a Þ þ a½Ri ðs; ~ a Þ þ gV i ðs0 Þ (where the exponent i refers to ð1-aÞQ i ðs; ~ a Þ). agent i and V i ðs0 Þ’max~a A A Q i ðs0 ; ~ Thus, the alternative approach proposed here is to not only partition agents in groups but also let them use joint actions only when this has proven to be efficient. In order to do this we introduce a supervisor that is in charge of suggesting joint actions that it has recorded as good ones in the past.
4. Supervised learning based approach 4.1. Overview The supervised learning control strategy that is proposed here, when applied to traffic controllers, is composed of two kinds of agents. Low-level, local agents control, each, a junction. Besides, hierarchically superior agents (supervisors or tutors) are in charge of controlling groups containing a small number of low-level agents. Control here is used in a relaxed way. Supervisors in fact must be seen as facilitators or tutors that will observe the traffic situation from a broader perspective and recommend actions to low-level agents in their groups. This recommendation will be made based on a group perspective, in opposition to the purely local perspective low-level agents have. Details of this process are given in Section 4.3. Three traffic signal agents per group are used in this paper. Notice that the number of agents in each group determines the size of search space for the learning algorithm. Therefore small groups are preferred as the combinatorial space jS1 j jSn j jA1 j jAn j can be probed within a relatively small time frame. The higher this space, the longer the supervisor agent must observe in order to collect information and be able to do useful recommendations. 4.2. Individual agents at low level Low-level agents communicate with the traffic simulator infrastructure (described in Section 5.1) in order to get information about the state of the road portions they are controlling, as well as to send control actions back to the simulator. This communication can be as frequent as desired but here we let agents decide about an action with intervals of da ¼ 60 s. Thus the simulator pools the signal agents each da seconds and requests an action to be performed. In order to return this information to the simulator, agents decide which action to perform next using a reinforcement learning scheme. To do the necessary calculations, agents must have information about the state of the their links in terms of number of stopped vehicles in each link (an average over the last ds seconds). The agent then compares the load in the approaching links. It is assumed that links are one way and that each junction is a crosstype junction so that there are only two approaching links namely vertical and horizontal. However this can be easily generalized to a more complex geometry of junctions. That comparison is then discretized in jSj states. In our case, three states are generated: state 1 if the vertical load is higher than the horizontal; state 2 if the vertical load is lower than the horizontal one; and state 0 if vertical load is nearly equal to the horizontal one. Nearly equal here means e-equal where e ¼ 20% was used. Notice that this coarse discretization of states aims at preventing a combinatorial explosion in the number of pairs /state; actionS, which would demand a long
563
experimentation stage, especially when it comes to the supervisor which has to consider not only those three states but actually the size of its group to the number of states, i.e. 33 here. In traffic domains that are highly dynamic, the issue of how fast agents react is an important one. Regarding the number of actions, each agent has three possible actions: to run signal plan 0, signal plan 1 and signal plan 2. All three plans have cycle time equal to 60 s but different splits. Plan 0 allows both traffic directions the same green time (30-30). Plan 2 allows 70% of the cycle (42) to the vertical direction and 30% (18 s) to the horizontal one. Plan 1 does the opposite. Such plans are designed following traffic engineering manuals that not only specify the green time for a given topology and load at the junction, but also comply with safety issues. For instance, if plan 0 is active in the controller of a given junction, lights at one direction of this junction (say, horizontal) are red from time step 1 to 30 while lights in the other direction are green. From time step 31 to 60 the opposite happens i.e. lights at the vertical direction are red. Because an action may change the plan that is active in the controller only at given time steps (see discussion on da before), a change to plan 2 for instance ensures that at time step 61 lights in the vertical direction will receive green (in the case of plan 2 this happens for further 42 s) while lights in the horizontal direction are red. Given the states and actions described before, a greedy strategy would be just to associate each state to one action in the obvious way, i.e. state 0 with plan 0, state 1 with plan 2, and state 2 with plan 1. However because agents are not alone, this strategy is likely to fail. Other agents act in this environment and their actions are highly coupled. One obvious coupling is the interaction between upstream and downstream traffic signals which is, again, not trivial because each agent has more than one neighbor (i.e. two in the vertical and two in the horizontal direction). The reward agents got is inversely proportional to the average number of stopped vehicles they observe over their approaching links, normalized to remain between 0 and 1. The learning strategy used here is the basic Q-learning equation presented in Section 3 (Eq. (1)). We refer to this as individual learning and it is described in Algorithm 1. Simulations were run where agents only use this algorithm in order to compare the performance of the system to the one it has when individual agents are supervised (described in the next section).
Algorithm 1. Plain individual learning. 1: 2: 3: 4:
for all j A N do initialize Q values, list of neighbors while not end of simulation do when in state sj , select action aj with probability P
5: 6: 7: 8:
expQ ðsj ;aj Þ=T (Boltzmann exploration) Q ðsj ;aj Þ=T aj A Aj exp
observe reward rj Q ðsj ; aj Þ’Q ðsj ; aj Þ þ a ðrj þ gmaxa0j Q ðs0j ; a0j ÞQ ðsj ; aj ÞÞ end while end for
4.3. Supervising individual agents Supervised learning works as formalized in Algorithms 2–5. Algorithm 2 describes the input of parameters. The main ones are the set of low level agents L; the set S of supervisor agents; Dind
ARTICLE IN PRESS 564
A.L.C. Bazzan et al. / Engineering Applications of Artificial Intelligence 23 (2010) 560–568
(time period during which each Lj A L learns and acts independently, updating the Q table Qjind ); Dtut (time period during which each Si A S recommends an action to each Lj based on cases observed so far); Dcrit (time period during which each Lj can act independently or follow rule of supervisor); a the learning rate; and the discount rate g. As mentioned in the three stages the lowlevel individual agents (those described in the previous section) are tutored by supervisor agents; each supervisor is associated with a group of low-level agents. The task of the supervisor is, initially, to observe states, actions, and rewards of the low-level agents and record this information in a case base. Later, in stages 2 and 3 this information is used to guide actions of the low-level agents.
the base, the corresponding reward is calculated by considering both the old value (rold ) as well as the newest observed value as shown in line 11. This stage takes Dind time steps. Algorithm 4. Supervised learning (cont.): tutoring stage (stage 2). 1: 2: 3: 4: 5: 6:
Algorithm 2. Supervised learning: input parameters. 1: input set S of supervisor agents (set of Si ’s) 2: input set L of local agents (set of Lj ’s), one for each Si 3: input Dind : time period during which each Lj learns and acts independently, updating the Q table ðQjind Þ; each Si observes
4: 5:
6: 7: 8: 9:
and records state and action of each of its Lj ’s, plus the average reward among the Lj input Dtut : time period during which each Si recommends an action to each Lj based on cases observed so far input Dcrit : time period during which each Lj can act independently or follow rule of supervisor; Si records new training instances if at least one Lj does not follow Si ’s prescription input a: learning rate input g: discount rate t’0: current time period Qjind ’0
Algorithm 3. Supervised learning (cont.): individual learning stage (stage 1). 1: 2: 3:
while t r Dind do for all Lj A L do when in state sj , select action aj with probability P
expQðsj ;aj Þ=T (Boltzmann exploration), observe Q ðsj ;aj Þ=T aj A Aj exp
reward rj 4:
Qjind ðsj ; aj Þ’Qjind ðsj ; aj Þ þ aðrj þ gmaxa0j Qjind ðs0j ; a0j Þ-Q ind ðsj ; aj ÞÞ
5: 6: 7: 8:
for all Si A S do observe state, action, and reward for each Lj compute the average reward r (among Lj ’s) t t if tuple /~ aj ;~ s j ; rS not yet in case base then t t s j ; rS add tuple /~ aj ;~
9: 10: 11: 12:
else
13: 14: end for 15: end for 16: end while
end if
r ’a r þð1-aÞ rold t t add tuple /~ aj ;~ s j ; rS
Stage 1 is described in Algorithm 3. Each low-level agent j uses basic Q-learning to select an action (line 3) and update the Q table (line 4). Each supervisor i just observes the low-level agents and collects information to a base of cases (line 6). This information consists of joint states, joint actions, and rewards. Thus the base of cases is composed of tuples /~ s; ~ a ; r S where r is averaged over all supervised agents (line 7). Besides, if one tuple already exists in
7: 8: 9: 10: 11: 12:
while t r ðDind þ Dtut Þ do for all Si A S do t t given ~ s j , find ~ a j in case base for which r is maximal; communicate aj to each LJ end for for all Lj A L do perform action aj communicated by supervisor or perform best local action if supervisor has not recommended any; collect reward rj Qjind ðsj ; aj Þ’Qjind ðsj ; aj Þ þ aðrj þ g maxa0j Qjind ðs0j ; a0j Þ-Q ind ðsj ; aj ÞÞ
end for for all Si A S do observe state, action, and reward for each Lj compute the average reward r (among Lj ’s) t t if tuple /~ aj ;~ s j ; rS not yet in case base then t t sj ; r S add tuple /~ aj ;~
13: 14: 15: 16:
else r’a r þ ð1-aÞ rold t t add tuple /~ aj ;~ sj ; r S
17: end if 18: end for 19: end while At the second stage (see Algorithm 4), which takes further Dtut time steps, low-level agents stop learning individually and follow the joint action the supervisor finds in its base of cases. Supervisors do as in stage 1. It is important to note that in any case the local Q tables continue to be updated (line 7). In order to find an appropriated case, the supervisor observes the states the low-level agents are in and retrieves a set of actions that yielded the best reward when agents were in those states in the past (line 3). This reward is also communicated to the agents so that they can compare this reward, which is the one the supervisor expects, with the expected Q values and with the actual reward they get when performing the recommended action. However, at this stage, even if the expected reward is not as good as the expected Q values, low-level agent cannot refuse to do the action recommended by the supervisor. If the supervisor does not have a case that relates to that particular joint state, then the low-level agents receive no recommendation of action and select one independently (locally) using their local policies. In this case, the supervisor is able again to observe and record this new case (lines 12 and 13). At the third stage (that takes Dcrit , see Algorithm 5) the low-level agents must not follow the recommended action. Rather, after comparing the expected reward that was communicated by the supervisor with the expected Q value (lines 7 and 8), the agent can decide to carry out the action associated with its local Q value (line 11). This means that the low-level agent will only select the recommended action if this is at least as good as the expected Q value plus a tolerance t as indicated in line 7 (see details when we explain the experiments). No matter whether the low-level agents do follow the prescription or not, the supervisor is able to observe the states, actions, and rewards and thus form a new case (or update an existing one) as indicated in lines 16–22. During Dcrit , each supervisor Si records new training instances if at least one Lj does not follow Si ’s prescription).
ARTICLE IN PRESS A.L.C. Bazzan et al. / Engineering Applications of Artificial Intelligence 23 (2010) 560–568
Algorithm 5. Supervised learning (cont.): critique stage (stage 3). 1: 2: 3:
while t r Dind þ Dtut þ Dcrit do for all Si A S do t t given ~ s j , find ~ a j in case base for which r is maximal; communicate apj to each LJ plus expected reward re
4: 5: 6: 7:
end for for all Lj A L do
{//compare Qjind and re :} if r e ð1 þ tÞ 4Qjind then
8:
perform apj {//where apj is action recommended by
9:
supervisor for this agent} update Qjind
10: 11:
12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:
else perform aind {//where aind is selected locally; in this case Lj should send a signal to Si that the model is bad} update Qjind end if end for for all Si A S do observe state, action, and reward for each Lj compute the average reward r (among Lj ’s) t t if tuple /~ aj ;~ s j ; rS not yet in case base then t
t
s j ; rS add tuple /~ aj ;~ else r’a r þ ð1-aÞ rold t t add tuple /~ aj ;~ s j ; rS
23: end if 24: end for 25: end while
5. Experimental validation 5.1. Simulation infrastructure: ITSUMO The simulations described in this section were performed using ITSUMO (Silva et al., 2006b) which is a microscopic traffic simulator based on the Nagel–Schreckenberg cellular automaton (CA) model (Nagel and Schreckenberg, 1992). In short, each road is divided into cells with fixed length. This allows the representation of a road as an array where vehicles occupy discrete positions. Each vehicle travels with a speed based on the number of cells it may advance without hitting another vehicle. The vehicle behavior is expressed by rules that represent a special form of car-following behavior. In this model there is a randomization factor (vehicles decelerate with probability p) in order to simulate the non-deterministic dynamics of vehicle movement. Obviously, the higher p, the higher the number of vehicles in the network at each given instant. Thus, we have used that parameter p in our simulations as a way to increase the number of vehicles in the network (over the basic input flow) and hence test the efficiency of our approach. ITSUMO is composed of several modules: data module, simulation kernel, driver definition module, traffic signals agent module, and the visualization module. The more relevant to the present paper is the traffic signal agent module that controls the lights. Agents can control one or more intersections. For the approach discussed in the present paper we have incorporated the supervisor agent.
565
5.2. Experiments and results In order to run experiments to evaluate the performance of the supervised learning, ITSUMO was also used to create the topology of the network as depicted in Fig. 1. In this figure one sees a grid network composed of 64 nodes (denoted A1 to H8). Between each two nodes there is a one-way link.2 Border nodes are either sinks (lighter or red node at borders) or sources (darker or blue nodes at borders). Roads are defined as a set of links, connecting one source with one sink in one of two possible directions. Hence we have road A, road B, etc. in vertical directions, and road 1, road 2, etc. in the horizontal directions. ITSUMO was run with the following setting: vehicles are inserted at the beginning of each road with a given probability. In the vertical roads, this probability is 1/3 meaning that one vehicle is inserted, on average, at each 3 time steps; this probability is 0.1 in the horizontal roads. We keep these probabilities constant during the simulation because we are primarily interested in the comparison of supervised versus individual learning and not in the performance of the individual learners. Keeping constant insertion rates does not mean that the environment is deterministic. There are at least three sources of stochasticity. First (and mainly) individual traffic signal agents are learning independently (at least in stage 1 of the algorithm) which is the main problem in multiagent learning as local agents are trying to maximize their rewards in a uncoordinated way (see Section 3). Second, vehicles can decelerate with probability p, a feature of the cellular automata model (see Section 5.1) which may cause unexpected jams. Finally, drivers are able to turn in each junction, although this is likely to play a minor role here as we keep this probability artificially low (1%) in order to focus on the performance of individual learning versus supervised learning. We remark however that when this probability is higher the main trends do not change although the time for convergence of the plots increases. Therefore, this setting is far from well behaved and greedy strategies by the traffic signals (e.g. give priority to the more congested direction) are not likely to work. For the same reason the fixed strategy of always running signal plan 1 (which gives priority to the vertical direction which has more traffic) is also not a good decision. The simulator was run for 8000 time steps. If the values of Dind , Dtut and Dcrit change then the simulation time must change too. In all cases discussed below the total number of stopped vehicles was measured, i.e. the sum of vehicles stopped in all lanes over time. All simulations were repeated 10 times. Because the plots are running averages of the 8000 steps we do not show the deviations in the plots but notice that they are inferior to 10%. Table 2 shows averages and standard deviations for learning stages 1, 2, and 3 for each curve of the graphs depicted in Figs. 2 and 3. Values for the main parameters used in the simulation are given in Table 1. A series of experiments was performed to evaluate the approach. First there was no supervision and lowlevel agents just run the standard Q-learning algorithm with learning rate a ¼ 0:5 and discount rate g ¼ 0:0 in an individual way. No discounting is used here as reward values for both local and supervisor agents must be comparable; supervisors do not use discount values in the algorithm, only learning values. The value of a ¼ 0:5 was selected to put equal probability on new and past Q values. This is necessary in dynamic environments such as the one described here.
2 The fact that links are one-way does not pose any constraint on the use of the supervised learning approach; this configuration was selected only to simplify the design of the traffic signal plans.
ARTICLE IN PRESS 566
A.L.C. Bazzan et al. / Engineering Applications of Artificial Intelligence 23 (2010) 560–568
Fig. 1. Scheme of groups and supervisors.
Table 1 Parameters and their values.
1400
Parameter
Description
Value
jLj jSj T
Number of agents Number of supervisors Initial temperature (Boltzmann) Decay for temperature Learning coefficient Discount rate Stage 1 Stage 2 Stage 3 Tolerance Deceleration probability
36 12 500 0:9 T (each action selection, until Tmin ¼ 0:1) 0.5 0.0 3000 steps 2000 steps 3000 steps Varied Varied
a g Dind Dtut Dcrit
t p
1200 1000 800 600 400
0 Table 2 Average and deviations for stages 1–3, for each curve appearing in Figs. 2 and 3. Curve (time frame)
p ¼ 0:05 (Fig. 2)
p ¼ 0:1 (Fig. 3)
Avg.
Avg.
St. dev.
729 656 649
32 31 32
768 722 716
23 24 23
QL (3001–5000)
994 751 784
33 44 33
1012 864 880
35 54 48
1084 694 830
41 61 41
1107 823 933
48 43 73
QL (5001–8000)
t ¼ 0:1 (5001–8000) t ¼ 0:01 (5001–8000)
0
2000
4000
6000
8000
Fig. 2. Total number of stopped vehicles along time for p ¼ 0:05; with and without supervision.
St. dev.
QL (0–3000) t ¼ 0:1 (0–3000) t ¼ 0:01 (0–3000)
t ¼ 0:1 (3001–5000) t ¼ 0:01 (3001–5000)
QL p=0.05 Mem=300 SUP p=0.05 Mem=300 Tau=0.1 SUP p=0.05 Mem=300 Tau=0.01
200
1400 1200 1000 800 600 400
The performance of the individual learning scheme can be seen in Figs. 2 and 3 (dashed line). As one notes, the performance is good in the beginning (time necessary for the network to saturate) with a lower number of stopped vehicles. However this changes with time and there are peaks of more than 1000 stopped vehicles. We then run simulations with supervised learning. The 12 supervisors are in charge of three low-level agents each. We let low-level agents at the borders of the grid to run greedy strategies because these are more efficient when agents are close to the sources (where the vehicles are inserted or removed). This means that 36 low-level agents are supervised. This is depicted in Fig. 1. As mentioned before, in stage 3, which starts at time 5000, low-level agents may refuse to do the action recommended by the supervisor. In order to decide whether or not to refuse, a low-level agent compares the reward the supervisor is expecting with the value of the Q table for the current state. The comparison between the local expected Q value and the expected reward commu-
QL p=0.1 Mem=300 SUP p=0.1 Mem=300 Tau=0.1 SUP p=0.1 Mem=300 Tau=0.01
200 0
0
2000
4000
6000
8000
Fig. 3. Total number of stopped vehicles along time for p ¼ 0:1; with and without supervision.
nicated by the supervisor has a tolerance factor t. This tolerance means that when a supervisor says that the expected reward is r e and the low-level agent expects a reward Qj , it will accept the action proposed by the supervisor if it is nearly as good as Qj (see Algorithm 5, line 7, where the tolerance t is considered). Experiments with different values of t were performed in order to see whether this makes a difference. When t ¼ 0 the agent has no tolerance: it will only carry out the action recommended by the supervisor if this is strictly higher than the one it expects from its local learning process. Here the results
ARTICLE IN PRESS A.L.C. Bazzan et al. / Engineering Applications of Artificial Intelligence 23 (2010) 560–568
for t ¼ 0:01 and 0:1 are shown. Further two values of p were tested: p ¼ 0:05 and 0:1. Given our experience, higher values of p are irrealistic: when p ¼ 0:1 this means that there is a probability of 10% of each driver decelerating at each time step. The results of these experiments are depicted in Fig. 2 (for p ¼ 0:05) and in Fig. 3 (for p ¼ 0:1). One can see that up to time step 5000 the behavior of the curves that refer to the supervised learning is nearly the same because the tolerance t is only used in stage 3. In all cases the number of stopped vehicles is lower than when the agents use only individual learning. Increasing p has the effect of yielding more jams. This is reflected in Fig. 3: the number of stopped vehicles is higher. In any case, one may conclude that supervision does pay off. While the agents are being strictly supervised (they must carry out the joint action recommended, which is the case of stage 2), the number of stopped vehicles is lower than in the case they learn independently. In stage 3 (after time step 5000) the best performance (i.e. lesser number of stopped vehicles) is achieved when t ¼ 0:01 i.e. low-level agents only carry out the action proposed by the supervisor is almost as good as the values they expect if they use their Q tables. This has a consequence that they act in a coordinated way and carry out a joint action that was the one intended by the supervisor after searching its case base. When low-level agents are more tolerant there is a tendency to perform the joint action, even when it is not completely appropriate. A final remark here is that this problem of multiagent learning in a group of three traffic signal agents already poses computational challenges if tackled as a standard joint learning problem. To see this we return to the discussion at the end of Section 3. Considering the size of the learning problems we address within the groups, if there were no supervisor and the three agents were to keep mappings of their joint states and actions, this would imply that each agent needs to maintain tables whose sizes are: jS1 j jSn j jA1 j jAn j. Given that n ¼ 3, jSj ¼ 3 and jAj ¼ 3, the size of the tables is 33 33 for each agent. In the supervised learning it is not always the case that the supervisor will necessarily see all these combinations; however for those seen, it is able to keep records of previous rewards and it is then able to make recommendations.
6. Conclusion Multi-agent reinforcement learning is inherently more complex than single agent reinforcement learning because while one agent is trying to model the environment, other agents are doing the same. This results in an environment that is non-stationary. MDP-based formalisms extended to multiagent environments (MMDP) is an efficient solution only in case of few agents, few states, and few actions. For a large number of agents, alternative solutions are necessary. In this paper the problem is partitioned into several, smaller multiagent systems. Our approach is to have agents divided into groups that are then supervised by further agents; these have a broader view, even it is not detailed or up-to-date. A supervised learning with three stages was proposed: in the first the supervisor only collects information about states, actions, and rewards received by the agents it supervises, storing this in a case base. In a second stage, the supervisor retrieves the best case for a given state and recommends actions for the low-level agents. These must carry out these recommended joint actions. In a third stage low-level agents still receive suggestions but must not carry them out if they have a better action to select. During all the times each supervisor updates its case base, and the low-level agents update their Q tables.
567
This approach was applied to a scenario of traffic lights control. As mentioned at the end of Section 2 existing classical or AI based approaches all have drawbacks. In particular, learning-based approaches are promising (see e.g. Bazzan, 2009; Camponogara and Kraus, 2003; Silva et al., 2006a; Wiering, 2000) but so far cannot deal with the exponential growth in the space of pairs state-action, especially if joint actions are explicitly considered. Therefore our partition and supervision based approach is a good compromise between complete distribution and complete centralization. Our results show that supervision pays off in the traffic control scenario regarding the number of stopped vehicles. When there is no supervision, low-level agents just learn using individual Q-learning, which does not explicitly take joint actions into account. In this case the number of stopped vehicles is higher meaning that the signal plans selected are not the best. In the future an investigation is planned about the relation between the number of times that a suggestion was or was not followed and the reward obtained along time. We also want to implement further levels of supervision (a kind of hierarchical learning). Finally we are currently working in a different way to compare rewards at stage 3 in order to be able to use discounting in Q-learning.
Acknowledgements Ana Bazzan is partially supported by CNPq and by the Alexander v. Humboldt Foundation; Denise de Oliveira was partially supported by CAPES and CNPq. References Balan, G., Luke, S., 2006. History-based traffic control. In: Nakashima, H., Wellman, M.P., Weiss, G., Stone, P. (Eds.), Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems. ACM Press, New York, NY, USA, pp. 616–621. Bazzan, A.L.C., 2005. A distributed approach for coordination of traffic signal agents. Autonomous Agents and Multiagent Systems 10 (1), 131–164. Bazzan, A.L.C., 2009. Opportunities for multiagent systems and multiagent reinforcement learning in traffic control. Autonomous Agents and Multiagent Systems 18 (3), 342–375. Boutilier, C., Dean, T., Hanks, S., 1999. Decision-theoretic planning: structural assumptions and computational leverage. J. Artif. Intell. Res. 11, 1–94 URL /http://dblp.uni-trier.de/db/journals/jair/jair11.html#BoutilierDH99S. Camponogara, E., Kraus, W. Jr., Distributed learning agents in urban traffic control. In: Moura-Pires, F., Abreu, S. (Eds.), EPIA, 2003, pp. 324–335. Claus, C., Boutilier, C., 1998. The dynamics of reinforcement learning in cooperative multiagent systems. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence, pp. 746–752. Diakaki, C., Dinopoulou, V., Aboudolas, K., Papageorgiou, M., Ben-Shabat, E., Seider, E., Leibov, A., 2003. Extensions and new applications of the traffic signal control strategy TUC. In: Proceedings of the 82nd Annual Meeting of the Transportation Research Board, pp. 12–16. Dresner, K., Stone, P., 2004. Multiagent traffic management: a reservation-based intersection control mechanism. In: Jennings, N., Sierra, C., Sonenberg, L., Tambe, M. (Eds.), The Third International Joint Conference on Autonomous Agents and Multiagent Systems. IEEE Computer Society, New York, USA, pp. 530–537. France, J., Ghorbani, A.A., 2003. A multiagent system for optimizing urban traffic. In: Proceedings of the IEEE/WIC International Conference on Intelligent Agent Technology. IEEE Computer Society, Washington, DC, USA, pp. 411–414. Hu, J., Wellman, M.P., 1998. Multiagent reinforcement learning: Theoretical framework and an algorithm. In: Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann, Los Altos, CA, pp. 242–250. Hunt, P.B., Robertson, D.I., Bretherton, R.D., Winton, R.I., 1981. SCOOT—a traffic responsive method of coordinating signals, TRRL Lab. Report 1014, Transport and Road Research Laboratory, Berkshire. Kaelbling, L.P., Littman, M., Moore, A., 1996. Reinforcement learning: a survey. Journal of Artificial Intelligence Research 4, 237–285. Kok, J., Vlassis, N., 2006. Collaborative multiagent reinforcement learning by payoff propagation. J. Mach. Learn. Res. 7, 1789–1828. Littman, M.L., 1994. Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of the 11th International Conference on Machine Learning, ML, Morgan Kaufmann, New Brunswick, NJ, pp. 157–163.
ARTICLE IN PRESS 568
A.L.C. Bazzan et al. / Engineering Applications of Artificial Intelligence 23 (2010) 560–568
Nagel, K., Schreckenberg, M., 1992. A cellular automaton model for freeway traffic. J. Phys. I 2, 2221. Nunes, L., Oliveira, E.C., 2004. Learning from multiple sources. In: Jennings, N., Sierra, C., Sonenberg, L., Tambe, M. (Eds.), Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multi Agent Systems, AAMAS, vol. 3. IEEE Computer Society, New York, USA, pp. 1106–1113. Oliveira, D., Bazzan, A.L.C., 2006. Traffic lights control with adaptive group formation based on swarm intelligence. In: Dorigo, M., Gambardella, L.M., Birattari, M., Martinoli, A., Poli, R., Stuetzle, T. (Eds.), Proceedings of the 5th International Workshop on Ant Colony Optimization and Swarm Intelligence, ANTS 2006, Lecture Notes in Computer Science, Springer, Berlin, pp. 520–521. Oliveira, D., Bazzan, A.L.C., Lesser, V., 2005. Using cooperative mediation to coordinate traffic lights: a case study. In: Dignum, F., Dignum, V., Koenig, S., Kraus, S., Singh, M.P., Wooldridge, M. (Eds.), Proceedings of the 4th International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS), New York, IEEE Computer Society, 2005, pp. 463–470. Papageorgiou, M., 2003. Traffic control. In: Hall, R.W. (Ed.), Handbook of Transportation Science. Kluwer Academic Publishers, Dordrecht, pp. 243–277 (Chapter 8).
Robertson, 1969. TRANSYT: A traffic network study tool. Report LR 253, Road Research Laboratory, London. Silva, B.C.d., Basso, E.W., Bazzan, A.L.C., Engel, P.M., 2006a. Dealing with nonstationary environments using context detection. In: Cohen, W.W., Moore, A., (Eds.), Proceedings of the 23rd International Conference on Machine Learning ICML, New York, ACM Press, pp. 217–224. URL /www.inf.ufrgs.br/maslab/ pergamus/pubs/Silva2006icml.pdfS. Silva, B.C.d., Junges, R., Oliveira, D., Bazzan, A.L.C., 2006b. ITSUMO: an intelligent transportation system for urban mobility. In: Nakashima, H., Wellman, M.P., Weiss, G., Stone, P. (Eds.), Proceedings of the 5th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, ACM Press, New York, pp. 1471–1472. URL /www.inf.ufrgs.br/maslab/pergamus/pubs/Sil va + 2006Demo.pdfS. Shoham, Y., Powers, R., Grenager, T., 2007. If multi-agent learning is the answer, what is the question. Artif. Intell. 171 (7), 365–377. Watkins, C.J.C.H., Dayan, P., 1992. Q-learning. Machine Learning 8 (3), 279–292. Wiering, M., 2000. Multi-agent reinforcement learning for traffic light control. In: Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), pp. 1151–1158.