Traffic Signal Control based on Markov Decision Process*

Traffic Signal Control based on Markov Decision Process*

14-th IFAC Symposium on Control in Transportation Systems 14-th IFAC Symposium on Control in Transportation Systems May 2016. Istanbul, 14-th18-20, IF...

669KB Sizes 91 Downloads 80 Views

14-th IFAC Symposium on Control in Transportation Systems 14-th IFAC Symposium on Control in Transportation Systems May 2016. Istanbul, 14-th18-20, IFAC Symposium Symposium on Turkey Control in in Transportation Transportation Systems Systems 14-th IFAC on Control May 18-20, 2016. Istanbul, Turkey Available online at www.sciencedirect.com May May 18-20, 18-20, 2016. 2016. Istanbul, Istanbul, Turkey Turkey

ScienceDirect IFAC-PapersOnLine 49-3 (2016) 067–072

Traffic Traffic Traffic

Signal Control based on Signal Signal Control Control based basedon on Decision Process  Decision Process Decision Process

Markov Markov Markov

Yunwen Xu ∗∗∗ Yugeng Xi ∗∗∗ Dewei Li ∗∗∗ Zhao Zhou ∗∗∗ Yunwen Xu ∗∗ Yugeng Xi ∗∗ Dewei Li ∗∗ Zhao Zhou ∗∗ Yunwen Yunwen Xu Xu Yugeng Yugeng Xi Xi Dewei Dewei Li Li Zhao Zhao Zhou Zhou ∗ ∗ Department of Automation, Shanghai Jiao Tong University, and Key ∗ Department of Automation, Shanghai Jiao Tong University, and Key ∗ ∗ Department Automation, Jiao University, and Laboratory of of System Control Shanghai and Information Processing, Ministry of Department of Automation, Shanghai Jiao Tong Tong University, and Key Key Laboratory of System Control and Information Processing, Ministry of Laboratory of System Control and Information Processing, Ministry Education of China, Shanghai, 200240, China.(e-mail: Laboratory of System ControlShanghai, and Information Ministry of of Education of China, 200240, Processing, China.(e-mail: Education of of China, China, Shanghai, Shanghai, 200240, [email protected], China.(e-mail: [email protected], [email protected], Education 200240, China.(e-mail: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]). [email protected], [email protected], [email protected], [email protected], [email protected]). [email protected]). [email protected]). Abstract: Abstract: Abstract: This Abstract: This paper paper proposes proposes a a Morkov Morkov state state transition transition model model for for an an isolated isolated intersection intersection in in urban urban traffic traffic This paper proposes a Morkov state transition model for an isolated intersection in urban traffic and formulates the traffic signal control problem as a Markov Decision Process(MDP). In order This paper proposes a Morkov transition model an isolated intersection in urban and formulates the traffic signalstate control problem as a for Markov Decision Process(MDP). In traffic order and formulates the traffic trafficburden, signal control control problem as as aapolicy Markov Decision Process(MDP). Process(MDP). In order order to reduce computational a sensitivity-based iteration(PI) algorithm is introduced and formulates the signal problem Markov Decision In computational burden, a sensitivity-based policy iteration(PI) algorithm is introduced to reduce to reduce computational a sensitivity-based policy iteration(PI) is introduced solve the MDP. The burden, proposed model is stage-varying according toalgorithm traffic flow variation to reduce computational burden, a sensitivity-based policy iteration(PI) algorithm is introduced to solve the MDP. The proposed model is stage-varying according to traffic flow variation to solve the MDP. proposed model is according to flow variation around the transition matrices and updated so to solvethe theintersection, MDP. The The and proposed model is stage-varying stage-varying to traffic trafficare flow variation around the intersection, and the state state transition matrices according and cost cost matrices matrices are updated so around the intersection, and the state transition matrices and cost matrices are updated so that a new optimal policy can be searched by the PI algorithm. The proposed model also can around the optimal intersection, and the state transition matrices and cost matrices are updated so that a new policy can be searched by the PI algorithm. The proposed model also can that a new new optimal from policyancan can be searched searched by the thetoPI PIa algorithm. The proposed proposed model also can can be easily extended isolated intersection traffic network based on model the space-time that a optimal policy be by The also be easily extended from an isolated intersection to aa algorithm. traffic network based on the space-time be easily extended from an intersection traffic network based on the space-time distribution characteristics ofisolated traffic flow, so as theto PIa algorithm. The numerical experiments of be easily extended from an isolated intersection to traffic network based on the space-time distribution characteristics of traffic flow, so as the PI algorithm. The numerical experiments of distribution characteristics of traffic flow, so PI algorithm. The experiments of a network this approach is capable of the of distribution characteristics of that traffic flow, so as as the the algorithm. The numerical numerical experiments of a small small traffic traffic network show show that this approach is PI capable of reducing reducing the number number of vehicles vehicles asubstantially small traffic trafficcompared network show show that this approach approach is capable capable of reducing reducing the number of vehicles vehicles with that the fixed-time control particularly for highthe traffic demand, while a small network this is of number of substantially compared with the fixed-time control particularly for high traffic demand, while substantially compared with the fixed-time control particularly for high traffic demand, while being computationally efficient. substantially compared with the fixed-time control particularly for high traffic demand, while being computationally being computationally efficient. efficient. being efficient. © 2016,computationally IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Keywords: Markov state transition model; Traffic signal control; Policy iteration algorithm; Policy iteration Keywords: Markov state transition Keywords: Markov state transition transition model; model; Traffic Traffic signal signal control; control; Policy iteration algorithm; algorithm; Markov decision process Keywords: Markov state model; Traffic signal control; Policy iteration algorithm; Markov decision process Markov decision process Markov decision process 1. INTRODUCTION to identify signal control in random traffic environment as 1. INTRODUCTION INTRODUCTION to identify signal control in random traffic environment as 1. to identify control traffic environment as a sequential decision-making problem, where dynamic pro1. INTRODUCTION to identify signal signal control in in random random traffic environment as aa sequential decision-making problem, where dynamic prosequential decision-making problem, where dynamic programming(DP) and reinforcement learning(RL), as well as Traffic congestion is a stringent issue in modern society agramming(DP) sequential decision-making problem, where dynamic proand reinforcement learning(RL), as well as Traffic congestion congestion is is aa stringent stringent issue issue in in modern society society their and learning(RL), as as Traffic commonly The has a due to increasing population and economic activity,society which gramming(DP) gramming(DP) and reinforcement reinforcement learning(RL), as well well Traffic congestionpopulation is a stringent issue in modern modern their variants, variants, were were commonly used. used. The DP DP also also has as a due to increasing and economic activity, which their variants, were commonly used. The DP also has a due to increasing population and economic activity, which bottleneck of the computational complexity in the recurmotivates the requirement for better utilization of the their variants, were commonly used. The DP also has a due to increasing population for andbetter economic activity, of which motivates the requirement utilization the bottleneck of the computational complexity in the recurbottleneck of the complexity the recurmotivates the requirement for better utilization of the sive calculation of computational Bellman’s equation which isin exponential existing infrastructures and for efficiently control bottleneck of the computational complexity in the recurmotivates the requirement for better utilization of the infrastructures and for efficiently control of the sive calculation of Bellman’s equation which is exponential existing infrastructures sive calculation of equation which is existing and for of size space action space. RL traffic flow. Among all the measures, traffic control signal control sivethe calculation of Bellman’s Bellman’s equation which is exponential exponential existing infrastructures andmeasures, for efficiently efficiently control of the the to to the size of of state state space and and action space. RL based based control traffic flow. Among all the traffic signal to the size of state space and action space. RL traffic flow. Among all the measures, traffic signal control methods have a major advantage that the optimization is a major component whose enhancement is the most to the size of astate space and action RL based based traffic flow. Among all the measures, traffic signal have major advantage that space. the optimization is a major component whose enhancement is thecontrol most methods methods have aa major major advantage that the the optimization optimization is aa major component whose enhancement is the most could be executed off line, but a stochastically stable enefficient way to traffic congestion. have advantage that is major whose enhancement efficient waycomponent to reduce reduce the the traffic congestion. is the most methods could be executed off line, but a stochastically stable encould be off but stable enefficient way to the congestion. vironment and constant state-action costs are the premise could be executed executed off line, line, but aa stochastically stochastically stable enefficient way to reduce reduce the traffic traffic congestion. vironment and constant state-action costs are the premise The signal control problem has been studied extensively in vironment and constant state-action costs are the premise getting the eventual optimal stationary control policy, The signal control problem has been studied extensively in for vironment and constant state-action costs are the premise for getting the eventual optimal stationary control policy, The signal control problem has beendeveloped studied extensively extensively in for getting the eventual optimal stationary control policy, the Except for early adaptive The signal control problem has been studied in which restrict random events to such as the literature. literature. Except for the the early developed adaptive sigsiggetting thethe eventual optimal control policy, which restrict the random eventsstationary to be be handled, handled, such as a a the Except for the early developed adaptive sig- for nal literature. control systems, such as SCOOT(Hunt etadaptive al. (1982)), the literature. Except for the early developed sigwhich of restrict the random eventsweather. to be be handled, handled, such such as as a a surge trafficthe flowrandom and adverse nal control systems, such as SCOOT(Hunt et al. (1982)), which restrict events to surge of traffic flow and adverse weather. nal control systems, such as SCOOT(Hunt et al. (1982)), SCATS(Lowrie (1982)), theasexisting signal control problem nal control systems, such SCOOT(Hunt et al. problem (1982)), surge of traffic flow and adverse weather. SCATS(Lowrie (1982)), the existing signal control surge of traffic flow and adverse weather. SCATS(Lowrie (1982)), the signal control In this generally two (1) mathematical model SCATS(Lowrie (1982)), the existing existing control problem problem this paper, paper, we we propose propose aa Morkov Morkov state state transition transition model model generally contains contains two aspects: aspects: (1) a asignal mathematical model In In this paper, we propose a Morkov state transition model generally contains two aspects: (1) a mathematical model for an intersection to describe the random nature model of the for the complex traffic system, and (2) an appropriate In this paper, we propose a Morkov state transition generally containstraffic two aspects: a mathematical model for an intersection to describe the random for the complex system,(1)and (2) an appropriate nature of the for an intersection describe nature of the for the complex system, and appropriate traffic system. Thisto model canthe be random easily extended control law such traffic that the behavior of(2) thean system meets for an intersection to describe the random nature offrom the for the complex traffic system, and (2) an appropriate meets traffic system. This model can be easily extended from control law such that the behavior of the system traffic system. This model can be easily extended from control law such that the behavior of the system meets an isolated intersection to a traffic network according to certain performance indices. Aboudolas et al. (2009), Lin an traffic system. This model can be easily extended from control law such that the behavior of the system meets isolated intersection to a traffic network according to certain performance performance indices. indices. Aboudolas Aboudolas et et al. al. (2009), (2009), Lin Lin the an isolated intersection to a traffic network according to certain space-time distribution characteristics of traffic flow. et al. (2011), Zhou et al. (2014), Lo (2001), Beard and an isolated intersection to acharacteristics traffic networkofaccording to certain performance indices. Aboudolas et al. (2009), Lin the et al. (2011), Zhou et al. (2014), Lo (2001), Beard and space-time distribution traffic flow. the space-time space-time distribution characteristics of traffic flow. flow. et al. (2011), Zhou al. (2014), (2001), Beard The entire traffic signal control problem is of described as a Ziliaskopoulos (2006)et and Aziz andLo Ukkusuri (2012) and de- the distribution characteristics traffic et al. (2011), Zhou et al. (2014), Lo (2001), Beard and The entire traffic signal control problem is described as a Ziliaskopoulos (2006) and Aziz and Ukkusuri (2012) designal control problem described as Ziliaskopoulos (2006) and and (2012) MDPentire basedtraffic on the proposed Tois the MDP scribed the signal control asAziz a mathematical programming The entire traffic signal controlmodel. problem issolve described as a a Ziliaskopoulos (2006) andas Aziz and Ukkusuri Ukkusuri (2012) dede- The MDP based on the proposed model. To solve the MDP scribed the signal control a mathematical programming MDP based on the proposed model. To solve the MDP scribed signal control as aa mathematical aa PI algorithm from the sensitivity viewpoint problemthe with embedded deterministic trafficprogramming flow models. efficiently, MDP based on the proposed model. To solve the MDP scribed the signal control as mathematical programming efficiently, PI algorithm from the sensitivity viewpoint problem with with embedded embedded deterministic deterministic traffic traffic flow flow models. models. is efficiently, PIThe algorithm from the thecomplexity sensitivity of viewpoint problem introduced. computational the PI However, for large-scale road most above aa PI algorithm from sensitivity viewpoint problem traffic flowof However,with for embedded large-scale deterministic road networks, networks, most ofmodels. above efficiently, is The computational complexity of PI is introduced. introduced. The computational complexity of the the PI However, for large-scale road networks, most of above algorithm is very small when the number of policies is less strategies are with heavy computational burden. is introduced. The computational complexity of the PI However, for large-scale road networks, most of above algorithm is very small when the number of policies is less strategies are with heavy computational burden. algorithm very small number of policies is less strategies are with heavy computational burden. than 1010 (see details inwhen Cao the (2007) section 4.1) which is 10 is algorithm is very small when the number of policies is less 10 strategies are with heavy computational burden. than 10 details in Cao (2007) section 4.1) which is 10 Abdulhai 10 (see than 10 (see details in Cao (2007) section 4.1) which is larger than the optimal problem formulated in this Abdulhai et et al. al. (2003), (2003), Balaji Balaji et et al. al. (2010), (2010), Prashanth Prashanth and and much than 10 (see details in Cao (2007) section 4.1) which is much larger than the optimal problem formulated in this Abdulhai al. (2003), Balaji et al. Prashanth and Bhatnagaret (2011), El-Tantawy et (2010), al. (2013), Robertson Abdulhai et al. (2003), Balaji et al. (2010), Prashanth and much larger larger than the the optimal optimal problem formulated formulated in this this paper. The simulation on microscopic traffic simulation Bhatnagar (2011), El-Tantawy et al. (2013), Robertson much than problem in paper. The simulation on microscopic traffic simulation Bhatnagar (2011), El-Tantawy et al. (2013), Robertson and Bretherton (1974) and Yuetand Recker (2006) ap- software(CORSIM) Bhatnagar (2011), El-Tantawy al. Recker (2013), (2006) Robertson The on microscopic traffic shows the PI algorithm with the and Bretherton Bretherton (1974) and Yu Yu and and ap- paper. paper. The simulation simulation on that microscopic traffic simulation simulation software(CORSIM) shows that the PI algorithm with the and (1974) and Recker (2006) applied Bretherton the framework of Markov Decision Process(MDP) and (1974) and Yu and Recker (2006) apsoftware(CORSIM) shows that the PI algorithm with the proposed model can reduce the number of vehicles plied the framework of Markov Decision Process(MDP) software(CORSIM) shows that the PI algorithm with sigthe proposed model can reduce the number of vehicles sigplied the framework of Markov Decision Process(MDP) plied the framework ofpart Markov Decision Process(MDP)  proposed model can reduce the number of vehicles significantly compared with the fixed-time control especially This work is supported in by the National Science Foundation proposed model canwith reduce the numbercontrol of vehicles sig This work is supported in part by the National Science Foundation nificantly compared the fixed-time especially  nificantly compared with thetraffic fixed-time control especially especially under the compared scenarios of high demand. of This China (Grant No. 61374110, 61221003), NSFC Internawork is in by Science Foundation  nificantly with the fixed-time control work is supported supported in part part61433002, by the the National National Science Foundation under the scenarios of high traffic demand. of This China (Grant No. 61374110, 61433002, 61221003), NSFC International Cooperation (Grant No. 71361130012). under of China (Grant 61374110, 61433002, 61221003), under the the scenarios scenarios of of high high traffic traffic demand. demand. of China (Grant No. No.Project 61374110, 61433002, 61221003), NSFC NSFC InternaInternational Cooperation Project (Grant No. 71361130012). tional tional Cooperation Cooperation Project Project (Grant (Grant No. No. 71361130012). 71361130012).

Copyright 2016 IFAC 67 Hosting by Elsevier Ltd. All rights reserved. 2405-8963 © 2016, IFAC (International Federation of Automatic Control) Copyright © 2016 IFAC 67 Copyright 2016 IFAC 67 Peer review© of International Federation of Automatic Copyright ©under 2016 responsibility IFAC 67 Control. 10.1016/j.ifacol.2016.07.012

IFAC CTS 2016 68 May 18-20, 2016. Istanbul, Turkey

Yunwen Xu et al. / IFAC-PapersOnLine 49-3 (2016) 067–072

The rest of this paper is organized as follows. In Section 2, we give a brief introduction of MDP and the PI algorithm from sensitivity viewpoint. The detailed description of the proposed Markov state transition model is provided in Section 3. Section 4 demonstrates the simulation results and the numerical comparison with the fixed-time control. Finally, Section 5 concludes the paper.

η h − η d = π h [(f h + P h g d ) − (f d + P d g d )]

(2)

where g d = [g d (1), g d (1), . . . , g d (n)] is the performance potential vector of policy d, and g d (s), s ∈ S, measures the ”potential” contribution of state s to the long-run average reward η d . It can be calculated from the Poisson equation: (I − P d )g d + η d e = f d (3)

2. MORKOV DECISION PROCESS AND POLICY ITERATION ALGORITHM

where I is a unit matrix and e = (1, 1, . . . , 1). The PI algorithm flowchart is shown in Algorithm 1.

2.1 Markov decision process

Algorithm 1 Policy Iteration Algorithm (Cao (2007)) 1: Guess an initial policy d0 , set k = 0. 2: (Policy evaluation) Obtain the potential g dk by solving the poisson equation (I − P dk )g dk + η dk e = f dk . 3: (Policy improvement) Choose

The MDP has been extensively studied in the literature. Interested readers can find a good introduction to MDP in the book by Puterman (2014). For the sake of completeness, a brief introduction to discrete-time MDP is given below.

dk+1 ∈ arg{max[f d + P d g dk ]} d∈D

In a MDP, at any time k, k = 0, 1, 2, . . ., the system is in a state Xk ∈ S, where S = {1, 2, . . . , n} is a finite state space. According to the Markov property of a state process, its future behavior is determined only by the current state Xk which contains all the information of system history.

(4)

component-wisely(i.e., to determine an action for each state). If in state s, action dk (s) attains the maximum, then set dk+1 (i) = dk (i). 4: If dk+1 = dk , stop; otherwise, set k := k + 1 and go to step 2.

In addition to the state space, there is an action space A. If the system is in state s, s ∈ S, we can take (independently from the actions taken in other states) any action a ∈ A(s) ⊆ A and apply it to the system, where A(s) is the set of actions that are available in state s ∈ S, A = ∪s∈S A(s). A (stationary and deterministic) policy is a mapping from S to A, denoted as d = (d(1), d(2), . . . , d(n)), with d(s) ∈ A(s), s = 1, 2, . . . , n, that determines the action taken in state s. We use D = ×s∈S A(s) to denote the space of all possible policies, where ”×” is called a Cartesian product, which is a direct product of sets.

Therefor, for a specific control problem, once the state transition probability matrix P and the corresponding reward vector f for each available policy are defined, then by maximizing the long-run average reward η, a policy for choosing an optimal action for each state can be obtained, which represents the optimal strategy that should be followed. 3. MARKOV STATE TRANSITION MODEL FOR AN INTERSECTION

Therefore, if policy d is adopted, the state transition prob- In this section we present a Markov state transition ability matrix is P d = [pd(s) (s1 |s )]ns,s1 =1 . The reward vec- model for an intersection, which consists of two steps: calculating state transition matrices for in-roads of an tor is denoted as f d = (f (1, d(1)), f (2, d(2)), . . . , f (n, d(n)))T . intersection and then Markov state transition modelling The steady-state probability vector of a Markov chain for the intersection. under policy d is denoted as π d = (π d (1), π d (2), . . . , π d (n)) and π d = π d · P d . Considering an ergodic Markov chain 3.1 Notation and network description under policy d, the long-run average reward is :  L−1  Consider a traffic network composed of a number of 1  d η = lim f [Xl , d(Xl )] = π d f d (1) intersections and roads. Each intersection j ∈ J consists L→∞ L l=0 of several in-roads, Ij , which are mutually disjoint, and denote I = ∪j∈J Ij . Each road i ∈ I with Ni lanes has in which L is the length of Markov chain. a number of upstream roads Ii,up and downstream roads Ii,down . The vehicles entering road come from the Ii,up , 2.2 Sensitivity-based Policy iteration algorithm and the vehicles leaving road i drive to the Ii,down with There are two basic approaches to solve the standard different turning options. Service time (green light) for MDPs, value iteration and policy iteration. Value it- road i is represented by a vector tig = [tig,L , tig,T , tig,R ]T , eration is basically a numerical approach, such as Q- denoting service time for left, through and right turning learning(Sutton and Barto (1998)). In this subsection, a option respectively. PI algorithm based on the sensitivity is introduced. Its basic principle can be briefly described as(Cao (2007)): by In the rest of this section we assume that all intersections observing and analyzing the behavior of a system under a have a common cycle length T , the time devoted to serving policy, find another policy that perform better, if such a vehicles from different in-roads at the intersection. We also define a uniform cycle length Tm for markov model policy exists. updating, Tm = M ·T with M an integer. Control decisions Suppose ηh and ηd are the long-run average rewards corre- in our policy are then made at the beginning of each cycle sponding to two policies h and d, h, d ∈ D. The comparison Tm based on the assumption that the traffic flow remains of two policies is based on performance difference formula: unchanged at the short term (Tm ) in the small area of 68

IFAC CTS 2016 May 18-20, 2016. Istanbul, Turkey

Yunwen Xu et al. / IFAC-PapersOnLine 49-3 (2016) 067–072

traffic network. In addition we assume that all the vehicles in the network have approximately same length.

1), moderate congestion state (denoted as 2), heavy congestion state (denoted as 3). Specifically, the discretization of states for road i is given by:

3.2 Queue dynamics model si (k) =

In our model, the road occupancy is chosen as the state variable instead of vehicle number to represent the dynamic behaviors on the road, which can establish a uniform scheme to define the traffic states for roads with different lengths. Occupancy of road i evolves according to the conservation of vehicles: Oi (k + 1) = Oi (k) + αi · (Qi,in (k) − Qi,out (k)) lveh α i = Ni li Oi (k) = αi · Qi (k)

69



1 2 3

if 0 ≤ Oi (k) < a if a ≤ Oi (k) < b if b ≤ Oi (k) ≤ 1

where a = αi · f loor( 0.3 · Qi,max ), b = αi · f loor( 0.6 · Qi,max ). Herein Qi,max represents the number of vehicles that road i can be hold and f loor(x) refers to the largest integer that is less than or equal to x. For each state, road occupancy Oi (k) is a discrete uniform distribution with two parameters (the upper and lower bounds).

(5)

Due to the time-varying traffic flow and the uncertain service time for road i, nine events are possible at each time interval T but with different probabilities of occurrence. These probabilities compose a markov state transition matrix, which is a function of the expected arriving rate λi (k) and service time tig at time step k. We denote the matrix and elements in it as below:  11  13 pi (k) p12 i (k) pi (k) 22 23  Pi (k) =  p21 i (k) pi (k) pi (k) 32 31 (k) pi (k) pi (k) p33 i

where Qi (k) is the number of vehicles on in-road i at time step k, Qi,in (k) and Qi,out (k)) represent the total vehicles entering and leaving road i during the time interval [kT, (k + 1)T ] respectively, Oi (k) denotes the occupancy of road i at time step k, and li is the length of road i, lveh is the vehicle length including the safety gap between two adjacent vehicles. Let Zi (k) be the number of vehicles served from in-road i in traffic cycle T if there are enough vehicles on road i. Thus the number of vehicles actually served on road i is

Next we specify the probabilities to evaluate the road occupancy, and the nine probabilistic events are defined as: ∆

pmn i (k) = Pr (si (k + 1) = n |si (k) = m )

Qi,out (k) = min(Qi (k), Zi (k)) Zi (k) =

Tgi hmin

(9)

(10)

with m, n = 1, 2, 3. For simplification purposes, we only launch analysis on p11 i (k), the rest following the similar line of thought.

(6)

in which hmin is the minimum headway, and Tgi is the total service time of different turning options for road i.

As the conditional probability Eq.(10) shown, p11 i (k) denotes the probability that si (k + 1) is state 1 by restricting si (k) to state 1, that is, Oi (k) belongs to the interval [0, a). When αi · Zi (k) is larger than a, Oi (k + 1) must be equal to αi · Qi,in (k) according to Eq.(5) and (6), where Qi,in (k) is subject to a poisson distribution with mean value λi (k). So the corresponding state probability is:

With respect to arriving vehicles, Wang and Ruskin (2002) and Yang et al. (2005) applied a Poisson distribution to describe vehicle arrival rate. In this paper, we employ the same distribution, and the probability of m vehicles arriving during the interval [kT, (k+1)T ] can be computed as λi (k)m · e−λi (k) P (Qi,in (k) = m) = (7) m!

f loor(a/αi )

p11 i (k) =



n=1

n

λi (k) · e−λi (k) n!

(11)

But when αi ·Zi (k) is less than a, road occupancy Oi (k+1) has two cases:  α · Qi,in (k)   i if 0 ≤ Oi (k) < αi · Zi (k) < a Oi (k + 1) = (12) O (k) + αi · Qi,in (k) − αi · Zi (k)   i if 0 ≤ αi · Zi (k) < Oi (k) < a

where λi (k) denotes the expected arrival rate per time interval T for road i at time step k. It can be estimated by using historical information of the last M time intervals [(k − M )T, (k − 1)T ] and keeps the same value for the next M time intervals [kT, (k + M − 1)T ]. Therefore, the expected arriving rate at time step k can be calculated by: M   i ∈Ii,up Qi ,in (k − t) t=1 (8) λi (k) = Li,up · M

According to the formula of total probability, the probability p11 i (k) can be written as follows on the premise that si (k) is equal to 1.

where Li,up is the number of roads in Ii,up .



p11 i (k) = Pr (Si (k + 1) = 1 |Oi (k) ∈ [0, αi · Zi (k)) ) · Pr (Oi (k) ∈ [0, αi · Zi (k))) + (13) Pr (Si (k + 1) = 1 |Oi (k) ∈ [αi · Zi (k), a) ) · Pr (Oi (k) ∈ [αi · Zi (k), a))

3.3 Markov state transition matrix for a road Considering the accuracy and complexity of the model, we characterize the state of road occupancy into three congestion levels in this paper: free flow state (denoted as

Expanding the above equation, we get: 69

IFAC CTS 2016 70 May 18-20, 2016. Istanbul, Turkey

p11 i (k)

Yunwen Xu et al. / IFAC-PapersOnLine 49-3 (2016) 067–072

u n αi · Zi (k)  λi (k) · e−λi (k) · + = a n! n=1

a − αi · Zi (k) · a · (u − Zi (k))

u+Zi (k) min(n,u)





n=Zi (k) i=Zi (k)

n−i

λi (k) e−λi (k) (n − i)!

light, such as 1 : 1, 1 : 2, 1 : 3 for two phases, instead of continuous ratios. The action space is denoted as Aj = {aj1 , aj2 , · · · , ajl }, where for each state, there are l available actions, i.e., A(s) = Aj , s ∈ sj , hence, the policy space is denoted as D = ×s∈S A(s).

(14)

with u = f loor(a/αi ). The mean value of Qi (k + 1) for this event can be written as: u 1  ¯ 11 (k + 1) = Oi (k + 1) · Pr(Oi (k + 1) = n · αi )(15) Q i αi n=0 In similar fashion, we can calculate the whole state transform matrix Pi (k) and the average number of vehicles ¯ i (k + 1) for each probabilistic event, which is denoted Q as:  11  ¯ 12 (k + 1) Q ¯ 13 (k + 1) ¯ (k + 1) Q Q i i i ¯ i (k + 1) =  Q ¯ 21 (k + 1) Q ¯ 22 (k + 1) Q ¯ 23 (k + 1)  Q i i i 31 32 ¯ ¯ 33 (k + 1) ¯ Qi (k + 1) Qi (k + 1) Q i

j,di j,di i pj,d 33→11 (k) p33→12 (k) · · · p33→33 (k)

in which m 1 n1 2 n2 i (k) · pm (k) pj,d m1 n1 →m2 n2 (k) = p1 2

and if the current state si (k) is determinate, we can get ¯ si (k) (k + 1) for next interval by: the mean queue length Q i ¯ m (k + 1) = Q i

3 

n=1

mn ¯ mn Q i (k + 1) · pi (k)

The cost(or reward) matrix F j,di (k) represents the penalty for the congestion level, denoted as:  j,di T j,di j,di F j,di (k) = f11 (k) f12 (k) · · · f33 (k)

(16)

where we can calculate the penalty value of each sate with policy di by: j,di ¯ n (k + 1)) ¯ m (k + 1) + Q fmn (k) = δ1 ·(Q 1 2  (19) m ¯ ¯  +δ2 · Q1 (k + 1) − Qn2 (k + 1)

Therefore, the markov state transition matrix for one road is obtained and it needs to be updated according to the time-varying traffic flow and different service time.

with m, n = 1, 2, 3, δ1 and δ2 are weight coefficients which can be specified for a specific traffic control problem.

3.4 Markov state transition model for an intersection

Considering all the actions in Aj , we have the basic state transition matrix P j (k) denoted as: P j (k) = [P j,d1 (k) , · · · , P j,dl (k)]T

Markov state transition model for intersection j is based on the state transition matrices of in-roads Ij , and its states are the combination of states of in-roads. Here we take a traffic intersection with four in-roads and two traffic signal phases, one for direction north-south and the other east-west, for illustration. We will reduce the computational burden through two ways.

and the basic cost matrix F j (k) defined as: F j (k) = [F j,d1 (k), · · · , F j,dl (k)]T For the other policy d ∈ D and d = di , i = 1, 2, . . . l, the state transition matrix and cost matrix can be obtained from P j (k) and F j (k).

Since the dimension of the state space for intersection j depends on the number of in-roads, at each phase, we choose only one representative road, which has the mean maximum number of vehicles at the last time of red lights during each traffic cycle, hence, the states of intersection j are characterized as: if s1 (k) = 1 , s2 (k) = 1 if s1 (k) = 1 , s2 (k) = 2 .. . if s1 (k) = 3 , s2 (k) = 3

(18)

with m1 , m2 , n1 , n2 = 1, 2, 3.

with m, n = 1, 2, 3.

 (1, 1)    (1, 2) sj (k) = ..    . (3, 3)

According to subsection 3.3, for any action aji ∈ Aj , i = 1, 2, · · · , l, we can calculate two state transition matrices, denoted as P1 (k) and P2 (k), and corresponding two av¯ 1 (k + 1) and erage vehicle number matrices, denoted as Q ¯ 2 (k + 1), for the two selected roads. Then the transition Q probability for intersection j by using policy di = ×s∈S aji can be written as:   j,di j,di i p11→11 (k) pj,d 11→12 (k) · · · p11→33 (k)  pj,di (k) pj,di (k) · · · pj,di (k)  12→12 12→33   12→11 P j,di (k) =   .. .. .. ..   . . . .

The procedure to control traffic signal of intersection j can be briefly described as follows. At the beginning of each cycle Tm , the Markov state transition model composed of P j (k) and F j (k) is calculated and updated based on available information in the last M cycles with time interval T , either measured or estimated. Then by applying the PI algorithm, an optimal policy that minimizes the average long-run cost(or reward) η j,d , d ∈ D, is found but only implemented for the next cycle TM , and at each time interval T within next cycle TM , control signal can be easily chosen according to the real-time state information and the obtained optimal policy.

(17)

where s1 (k) and s2 (k) represent the state of the two selected roads respectively, and the total number of states of intersection j is 32 = 9. But if we consider all the inroads, the number of states would be 34 = 81!

For a traffic network, traffic signal control proceeds along a similar manner. The neighboring intersections do not interact. As a result, adjacent intersection can be ”isolated” and the respective control policy can be calculated separately.

With respect to the action, considering the difficulty of solving high dimension problem and the real word implementation, we choose several typical ratios of traffic 70

IFAC CTS 2016 May 18-20, 2016. Istanbul, Turkey

I 10

450

3

I9

Car number

2

350

I2 4

I3

mean FT +SD FT −SD FT mean of PI +SD PI −SD PI

400 1

5

6

I8

71

300 250 200 150

8

I4

0 0

9

I5

x 10

4

mean FT +SD FT −SD FT mean PI +SD PI −SD PI

6 5 4 3 2

100 1

50 7

7

Time(min)

I 11

I12

I1

Yunwen Xu et al. / IFAC-PapersOnLine 49-3 (2016) 067–072

I7

50

100

150

200

250

300

t(min)

(a) Car number in network

I6

0 0

50

100

150

200

250

t(min)

(b) Total spend time

Fig. 2. Car number and total speed time for scenario 1

Fig. 1. The 9-node grid network 4. NUMERICAL CASE STUDY

network, such as shown in Fig.1. In the cases of fixedtime control, the phases and the cycle time are the same as PI algorithm, and the ratio of traffic signal is set to 1:1. In the PI algorithm, the objective is to minimize the total number of vehicles, and the reward matrix for each intersection is calculated according to Eq.(19), where the weight coefficients δ1 and δ2 are set to 1 and 0.5 for the first two scenarios , 1 and 0.05 for the third scenarios, respectively. In order to provide statistical significance for the simulation results, a number of runs (each with a different random seed) are collected from CORSIM for each scenario, a total of twenty in the first scenario and seventy in the second and third scenario for both PI algorithm and fixed-time control.

In this section, the proposed model is evaluated by simulation under different traffic scenarios. Moreover, the PI algorithm is utilized to control a typical urban traffic network, and the performance of our approach is compared with the fixed-time control. 4.1 Simulation setting In this paper, We use CORSIM, a widely used microscopic traffic simulator developed by the Federal Highway Administration, to simulate the real traffic environment and find the optimal policy to decide the control inputs for the traffic signals in CORSIM. The test network containing 9 intersections is shown in Fig.1. The road segments in the network have different lengths ranging from 306m to 428m and have two lanes with three turning options. The average vehicle length plus the safety distance between vehicles is 6.5m. The free-flow speed on the road is 50km/h. The minimum headway is assumed to be 1.8 seconds. The time interval for traffic light control of each intersection is set to 60s, at which time there are two traffic signal phases with two directions of east-west and north-south. Hence, we choose seven actions as 15 : 45, 20 : 40, 25 : 35, 30 : 30, 35 : 25, 40 : 20, 45 : 15. Furthermore, the cycle time for model update is defined as 15min, i.e., Tm = 15T . The simulations are carried out under three scenarios with different traffic demand which is characterized by the network inputs through 12 source links, denoted as I1 , . . . , I12 , as shown in Fig.1. The first two scenarios are uniform (balanced) demand among all the source links and reflect the low and high congestion level respectively. The third scenario also represents the high congestion level but with unbalanced traffic demand. The simulation time for all scenarios is set to 4 hours. However, one or two hours with no demand input are added to the end of simulation to make sure that traffic can be cleared. The detailed traffic demand variations of the three scenarios are illustrated in Table 1. Table 1. Network inflow for source links Scenarios 1 2 3

I1−4 , I6−9 , I11 (veh/h) 600-1000-800-200 1000-1600-1200-500 400-1500-1000-500

Under the first scenario, the total vehicle number for whole traffic network over time is shown in Fig.2(a), where the means and standard deviations(SD) of twenty simulations are demonstrated and ”FT” stands for the fixed-time control, ”+SD” and ”−SD” denote the mean value plus and minus the SD respectively. Another common used metrics, the total spend time which is actually the accumulation of vehicle number over time, is depicted in Fig.2(b) with the same notation. From the results shown in Fig.2(a) and Fig.2(b), we can see that the PI algorithm has a little bit worse performance than fixed-time control under the low traffic demand. This is mainly cased by the rough division of the state. Because of the low traffic demand, the states of intersections tend to stay in free flow for a long-term, which results in that the PI algorithm fails to play an active role in the evolution of traffic. When the traffic demands are high and traffic network is more crowded, the PI algorithm achieves better performance than the fixed-time control in a statistical sense. Under the scenario 2 with balanced traffic demand, the results are shown in Fig.3(a) and Fig.3(b), where the upper bound (mean value plus SD) of both total vehicle number and total spend time of PI are better than the mean value of fixed-time control, and the total time spend is reduced dramatically by about 30% on average. Fig.4(a) and Fig.4(b) show the performance under the scenario 3 with unbalanced traffic demand. From the results, we can see that the upper bound of total vehicle number and total spend time of PI are on the verge of the mean value of fixed-time control, and the total time spend is reduced by about 18.75% on average.

I5 , I10 , I12 (veh/h) 600-1000-800-200 1000-1600-1200-500 800-3000-2000-1000

4.2 Result analysis

The results demonstrated above indicate that the PI algorithm significantly outperforms the fixed-time control on average in the cases of high traffic demand, while for

Using the same set of traffic demand, the PI algorithm and the traditional fixed-time control are applied to the 71

IFAC CTS 2016 72 May 18-20, 2016. Istanbul, Turkey

mean FT +SD FT −SD FT mean of PI +SD PI −SD PI

Car number

2500 2000 1500 1000

6

500 0 0

x 10

Beard, C. and Ziliaskopoulos, A. (2006). System optimal signal optimization formulation. Transportation Research Record: Journal of the Transportation Research Board, (1978), 102–112. Cao, X.R. (2007). Stochastic learning and optimization: a sensitivity-based approach. Springer Science & Business Media. El-Tantawy, S., Abdulhai, B., and Abdelgawad, H. (2013). Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (marlin-atsc): methodology and large-scale application on downtown toronto. IEEE Transactions on Intelligent Transportation Systems, 14(3), 1140–1150. Hunt, P., Robertson, D., Bretherton, R., and Royle, M.C. (1982). The scoot on-line traffic signal optimisation technique. Traffic Engineering & Control, 23(4). Lin, S., De Schutter, B., Xi, Y., and Hellendoorn, H. (2011). Fast model predictive control for urban road networks via milp. IEEE Transactions on Intelligent Transportation Systems, 12(3), 846–856. Lo, H.K. (2001). A cell-based traffic control formulation: strategies and benefits of dynamic timing plans. Transportation Science, 35(2), 148–164. Lowrie, P. (1982). The sydney coordinated adaptive traffic system-principles, methodology, algorithms. In International Conference on Road Traffic Signalling, 1982, London, United Kingdom, 207. Prashanth, L. and Bhatnagar, S. (2011). Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 12(2), 412–421. Puterman, M.L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Robertson, D. and Bretherton, R. (1974). Optimum control of an intersection for any known sequence of vehicle arrivals. In Proceedings of the 2nd IFAC/IFIP/IFORS Symposium on Traffic Control and Transportation Systems. Sutton, R.S. and Barto, A.G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge. Wang, R. and Ruskin, H.J. (2002). Modeling traffic flow at a single-lane urban roundabout. Computer Physics Communications, 147(1), 570–576. Yang, Z.s., Chen, X., Tang, Y.s., and Sun, J.p. (2005). Intelligent cooperation control of urban traffic networks. In Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on, volume 3, 1482–1486. IEEE. Yu, X.H. and Recker, W.W. (2006). Stochastic adaptive control model for traffic signal systems. Transportation Research Part C: Emerging Technologies, 14(4), 263– 282. Zhou, Z., De Schutter, B., Lin, S., and Xi, Y. (2014). Multi-agent model-based predictive control for largescale urban traffic networks using a serial scheme. IET Control Theory & Applications, 9(3), 475–484.

5

mean FT +SD FT −SD FT mean PI +SD PI −SD PI

5

Time(min)

3000

Yunwen Xu et al. / IFAC-PapersOnLine 49-3 (2016) 067–072

4 3 2 1

100

200

300

0 0

400

100

t(min)

200

300

400

t(min)

(a) Car number in network

(b) Total spend time

Fig. 3. Car number and total speed time for scenario 2

mean FT +SD FT −SD FT mean of PI +SD PI −SD PI

Car number

2500 2000 1500 1000

x 10

5

mean FT +SD FT −SD FT mean PI +SD PI −SD PI

4

3

2

1

500 0 0

5

Time(min)

3000

100

200

300

400

0 0

50

t(min)

100

150

200

250

300

350

t(min)

(a) Car number in network

(b) Total spend time

Fig. 4. Car number and total speed time for scenario 3 the low demand case, it does not do very well, but the performance that total time spend increases by only 3.19% on average compared with the fixed-time control is also acceptable. 5. CONCLUSION This paper formulates traffic signal control problem as a MDP based on the proposed Markov state transition model, which provides transition probabilities and cost functions during each optimization. A sensitivity-based PI algorithm is employed to find the optimal policy for the controlled Markov process. Simulation results under three different scenarios for a small traffic network indicate that this approach performs better than the traditional fixedtime control, especially under the conditions of high traffic demand. REFERENCES Abdulhai, B., Pringle, R., and Karakoulas, G.J. (2003). Reinforcement learning for true adaptive traffic signal control. Journal of Transportation Engineering, 129(3), 278–285. Aboudolas, K., Papageorgiou, M., and Kosmatopoulos, E. (2009). Store-and-forward based methods for the signal control problem in large-scale congested urban road networks. Transportation Research Part C: Emerging Technologies, 17(2), 163–174. Aziz, H. and Ukkusuri, S. (2012). Unified framework for dynamic traffic assignment and signal control with cell transmission model. Transportation Research Record: Journal of the Transportation Research Board, (2311), 73–84. Balaji, P., German, X., and Srinivasan, D. (2010). Urban traffic signal control using reinforcement learning agents. IET Intelligent Transport Systems, 4(3), 177– 188. 72