Reinforcement learning approach for coordinated passenger inflow control of urban rail transit in peak hours

Reinforcement learning approach for coordinated passenger inflow control of urban rail transit in peak hours

Transportation Research Part C 88 (2018) 1–16 Contents lists available at ScienceDirect Transportation Research Part C journal homepage: www.elsevie...

2MB Sizes 0 Downloads 63 Views

Transportation Research Part C 88 (2018) 1–16

Contents lists available at ScienceDirect

Transportation Research Part C journal homepage: www.elsevier.com/locate/trc

Reinforcement learning approach for coordinated passenger inflow control of urban rail transit in peak hours

T



Zhibin Jianga, Wei Fana,b, , Wei Liuc, Bingqin Zhud, Jinjing Gua a College of Transportation Engineering, Key Laboratory of Road and Traffic Engineering of the State Ministry of Education, Tongji University, 4800 Cao'an Rd., Shanghai 201804, PR China b USDOT Center for Advanced Multimodal Mobility Solutions and Education, Department of Civil and Environmental Engineering, University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA c Technical Center of Shanghai Shentong Metro Group Co., Ltd., 909 Guilin Road, Xuhui, Shanghai 201103, PR China d Operation Management Center of Shanghai Shentong Metro Group Co., Ltd., 222 Hengtong Road, Jing'an, Shanghai 200070, PR China

AR TI CLE I NF O

AB S T R A CT

Keywords: Urban rail transit Train capacity Operation safety Coordinated passenger inflow control Reinforcement learning

In peak hours, when the limited transportation capacity of urban rail transit is not adequate enough to meet the travel demands, the density of the passengers waiting at the platform can exceed the critical density of the platform. Coordinated passenger inflow control strategy is required to adjust/meter the inflow volume and relieve some of the demand pressure at crowded metro stations so as to ensure both operational efficiency and safety at such stations for all passengers. However, such strategy is usually developed by the operation staff at each station based on their practical working experience. As such, the best strategy/decision cannot always be made and sometimes can even be highly undesirable due to their inability to account for the dynamic performance of all metro stations in the entire rail transit network. In this paper, a new reinforcement learning-based method is developed to optimize the inflow volume during a certain period of time at each station with the aim of minimizing the safety risks imposed on passengers at the metro stations. Basic principles and fundamental components of the reinforcement learning, as well as the reinforcement learning-based problem-specific algorithm are presented. The simulation experiment carried out on a real-world metro line in Shanghai is constructed to test the performance of the approach. Simulation results show that the reinforcement learningbased inflow volume control strategy is highly effective in minimizing the safety risks by reducing the frequency of passengers being stranded. Additionally, the strategy also helps to relieve the passenger congestion at certain stations.

1. Introduction With the rapid development of urban rail transit (URT) in China, the limited transportation capacity is not adequate enough to meet the booming travel demands, especially during peak hours. It is common to see a number of people gather at the platform and cannot get aboard when the train arrives during the rush hours. In particular, if the density of the passengers waiting at the platform exceeds the critical density of the platform, it will be extremely challenging to ensure the safety of passengers and efficient daily operations. Increasing the capacity is a straightforward solution. However, it is not always feasible in practice due to potential long ⁎ Corresponding author at: Department of Civil and Environmental Engineering, University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA. E-mail address: [email protected] (W. Fan).

https://doi.org/10.1016/j.trc.2018.01.008 Received 6 April 2017; Received in revised form 27 December 2017; Accepted 6 January 2018 Available online 30 January 2018 0968-090X/ © 2018 Elsevier Ltd. All rights reserved.

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Table 1 Passenger inflow control for URT in major cities in China. City

The number of operating lines

The number of inflow control lines

The number of inflow control stations

Time Periods for passenger inflow control

Beijing

18

14

75

Shanghai

14

13

50

Guangzhou

9

6

38

Morning peak 7:00–9:00 Evening peak 17:00–19:00 Morning peak 7:30–9:00 Evening peak 17:30–19:00 Morning peak 8:00–9:00 Evening peak 17:30–19:00

Data source: Beijing Metro Operation Co., Ltd. & Guangzhou Metro Operation Co., Ltd.: Collected in April 2016; Shanghai Metro Operation Co., Ltd.: Collected in December 2015.

construction time period, budget limitations, right-of-way constraints, operational and safety restrictions including line capacity, and the maximum possible amount of rolling stock, etc. Under such circumstances, passenger inflow control can be a potentially effective short-term alternative to ensure operational efficiency and safety while also lowering the pressure on stations. By setting railings outside metro stations, shutting down part of the ticket vending machines (TVM) or entrance gates, and/or closing off partial entrances, the speed and flow rate of passengers entering the metro stations during a certain period of time can be effectively limited so that the number of passengers waiting at the platform per unit of time is under control. In fact, these measures for passenger inflow control have already been taken in the daily operation of metro lines in major cities of China like Beijing, Shanghai and Guangzhou (see Table 1 below). Unfortunately, such present control strategies are usually developed based on the engineering judgement and subjective work experience of the operation staff at each station without any help from mathematical programming and scientific methods, leading to the potential deficiency in overall and dynamic performance. The coordinated passenger inflow control is highly needed to maintain the safety of all passengers in which a quick response to the demand flow dynamics is typically required. In this paper, a novel approach is developed which is based on reinforcement learning, more specifically Q-learning. Based on the simulation of the interaction between passengers and trains, the reinforcement learning algorithm automatically learns when and at which stations to enforce inflow control and the optimal control rates at each control station (measured in per unit of time) with the aim of minimizing the safety risks at metro stations of a single line. As a powerful tool for solving complex sequential decision-making problems in control theory (Gosavi, 2009), reinforcement learning is able to make quick responses to the dynamic changes of the network environment and has already been successfully used to solve the similar problem such as traffic flow control optimization problem in highways. The main contributions of this paper can be summarized as follows: 1. Real-time data about some key indicators, including the number of people waiting at the platform and the frequency of passengers being stranded, are used to evaluate the safety risks in this paper. On the one hand, more people waiting at the platform can certainly pose great challenges while increasing the possibility of pushing some passengers off of the platform and/or stepping on those who may fall onto the ground under overcrowded situations, which may cause significant safety risks of the platform. On the other hand, higher frequency of passengers being stranded can result in more anxious and impatient passengers especially in morning peak hours, many of whom may force/rush into the train (therefore making the doors unclosed and causing significant delays in train departure), and sometimes could even lead to severe accidents. As such, both the number of people waiting at the platform and the frequency of passengers being stranded are used in this paper to evaluate the safety risks in this paper. 2. Model of coordinated passenger inflow control is formulated to minimize the penalty value of passengers being stranded on a whole metro line considering real-time passenger trip chains and states. The interaction between passenger and train allows for a quantitative evaluation of their available capacities respectively. It can be used and applied to identify the entry rate of passengers at each station with dynamical passenger demand. 3. The reinforcement learning developed in this paper can be applied to formulate strategies dictating when and at which stations to enforce inflow control and the corresponding control rates to ensure the safety of the entire metro line. The rest of the paper is organized as follows. Section 2 reviews the previous studies on passenger inflow control and the application of reinforcement learning. Section 3 describes coordinated passenger inflow control problem for a single metro line during peak hours. Section 4 builds an optimization model for the coordinated passenger inflow control problem on a whole line. Section 5 presents a unique approach to the coordinated passenger inflow control with reinforcement learning by systematically explaining the fundamental components and rationale behind the specific design choices. Section 6 provides a real-world example and discusses how it works to support the coordinated passenger inflow control. Finally, conclusions are made and future research directions are also given in section 7. 2. Literature review The coordinated passenger inflow control of URT is a complex and very challenging optimization problem which involves various essential components including passengers, trains, infrastructure and strategies. Previous studies regarding the passenger inflow 2

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

control can be classified in two major aspects: (1) large passenger flow organization, (2) the optimization model and corresponding solution approach. In terms of the first classification dimension, existing research efforts can be classified into two levels: at the macroscopic or microscopic level. At the macroscopic level, research commonly developed a general plan for large passenger flow organization, such as the boarding limiting strategy (Delgado et al., 2012). At the microscopic level, studies mainly focused on the simulation and analysis of passenger travel behavior at a single metro station by using different microscopic models such as social force model (Wan et al., 2014), cellular automata (Feng et al., 2013) and multi-agent (Martinez-Gil et al., 2014; Chen et al., 2017). However, such simulation-based research work neglects the interaction between passengers and trains and is only confined to a single metro station, and fails to account for the necessary coordination between stations along the same metro line. The coordinated passenger inflow control of URT is a particular and newly arising problem in China. The optimization models formulated and corresponding solution approaches developed to solving such models in previous studies are insufficient. Traditional operation research method (e.g., multi-objective linear programming) is among the most commonly used. The control stations, inflow control volume at each station and time to implement control strategies are optimized and determined by establishing the coordinated passenger inflow control model with different optimization objectives. Such objectives are usually relevant to the passengers, operators or combination of both, such as minimizing the number of delayed passengers, or maximizing the matching degree of capacity and demand (Yao et al., 2015), or minimizing the passenger loss delays, or maximizing the passenger person-kilometers traveled (Zhao et al., 2014), or maximizing passenger profit (Jiang et al., 2017). It is important to note that this paper intends to minimize the safety risks by reducing the frequency of passengers being stranded. Perhaps most important, this paper uses the origindestination passenger flow matrix collected based on the AFC data instead of using the aggregated flows coming to and departing from each station at the station-level. As such, more reliable computational results with a higher level of accuracy can be obtained compared to those from previous research efforts. In fact, one of the passenger inflow controls is to optimize the inflow volume at each station in a certain period of time. It is similar to the ramp metering which regulates the number of vehicles entering a segment of highway. Due to the similarity between the two problems, the related work on the various approaches to the traffic flow control problem is reviewed and summarized. Linear programming was the one commonly used in the control of freeway system during peak traffic periods with the objective of maximizing the total inflow volume under the constraints of the highway capacity (Wattleworth and Berry, 1965). However, it is not suitable for solving the problem on a large scale when the traffic flow fluctuates due to the emergencies. Consequently, researchers tried to apply artificial intelligence in the case of coordinated responsive metering, including the artificial neural networks (Spall and Chin, 1994; Kwon and Stephanedes, 1994) and reinforcement learning (Zhu and Ukkusuri, 2014; Walraven et al., 2016). In addition, some research studies also focused on the effect of passenger inflow control. For example, a simulation work of macroscopic pedestrian flows was conducted to test how the passenger behavior evolves when the passenger flow control is employed after the occurrence of special events (Bauer et al., 2007). Since the coordinated passenger inflow control problem is essentially a large-scale nonlinear combinatorial optimization problem with a large number of decision variables, one cannot make dynamic decisions in a short time period by using the traditional operational research methods. The review of the literature indicates the need to develop a new method for the coordinated passenger inflow control. Reinforcement learning, as a powerful tool for solving complex sequential decision-making problems in control theory (Gosavi, 2009), has already been effectively used in several transportation problems. Examples include train rescheduling (Šemrov et al., 2016), traffic signal control (Khamis and Gomaa, 2014), route planning (Zolfpour-Arokhlo et al., 2014) and traffic flow control (Zhu and Ukkusuri, 2014; Walraven et al., 2016). According to the results of the research efforts mentioned above, reinforcement learning is at least equivalent and often superior to the traditional operation research method in terms of overall dynamic decisionmaking performance and the computational speed. Meanwhile, the successful applications of reinforcement learning to ramp metering make it possible to deal with a similar problem as coordinated passenger inflow control in this paper. 3. Problem description The problem of passenger congestion in URT occurs if the passenger demand volume exceeds the transport capacity. Consequently, when the number of waiting passengers exceeds the capacity of the arrival train, those who cannot get aboard are stranded once with a further surge in the number of passengers waiting at the platform. If they fail to get on the next train, they will be stranded twice. If there are several passengers being stranded more than twice at a station, it will lead to the prolonged waiting time which can pose serious problems. As mentioned in the introduction section, increasing the capacity is not always a feasible solution, and therefore other measures such as controlling the inflow volume at certain stations should be taken to mitigate such passenger congestion issues. Fig. 1 depicts the effect of passenger inflow control on keeping passengers in order at the station by setting railings outside metro stations and closing off partial entrances. It is important to note that one should not attribute the problem of passenger flow congestion at one station only to the inflow volume of its own. In fact, for a single metro line, the congestion usually occurs when the train capacity is quickly reached at upstream stations even if the inflow volume of its own is small. For illustration purpose, a hypothetical case is designed and used to show the importance of coordinated passenger inflow control here in this paper. As shown in Fig. 2(a), due to the large inflow volume of station B, the train capacity is fully occupied by waiting passengers at station B. If no passenger on board gets off the train at the downstream stations including C, D and E, those who wait at stations C, D and E will not be able to get aboard, leading to the prolonged passenger boarding and congestion at these three stations. In order to relieve the congestion, the inflow volume of station B should be controlled so that the train capacity is reserved for passengers at the downstream stations who otherwise would have 3

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 1. The effect of passenger inflow control on keeping passengers in order at the station.

Direction Section Passenger Volume

The number of passengers who cannot get aboard

Direction Section Passenger Volume

Train Capacity

The number of Inflow control volume of passengers get aboard station B at station C, D, E

Train Capacity

S

A

B

C

D

E

F Station

a) Without Control (station C, D, E suffered from passenger congestion)

A

B

C

D

E

F Station

b) Under Control (The passenger congestion is relieved at station C, D, E)

Fig. 2. The effect of coordinated inflow control on balancing the utilization of the train capacity.

suffered from the congestion, which is so called coordinated passenger inflow control. It is obvious that the coordinated passenger inflow control can greatly help balance the utilization of the train capacity and improve the operational reliability, as shown in Fig. 2(b). In this paper, the problem to formulate the coordinated passenger inflow control strategy for a single metro line can be visualized in Fig. 3. For the purpose of addressing the congestion concern in a totally oversaturated condition, a virtual platform (named “Outside” in Fig. 3(b)) is added to the related original platform of the station under control along this single metro line. If the demand volume exceeds the transport capacity, specific control measures should be taken at certain stations. By doing so, the passengers arriving at these stations should first wait outside the station and then enter the platform according to the inflow control rate and their arrival sequence. In other words, the inflow control rate determines the inflow control volume in a certain period of time, which represents those who are forbidden to enter the station and wait outside. According to the passenger demand volume and its characteristics during morning peak hours in Shanghai Metro Line 6 as shown in Fig. 4 (data from Automatic Fare Collection (AFC)), a large proportion of passengers originate from suburb and head towards the city center destination, leading to a large flow in a certain direction of a single metro line. Fig. 4(a) shows the inbound volume of different stations in both up and down directions, while Fig. 4(b) shows the load ratio of different sections in both up and down directions. It can be concluded that the metro line is suffered from a heavy flow in the up direction and control strategies should be developed at the stations whose inbound volume of up direction (i.e., heading towards the city center) is high. Under such circumstances, the passengers arriving at the control station in both directions are limited to enter the platform. However, it may not

4

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Down direction 1

2

3

k

n-1

n

Up direction

Train in down direction

Train in down direction boarding inbound

outbound

Platform

alighting

boarding

alighting inbound

Outside

Platform

alighting

boarding inflow w con control volume?

Train in up direction

alighting outbound

boarding

Train in up direction Station k

Station k b) Under Control

a) Without Control

Fig. 3. An example of coordinated passenger inflow control.

have a strong effect on the passengers in the down direction since the inbound volume of the down direction is much lower than that of the opposite direction. Coordinated passenger inflow control strategy is aimed at preventing passengers from being stranded several times and minimizing the relevant safety risks (as previously explained) by relieving the congestion at the platform. Therefore, the strategy for coordinated passenger inflow control can be evaluated based on the number of people waiting at the platform and the frequency of being stranded. In that regard, the problem to be solved in this paper is how to design the best coordinated passenger inflow control strategy possible for a single metro line consisting of several stations. More specifically, the decisions have to be optimized as to when and at which stations to take control measures and what the optimal control rates are at each control station (measured in per unit of time).

4. Mathematical model 4.1. Sets and parameters For a passenger j, his/her trip chain of interest forms between inbounding to the origin station and leaving the destination platform, which is as shown in Fig. 5. Due to the arrival and departure process of passenger j at a station, six key time points (Ti,Aj to TiF,j ) and three different states of a passenger (i.e., waiting at outside origin station, waiting at origin station platform, and remaining on the train) are identified. Passengers and also the state of passengers at each time point combined characterize the station loading capacity, platform loading capacity, cumulative number of passengers at each state, train available capacity, train timetable and passenger inflow control strategies at each station. It is important to note that different from the given total inbound passenger demand (Jiang et al., 2017), the model in this paper can consider the demand at the individual passenger level and match such demands with train capacity. In other words, the model in this paper keeps track of passenger travel time chain and statistical cumulative number of passengers at each state to represent the interaction between passengers and service facilities. For the convenience of model formulation, relevant sets and parameters are listed in Table 2.

4.2. Variables The variables used in this paper are listed in Table 3.

4.3. Objective function For the coordinated passenger inflow control of URT, the objective is to minimize the penalty value of passengers being stranded on a whole metro line. The objective function is given by Eq. (1): 5

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 4. Passenger demand characteristics: (a) inbound flow in both the up and down directions and (b) load ratio of different sections in both the up and down directions.

Fig. 5. Arrival and departure process of passenger j at a station.

6

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Table 2 Passenger inflow control-related sets and notations. Sets/parameters

Definition

I J K M N

Set Set Set Set Set

of of of of of

stations (platforms) indexed by i ; passengers indexed by j ; trains indexed by k ; times indexed by t ; passenger inflow control time steps indexed by n ,

n = 1,2,⋯

T1 − T0 ; tc

Study period; Length of passenger inflow control time; Origin station of passenger j ;

[T0,T1] tc sO j pjO

Origin platform of passenger j ;

s jD

Destination station of passenger j ;

pjD

Destination platform of passenger j ;

r k,(i1,i2)

Trip time it takes to travel from platform i1 to platform i2 by train k;

TiA ,j

Time when passenger j arrives at station, i = sO j ;

tE

Walking time from origin station gate to origin platform;

TiG,j

O E Time when passenger j arrives at platform i without passenger inflow control, TiG,j = TiA ,j + t ,i = s j ;

Tid,k

Departure time of train k at platform i ;

C φ Ai θ qi,j,t

Train capacity; Maximum load factor of trains; Capacity of platform i ; Maximum load factor of platforms; Binary value, if passenger j is inside station i at time t then qi,j,t = 1, otherwise qi,j,t = 0 .

Table 3 Passenger inflow control-related variables. Variables

Definition

βj

Amplification coefficient of passenger j ;

Pip,t

Cumulative number of passengers waiting at platform i by time t ;

Pie,n

Cumulative number of passengers entering station i at time step n ;

Pia,n

Cumulative number of passengers arriving at station i at time step n ;

Pkr,t

Number of passengers remaining on train k at time t ;

TiB,j

Time when passenger j enters origin station i ;

TiC,j

Time when passenger j arrives at origin platform i ;

T jD,k

Time when passenger j boards train k ;

TiE,j

Time when passenger j alights train at destination platform i ;

TiF,j

Time when passenger j walks out of destination station;

qi1,j,t

Binary variable, if passenger j is outside origin station i at time t then qi1,j,t = 1, otherwise qi1,j,t = 0 ;

qi2,j,t

Binary variable, if passenger j is at origin platform i at time t then qi2,j,t = 1, otherwise qi2,j,t = 0 ;

qj3,k,t

Binary variable, if passenger j remains on train k at time t then qj3,k,t = 1, otherwise qj3,k,t = 0 ;

qj4,k

Binary variable, if TiC,j ⩽ Tid,k , i = pjO and train k can load passenger j then qj4,k =1, otherwise qj4,k =0;

Ri,(T3,T4) ai,n dj

Number of departure trains at platform i from time T3 to time T4 ; Passenger inflow control rate at station i at time step n ; Frequency (i.e., times) of passenger j being stranded at origin station.

MinZ =

∑∑ i∈I

βj =

βj ·(TjD,k−TiG,j )

(1)

j∈J

2d ⎧ e j, 0 ⩽ dj ⩽ 3,j ∈ J ⎨ e 8, dj ⩾ 4,j ∈ J ⎩

(2)

dj = Ri,(TiG,j ,T jD,k ) −1,j ∈ J ,k ∈ K ,i = pjO

(3) 7

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

4.4. Constraints

TiC,j = TiB,j + t E ,j ∈ J ,i = sOj / pjO

(4)

TjD,k = Tid,k,qj4,k = 1,j ∈ J ,k ∈ K ,i = pjO TiE,j

=

TjD,k

+

rk,(pjo ,pjD ) ,qj4,k

= 1,j ∈ J ,k ∈ K ,i =

(5)

pjD

(6)

1, t ∈ [Ti,Aj ,TiB,j ),j ∈ J ,i = sOj qi1,j,t = ⎧ ⎨ ⎩ 0, otherwise C D O ⎧1, t ∈ [Ti,j ,Tj,k ),j ∈ J ,k ∈ K ,i = pj ⎨ 0, otherwise ⎩

qi2,j,t =

qj3,k,t =



(7)

(8)

4 D E O D ⎧1, t ∈ [Tj,k,Ti,j ),qj,k = 1,i = pj / pj ,j ∈ J ,k ∈ K ⎨ 0, otherwise ⎩

(9)

qj4,k = 1,j ∈ J

(10)

k∈K

Pi,pt =



qi2,j,t ,i ∈ I ,t ∈ M

(11)

j∈J

Pi,pt

⩽ Ai ·θ,i ∈ I ,t ∈ M

Pkr,t =



(12)

qj3,k,t ,k ∈ K ,t ∈ M

(13)

j∈J

Pkr,t ⩽ C·φ,k ∈ K ,t ∈ M

(14)

T0 + n·t c

Pie,n =





qi,j,t ,i ∈ sOj ,n ∈ N

(15)

j ∈ J t = T0 + (n − 1)·t c T0 + n·t c

Pia,n =





j ∈ J t = T0 + (n − 1)·t c

ai,n =

Pia,n−Pie,n ,i Pia,n

qi1,j,t ,i ∈ sOj ,n ∈ N

(16)

∈ I ,n ∈ N

(17)

Formula (3) defines dj , the times of passenger j being stranded, and relates it to the number of trains departing origin platform pjO during the period of time between TiG,j and TjD,k . Constraints (4)–(6) indicate the time when passenger j arrives at origin platform, boards train and alights train, respectively. Those time points can be used to infer trip chain between the origin station sOj and destination platform pjD of passenger j . Constraints (7)–(10) denote different passenger states, corresponding to which include the passenger’s waiting time before entering the origin station, the waiting time at origin platform, and the time remaining on train before arriving at destination platform pjD . Constraints (11), (12) show the cumulative number of passengers waiting at platform i at time t while also ensuring the safety of passengers on the platforms. Constraints (13), (14) specify the cumulative number of passengers remaining on trains in which their values should not exceed the maximum capacity of the trains. In constraints (15)–(17), the passenger inflow control rate at station i at time step n is determined. 5. Reinforcement learning for coordinated passenger inflow control In order to approach the problem with reinforcement learning, the following subsections provide the principles and fundamental components of reinforcement learning for the coordinated passenger inflow control, including the environment and its states, action set, reward function and algorithm. 5.1. Reinforcement learning As an important branch of machine learning, reinforcement learning is an approach of learning what to do and how to map

8

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 6. Reinforcement learning agent and environment interaction in the case of coordinated passenger inflow control.

situations to actions so as to maximize a numerical reward signal (Sutton and Barto, 1998). More specifically, in reinforcement learning, the agent aims to find an optimal control policy that maximizes the expected rewards through trial-and-error interaction with dynamic environment. During the learning process, the agent observes the environment’s state and decides upon an action. After executing the action, the environment switches its state to the subsequent state and gives the reward or penalty incurred by the action to the agent. The value function is also updated accordingly. The reward or penalty indicates the quality of the selected action on whether it is good in an immediate sense, while the value function specifies what is good in the long run. The goal of the agent is to maximize such long-term reward, by learning a good policy which is a mapping from perceived states to actions (Arel et al., 2010). Fig. 6 depicts the interaction between the agent and environment in reinforcement learning for the particular case of coordinated passenger inflow control in this paper. The agent corresponds to each station on the single metro line, making decisions and taking actions on different control rate of inflow volume, while the environment is the condition of passenger demand at each station. The environment is passive since all the decisions about the actions which change its states will be coming from the agent. The reward given to each station after executing the action is inversely proportional to the prolonged waiting time incurred by the passenger inflow control. Q-learning (Watkins and Dayan, 1992) is a typical reinforcement learning method for agents to learn how to act optimally in controlled Markovian domains. In Q-learning algorithm, the Q-value represents the expected utility of an agent executing the action in the current state. The agent intends to select the action with the maximal value of the utility, i.e., the Q-value. The Q-value function for the particular action and state is updated according to the reward received from the environment after executing the action in the current state. For each station i , the Q-value is initialized as 0 at the beginning of the learning process. Then the Q-value is updated, using the Q-learning update rule given by Eqs. (18) and (19):

Q (i,sn,a v ) = (1−α ) Q (i,sn,a v ) + α [Ri (sn,a v ) + γV ∗]

(18)

V∗

(19)

= maxQ (i,sn + 1,a v )

where α ∈ [0,1] is the learning rate, γ ∈ [0,1] is the discount rate, Ri (sn,a v ) is the reward given to station i after executing action a v in state sn , Q (i,sn,a v ) and Q (i,sn + 1,a v ) are the values of the Q function of station i for the respective states and actions. In Q-learning, the agent needs to explore the environment through taking different actions at random to avoid local optimum. However, it also needs to select the action with the highest Q-value to reach the goal. The process is called the balance between exploration and exploitation. Several types of probability distribution have been proposed to ensure a good balance between the exploration and exploitation, and the Boltzmann distribution has been commonly used (Kaelbling et al., 1996) which is formulated as follows:

π [a v |sn] =

exp[Q (i,sn,a v )/ τ ]



exp[Q (i,sn,a v )/ τ ] (20)

av ∈ A

where τ is the temperature parameter used to control the degree of exploration. At the beginning of the learning process, the value of τ is large so that the action is selected randomly, (i.e., the agent explores). With the deepening of learning, the value of τ is reduced so

9

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

that the agent always selects the action with the highest Q-value to approach the goal (i.e., it exploits).

5.2. Environment and simulation Three main elements influence the coordinated passenger inflow control strategy. The first element is the real-time passenger demand, including the passenger demand of origin-destination (O-D) pairs on a single metro line, the inflow demand of each station and number of passengers waiting at each platform. The second element comprises trains, in which each train is characterized by the arrival & departure times and available capacity when arriving at each station. The third element refers to the stations, for which the platform capacity is of particular interest. In this paper, the above elements are designed and implemented with great accuracy to represent the interaction between passengers and trains (Jiang et al., 2016) and the coordination among stations on the same metro line (Jiang et al., 2012). In particular, the following assumptions are made when implementing the environment simulator to ensure a high computational efficiency while still guaranteeing an appropriate accuracy of the simulation. First, it is assumed that the original passenger demand over the entire peak period will not be reduced due to the passenger inflow control. Second assumption made is related to trains. All the trains operate according to the given timetable (D’Acierno et al., 2017), which means that there are no delays in trains. Third, it is assumed that passenger inflow control strategy can be implemented at any station on the single metro line. Fourth, it is assumed that the transfer passengers from other lines are already converted to the inbound passengers and the transfer passengers to other lines are also converted to the outbound passengers. The final assumption is that passengers follow the rule of “first-come, first-served”. For the purpose of implementing the reinforcement learning approach to solving the coordinated passenger inflow control problem, a simulation platform is built based on the assumptions as mentioned above. Both the simulator and the reinforcement learning algorithm are implemented in the Microsoft Access and Visual Studio 2013 environment. The simulator can implement the gathering and scattering process at each station on the single metro line. Gathering and scattering process at each station usually consists of three sub-procedures: arrival, departure and alighting-boarding process (Xu et al., 2014; Xu et al., 2016). In the context of the specific problem domain in this paper, the arrival process is separated into two steps: the first step is passengers getting to the entrance of the station, and the second step is arriving at the platform. The former step requires the arrival distribution of inbound passengers and their corresponding destinations which will be used as the input to the simulation. With the help of AFC, the above-mentioned information is known and fixed over the period of interest. The latter step is related to the control rate of inflow volume per unit of time. If the control rate is 0, passengers arrive at the platform following the arrival distribution in the first step. If the control rate is greater than 0, those who are limited to enter the platform should wait outside and enter the platform according to the control rate of next period and the arrival sequence including passengers wait outside in the last period. The alighting-boarding process includes both passengers and trains and their interaction. As such, train is an important simulator object. The planned timetable specifies the time points of scheduled train arrivals and departures at each station and the dwell time at each station, while the maximum capacity and the number of passengers on board determine the number of boarding passengers at each station and the number of passengers being stranded. These basic objects of the environment simulator are made available to the reinforcement learning algorithm for coordinated passenger inflow control through the environment state which will be discussed in detail in the following subsection.

5.3. State description Station states characterize the state of each station on the single metro line at a certain point in time. It is assumed that the control rate of inflow volume for each station is changed at so-called control time steps. We define t c as the control time step size, based on which the total simulation time is divided into n control time steps. For example, if the control time step size: t c =15 min then the control rate of inflow volume for control stations may be changed every 15 min during the simulation. Considering the maneuverability in real case, the control time step size should not be less than 15 min. Let sn denote the state at control time step n . In the case of passenger inflow control problem, the state characterizes the real-time condition of passenger demand for each station at different control time steps and is defined as:

sn =

D (n) D′ (n)

(21)

The original inflow volume of the station at control time step n is defined as D (n) . Due to the passenger inflow control, some passengers who arrive at the station at n−1 may have to wait outside the station and are not allowed to enter the station until the next step n . The sum of the volume of these passengers and D (n) is the current inflow volume of the station at control time step n , which is denoted by D′ (n) .

10

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 7. URT network topology in Shanghai.

5.4. Action set The action set A contains the control rate of inflow volume for each station on a single metro line. The action a v ∈ A is the action selected at each control time step n , which represents the percentage of passengers forbidden to enter the station at step n . Considering the maneuverability in real case, the action space is defined as follows: A = {0%, 20%, 40%, 60%, 80%, 100%}. For example, one can assume that there will be 100 people entering station i at control time step n . If the action a2 = 20% is executed at station i at step n , then the number of people allowed to enter the station at step n reduces to 80. Therefore, the action a1 = 0% indicates that no one is forbidden to enter the station, while the action a6 = 100% means no one is allowed to enter the station. 5.5. Reward function A reward function defines the goal in a reinforcement learning problem (Sutton and Barto, 1998). In this paper, the objective is to minimize the safety risks at metro stations. Since the frequency of passengers being stranded is the main factor of safety risks, the reward function is directly related to the punishment for passengers being stranded both inside and outside the station. The reward for station i after executing the action a v in the state sn is defined as:

⎤ ⎡ Ri (sn,a v ) = −⎢ ∑ (W (g ,sn ) δi (g ,sn )) + LRi (sn,a v ) ⎥ ⎦ ⎣ g∈I

(22)

where LRi (sn,a v ) is the long-term efficiency. 11

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

It should be noted that the reward value consists of two parts. The first element, ∑g ∈ I (W (g ,sn ) δi (g ,sn )) , denotes the reward value for station i after taking the action a v in the state sn during the passenger inflow control time periods. The second component, LRi (sn,a v ), represents the penalty value of ∑g ∈ I (W (g ,sn ) δi (g ,sn )) at station i when all of the passenger inflow control strategies are implemented. In particular, the long-term efficiency explicitly accounts for the impact of the passengers being stranded during the passenger inflow control time periods on the passenger flows during subsequent non-controlled time periods. Note that such longterm efficiency is jointly determined by both the Pie,n and Pia,n (i.e., both the cumulative number of passengers entering and arriving at station i at time step n). Due to the coordinated passenger inflow control and the limited platform capacity, passengers at each station are divided into two types as shown in Fig. 3(b), one of which represents those who are forbidden to enter the station and have to wait outside, and the other represents those who wait at the platform. Note that the punishment for those who wait at the station is quantified by using Eq. (23):

W (g ,sn ) = β [w1 (g ,sn ) + w2 (g ,sn )]

(23)

where β is the amplification coefficient, which increases nonlinearly and exponentially with the frequency of being stranded (i.e., dj which is as shown in Eq. (3)), and w1 (g ,sn ) is the time length of being stranded at the platform. If the action a v is executed at station g in the state sn , the punishment for those who wait outside the station refers to the extra waiting time spent outside the station g and is defined as w2 (g ,sn ) . For example, it is assumed that one enters the station at 8:00 am (the moment he/she arrives at the station according to the AFC data). However, due to the passenger inflow control, he/she has to wait outside and finally arrives at the platform at 8:10 am. Then the time loss of the passenger spent outside is 10 min. After reaching the platform, he/she fails to get on the train that departs at 8:12 am. In other words, due to the limited train capacity, he/she is stranded twice and finally gets on the train which departs at 8:18 am. Then the punishment is 960e6 (i.e., W (g ,sn ) = e 2 × 3 × [(10−0) + (18−12)] × 60 = 960e6 ). Since the variable W (g ,sn ) is related not only to the inflow volume of station g but also to the train capacity occupied at the upstream stations, δi (g ,sn ) with i,g ∈ I being defined as the punishment proportion which station i should take for the punishment of station g . 5.6. Reinforcement learning algorithm The reinforcement learning algorithm process is presented as follows:

12

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 8. The learning curve (i.e. change of the average total loss with increasing number of training episodes) obtained with reinforcement learning for the simulation scenario using 100 training episodes.

6. Example and analysis In this paper, the use and performance of reinforcement learning approach to the coordinated passenger inflow control for URT are evaluated and illustrated using a real-world example. Note that the example is based on a metro line in Shanghai, namely Line 6. 6.1. Simulation scenario The topology of URT network in Shanghai in the mid of 2016 consists of 14 lines with 366 stations, as is shown in Fig. 7. Line 6, marked in red in Fig. 7, consists of 28 stations with a total length of 36.1 km. During morning peak hours, passenger congestion occurs in the direction from Gangcheng Road to Oriental Sports center. In order to obtain the number of passengers in trains and in platforms, one needs to simulate the passenger flow during the morning peak hours. As such, passenger flow data with 95,850O-D pairs on March 3, 2016 is selected and used as the input in the experiment, which was provided by AFC System. The initial timetable was an actual weekday operation timetable of Line 6 on March 3, 2016. This timetable, named 622_1, is compiled by the train plan maker (TPM) software (Jiang and Xiao, 2014) with an average interval of 140 s in peak hours. The capacity of each train is set as 1008, which is the product of passengers per train unit and the number of train units, adjusted to maximum capacity using a maximum load factor (i.e., 210 × 4 × 1.2 = 1008). The simulation time horizon is defined from 8:00 to 11:00 a.m., covering the morning peak hours. The simulation is implemented in Visual Studio 2013 running on a personal computer with an Intel Core i5-3470 M CPU at 3.20 GHz and 8 GB of RAM under Windows 8 64-Bit. In addition, the following parameters are used for: φ = 1.4, θ = 0.8, α = 0.1, γ = 0.7, t E = 2min, t c = 15 min, A = {0%,20%,40%,60%,80%,100%}, and 100 training episodes. To properly evaluate the performance of reinforcement learning, 5 runs and 100 training episodes per run are designed as an experiment conducted in this paper. Note that a different random-generator seed in each run is used. The average computational time for 100 training episodes is 50 s. Note that the coordinated passenger inflow control problem is essentially a large-scale nonlinear combinatorial optimization problem. One can easily see from the solution space of this simple case study. There are 28 stations on line 6, 3 × 60/15 = 12 decision intervals within 3 h, and 6 potential actions for each interval at each station. As such, the space consists of (6^12)^28 = 2.8762 ∗ 10^261 possible solutions. In this regard, no global optimal solution can be found and the use of reinforcement learning approach for coordinated passenger inflow control of urban rail transit in peak hours is therefore well justified. 6.2. Results and analysis As mentioned, the experiment conducted consists of 5 runs with 100 training episodes per run. The result of each training episode is an improved inflow control strategy in which the quality of the inflow control strategy is expected to improve, while the Table 4 The inflow control strategy for Line 6. Inflow control station

Time to take control strategies

Inflow control rate

Corresponding inbound volume

Dongjing Road Jufeng road Wulian road Boxing road Beiyangjing road

8:00–8:15 8:00–8:15 8:00–8:15 8:00–8:15 8:15–8:30

20% 20% 20% 20% 20%

595 1105 594 945 253

13

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Table 5 Number of passengers being stranded from 8 am to 9 am. Frequency of being stranded

Once

Twice

Three times

More than three times

Yuanshen Sports Centre Stadium Without control Under control

153 112

15 0

0 0

0 0

Minsheng Road Without control Under control

319 375

169 100

118 5

11 0

Beiyangjing Road Without control Under control

221 303

130 91

23 0

0 0

Deping Road Without control Under control

465 688

281 88

37 0

0 0

Yunshan Road Without control Under control

460 308

0 0

0 0

0 0

Jinqaio Road Without control Under control

176 0

0 0

0 0

0 0

Boxing Road Without control Under control

59 0

0 0

0 0

0 0

The whole line Without control Under control

2223 2156

595 279

178 5

11 0

punishment for passengers being stranded is expected to decrease. All training episodes together present one learning run in which the agent learns the inflow control strategy for the oversaturated single metro line. Fig. 8 depicts the learning curve for the simulation scenario as mentioned in Section 6.1: the red dashed line labeled “FC_RL_avg” is a fitting curve which represents the average trend of the 5 learning curves obtained with 5 learning runs with different random-seed values, while the solid line labeled “RL_stDev” corresponds to the standard deviation. The learning curve shows that the inflow control strategy improves with the training episodes, which means the agents (stations) learn the performing strategy that leads to a lower total loss. It can also be seen that the standard deviation in the 5 runs decreases with the training episodes, due to the fact that the exploration rate (i.e. the action is selected randomly) decreases with the deepening of learning. 6.2.1. Inflow control strategy With the help of reinforcement learning, the inflow control strategy for metro line 6 is determined, including the control stations, inflow control rates and the times to implement control strategies, as shown in Table 4. It should be noted that for all other stations on line 6 which are not listed in Table 4, it is recommended that no inflow control strategies be taken. It is very important to note that the strategy obtained by reinforcement learning is different from the one formulated by the subjective work experience of the operation staff. The inflow control rate of the latter one is usually subject and insufficient. In real case, the control stations of line 6 include Dongjing Road, Jufeng Road, Wulian Road, Boxing Road, Jinqiao Road and Deping Road and the time to take control strategies is 7:20–8:40. Compared with the one applied in real case, the inflow control strategy obtained by reinforcement learning approach is more accurate and has greater maneuverability. The number of the control stations is reduced from 6 to 5 and the duration of the control strategy is also shortened greatly. By using the reinforcement learning approach, the inflow control rate and the corresponding inbound volume can be easily quantified, which covers the disadvantages of the one in real case since they are usually valid due to the sub-optimality and uncertainty of the work experience of operation staff. It can be concluded that the inflow control strategy obtained by using the reinforcement learning approach is more accurate and has greater maneuverability. 6.2.2. The frequency of passengers being stranded in stations The frequency of passengers being stranded is one of the performance indicators which can be used to evaluate the quality of the inflow control strategy obtained by reinforcement learning. As mentioned before, if there are several passengers being stranded more than twice at a station, it can cause serious hidden dangers. The number of passengers being stranded in stations is compared with the result of non-inflow control in Table 5. Before taking any control measures, some passengers are stranded more than twice in such stations including Minsheng Road, Beiyangjing Road and Deping Road. After taking the control strategy obtained by using the reinforcement learning approach (as presented in Table 5), 14

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 9. The change of waiting passengers at platform without inflow control and under inflow control.

the average frequency of passengers being stranded is reduced to less than two times and the number of passengers being stranded more than twice is significantly reduced from 189 to 5. 6.2.3. The number of passengers waiting at the platform Another indicator used to evaluate the inflow control strategy is the number of passengers waiting at the platform. Fig. 9 presents the results at four stations of interest (including Minsheng Road, Beiyangjing Road, Deping Road and Jinqiao Road) to evaluate the effect of the inflow control strategy. Since the train capacity is quickly occupied by the upstream stations (including Dongjing Road, Jufeng Road, Wulian Road and Boxing Road stations), the number of waiting passengers at these four stations of interest exceeds the critical density of the platform without passenger control. According to the control strategy obtained by using reinforcement learning, upstream stations will take control measures at time period 8:00–8:15. The train capacity is reserved for the above four stations of interest so that passengers waiting at these stations can get aboard and the number of waiting passengers will be reduced. Considering the relative position of these stations, the numbers of waiting passengers are greatly reduced at time period 8:14–8:24 under passenger control as shown in Fig. 9(c)–(d). In that regard, the results clearly indicate that the inflow control strategy obtained by reinforcement learning can greatly help relieve the platform congestion on Line 6 and can be effectively applied in practice. 7. Conclusion In this paper, the model of coordinated passenger inflow dynamic control is built with the objective of minimizing the penalty value of passengers being stranded on a whole metro line while explicitly accounting for the constraints of passenger trip chains and states, capacity of the platform and train, and scheduled train timetable. A new method based on reinforcement learning is presented to optimize the inflow volume of each station at a certain period of time with the aim of minimizing the safety risks at metro stations. In order to determine the inflow control station, optimal inflow control volume of each control station and the time to take control strategies, a detailed presentation of the basic principles and fundamental components of reinforcement learning is given, including the environment and its states, action set, reward function and algorithm. In the case study, the metro Line 6 in Shanghai is presented to demonstrate the performance of the model and approach. In comparison with the result of no-inflow control, the number of passengers being stranded more than twice is significantly reduced and the passenger congestion at certain stations is also greatly relieved. The plan for coordinated passenger inflow control achieved by solving the model using reinforcement learning can be useful for the operation staff when determining the inflow control station, inflow control volume of each station and time to take control strategies in practice. 15

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

The proposed approach can be considered as a first step towards the use of reinforcement learning for coordinated passenger inflow control. As the line of this research matures in the future, further research should extend its application to multiple lines in an urban rail transit network. Moreover, train delays are not considered in the proposed approach, which may occur as a result of extended dwelling times due to a large number of boarding passengers in practice and can therefore have some influence on the passenger inflow control strategy. It is also well understood that the passenger flow control strategy may be restricted by many external influencing factors, including the station layout, passenger transfer characters, origin & terminal station, etc. Therefore, one may need to account for such factors accordingly. These issues will be addressed in the future research. Acknowledgements This work was supported by the National Natural Science Foundation of China (Grants Nos. 61473210), the Fundamental Research Funds for the Central Universities. The acquisition of the operations data in the paper was supported by the Shanghai Shentong Metro Management Center. The authors are grateful for these supports. References Arel, I., Liu, C., Urbanik, T., Kohls, A.G., 2010. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intel. Transport Syst. 4 (2), 128–135. Bauer, D., Seer, S., Brandle, N., 2007. Macroscopic pedestrian flow simulation for designing crowd control measures in public transport after special events. In: Summer Computer Simulation Conference 2007. San Diego, California, USA, pp. 1035–1042. Chen, X., Li, H., Miao, J., Jiang, S., Jiang, X., 2017. A multiagent-based model for pedestrian simulation in subway stations. Simul. Model. Pract. Theory 71, 134–148. D’Acierno, L., Botte, M., Placido, A., Caropreso, C., Montella, B., 2017. Methodology for determining dwell times consistent with passenger flows in the case of metro services. Urban Rail Transit 3 (2), 73–89. Delgado, F., Munoz, J.C., Giesen, R., 2012. How much can holding and/or limiting boarding improve transit performance? Transp. Res. Part B: Methodol. 46 (9), 1202–1217. Feng, S., Ding, N., Chen, T., Zhang, H., 2013. Simulation of pedestrian flow based on cellular automata: a case of pedestrian crossing street at section in China. Phys. A: Stat. Mech. Appl. 392 (13), 2847–2859. Gosavi, A., 2009. Reinforcement learning: a tutorial survey and recent advances. Inf. J. Comput. 21 (2), 178–192. Jiang, M., Li, H.Y., Xu, X.Y., Xu, S.P., Miao, J.R., 2017. Metro passenger flow control with station-to-station cooperation based on stop-skipping and boarding limiting. J. Cent. S. Univ. 24 (1), 236–244. Jiang, Z., Hsu, C., Zhang, D., Zou, L., 2016. Evaluating rail transit timetable using big passengers' data. J. Comput. Syst. Sci. 82 (1), 144–155. Jiang, Z., Li, F., Xu, R.H., Gao, P., 2012. A simulation model for estimating train and passenger delays in large-scale rail transit networks. J. Cent. S. Univ. 19 (12), 3603–3613. Jiang, Z.B., Xiao, X.Y., 2014. A turn-back track constraint train scheduling algorithm on a multi-interval rail transit line. Comp. Rail. XIV 135, 151–162. Kaelbling, L.P., Littman, M.L., Moore, A.W., 1996. Reinforcement learning: a survey. J. Artif. Intell. Res. 4 (1), 237–285. Khamis, M.A., Gomaa, W., 2014. Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework. Eng. Appl. Artif. Intell. 29, 134–151. Kwon, E., Stephanedes, Y.J., 1994. Comparative evaluation of adaptive and neural-network exit demand prediction for freeway control. Transp. Res. Rec. 1446, 66–76. Martinez-Gil, F., Lozano, M., Fernández, F., 2014. MARL-Ped: a multi-agent reinforcement learning based framework to simulate pedestrian groups. Simul. Model. Pract. Theory 47, 259–275. Šemrov, D., Marsetič, R., Žura, M., Todorovski, L., Srdic, A., 2016. Reinforcement learning approach for train rescheduling on a single-track railway. Transp. Res. Part B: Methodol. 86, 250–267. Spall, J.C., Chin, D.C., 1994. A model-free approach to optimal signal light timing for system-wide traffic control. In: Proceedings of the IEEE Conference on IEEE Decision and Control, pp. 1868–1875. Sutton, R.S., Barto, A.G., 1998. Reinforcement learning: an introduction. IEEE Trans. Neural Netw. 9 (5), 1054. Walraven, E., Spaan, M.T.J., Bakker, B., 2016. Traffic flow optimization: a reinforcement learning approach. Eng. Appl. Artif. Intell. 52, 203–212. Wan, J., Sui, J., Yu, H., 2014. Research on evacuation in the subway station in China based on the Combined Social Force Model. Phys. A: Stat. Mech. Appl. 394, 33–46. Watkins, C.J.C.H., Dayan, P., 1992. Q-learning. Mach. Learn. 8 (3–4), 279–292. Wattleworth, J.A., Berry, D.S., 1965. Peak-period control of a freeway system-some theoretical investigations. Highway Res. Rec. 89, 1–25. Xu, X., Liu, J., Li, H.Y., Hu, J.Q., 2014. Analysis of subway station capacity with the use of queueing theory. Transp. Res. Part C: Emerg. Technol. 38, 28–43. Xu, X., Liu, J., Li, H.Y., Jiang, M., 2016. Capacity-oriented passenger flow control under uncertain demand: Algorithm development and real-world case study. Transp. Res. Part E: Logist. Transp. Rev. 87, 130–148. Yao, X.M., Zhao, P., Qiao, K., Yu, D., 2015. Modeling on coordinated passenger inflow control for urban rail transit network. J. Cent. S. Univ. 1, 342–350. Zhao, P., Yao, X.M., Yu. D., 2014. Cooperative passenger inflow control of urban mass transit in peak hours. J. Tongji Univ. (Nat. Sci.) 42(9), 1340-1346+1443. Zhu, F., Ukkusuri, S.V., 2014. Accounting for dynamic speed limit control in a stochastic traffic environment: a reinforcement learning approach. Transp. Res. Part C Emerg. Technol. 41 (2), 30–47. Zolfpour-Arokhlo, M., Selamat, A., Hashim, S.Z.M., Afkhami, H., 2014. Modeling of route planning system based on Q value-based dynamic programming with multiagent reinforcement learning algorithms. Eng. Appl. Artif. Intell. 29, 163–177.

16