Reinforcement learning approach for coordinated passenger inflow control of urban rail transit in peak hours

Transportation Research Part C 88 (2018) 1–16 Contents lists available at ScienceDirect Transportation Research Part C journal homepage: www.elsevie...

Download PDF

2MB Sizes 0 Downloads 63 Views

Report

PDF Reader
Full Text

Transportation Research Part C 88 (2018) 1–16

Contents lists available at ScienceDirect

Transportation Research Part C journal homepage: www.elsevier.com/locate/trc

Reinforcement learning approach for coordinated passenger inﬂow control of urban rail transit in peak hours

T

⁎

Zhibin Jianga, Wei Fana,b, , Wei Liuc, Bingqin Zhud, Jinjing Gua a College of Transportation Engineering, Key Laboratory of Road and Traﬃc Engineering of the State Ministry of Education, Tongji University, 4800 Cao'an Rd., Shanghai 201804, PR China b USDOT Center for Advanced Multimodal Mobility Solutions and Education, Department of Civil and Environmental Engineering, University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA c Technical Center of Shanghai Shentong Metro Group Co., Ltd., 909 Guilin Road, Xuhui, Shanghai 201103, PR China d Operation Management Center of Shanghai Shentong Metro Group Co., Ltd., 222 Hengtong Road, Jing'an, Shanghai 200070, PR China

AR TI CLE I NF O

AB S T R A CT

Keywords: Urban rail transit Train capacity Operation safety Coordinated passenger inﬂow control Reinforcement learning

In peak hours, when the limited transportation capacity of urban rail transit is not adequate enough to meet the travel demands, the density of the passengers waiting at the platform can exceed the critical density of the platform. Coordinated passenger inﬂow control strategy is required to adjust/meter the inﬂow volume and relieve some of the demand pressure at crowded metro stations so as to ensure both operational eﬃciency and safety at such stations for all passengers. However, such strategy is usually developed by the operation staﬀ at each station based on their practical working experience. As such, the best strategy/decision cannot always be made and sometimes can even be highly undesirable due to their inability to account for the dynamic performance of all metro stations in the entire rail transit network. In this paper, a new reinforcement learning-based method is developed to optimize the inﬂow volume during a certain period of time at each station with the aim of minimizing the safety risks imposed on passengers at the metro stations. Basic principles and fundamental components of the reinforcement learning, as well as the reinforcement learning-based problem-speciﬁc algorithm are presented. The simulation experiment carried out on a real-world metro line in Shanghai is constructed to test the performance of the approach. Simulation results show that the reinforcement learningbased inﬂow volume control strategy is highly eﬀective in minimizing the safety risks by reducing the frequency of passengers being stranded. Additionally, the strategy also helps to relieve the passenger congestion at certain stations.

1. Introduction With the rapid development of urban rail transit (URT) in China, the limited transportation capacity is not adequate enough to meet the booming travel demands, especially during peak hours. It is common to see a number of people gather at the platform and cannot get aboard when the train arrives during the rush hours. In particular, if the density of the passengers waiting at the platform exceeds the critical density of the platform, it will be extremely challenging to ensure the safety of passengers and eﬃcient daily operations. Increasing the capacity is a straightforward solution. However, it is not always feasible in practice due to potential long ⁎ Corresponding author at: Department of Civil and Environmental Engineering, University of North Carolina at Charlotte, 9201 University City Boulevard, Charlotte, NC 28223, USA. E-mail address: [email protected] (W. Fan).

https://doi.org/10.1016/j.trc.2018.01.008 Received 6 April 2017; Received in revised form 27 December 2017; Accepted 6 January 2018 Available online 30 January 2018 0968-090X/ © 2018 Elsevier Ltd. All rights reserved.

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Table 1 Passenger inﬂow control for URT in major cities in China. City

The number of operating lines

The number of inﬂow control lines

The number of inﬂow control stations

Time Periods for passenger inﬂow control

Beijing

18

14

75

Shanghai

14

13

50

Guangzhou

9

6

38

Morning peak 7:00–9:00 Evening peak 17:00–19:00 Morning peak 7:30–9:00 Evening peak 17:30–19:00 Morning peak 8:00–9:00 Evening peak 17:30–19:00

Data source: Beijing Metro Operation Co., Ltd. & Guangzhou Metro Operation Co., Ltd.: Collected in April 2016; Shanghai Metro Operation Co., Ltd.: Collected in December 2015.

construction time period, budget limitations, right-of-way constraints, operational and safety restrictions including line capacity, and the maximum possible amount of rolling stock, etc. Under such circumstances, passenger inﬂow control can be a potentially eﬀective short-term alternative to ensure operational eﬃciency and safety while also lowering the pressure on stations. By setting railings outside metro stations, shutting down part of the ticket vending machines (TVM) or entrance gates, and/or closing oﬀ partial entrances, the speed and ﬂow rate of passengers entering the metro stations during a certain period of time can be eﬀectively limited so that the number of passengers waiting at the platform per unit of time is under control. In fact, these measures for passenger inﬂow control have already been taken in the daily operation of metro lines in major cities of China like Beijing, Shanghai and Guangzhou (see Table 1 below). Unfortunately, such present control strategies are usually developed based on the engineering judgement and subjective work experience of the operation staﬀ at each station without any help from mathematical programming and scientiﬁc methods, leading to the potential deﬁciency in overall and dynamic performance. The coordinated passenger inﬂow control is highly needed to maintain the safety of all passengers in which a quick response to the demand ﬂow dynamics is typically required. In this paper, a novel approach is developed which is based on reinforcement learning, more speciﬁcally Q-learning. Based on the simulation of the interaction between passengers and trains, the reinforcement learning algorithm automatically learns when and at which stations to enforce inﬂow control and the optimal control rates at each control station (measured in per unit of time) with the aim of minimizing the safety risks at metro stations of a single line. As a powerful tool for solving complex sequential decision-making problems in control theory (Gosavi, 2009), reinforcement learning is able to make quick responses to the dynamic changes of the network environment and has already been successfully used to solve the similar problem such as traﬃc ﬂow control optimization problem in highways. The main contributions of this paper can be summarized as follows: 1. Real-time data about some key indicators, including the number of people waiting at the platform and the frequency of passengers being stranded, are used to evaluate the safety risks in this paper. On the one hand, more people waiting at the platform can certainly pose great challenges while increasing the possibility of pushing some passengers oﬀ of the platform and/or stepping on those who may fall onto the ground under overcrowded situations, which may cause signiﬁcant safety risks of the platform. On the other hand, higher frequency of passengers being stranded can result in more anxious and impatient passengers especially in morning peak hours, many of whom may force/rush into the train (therefore making the doors unclosed and causing signiﬁcant delays in train departure), and sometimes could even lead to severe accidents. As such, both the number of people waiting at the platform and the frequency of passengers being stranded are used in this paper to evaluate the safety risks in this paper. 2. Model of coordinated passenger inﬂow control is formulated to minimize the penalty value of passengers being stranded on a whole metro line considering real-time passenger trip chains and states. The interaction between passenger and train allows for a quantitative evaluation of their available capacities respectively. It can be used and applied to identify the entry rate of passengers at each station with dynamical passenger demand. 3. The reinforcement learning developed in this paper can be applied to formulate strategies dictating when and at which stations to enforce inﬂow control and the corresponding control rates to ensure the safety of the entire metro line. The rest of the paper is organized as follows. Section 2 reviews the previous studies on passenger inﬂow control and the application of reinforcement learning. Section 3 describes coordinated passenger inﬂow control problem for a single metro line during peak hours. Section 4 builds an optimization model for the coordinated passenger inﬂow control problem on a whole line. Section 5 presents a unique approach to the coordinated passenger inﬂow control with reinforcement learning by systematically explaining the fundamental components and rationale behind the speciﬁc design choices. Section 6 provides a real-world example and discusses how it works to support the coordinated passenger inﬂow control. Finally, conclusions are made and future research directions are also given in section 7. 2. Literature review The coordinated passenger inﬂow control of URT is a complex and very challenging optimization problem which involves various essential components including passengers, trains, infrastructure and strategies. Previous studies regarding the passenger inﬂow 2

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

control can be classiﬁed in two major aspects: (1) large passenger ﬂow organization, (2) the optimization model and corresponding solution approach. In terms of the ﬁrst classiﬁcation dimension, existing research eﬀorts can be classiﬁed into two levels: at the macroscopic or microscopic level. At the macroscopic level, research commonly developed a general plan for large passenger ﬂow organization, such as the boarding limiting strategy (Delgado et al., 2012). At the microscopic level, studies mainly focused on the simulation and analysis of passenger travel behavior at a single metro station by using diﬀerent microscopic models such as social force model (Wan et al., 2014), cellular automata (Feng et al., 2013) and multi-agent (Martinez-Gil et al., 2014; Chen et al., 2017). However, such simulation-based research work neglects the interaction between passengers and trains and is only conﬁned to a single metro station, and fails to account for the necessary coordination between stations along the same metro line. The coordinated passenger inﬂow control of URT is a particular and newly arising problem in China. The optimization models formulated and corresponding solution approaches developed to solving such models in previous studies are insuﬃcient. Traditional operation research method (e.g., multi-objective linear programming) is among the most commonly used. The control stations, inﬂow control volume at each station and time to implement control strategies are optimized and determined by establishing the coordinated passenger inﬂow control model with diﬀerent optimization objectives. Such objectives are usually relevant to the passengers, operators or combination of both, such as minimizing the number of delayed passengers, or maximizing the matching degree of capacity and demand (Yao et al., 2015), or minimizing the passenger loss delays, or maximizing the passenger person-kilometers traveled (Zhao et al., 2014), or maximizing passenger proﬁt (Jiang et al., 2017). It is important to note that this paper intends to minimize the safety risks by reducing the frequency of passengers being stranded. Perhaps most important, this paper uses the origindestination passenger ﬂow matrix collected based on the AFC data instead of using the aggregated ﬂows coming to and departing from each station at the station-level. As such, more reliable computational results with a higher level of accuracy can be obtained compared to those from previous research eﬀorts. In fact, one of the passenger inﬂow controls is to optimize the inﬂow volume at each station in a certain period of time. It is similar to the ramp metering which regulates the number of vehicles entering a segment of highway. Due to the similarity between the two problems, the related work on the various approaches to the traﬃc ﬂow control problem is reviewed and summarized. Linear programming was the one commonly used in the control of freeway system during peak traﬃc periods with the objective of maximizing the total inﬂow volume under the constraints of the highway capacity (Wattleworth and Berry, 1965). However, it is not suitable for solving the problem on a large scale when the traﬃc ﬂow ﬂuctuates due to the emergencies. Consequently, researchers tried to apply artiﬁcial intelligence in the case of coordinated responsive metering, including the artiﬁcial neural networks (Spall and Chin, 1994; Kwon and Stephanedes, 1994) and reinforcement learning (Zhu and Ukkusuri, 2014; Walraven et al., 2016). In addition, some research studies also focused on the eﬀect of passenger inﬂow control. For example, a simulation work of macroscopic pedestrian ﬂows was conducted to test how the passenger behavior evolves when the passenger ﬂow control is employed after the occurrence of special events (Bauer et al., 2007). Since the coordinated passenger inﬂow control problem is essentially a large-scale nonlinear combinatorial optimization problem with a large number of decision variables, one cannot make dynamic decisions in a short time period by using the traditional operational research methods. The review of the literature indicates the need to develop a new method for the coordinated passenger inﬂow control. Reinforcement learning, as a powerful tool for solving complex sequential decision-making problems in control theory (Gosavi, 2009), has already been eﬀectively used in several transportation problems. Examples include train rescheduling (Šemrov et al., 2016), traﬃc signal control (Khamis and Gomaa, 2014), route planning (Zolfpour-Arokhlo et al., 2014) and traﬃc ﬂow control (Zhu and Ukkusuri, 2014; Walraven et al., 2016). According to the results of the research eﬀorts mentioned above, reinforcement learning is at least equivalent and often superior to the traditional operation research method in terms of overall dynamic decisionmaking performance and the computational speed. Meanwhile, the successful applications of reinforcement learning to ramp metering make it possible to deal with a similar problem as coordinated passenger inﬂow control in this paper. 3. Problem description The problem of passenger congestion in URT occurs if the passenger demand volume exceeds the transport capacity. Consequently, when the number of waiting passengers exceeds the capacity of the arrival train, those who cannot get aboard are stranded once with a further surge in the number of passengers waiting at the platform. If they fail to get on the next train, they will be stranded twice. If there are several passengers being stranded more than twice at a station, it will lead to the prolonged waiting time which can pose serious problems. As mentioned in the introduction section, increasing the capacity is not always a feasible solution, and therefore other measures such as controlling the inﬂow volume at certain stations should be taken to mitigate such passenger congestion issues. Fig. 1 depicts the eﬀect of passenger inﬂow control on keeping passengers in order at the station by setting railings outside metro stations and closing oﬀ partial entrances. It is important to note that one should not attribute the problem of passenger ﬂow congestion at one station only to the inﬂow volume of its own. In fact, for a single metro line, the congestion usually occurs when the train capacity is quickly reached at upstream stations even if the inﬂow volume of its own is small. For illustration purpose, a hypothetical case is designed and used to show the importance of coordinated passenger inﬂow control here in this paper. As shown in Fig. 2(a), due to the large inﬂow volume of station B, the train capacity is fully occupied by waiting passengers at station B. If no passenger on board gets oﬀ the train at the downstream stations including C, D and E, those who wait at stations C, D and E will not be able to get aboard, leading to the prolonged passenger boarding and congestion at these three stations. In order to relieve the congestion, the inﬂow volume of station B should be controlled so that the train capacity is reserved for passengers at the downstream stations who otherwise would have 3

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 1. The eﬀect of passenger inﬂow control on keeping passengers in order at the station.

Direction Section Passenger Volume

The number of passengers who cannot get aboard

Direction Section Passenger Volume

Train Capacity

The number of Inflow control volume of passengers get aboard station B at station C, D, E

Train Capacity

S

A

B

C

D

E

F Station

a) Without Control (station C, D, E suffered from passenger congestion)

A

B

C

D

E

F Station

b) Under Control (The passenger congestion is relieved at station C, D, E)

Fig. 2. The eﬀect of coordinated inﬂow control on balancing the utilization of the train capacity.

suﬀered from the congestion, which is so called coordinated passenger inﬂow control. It is obvious that the coordinated passenger inﬂow control can greatly help balance the utilization of the train capacity and improve the operational reliability, as shown in Fig. 2(b). In this paper, the problem to formulate the coordinated passenger inﬂow control strategy for a single metro line can be visualized in Fig. 3. For the purpose of addressing the congestion concern in a totally oversaturated condition, a virtual platform (named “Outside” in Fig. 3(b)) is added to the related original platform of the station under control along this single metro line. If the demand volume exceeds the transport capacity, speciﬁc control measures should be taken at certain stations. By doing so, the passengers arriving at these stations should ﬁrst wait outside the station and then enter the platform according to the inﬂow control rate and their arrival sequence. In other words, the inﬂow control rate determines the inﬂow control volume in a certain period of time, which represents those who are forbidden to enter the station and wait outside. According to the passenger demand volume and its characteristics during morning peak hours in Shanghai Metro Line 6 as shown in Fig. 4 (data from Automatic Fare Collection (AFC)), a large proportion of passengers originate from suburb and head towards the city center destination, leading to a large ﬂow in a certain direction of a single metro line. Fig. 4(a) shows the inbound volume of diﬀerent stations in both up and down directions, while Fig. 4(b) shows the load ratio of diﬀerent sections in both up and down directions. It can be concluded that the metro line is suﬀered from a heavy ﬂow in the up direction and control strategies should be developed at the stations whose inbound volume of up direction (i.e., heading towards the city center) is high. Under such circumstances, the passengers arriving at the control station in both directions are limited to enter the platform. However, it may not

4

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Down direction 1

2

3

k

n-1

n

Up direction

Train in down direction

Train in down direction boarding inbound

outbound

Platform

alighting

boarding

alighting inbound

Outside

Platform

alighting

boarding inflow w con control volume?

Train in up direction

alighting outbound

boarding

Train in up direction Station k

Station k b) Under Control

a) Without Control

Fig. 3. An example of coordinated passenger inﬂow control.

have a strong eﬀect on the passengers in the down direction since the inbound volume of the down direction is much lower than that of the opposite direction. Coordinated passenger inﬂow control strategy is aimed at preventing passengers from being stranded several times and minimizing the relevant safety risks (as previously explained) by relieving the congestion at the platform. Therefore, the strategy for coordinated passenger inﬂow control can be evaluated based on the number of people waiting at the platform and the frequency of being stranded. In that regard, the problem to be solved in this paper is how to design the best coordinated passenger inﬂow control strategy possible for a single metro line consisting of several stations. More speciﬁcally, the decisions have to be optimized as to when and at which stations to take control measures and what the optimal control rates are at each control station (measured in per unit of time).

4. Mathematical model 4.1. Sets and parameters For a passenger j, his/her trip chain of interest forms between inbounding to the origin station and leaving the destination platform, which is as shown in Fig. 5. Due to the arrival and departure process of passenger j at a station, six key time points (Ti,Aj to TiF,j ) and three diﬀerent states of a passenger (i.e., waiting at outside origin station, waiting at origin station platform, and remaining on the train) are identiﬁed. Passengers and also the state of passengers at each time point combined characterize the station loading capacity, platform loading capacity, cumulative number of passengers at each state, train available capacity, train timetable and passenger inﬂow control strategies at each station. It is important to note that diﬀerent from the given total inbound passenger demand (Jiang et al., 2017), the model in this paper can consider the demand at the individual passenger level and match such demands with train capacity. In other words, the model in this paper keeps track of passenger travel time chain and statistical cumulative number of passengers at each state to represent the interaction between passengers and service facilities. For the convenience of model formulation, relevant sets and parameters are listed in Table 2.

4.2. Variables The variables used in this paper are listed in Table 3.

4.3. Objective function For the coordinated passenger inﬂow control of URT, the objective is to minimize the penalty value of passengers being stranded on a whole metro line. The objective function is given by Eq. (1): 5

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 4. Passenger demand characteristics: (a) inbound ﬂow in both the up and down directions and (b) load ratio of diﬀerent sections in both the up and down directions.

Fig. 5. Arrival and departure process of passenger j at a station.

6

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Table 2 Passenger inﬂow control-related sets and notations. Sets/parameters

Deﬁnition

I J K M N

Set Set Set Set Set

of of of of of

stations (platforms) indexed by i ; passengers indexed by j ; trains indexed by k ; times indexed by t ; passenger inﬂow control time steps indexed by n ,

n = 1,2,⋯

T1 − T0 ; tc

Study period; Length of passenger inﬂow control time; Origin station of passenger j ;

[T0,T1] tc sO j pjO

Origin platform of passenger j ;

s jD

Destination station of passenger j ;

pjD

Destination platform of passenger j ;

r k,(i1,i2)

Trip time it takes to travel from platform i1 to platform i2 by train k;

TiA ,j

Time when passenger j arrives at station, i = sO j ;

tE

Walking time from origin station gate to origin platform;

TiG,j

O E Time when passenger j arrives at platform i without passenger inﬂow control, TiG,j = TiA ,j + t ,i = s j ;

Tid,k

Departure time of train k at platform i ;

C φ Ai θ qi,j,t

Train capacity; Maximum load factor of trains; Capacity of platform i ; Maximum load factor of platforms; Binary value, if passenger j is inside station i at time t then qi,j,t = 1, otherwise qi,j,t = 0 .

Table 3 Passenger inﬂow control-related variables. Variables

Deﬁnition

βj

Ampliﬁcation coeﬃcient of passenger j ;

Pip,t

Cumulative number of passengers waiting at platform i by time t ;

Pie,n

Cumulative number of passengers entering station i at time step n ;

Pia,n

Cumulative number of passengers arriving at station i at time step n ;

Pkr,t

Number of passengers remaining on train k at time t ;

TiB,j

Time when passenger j enters origin station i ;

TiC,j

Time when passenger j arrives at origin platform i ;

T jD,k

Time when passenger j boards train k ;

TiE,j

Time when passenger j alights train at destination platform i ;

TiF,j

Time when passenger j walks out of destination station;

qi1,j,t

Binary variable, if passenger j is outside origin station i at time t then qi1,j,t = 1, otherwise qi1,j,t = 0 ;

qi2,j,t

Binary variable, if passenger j is at origin platform i at time t then qi2,j,t = 1, otherwise qi2,j,t = 0 ;

qj3,k,t

Binary variable, if passenger j remains on train k at time t then qj3,k,t = 1, otherwise qj3,k,t = 0 ;

qj4,k

Binary variable, if TiC,j ⩽ Tid,k , i = pjO and train k can load passenger j then qj4,k =1, otherwise qj4,k =0;

Ri,(T3,T4) ai,n dj

Number of departure trains at platform i from time T3 to time T4 ; Passenger inﬂow control rate at station i at time step n ; Frequency (i.e., times) of passenger j being stranded at origin station.

MinZ =

∑∑ i∈I

βj =

βj ·(TjD,k−TiG,j )

(1)

j∈J

2d ⎧ e j, 0 ⩽ dj ⩽ 3,j ∈ J ⎨ e 8, dj ⩾ 4,j ∈ J ⎩

(2)

dj = Ri,(TiG,j ,T jD,k ) −1,j ∈ J ,k ∈ K ,i = pjO

(3) 7

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

4.4. Constraints

TiC,j = TiB,j + t E ,j ∈ J ,i = sOj / pjO

(4)

TjD,k = Tid,k,qj4,k = 1,j ∈ J ,k ∈ K ,i = pjO TiE,j

=

TjD,k

+

rk,(pjo ,pjD ) ,qj4,k

= 1,j ∈ J ,k ∈ K ,i =

(5)

pjD

(6)

1, t ∈ [Ti,Aj ,TiB,j ),j ∈ J ,i = sOj qi1,j,t = ⎧ ⎨ ⎩ 0, otherwise C D O ⎧1, t ∈ [Ti,j ,Tj,k ),j ∈ J ,k ∈ K ,i = pj ⎨ 0, otherwise ⎩

qi2,j,t =

qj3,k,t =

∑

(7)

(8)

4 D E O D ⎧1, t ∈ [Tj,k,Ti,j ),qj,k = 1,i = pj / pj ,j ∈ J ,k ∈ K ⎨ 0, otherwise ⎩

(9)

qj4,k = 1,j ∈ J

(10)

k∈K

Pi,pt =

∑

qi2,j,t ,i ∈ I ,t ∈ M

(11)

j∈J

Pi,pt

⩽ Ai ·θ,i ∈ I ,t ∈ M

Pkr,t =

∑

(12)

qj3,k,t ,k ∈ K ,t ∈ M

(13)

j∈J

Pkr,t ⩽ C·φ,k ∈ K ,t ∈ M

(14)

T0 + n·t c

Pie,n =

∑

∑

qi,j,t ,i ∈ sOj ,n ∈ N

(15)

j ∈ J t = T0 + (n − 1)·t c T0 + n·t c

Pia,n =

∑

∑

j ∈ J t = T0 + (n − 1)·t c

ai,n =

Pia,n−Pie,n ,i Pia,n

qi1,j,t ,i ∈ sOj ,n ∈ N

(16)

∈ I ,n ∈ N

(17)

Formula (3) deﬁnes dj , the times of passenger j being stranded, and relates it to the number of trains departing origin platform pjO during the period of time between TiG,j and TjD,k . Constraints (4)–(6) indicate the time when passenger j arrives at origin platform, boards train and alights train, respectively. Those time points can be used to infer trip chain between the origin station sOj and destination platform pjD of passenger j . Constraints (7)–(10) denote diﬀerent passenger states, corresponding to which include the passenger’s waiting time before entering the origin station, the waiting time at origin platform, and the time remaining on train before arriving at destination platform pjD . Constraints (11), (12) show the cumulative number of passengers waiting at platform i at time t while also ensuring the safety of passengers on the platforms. Constraints (13), (14) specify the cumulative number of passengers remaining on trains in which their values should not exceed the maximum capacity of the trains. In constraints (15)–(17), the passenger inﬂow control rate at station i at time step n is determined. 5. Reinforcement learning for coordinated passenger inﬂow control In order to approach the problem with reinforcement learning, the following subsections provide the principles and fundamental components of reinforcement learning for the coordinated passenger inﬂow control, including the environment and its states, action set, reward function and algorithm. 5.1. Reinforcement learning As an important branch of machine learning, reinforcement learning is an approach of learning what to do and how to map

8

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 6. Reinforcement learning agent and environment interaction in the case of coordinated passenger inﬂow control.

situations to actions so as to maximize a numerical reward signal (Sutton and Barto, 1998). More speciﬁcally, in reinforcement learning, the agent aims to ﬁnd an optimal control policy that maximizes the expected rewards through trial-and-error interaction with dynamic environment. During the learning process, the agent observes the environment’s state and decides upon an action. After executing the action, the environment switches its state to the subsequent state and gives the reward or penalty incurred by the action to the agent. The value function is also updated accordingly. The reward or penalty indicates the quality of the selected action on whether it is good in an immediate sense, while the value function speciﬁes what is good in the long run. The goal of the agent is to maximize such long-term reward, by learning a good policy which is a mapping from perceived states to actions (Arel et al., 2010). Fig. 6 depicts the interaction between the agent and environment in reinforcement learning for the particular case of coordinated passenger inﬂow control in this paper. The agent corresponds to each station on the single metro line, making decisions and taking actions on diﬀerent control rate of inﬂow volume, while the environment is the condition of passenger demand at each station. The environment is passive since all the decisions about the actions which change its states will be coming from the agent. The reward given to each station after executing the action is inversely proportional to the prolonged waiting time incurred by the passenger inﬂow control. Q-learning (Watkins and Dayan, 1992) is a typical reinforcement learning method for agents to learn how to act optimally in controlled Markovian domains. In Q-learning algorithm, the Q-value represents the expected utility of an agent executing the action in the current state. The agent intends to select the action with the maximal value of the utility, i.e., the Q-value. The Q-value function for the particular action and state is updated according to the reward received from the environment after executing the action in the current state. For each station i , the Q-value is initialized as 0 at the beginning of the learning process. Then the Q-value is updated, using the Q-learning update rule given by Eqs. (18) and (19):

Q (i,sn,a v ) = (1−α ) Q (i,sn,a v ) + α [Ri (sn,a v ) + γV ∗]

(18)

V∗

(19)

= maxQ (i,sn + 1,a v )

where α ∈ [0,1] is the learning rate, γ ∈ [0,1] is the discount rate, Ri (sn,a v ) is the reward given to station i after executing action a v in state sn , Q (i,sn,a v ) and Q (i,sn + 1,a v ) are the values of the Q function of station i for the respective states and actions. In Q-learning, the agent needs to explore the environment through taking diﬀerent actions at random to avoid local optimum. However, it also needs to select the action with the highest Q-value to reach the goal. The process is called the balance between exploration and exploitation. Several types of probability distribution have been proposed to ensure a good balance between the exploration and exploitation, and the Boltzmann distribution has been commonly used (Kaelbling et al., 1996) which is formulated as follows:

π [a v |sn] =

exp[Q (i,sn,a v )/ τ ]

∑

exp[Q (i,sn,a v )/ τ ] (20)

av ∈ A

where τ is the temperature parameter used to control the degree of exploration. At the beginning of the learning process, the value of τ is large so that the action is selected randomly, (i.e., the agent explores). With the deepening of learning, the value of τ is reduced so

9

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

that the agent always selects the action with the highest Q-value to approach the goal (i.e., it exploits).

5.2. Environment and simulation Three main elements inﬂuence the coordinated passenger inﬂow control strategy. The ﬁrst element is the real-time passenger demand, including the passenger demand of origin-destination (O-D) pairs on a single metro line, the inﬂow demand of each station and number of passengers waiting at each platform. The second element comprises trains, in which each train is characterized by the arrival & departure times and available capacity when arriving at each station. The third element refers to the stations, for which the platform capacity is of particular interest. In this paper, the above elements are designed and implemented with great accuracy to represent the interaction between passengers and trains (Jiang et al., 2016) and the coordination among stations on the same metro line (Jiang et al., 2012). In particular, the following assumptions are made when implementing the environment simulator to ensure a high computational eﬃciency while still guaranteeing an appropriate accuracy of the simulation. First, it is assumed that the original passenger demand over the entire peak period will not be reduced due to the passenger inﬂow control. Second assumption made is related to trains. All the trains operate according to the given timetable (D’Acierno et al., 2017), which means that there are no delays in trains. Third, it is assumed that passenger inﬂow control strategy can be implemented at any station on the single metro line. Fourth, it is assumed that the transfer passengers from other lines are already converted to the inbound passengers and the transfer passengers to other lines are also converted to the outbound passengers. The ﬁnal assumption is that passengers follow the rule of “ﬁrst-come, ﬁrst-served”. For the purpose of implementing the reinforcement learning approach to solving the coordinated passenger inﬂow control problem, a simulation platform is built based on the assumptions as mentioned above. Both the simulator and the reinforcement learning algorithm are implemented in the Microsoft Access and Visual Studio 2013 environment. The simulator can implement the gathering and scattering process at each station on the single metro line. Gathering and scattering process at each station usually consists of three sub-procedures: arrival, departure and alighting-boarding process (Xu et al., 2014; Xu et al., 2016). In the context of the speciﬁc problem domain in this paper, the arrival process is separated into two steps: the ﬁrst step is passengers getting to the entrance of the station, and the second step is arriving at the platform. The former step requires the arrival distribution of inbound passengers and their corresponding destinations which will be used as the input to the simulation. With the help of AFC, the above-mentioned information is known and ﬁxed over the period of interest. The latter step is related to the control rate of inﬂow volume per unit of time. If the control rate is 0, passengers arrive at the platform following the arrival distribution in the ﬁrst step. If the control rate is greater than 0, those who are limited to enter the platform should wait outside and enter the platform according to the control rate of next period and the arrival sequence including passengers wait outside in the last period. The alighting-boarding process includes both passengers and trains and their interaction. As such, train is an important simulator object. The planned timetable speciﬁes the time points of scheduled train arrivals and departures at each station and the dwell time at each station, while the maximum capacity and the number of passengers on board determine the number of boarding passengers at each station and the number of passengers being stranded. These basic objects of the environment simulator are made available to the reinforcement learning algorithm for coordinated passenger inﬂow control through the environment state which will be discussed in detail in the following subsection.

5.3. State description Station states characterize the state of each station on the single metro line at a certain point in time. It is assumed that the control rate of inﬂow volume for each station is changed at so-called control time steps. We deﬁne t c as the control time step size, based on which the total simulation time is divided into n control time steps. For example, if the control time step size: t c =15 min then the control rate of inﬂow volume for control stations may be changed every 15 min during the simulation. Considering the maneuverability in real case, the control time step size should not be less than 15 min. Let sn denote the state at control time step n . In the case of passenger inﬂow control problem, the state characterizes the real-time condition of passenger demand for each station at diﬀerent control time steps and is deﬁned as:

sn =

D (n) D′ (n)

(21)

The original inﬂow volume of the station at control time step n is deﬁned as D (n) . Due to the passenger inﬂow control, some passengers who arrive at the station at n−1 may have to wait outside the station and are not allowed to enter the station until the next step n . The sum of the volume of these passengers and D (n) is the current inﬂow volume of the station at control time step n , which is denoted by D′ (n) .

10

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 7. URT network topology in Shanghai.

5.4. Action set The action set A contains the control rate of inﬂow volume for each station on a single metro line. The action a v ∈ A is the action selected at each control time step n , which represents the percentage of passengers forbidden to enter the station at step n . Considering the maneuverability in real case, the action space is deﬁned as follows: A = {0%, 20%, 40%, 60%, 80%, 100%}. For example, one can assume that there will be 100 people entering station i at control time step n . If the action a2 = 20% is executed at station i at step n , then the number of people allowed to enter the station at step n reduces to 80. Therefore, the action a1 = 0% indicates that no one is forbidden to enter the station, while the action a6 = 100% means no one is allowed to enter the station. 5.5. Reward function A reward function deﬁnes the goal in a reinforcement learning problem (Sutton and Barto, 1998). In this paper, the objective is to minimize the safety risks at metro stations. Since the frequency of passengers being stranded is the main factor of safety risks, the reward function is directly related to the punishment for passengers being stranded both inside and outside the station. The reward for station i after executing the action a v in the state sn is deﬁned as:

⎤ ⎡ Ri (sn,a v ) = −⎢ ∑ (W (g ,sn ) δi (g ,sn )) + LRi (sn,a v ) ⎥ ⎦ ⎣ g∈I

(22)

where LRi (sn,a v ) is the long-term eﬃciency. 11

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

It should be noted that the reward value consists of two parts. The ﬁrst element, ∑g ∈ I (W (g ,sn ) δi (g ,sn )) , denotes the reward value for station i after taking the action a v in the state sn during the passenger inﬂow control time periods. The second component, LRi (sn,a v ), represents the penalty value of ∑g ∈ I (W (g ,sn ) δi (g ,sn )) at station i when all of the passenger inﬂow control strategies are implemented. In particular, the long-term eﬃciency explicitly accounts for the impact of the passengers being stranded during the passenger inﬂow control time periods on the passenger ﬂows during subsequent non-controlled time periods. Note that such longterm eﬃciency is jointly determined by both the Pie,n and Pia,n (i.e., both the cumulative number of passengers entering and arriving at station i at time step n). Due to the coordinated passenger inﬂow control and the limited platform capacity, passengers at each station are divided into two types as shown in Fig. 3(b), one of which represents those who are forbidden to enter the station and have to wait outside, and the other represents those who wait at the platform. Note that the punishment for those who wait at the station is quantiﬁed by using Eq. (23):

W (g ,sn ) = β [w1 (g ,sn ) + w2 (g ,sn )]

(23)

where β is the ampliﬁcation coeﬃcient, which increases nonlinearly and exponentially with the frequency of being stranded (i.e., dj which is as shown in Eq. (3)), and w1 (g ,sn ) is the time length of being stranded at the platform. If the action a v is executed at station g in the state sn , the punishment for those who wait outside the station refers to the extra waiting time spent outside the station g and is deﬁned as w2 (g ,sn ) . For example, it is assumed that one enters the station at 8:00 am (the moment he/she arrives at the station according to the AFC data). However, due to the passenger inﬂow control, he/she has to wait outside and ﬁnally arrives at the platform at 8:10 am. Then the time loss of the passenger spent outside is 10 min. After reaching the platform, he/she fails to get on the train that departs at 8:12 am. In other words, due to the limited train capacity, he/she is stranded twice and ﬁnally gets on the train which departs at 8:18 am. Then the punishment is 960e6 (i.e., W (g ,sn ) = e 2 × 3 × [(10−0) + (18−12)] × 60 = 960e6 ). Since the variable W (g ,sn ) is related not only to the inﬂow volume of station g but also to the train capacity occupied at the upstream stations, δi (g ,sn ) with i,g ∈ I being deﬁned as the punishment proportion which station i should take for the punishment of station g . 5.6. Reinforcement learning algorithm The reinforcement learning algorithm process is presented as follows:

12

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 8. The learning curve (i.e. change of the average total loss with increasing number of training episodes) obtained with reinforcement learning for the simulation scenario using 100 training episodes.

6. Example and analysis In this paper, the use and performance of reinforcement learning approach to the coordinated passenger inﬂow control for URT are evaluated and illustrated using a real-world example. Note that the example is based on a metro line in Shanghai, namely Line 6. 6.1. Simulation scenario The topology of URT network in Shanghai in the mid of 2016 consists of 14 lines with 366 stations, as is shown in Fig. 7. Line 6, marked in red in Fig. 7, consists of 28 stations with a total length of 36.1 km. During morning peak hours, passenger congestion occurs in the direction from Gangcheng Road to Oriental Sports center. In order to obtain the number of passengers in trains and in platforms, one needs to simulate the passenger ﬂow during the morning peak hours. As such, passenger ﬂow data with 95,850O-D pairs on March 3, 2016 is selected and used as the input in the experiment, which was provided by AFC System. The initial timetable was an actual weekday operation timetable of Line 6 on March 3, 2016. This timetable, named 622_1, is compiled by the train plan maker (TPM) software (Jiang and Xiao, 2014) with an average interval of 140 s in peak hours. The capacity of each train is set as 1008, which is the product of passengers per train unit and the number of train units, adjusted to maximum capacity using a maximum load factor (i.e., 210 × 4 × 1.2 = 1008). The simulation time horizon is deﬁned from 8:00 to 11:00 a.m., covering the morning peak hours. The simulation is implemented in Visual Studio 2013 running on a personal computer with an Intel Core i5-3470 M CPU at 3.20 GHz and 8 GB of RAM under Windows 8 64-Bit. In addition, the following parameters are used for: φ = 1.4, θ = 0.8, α = 0.1, γ = 0.7, t E = 2min, t c = 15 min, A = {0%,20%,40%,60%,80%,100%}, and 100 training episodes. To properly evaluate the performance of reinforcement learning, 5 runs and 100 training episodes per run are designed as an experiment conducted in this paper. Note that a diﬀerent random-generator seed in each run is used. The average computational time for 100 training episodes is 50 s. Note that the coordinated passenger inﬂow control problem is essentially a large-scale nonlinear combinatorial optimization problem. One can easily see from the solution space of this simple case study. There are 28 stations on line 6, 3 × 60/15 = 12 decision intervals within 3 h, and 6 potential actions for each interval at each station. As such, the space consists of (6^12)^28 = 2.8762 ∗ 10^261 possible solutions. In this regard, no global optimal solution can be found and the use of reinforcement learning approach for coordinated passenger inﬂow control of urban rail transit in peak hours is therefore well justiﬁed. 6.2. Results and analysis As mentioned, the experiment conducted consists of 5 runs with 100 training episodes per run. The result of each training episode is an improved inﬂow control strategy in which the quality of the inﬂow control strategy is expected to improve, while the Table 4 The inﬂow control strategy for Line 6. Inﬂow control station

Time to take control strategies

Inﬂow control rate

Corresponding inbound volume

Dongjing Road Jufeng road Wulian road Boxing road Beiyangjing road

8:00–8:15 8:00–8:15 8:00–8:15 8:00–8:15 8:15–8:30

20% 20% 20% 20% 20%

595 1105 594 945 253

13

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Table 5 Number of passengers being stranded from 8 am to 9 am. Frequency of being stranded

Once

Twice

Three times

More than three times

Yuanshen Sports Centre Stadium Without control Under control

153 112

15 0

0 0

0 0

Minsheng Road Without control Under control

319 375

169 100

118 5

11 0

Beiyangjing Road Without control Under control

221 303

130 91

23 0

0 0

Deping Road Without control Under control

465 688

281 88

37 0

0 0

Yunshan Road Without control Under control

460 308

0 0

0 0

0 0

Jinqaio Road Without control Under control

176 0

0 0

0 0

0 0

Boxing Road Without control Under control

59 0

0 0

0 0

0 0

The whole line Without control Under control

2223 2156

595 279

178 5

11 0

punishment for passengers being stranded is expected to decrease. All training episodes together present one learning run in which the agent learns the inﬂow control strategy for the oversaturated single metro line. Fig. 8 depicts the learning curve for the simulation scenario as mentioned in Section 6.1: the red dashed line labeled “FC_RL_avg” is a ﬁtting curve which represents the average trend of the 5 learning curves obtained with 5 learning runs with diﬀerent random-seed values, while the solid line labeled “RL_stDev” corresponds to the standard deviation. The learning curve shows that the inﬂow control strategy improves with the training episodes, which means the agents (stations) learn the performing strategy that leads to a lower total loss. It can also be seen that the standard deviation in the 5 runs decreases with the training episodes, due to the fact that the exploration rate (i.e. the action is selected randomly) decreases with the deepening of learning. 6.2.1. Inﬂow control strategy With the help of reinforcement learning, the inﬂow control strategy for metro line 6 is determined, including the control stations, inﬂow control rates and the times to implement control strategies, as shown in Table 4. It should be noted that for all other stations on line 6 which are not listed in Table 4, it is recommended that no inﬂow control strategies be taken. It is very important to note that the strategy obtained by reinforcement learning is diﬀerent from the one formulated by the subjective work experience of the operation staﬀ. The inﬂow control rate of the latter one is usually subject and insuﬃcient. In real case, the control stations of line 6 include Dongjing Road, Jufeng Road, Wulian Road, Boxing Road, Jinqiao Road and Deping Road and the time to take control strategies is 7:20–8:40. Compared with the one applied in real case, the inﬂow control strategy obtained by reinforcement learning approach is more accurate and has greater maneuverability. The number of the control stations is reduced from 6 to 5 and the duration of the control strategy is also shortened greatly. By using the reinforcement learning approach, the inﬂow control rate and the corresponding inbound volume can be easily quantiﬁed, which covers the disadvantages of the one in real case since they are usually valid due to the sub-optimality and uncertainty of the work experience of operation staﬀ. It can be concluded that the inﬂow control strategy obtained by using the reinforcement learning approach is more accurate and has greater maneuverability. 6.2.2. The frequency of passengers being stranded in stations The frequency of passengers being stranded is one of the performance indicators which can be used to evaluate the quality of the inﬂow control strategy obtained by reinforcement learning. As mentioned before, if there are several passengers being stranded more than twice at a station, it can cause serious hidden dangers. The number of passengers being stranded in stations is compared with the result of non-inﬂow control in Table 5. Before taking any control measures, some passengers are stranded more than twice in such stations including Minsheng Road, Beiyangjing Road and Deping Road. After taking the control strategy obtained by using the reinforcement learning approach (as presented in Table 5), 14

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

Fig. 9. The change of waiting passengers at platform without inﬂow control and under inﬂow control.

the average frequency of passengers being stranded is reduced to less than two times and the number of passengers being stranded more than twice is signiﬁcantly reduced from 189 to 5. 6.2.3. The number of passengers waiting at the platform Another indicator used to evaluate the inﬂow control strategy is the number of passengers waiting at the platform. Fig. 9 presents the results at four stations of interest (including Minsheng Road, Beiyangjing Road, Deping Road and Jinqiao Road) to evaluate the eﬀect of the inﬂow control strategy. Since the train capacity is quickly occupied by the upstream stations (including Dongjing Road, Jufeng Road, Wulian Road and Boxing Road stations), the number of waiting passengers at these four stations of interest exceeds the critical density of the platform without passenger control. According to the control strategy obtained by using reinforcement learning, upstream stations will take control measures at time period 8:00–8:15. The train capacity is reserved for the above four stations of interest so that passengers waiting at these stations can get aboard and the number of waiting passengers will be reduced. Considering the relative position of these stations, the numbers of waiting passengers are greatly reduced at time period 8:14–8:24 under passenger control as shown in Fig. 9(c)–(d). In that regard, the results clearly indicate that the inﬂow control strategy obtained by reinforcement learning can greatly help relieve the platform congestion on Line 6 and can be eﬀectively applied in practice. 7. Conclusion In this paper, the model of coordinated passenger inﬂow dynamic control is built with the objective of minimizing the penalty value of passengers being stranded on a whole metro line while explicitly accounting for the constraints of passenger trip chains and states, capacity of the platform and train, and scheduled train timetable. A new method based on reinforcement learning is presented to optimize the inﬂow volume of each station at a certain period of time with the aim of minimizing the safety risks at metro stations. In order to determine the inﬂow control station, optimal inﬂow control volume of each control station and the time to take control strategies, a detailed presentation of the basic principles and fundamental components of reinforcement learning is given, including the environment and its states, action set, reward function and algorithm. In the case study, the metro Line 6 in Shanghai is presented to demonstrate the performance of the model and approach. In comparison with the result of no-inﬂow control, the number of passengers being stranded more than twice is signiﬁcantly reduced and the passenger congestion at certain stations is also greatly relieved. The plan for coordinated passenger inﬂow control achieved by solving the model using reinforcement learning can be useful for the operation staﬀ when determining the inﬂow control station, inﬂow control volume of each station and time to take control strategies in practice. 15

Transportation Research Part C 88 (2018) 1–16

Z. Jiang et al.

The proposed approach can be considered as a ﬁrst step towards the use of reinforcement learning for coordinated passenger inﬂow control. As the line of this research matures in the future, further research should extend its application to multiple lines in an urban rail transit network. Moreover, train delays are not considered in the proposed approach, which may occur as a result of extended dwelling times due to a large number of boarding passengers in practice and can therefore have some inﬂuence on the passenger inﬂow control strategy. It is also well understood that the passenger ﬂow control strategy may be restricted by many external inﬂuencing factors, including the station layout, passenger transfer characters, origin & terminal station, etc. Therefore, one may need to account for such factors accordingly. These issues will be addressed in the future research. Acknowledgements This work was supported by the National Natural Science Foundation of China (Grants Nos. 61473210), the Fundamental Research Funds for the Central Universities. The acquisition of the operations data in the paper was supported by the Shanghai Shentong Metro Management Center. The authors are grateful for these supports. References Arel, I., Liu, C., Urbanik, T., Kohls, A.G., 2010. Reinforcement learning-based multi-agent system for network traﬃc signal control. IET Intel. Transport Syst. 4 (2), 128–135. Bauer, D., Seer, S., Brandle, N., 2007. Macroscopic pedestrian ﬂow simulation for designing crowd control measures in public transport after special events. In: Summer Computer Simulation Conference 2007. San Diego, California, USA, pp. 1035–1042. Chen, X., Li, H., Miao, J., Jiang, S., Jiang, X., 2017. A multiagent-based model for pedestrian simulation in subway stations. Simul. Model. Pract. Theory 71, 134–148. D’Acierno, L., Botte, M., Placido, A., Caropreso, C., Montella, B., 2017. Methodology for determining dwell times consistent with passenger ﬂows in the case of metro services. Urban Rail Transit 3 (2), 73–89. Delgado, F., Munoz, J.C., Giesen, R., 2012. How much can holding and/or limiting boarding improve transit performance? Transp. Res. Part B: Methodol. 46 (9), 1202–1217. Feng, S., Ding, N., Chen, T., Zhang, H., 2013. Simulation of pedestrian ﬂow based on cellular automata: a case of pedestrian crossing street at section in China. Phys. A: Stat. Mech. Appl. 392 (13), 2847–2859. Gosavi, A., 2009. Reinforcement learning: a tutorial survey and recent advances. Inf. J. Comput. 21 (2), 178–192. Jiang, M., Li, H.Y., Xu, X.Y., Xu, S.P., Miao, J.R., 2017. Metro passenger ﬂow control with station-to-station cooperation based on stop-skipping and boarding limiting. J. Cent. S. Univ. 24 (1), 236–244. Jiang, Z., Hsu, C., Zhang, D., Zou, L., 2016. Evaluating rail transit timetable using big passengers' data. J. Comput. Syst. Sci. 82 (1), 144–155. Jiang, Z., Li, F., Xu, R.H., Gao, P., 2012. A simulation model for estimating train and passenger delays in large-scale rail transit networks. J. Cent. S. Univ. 19 (12), 3603–3613. Jiang, Z.B., Xiao, X.Y., 2014. A turn-back track constraint train scheduling algorithm on a multi-interval rail transit line. Comp. Rail. XIV 135, 151–162. Kaelbling, L.P., Littman, M.L., Moore, A.W., 1996. Reinforcement learning: a survey. J. Artif. Intell. Res. 4 (1), 237–285. Khamis, M.A., Gomaa, W., 2014. Adaptive multi-objective reinforcement learning with hybrid exploration for traﬃc signal control based on cooperative multi-agent framework. Eng. Appl. Artif. Intell. 29, 134–151. Kwon, E., Stephanedes, Y.J., 1994. Comparative evaluation of adaptive and neural-network exit demand prediction for freeway control. Transp. Res. Rec. 1446, 66–76. Martinez-Gil, F., Lozano, M., Fernández, F., 2014. MARL-Ped: a multi-agent reinforcement learning based framework to simulate pedestrian groups. Simul. Model. Pract. Theory 47, 259–275. Šemrov, D., Marsetič, R., Žura, M., Todorovski, L., Srdic, A., 2016. Reinforcement learning approach for train rescheduling on a single-track railway. Transp. Res. Part B: Methodol. 86, 250–267. Spall, J.C., Chin, D.C., 1994. A model-free approach to optimal signal light timing for system-wide traﬃc control. In: Proceedings of the IEEE Conference on IEEE Decision and Control, pp. 1868–1875. Sutton, R.S., Barto, A.G., 1998. Reinforcement learning: an introduction. IEEE Trans. Neural Netw. 9 (5), 1054. Walraven, E., Spaan, M.T.J., Bakker, B., 2016. Traﬃc ﬂow optimization: a reinforcement learning approach. Eng. Appl. Artif. Intell. 52, 203–212. Wan, J., Sui, J., Yu, H., 2014. Research on evacuation in the subway station in China based on the Combined Social Force Model. Phys. A: Stat. Mech. Appl. 394, 33–46. Watkins, C.J.C.H., Dayan, P., 1992. Q-learning. Mach. Learn. 8 (3–4), 279–292. Wattleworth, J.A., Berry, D.S., 1965. Peak-period control of a freeway system-some theoretical investigations. Highway Res. Rec. 89, 1–25. Xu, X., Liu, J., Li, H.Y., Hu, J.Q., 2014. Analysis of subway station capacity with the use of queueing theory. Transp. Res. Part C: Emerg. Technol. 38, 28–43. Xu, X., Liu, J., Li, H.Y., Jiang, M., 2016. Capacity-oriented passenger ﬂow control under uncertain demand: Algorithm development and real-world case study. Transp. Res. Part E: Logist. Transp. Rev. 87, 130–148. Yao, X.M., Zhao, P., Qiao, K., Yu, D., 2015. Modeling on coordinated passenger inﬂow control for urban rail transit network. J. Cent. S. Univ. 1, 342–350. Zhao, P., Yao, X.M., Yu. D., 2014. Cooperative passenger inﬂow control of urban mass transit in peak hours. J. Tongji Univ. (Nat. Sci.) 42(9), 1340-1346+1443. Zhu, F., Ukkusuri, S.V., 2014. Accounting for dynamic speed limit control in a stochastic traﬃc environment: a reinforcement learning approach. Transp. Res. Part C Emerg. Technol. 41 (2), 30–47. Zolfpour-Arokhlo, M., Selamat, A., Hashim, S.Z.M., Afkhami, H., 2014. Modeling of route planning system based on Q value-based dynamic programming with multiagent reinforcement learning algorithms. Eng. Appl. Artif. Intell. 29, 163–177.

16

Reinforcement learning approach for coordinated passenger inflow control of urban rail transit in peak hours

Reinforcement learning approach for coordinated passenger inflow control of urban rail transit in peak hours

Recommend Documents