Data-driven model for passenger route choice in urban metro network

Data-driven model for passenger route choice in urban metro network

Accepted Manuscript Data-driven model for passenger route choice in urban metro network Jianjun Wu, Yunchao Qu, Huijun Sun, Haodong Yin, Xiaoyong Yan,...

1MB Sizes 0 Downloads 34 Views

Accepted Manuscript Data-driven model for passenger route choice in urban metro network Jianjun Wu, Yunchao Qu, Huijun Sun, Haodong Yin, Xiaoyong Yan, Jiandong Zhao

PII: DOI: Reference:

S0378-4371(19)30603-X https://doi.org/10.1016/j.physa.2019.04.231 PHYSA 20995

To appear in:

Physica A

Received date : 20 July 2018 Revised date : 14 February 2019 Please cite this article as: J. Wu, Y. Qu, H. Sun et al., Data-driven model for passenger route choice in urban metro network, Physica A (2019), https://doi.org/10.1016/j.physa.2019.04.231 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Highlights (for review)

Highlights: 1. A data-driven methodology is proposed for reconstructing metro passenger flow distributions from large-scale smart card data. 2. A similarity measure model is built to identify the possible routes according to the travel time clustering results. 3. The travel times of all od pairs are calculated and clustered to obtain the maximum possible travel time groups. 4. Real data from the Beijing metro smart card system is applied to verify our approach.

*Manuscript Click here to view linked References

Data-driven Model for Passenger Route Choice in Urban Metro Network Jianjun Wua, Yunchao Qua, Huijun Sunb, Haodong Yina, Xiaoyong Yanb, Jiandong Zhaob a

State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing 100044, China b

School of Traffic and Transportation, Beijing Jiaotong University, Beijing, 100044, China

Abstract Passenger flow distribution in the metro system is fundamental for many applications such as network planning and design, passenger flow forecasting, individual travel activity modeling and emergency response management. However, in most metro systems the smart card automated fare collection (AFC) equipment in Beijing only record when and where a passenger enters and leaves the metro network. Therefore, how to accurately determine passenger flow distribution in unknown travel routes remains a challenging task for the managers. This paper presents a methodology for reconstructing metro passenger flow distribution from large-scale smart card data. A clustering method was first applied to group the travel time of passengers between origin-destination (OD) station pairs into different clusters. Then an approach was proposed that considered both uncertain walking time and transfer time, to estimate the theoretical travel time of all possible routes between the OD pair. An approach to measure the similarity was further employed to match each travel time cluster to a most-likely travel route, and finally obtained the passengers’ flow of every route. Compared with two classical methods, the proposed approach was more accurate and efficient. Keywords: Smart card; Travel behavior; Metro network; Passenger flow

1 Introduction Nowadays, rapid urbanization creates excessive travel demand [1], which has been increasingly difficult to maintain the ease of access [2-5] and might lead to traffic congestion in cities. Due to the advantages of large-capacity, low-pollution and highly reliable, urban metro systems have become an important transportation mode to relieve traffic congestion, especially in large cities. Take Beijing as an example, where the metro system carries more than 10 million passengers per day. Obviously, uncovering spatial-temporal patterns [6-9] to understand metro passenger travel behavior in such cities has great significance for public transit agencies, from the day-to-day operation of the metro system to the long-term planning of the mass transit network. In past decades, smart card automated fare collection (AFC) systems have been widely used in urban metro systems around the world. A large scale smart card data is useful for us to collect the travel information and capture the complexity behind the data. Mining metro passenger travel behavior from smart card data has attracted a wide range of research interests from various perspectives. Pelletier et al. [31] provided a detailed literature review that covers several aspects of smart card data use at three levels of management in the public transit context. Some successful applications include transit planning, improving operation services [10], travel behavior dynamics [11], traffic flow prediction [11-12], social interaction [13], detection of contagious outbreaks [14], urban planning [15], urban mobility patterns [16-18], station function identification [32], estimating behavioral attributes and adjustment of trips [33, 34], predicting the impact of a new line on flow distribution [35]. 

Corresponding author. E-mail addresses: [email protected]

Most of the AFC systems around the world may record time and location at boarding and alighting station. For example, the AFC systems in Chicago and New York may only record boarding time and boarding location, while the AFC system in Beijing metro systems record both boarding time, boarding location, alighting time, and alighting location. However, up to now, none of them can record the passenger’s travel route in the metro network. Because both of the time and location at boarding and alighting stations are very important to recognize the route pattern, only the AFC systems that can record all the boarding and alighting information will be discussed in this paper. In the metro smart card record, we only know the travel time between an arbitrary origin and destination stations [9]. The route used in the metro is unknown because there are many routes for transfer passengers, especially in larger network. Unknown trip routes make movement in the system a “black box”. Operation service optimization, emergency incident management, and fare clearing among multi-operators in metro systems, all require a travel behavior model to match the route choice of passengers as accurately as possible. Accurately determining passenger flow distributions on metro systems under unknown travel routes remains a challenge. Traditionally, the discrete route choice models are often used to solve metro passenger flow assignment problems [19]. In the existing researches, generalized cost functions for urban rail transit network have been proposed. Logit-based model [36-38] and Bayesian approach [39] have been proposed to estimation of passenger route choice pattern and travel time reliability. However, such models must know the total cost of each route, including waiting time, in-vehicle time, dwell time at the station, transfer time, and choice habits (e.g., the number of transfers, the comfort, and penalties of transferring). The effects of comfort in route choice is very hard to be quantified [21-22], leading to inaccurate passenger flow prediction and estimation in real applications. Recently, smart card technology is helpful to solve this problem, and some data-driven models based on AFC data have been developed to discover the route choice behavior in urban metro systems and urban transit systems [9, 13, 24-27, 40-44]. Combined with train schedule data and human mobility constraints, the in-vehicle travel time and transfer time could be obtained from AFC data [40]. These components of travel time and marginal disutility were integrated into data mining models including logit-based model [41, 42], probabilistic maximum likelihood function [9], synchronous clustering algorithm [43, 44] to further analyze route choice behaviors. However, the stochastic characteristics of individual travel time and the fuzzy properties of route choice behavior have not been fully considered in these models. In this paper, a clustering approach based on individual travel time data was proposed to recognize the route set pattern and then reconstruct the passenger flow distribution in urban metro systems. The generalized Manhattan distance between each pair of two points was calculated, and then the local density was determined by counting the number of the points with distances smaller than the cutoff distance. The cluster centers were determined by finding the points with larger minimal distances without the outliers. A cluster center represented a feasible path that was frequently used by passengers. To obtain the actual route in the urban metro network, the k* algorithm was applied to obtain the available travel route set. The generalized travel cost of each route was calculated by summing up the preset train travel time and the free flow walking time in transfer stations. By analyzing the actual travel data, the individual walking time almost followed the normal distribution. The theoretical time-dependent travel time of each route was obtained by the mean

value and variance of the normal distribution. Based on the fuzzy matching approach, a normal distribution based similarity function was proposed to match the routes to the cluster centers. Besides, the individual travel route was determined by finding the maximal similarity to the alternative cluster centers. The route flow and travel flow distribution can be further obtained by integrating the individual data. To validate our model, the smart card data in Beijing metro system was selected to demonstrate the computational efficiency. The remainder of this paper is organized as follows. Section 2 gives the data description of AFC data in Beijing Metro system. Section 3 introduces the approaches of travel time cluster algorithm, route travel time estimation and fuzzy matching model. Section 4 illustrates the numerical results of the passenger flow distributions in Beijing Metro network, and shows the computational comparison results with other two models. Section 5 gives a further discussion of this model and explores the future works.

2 Data description In the Beijing mass transit railway, more than 80% of passengers pay for their trips with smart cards. Data on more than 10 million smart card transactions are created every day, and many challenges in travel behavior modelling and new techniques are required to process the data and mine useful information from it. The data we used in this paper was recorded by AFC systems in their tap-in and tap-out stations provided by the Beijing Municipal Commission of Transport (BMCT), covering more than 2 months and about 1 billion tap-in and tap-out successive data from February 7, 2014 to April 14, 2014. In the dataset, key information included card ID, entry-time, exit-time, trip-origin-location, and trip-destination-location. Therefore, the OD of each trip record was determined. However, these records could not offer any travel route information for transfer passengers due to many alternative routes between an arbitrary OD. Invalid data were discarded as follows: mismatching tap-in and tap-out times, too small travel times (less than 3 minutes), and too large travel time (greater than 4 hours). To keep consistent with the data transmission coming from AFC machines, the records were integrated with 5-minute intervals from 5:00 AM to 12:00 PM. Therefore, the total daily operation time was separated into 228 time intervals. The entry-time and exit-time show the tap-in and tap-out time for each trip record, while the trip-origin-location and trip-destination-location uniquely identifies each of the 237 stations of the network provided in Fig. 1(a). In a complex metro network, there may be more than one travel route between each OD station pair (see Fig. 1(b)), and different routes between an OD pair have different expected travel times. In the smart card data, the entry-time and exit-time (see Fig. 1(c)).

of the

passenger are marked for each travel from station

to station

Fig. 1. Illustration of the Beijing metro network. (a) Profile of the complete network. (b) As an example, we give 5 possible routes A1, A2, …, A5 between OD pair TTY (TianTongYuan) to DSST (DongSiShiTiao), marked with different colors. (c) Illustration of travel time composition. Deterministic in-vehicle time and random walking time including the time from AFC system to vehicle, time in the interchange station, and the time from the platform to the AFC system.

It should be noted that the travel time and travel demand will change over time. Especially, the travel patterns in weekdays and weekends will be quite different. To provide a clearer description, the overview of daily travel demands between the OD pair TTY-DSST during the period were listed in Fig. 2. For the 66-day dataset, the maximal travel demand is 625 persons/day, and the minimal travel demand is 141 persons/day. The result shows that the travel demands in weekdays are larger than weekends. During the weekends, the travel demand is in a low level, and the congest in the train will be not serious. Most of passengers will choose the shortest path to travel, and it will be less crucial for managers and researchers to obtain the route choice behavior. Therefore, we only concern about the travel pattern in weekdays. As shown in Fig. 2 (a), in weekdays, the minimal demand is 517 persons/day, and the maximum demand is 625 persons/day. The mean value of the demands in 47 weekdays is 574.9 persons/day, and the standard deviation is 28.6 persons/day. The results show that the travel demand in weekdays seems steadier. The individual travel time in weekdays is then analyzed, and the results are listed in Fig. 2 (b). The median value of the travel time fluctuates near 2300s, and the standard deviation of the median value is 167.3s. The result also shows a steady characteristics of the individual travel time. Fig. 3 illustrates the daily travel demand and individual travel time in a specific week (only weekdays). Both of the demand and travel time show the steady trends.

Fig. 2. Time-dependent daily travel demand and travel time between TTY to DSST stations from Feb. 7, 2014 to Apr. 14, 2014. a) time-dependent daily travel demand. b) Box graph of individual travel time.

Fig. 3. Overview of travel time and demand characteristics between TTY and DSST stations from March 24, 2014 to March 28, 2014. (a) Shows that the average travel time was about [2339, 2380.6] s. (b) Displays that demand was about [543, 600] persons/day.

3 Method Because the route travel time was time-varying due to the different travel demand levels, all the travel data during a whole day were divided into several time intervals. The individual travel time of the passengers travelling in the same time interval would be integrated into a group. In order to avoid the errors and fluctuations caused by too few data, the time interval was set to 2 hours. All the individual travel time would be processed according to the cluster algorithm. 3.1 Travel Time Clustering Assume that there were N travel time data points, and the travel time set was Due to the symmetrical property of Manhattan distance, the equation only needed to calculate and

.

followed. So we

distances between each two points. For two arbitrary points

, we calculated the Manhattan distance

get a new distance series by

, and sorted them to ,

. To give a brief and clear

definition, the parameters were defined as follow: : the local density of the point l, which denotes the number of points that are closer than the cutoff distance to this point; : the minimal distance between the point l and other point with higher densities; : the product of density and distance , which is used to find cluster centers. For each data point , the local density can be calculated by Eq. (1), which was proposed by Ref. [28]. (1) . Let be the cutoff distance, Where when x  0 . Otherwise, and

be the round number of

for

experiential. Besides,

is the minimum distance between the point l and any other point with higher density. As mentioned in Ref. [28], the cluster algorithm was on the assumptions that the cluster centers were surrounded by neighbors with lower local density. The cluster centers are usually recognized as the points for which the value of is anomalously large. For some special situations, some points may have a relatively high but a low because they are isolated, which can be regarded as outliers. It is obvious that a cluster center may have a large value of . The cluster center will be recognized by finding the points with larger without the outliers. By this approach, the two cluster centers (32.1, 424.6) and (26.4, 144.7) that their values of were extremely larger than other points can be recognized. It should be noted that the points (0, 408.7) and (4.9, 210.3) are outliers. The travel time of the passenger for the trip from to can be calculated by . Using the clustering method proposed by [28], the passengers travel time were clustered into different groups. Fig. 4 depicts an example of clustering results of travel times between two stations TTY and DSST. There are two cluster points (see Fig. 4(a)). Let be the travel time of j-th cluster, and the two cluster centers are identified as cluster center 1 with and cluster center 2 with (see Fig. 4(b)).

Fig. 4. Decision graph of clustering results. (a) Two cluster centers identified marked with red and blue points. (b) According to Ref. [28], the larger

corresponds to the cluster center.

3.2 Route Travel Time Estimation Because the AFC system only records the entrance time and exit time, the AFC-to-vehicle time including transfer time and walking time could not be directly derived from the original AFC data. The components of the travel time will be estimated by the following approach. The theoretical in the route between OD pair consists of three items: deterministic travel time in the station (the sum of time from AFC machine , random walking time in-vehicle time to the platform at the origin station and the time from the platform to the AFC machine at the . Thus, we have destination station), and the total transfer time (2) can be directly calculated by the trains’ timetable and is the sum of the running time where and can be between 2 successive stations and the dwell time at each station; and estimated with the tap-in and tap-out data as follows. Firstly, we chose records in the tap-in and tap-out data that included only one alternative route between OD pairs (has the same origin as OD pairs ) and (has the same destination as ) to estimate the walking time between OD pairs . Let random items in Eq.(2) be for an OD pair variance

, where

, and the expectation

and the

could be calculated by summing the running time and

dwell time at each station between according to the train timetable. Here, the ‘dwell time’ represents the train dwell time at each subway station. The passenger’s waiting time at the platform will be included into the random walking time or the total transfer time . In addition, time consisted of the time from the AFC machine to the platform at the origin station and the time from the platform to the AFC machine at the destination station. The term followed the normal distribution

, where

and

. In fact,

and could be obtained according to the historical travel records in AFC data. Therefore, the parameters and also can be obtained. Next, we used the AFC data which had only one interchange at a transfer station between an OD pair to estimate the transfer time . For simplicity, the transfer walking time of each passenger was assumed to be independent among different transfer stations. Let , where represented the transfer walking time at station . Binary value when route

between OD

used the transfer station

; otherwise,

. In general, the walking time

followed the normal distribution, such as in the transfer station YHG between OD pair HPXQ (HePinXiQiao) and DSST (DongSiShiTiao) (see Fig. 5(a)). Therefore, we assumed that the transfer time also follows , where and are the mean value and the variance of walking time in the transfer corridor. Selecting an OD pair with only one transfer route at station , let , where , and were known , from the tap-in and tap-out records. Thus, and could also be determined. Therefore, and could be measured. We estimated the transfer time at all transfer stations for the given OD pair from station TTY to station DSST using the above method, and the results were shown in Fig. 5(b). Furthermore, we estimated the theoretical travel time of different routes for a given OD pair.

Fig. 5. Estimated transfer time by AFC data. (a) Distribution of transfer time at YHG station. Clearly, the fit curve follows the normal distribution N (183.5, 87.32), with the coefficient of determination 0.9907 and the sum of squares errors 0.0009655. The 80% probability of transfer time when anyone interchanges at this station is about [302,482] s. In the figure, the red line is the fitness curve. The real transfer time surveyed from BMCT is about 390±13 s. Therefore, the estimated transfer time by our method was efficient. (b) The mean value and variance of the transfer time distribution for all possible interchange stations between TTY and DSST.

3.3 Route Matching and Flow Assignment The final problem to be addressed was how to match the real passenger travel time to a certain route. The feasible route set between each OD pair was firstly determined. It was usually to have more than one path between each OD pair in the urban rail transit network. Using traditional Dijkstra’s algorithm or Bellman Ford algorithm, the shortest path could be easily and efficiently obtained. The k shortest path algorithm was a generalization of the shortest path problem. The algorithm not only found the shortest path, but also k − 1 other paths in non-decreasing order of cost. There had been many papers published on the k shortest path algorithm problem, and the k* algorithm recently proposed by Aljazzar and Leue [45] was applied to calculate the feasible routes between each OD pair. More details of the algorithm could be referred to Ref. [45]. In metro network, the travel time of a route was time-varying due to the uncertain travel demands and the time-dependent train time table. It was also determined by the individual stochastic walking time and transfer time in the entrance, exit and transfer stations. It was difficult to calculate the

personalized shortest path for each passenger under a certain situation; therefore, for simplicity, the k-shortest path would be calculated under the determined free flow train travel time. In this numerical example, the number of routes k was set to 5 for instance. The second step was matching the alternative routes to the obtained cluster centers. After obtaining the feasible route set, let be the set of average travel time for potential routes. Due to the randomness and uncertainty in passenger travel, how to match the route accuracy became more and more difficult with the expansion of the metro network. Here, a fuzzy matching method was developed to identify the route with the matching approach proposed in Ref. [29]. Because the fuzzy set of the routes only had a single attribute that was the travel time, a simple membership function based on the normal distribution assumption could be applied to calculate the similarity. For example, the j-th clustering center , according to the statistical travel time distribution, the similarity could be calculated as follows, (3) . where represented the variance of the travel time in the route An alternative route was then determined according to the principle of maximum membership degree. If , the clustering center had the largest matching probability with the -th route. For convenience, these results were explained in Fig. 6 using an example from TTY to DSST with given five initial routes. Fig. 6(a) illustrated the theoretical travel time of the routes. Fig. 6(b) showed that the most probable routes of two cluster centers were for this case. The detail route information was given in Fig. 6(c).

A1

TTY (Line 5)

A2

TTY (Line 5)

A3

TTY (Line 5)

A4

TTY (Line 5)

A5

TTY (Line 5)

YHG (Line 2) LSQ (Line 13)

DSST (Line 2) DZM (Line 2)

and

DSST (Line 2)

DS (Line 6)

CYM (Line 2)

HXXJNK (Line 10)

SYJ (Line 13)

DSST (Line 2) DZM (Line 2)

HXXJNK (Line 10)

HJL (Line6)

CYM (Line 2)

DSST (Line 2) DSST (Line 2)

Fig. 6. Theoretical travel time and match degree. (a) Theoretical travel time between TTY and DSST for 5 possible routes. By using our method, the total theoretical travel time for different routes at peak hour was estimated as 2234 s, 2438 s, 3105 s, 2648 s, and 3187 s. (b) Match degree of 5 possible routes; the most probable routes of 2 cluster centers are

and

. (c) Five possible routes between TTY and DSST. The red points

represent the OD stations, while the grey points show the transfer station between two lines.

The final task was to find the route for each passenger according to its time-dependent travel time. According to Ref. [28], when the cluster centers had been found, each remaining point was assigned to the same cluster as its nearest neighbor of higher density. For a passenger, its travel time was firstly assigned to a cluster center, and then matched to a certain route by the fuzzy matching approach. The estimated individual travel route was finally obtained according to the above steps. For instance, the travel time of a passenger traveling from TTY to DSST was 2268s. According to the calculation of the proposed model, its assigned cluster center was (2300 s), and its travel route was (TTY-YHG-DSST). It should be noted that, for the isolated outliers, their routes would be assigned to none of the cluster centers. A naïve strategy was applying the fuzzy matching model in the second step to directly match the outlier travel time to the alternative routes in the route set. be the identified routes between the OD pair with the clustering algorithm Let be the ratio of passengers using the route to the total number mentioned above, and . According to the route matching results, the travel time of passengers. Clearly, for 387 passengers (7:00–9:00, 18 Feb. 2014) from TTY to DSST was analyzed by counting the number of passengers assigned to the two clusters. It was found that the choice for two routes was about 234:153 = 0.605:0.395 by using the method described above, that means . and

4 Results and comparisons According to the proposed approach, we had calculated the overall map of time-dependent passenger flow distribution over the whole network from 7:00 to 9:00 on Feb. 18, 2014, which was shown in Fig. 7. In these subfigures, the color of each line represents the average train load of the segment, and the width of each line represents the average segment passenger flow. The train load is the usage percentage of the train capacity, and the maximal value is 140%. The unit passenger flow is thousand persons per hour. At the beginning of the period (7:00-7:15, Fig. 7(a)), the travel demand level was low, and the passenger flow of each line was small. With the continuous arriving of passengers (7:30-7:45, Fig. 7(b)), the total travel demand became larger, and the congestion became serious in some lines, e.g., Line 5 and Line 13. During the high-peak period (8:00-8:15, Fig. 7(c)), many segments in more lines exceeded to 100%, and even achieved at 120%. During the period (8:30-8:45, Fig. 7(d)), the total travel demand gradually decreased, and some congestion segments began to evanish.

Fig. 7

Overall map of time-dependent passenger flow distribution over the whole network. a) 7:00-7:15.

b) 7:30-7:45. c) 8:00-8:15. d) 8:30-8:45.

To better understand the effectiveness of our method, we compared our proposed approach with two other models: travel-time-based model and logit-based method. The research period was selected as 7:00-9:00 on Feb. 18, 2014, and 10 pairs of OD in Beijing Metro network were chosen to make comparisons. These OD pairs included TTYB-DWL (TianTongYuanBei-DaWangLu), DZM-FXM (DongZhiMen-FuXingMen), PGY-DSST (PingGuoYuan-DongSiShiTiao), LJY-DSST (LiuJiaYao-DongSiShiTiao), JST-DWL (JiShuiTan-DaWangLu), HLG-MDY (HuiLongGuan-MuDanYuan), DZM-XZM (DongZhiMen-XiZhiMen), TTY-DSST (TianTongYuan-DongSiShiTiao), SH-ZGC (SiHui-ZhongGuanCun), QNL-FCM (QingNianLu-FuChengMen). Between each OD pair, there were two or three alternative routes for passengers to choose. (1) Travel-time-based Model We compared a travel time model proposed by [30]. For an OD pair , we knew the travel time using each route was independence . In addition, we could get the maximum and minimum travel time in collected AFC data, which were denoted by and , respectively. Dividing into segments , the probability of travel time in total passengers is

was and

the trip data. Then, according to (2) Logit-based method

. Additionally, we assumed that the passengers in

, where

, we could get the route ratio

could be counted from .

Another traditional method in the assignment of passenger flow is the logit-based algorithm, which have been widely applied in urban rail transit networks. Here, an online logit-based passenger flow distribution system applied in Beijing Metro system was selected to make comparisons. Because the actual passenger route choice pattern and route flow distributions were very difficult to obtain, the route flows calculated from the logit-based passenger flow distribution system in Beijing Metro were approximatively regarded as the ‘true’ value. Tab. 1 shows the flow distributions in the 10 OD pairs. The results show that the calculated distributions are nearly the same, and the proportions of the routes with shorter travel time will be larger. Passengers were also not willing to make many transfers during their trips. The average relative errors obtained by our model was 8.62%, which was smaller than the travel time based model (15.54%). It implied that our method could obtain better solutions. Tab. 1 Comparison results of route proportion by three models Logit-based model

Travel time model

our model

relative OD

index

flow

proportion

flow

proportion

relative flow

proportion

error (%)

error (%)

1

170

0.343

192

0.388

12.941

166

0.335

2.353

2

175

0.354

164

0.331

6.286

182

0.368

4.000

3

150

0.303

139

0.281

7.333

147

0.297

2.000

1

149

0.383

123

0.316

17.450

135

0.347

9.396

2

133

0.342

168

0.432

26.316

142

0.365

6.767

3

107

0.275

98

0.252

8.411

112

0.288

4.673

1

135

0.380

120

0.338

11.111

128

0.361

5.185

2

109

0.307

113

0.318

3.670

120

0.338

10.092

3

111

0.313

122

0.344

9.910

107

0.301

3.604

1

110

0.354

86

0.277

21.818

103

0.331

6.364

2

103

0.331

95

0.305

7.767

107

0.344

3.883

3

98

0.315

130

0.418

32.653

101

0.325

3.061

1

148

0.398

113

0.304

23.649

122

0.328

17.568

2

120

0.323

167

0.449

39.167

136

0.366

13.333

3

104

0.280

92

0.247

11.538

114

0.306

9.615

1

120

0.513

108

0.462

10.000

116

0.496

3.333

2

36

0.154

54

0.231

50.000

48

0.205

33.333

3

78

0.333

72

0.308

7.692

70

0.299

10.256

1

34

0.515

36

0.545

5.882

38

0.576

11.765

2

32

0.485

30

0.455

6.250

28

0.424

12.500

1

265

0.634

268

0.641

1.132

253

0.605

4.528

2

153

0.366

150

0.359

1.961

165

0.395

7.843

1

107

0.430

120

0.482

12.150

115

0.462

7.477

2

66

0.265

35

0.141

46.970

73

0.293

10.606

3

76

0.305

94

0.378

23.684

61

0.245

19.737

1

102

0.436

95

0.406

6.863

108

0.462

5.882

2

66

0.282

77

0.329

16.667

59

0.252

10.606

TTYB-DW L

DZM-FXM

PGY-DSST

LJY-DSST

JST-DWL

HLG-MDY

DZM-XZM

TTY-DSST

SH-ZGC

QNL-FCM

3

66

0.282

62

0.265

6.061

67

0.286

1.515

Although the numerical results of the flow distributions by three models are nearly same, our approach needed less parameters and computational resources, which was more robust than the other methods. Fig. 8 is an example of the OD pair of TTY-DSST. Using the travel-time based model, we calculated the passenger flow ratio on the two identified routes as . However, and can be and , when given with different numbers. For example, for the case study of TTY and DSST, when , the passenger flow ratio in two identified routes was very different, and . Therefore, this method was not robust to apply in the real estimation of passenger flow distribution. Using the logit-based model, many parameter calibrations are needed to ensure acceptable precision, such as the comfortableness, the penalty of transfer, and so on. Moreover, with network expansion, parameters must be updated to capture the different travel behavior. Using our proposed model, the cluster center could be easily computed in an efficient way. The most time consuming part of the cluster algorithm is calculating the distance between each pair of two points, and its computational complex is O(n2). Unlike other existing cluster algorithm, the cluster assignment is performed in a single step. In summary, the proposed route recognition method is with good computational efficiency and accuracy.

Comparisons Logit-Based Route A1:Route A2=0.634:0.366 Travel-time-based Route A1:Route A2=0.641:0.359 Our approach Route A1:Route A2=0.605:0.395

Fig. 8. Comparisons. Red star shows the results of logit-based passenger distribution (7:00-9:00, 18 Feb. 2014) by the online passenger flow distribution system.

5 Conclusion Smart card AFC systems are widely used in metro systems around the world. Smart card data is very useful for public transit agencies, from the day-to-day operation of the metro system to the long-term planning of mass transit networks. However, most AFC systems only record when and

where a passenger enters and leaves the metro system, not the passenger's spatio-temporal movement trajectory. That is, the passenger's travel route in the metro network is unknown. In this study, we presented a methodology for reconstructing the metro passenger's route from smart card data. To identify the passenger's travel route, the travel time of passengers was first clustered with the same OD pair. According to clustering results, several classes of travel time were distinguished. Such information is important, because it allows is to determine the relationship between possible routes and travel time. Then, a measure of similarity was used to assign the clusters to different travel routes between the OD pair to obtain the most likely travel route for each passenger. In practice, this would imply that the travel route of each passenger could be found based on the smart card data. Finally, we compared this approach to two methods, and found that this approach was more efficient in real application. In the future, we will develop a travel time regression model with parameters estimated from smart card data collected from the fastest passengers in each travel route by separating a typical metro trip into station-to-station segments. Based on the regression results, we could then group observed passengers together to individual trains and finally obtain the spatio-temporal trajectories of each passenger. Such trajectory information has great potential to improve station passenger management, train scheduling, and emergency response management for metro systems.

Acknowledgments This paper was partly supported by China National Funds for Distinguished Young Scientists (71525002), NSFC (71621001, 71601017, 71771018), and the Research Foundation of State Key Laboratory of Rail Traffic Control and Safety (RCS2018ZT004, RCS2018ZQ001).

References 1.

Çolak, S., Lima, A., Gonza´lez, M. C. Understanding congested travel in urban areas. Nat. Commun. 7, 10793 (2016). 2. Bettencourt, L.M., Lobo, J., Helbing, D., Ku¨hnert, C. & West, G.B. Growth, innovation, scaling, and the pace of life in cities. Proc. Natl Acad. Sci. USA 104, 7301–7306 (2007). 3. Bettencourt, L. M. A. The origins of scaling in cities. Science 340, 1438–1441(2013). 4. Arcaute, E., Hatna, E., Ferguson, P., Youn, H., Johansson, A. & Batty M. Constructing cities, deconstructing scaling laws. J. R. Soc. Interface 12, 20140745 (2015). 5. Hernando, A., Hernando, R. & Plastino, A. Space-time correlations in urban sprawl. J. R. Soc. Interface 11, 20130930 (2014) 6. Gonzaĺez, M. C., Hidalgo, C. A. & Barabasi, A. -L. Understanding individual human mobility patterns. Nature 453, 779–782 (2008). 7. Brockmann, D., Hufnagel, L. & Geisel, T. The scaling laws of human travel. Nature 439, 462–465 (2006). 8. Song, C., Qu, Z., Blumm, N. & Baraba´si, A.-L. Limits of predictability in human mobility. Science 327, 1018–1021 (2010) 9. Zhao, J., Zhang, F., Tu, L., Xu, C., Shen, D., & Tian, C., et al.. Estimation of passenger route choice pattern using smart card data for complex metro systems. IEEE Transactions on Intelligent Transportation Systems, 18(4), 790-801 (2017). 10. Silva, R., Kang, S.M. & Airoldi, E.M. Predicting traffic volumes and estimating the effects of shocks in massive transportation systems. Proc Natl Acad Sci USA 112(18), 5643-5648

11.

12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.

23. 24. 25.

26. 27.

28. 29. 30.

(2015). Nishiuchi, H., King, J. & Todoroki, T. Spatial–temporal daily frequent trip pattern of public transport passengers using smart card data. Int. J. Intell. Transport. Syst. Res. 11, 1–10 (2013). Sun, H.J., Wu, J.J., Wu, L.J., Yan, X.Y. & Gao, Z.Y. Estimating the influence of common disruptions on urban rail transit networks. Transport. Res. Part A 94, 62–75 (2016). Sun, L.J., Axhausen, K.W., Lee, D.-H. & Cebrian, M. Efficient detection of contagious outbreaks in massive metropolitanen counter networks. Sci. Rep. 4, 5099 (2014). Sun, L.J., Axhausen, K.W., Lee, D.H. & Huang X.F. Understanding metropolitan patterns of daily encounters, Proc Natl Acad Sci USA 110(34), 13774–13779 (2013). Batty, M. Big data, smart cities and city planning. Dialogues in Human Geography 3(3): 274-279 (2013). Roth, C., Kang, S.M., Batty, M. & Barthélemy, M. Structure of urban movements: Polycentric activity and entangled hierarchical flows. PLoS ONE 6(1), e15923 (2011). Hasan, S., Schneider, C.M., Ukkusuri, S.V. & González, M.C. Spatiotemporal patterns of urban human mobility. J. Stat. Phys. 151, 304–318 (2013). Liang, X., Zhao, J., Dong, L. & Xu, K. Unraveling the origin of exponential law in intra-urban human mobility. Sci. Rep. 3, 2983 (2013). Nguyen, S., Pallottino, S. & Gendreau, M. Implicit enumeration of hyperpaths in a logit model for transit networks. Transport. Sci. 32(1), 54-64 (1998). Hong, S. P., Min, Y. H., Park, M. J., Kim, K. M. & Oh, S. M. Precise estimation of connections of metro passengers from smart card data. Transportation 43(5), 1-21 (2016). Kölbl, R., Herbing, D. Energy laws in human travel behavior. New Journal of Physics, 5(1), 48 (2003). Monterola, C., Legara, E.F., Pan, D., Lee, K.K. & Hung G.G. Non-invasive procedure to probe the route choices of commuters in rail transit systems. Procedia Computer Science 80, 2387-2391 (2016). Hin, L.T.W., Subramaniam, R. Smart-card system keeps Singapore in the fast lane. Nature 411, 737–737 (2001). Kusakabe, T., Iryo, T. & Asakura, Y. Estimation method for railway passengers’ train choice behavior with smart card transaction data. Transportation 37(5), 731–749 (2010). Zhou, F., Xu R. Passenger flow assignment model for urban rail transit based on entry and exit time constraints. Transportation Research Record: Journal of the Transportation Research Board 2284, 57–61 (2012). Zhu, W., Hu, H. & Huang, Z. Calibrating rail transit assignment models with genetic algorithm and automated fare collection data. Comput-Aided Civ. Inf. 29(7), 518–530 (2014). Lee, M., & Sohn, K. Inferring the route-use patterns of metro passengers based only on travel time data within a Bayesian framework using a reversible-jump Markov chain Monte Carlo (MCMC) simulation. Transport. Res Part B 81, 1-17 (2015). Rodriguez, A., Laio, A. Clustering by fast search and find of density peaks. Science 344, 1492-1496 (2014). Zhang, Y.X., Wang B.S. Application of fuzzy matching method and evidence theory in field of recognition for radiant point. Computer Engineering 31(22), 183-185 (2005). Sun, Y.S., Xu R.H. Estimation of rail transit passenger route using automated fare collected

data. Traffic & Transportation, 27(B12), 85-90 (2011). 31. Pelletier, M.P., Trépanier M., & Morency C.. Smart card data use in public transit: a literature review. Transportation Research Part C, 19(4), 557-568 (2011). 32. Tang, T., Dong, X., Wang, J., Kong, X., Rahim, A., & Yu, X., et al. (2018). FISS: function identification of subway stations based on semantics mining and functional clustering. Iet Intelligent Transport Systems, 12(7) (2018). 33. Asakura, Y., & Kusakabe, T. Estimation of behavioural change of railway passengers using smart card data. Public Transport, 4(1), 1-16 (2012).

34. Kusakabe, T., & Asakura, Y. Behavioural data mining of transit smart card data: a data fusion approach. Transportation Research Part C, 46, 179-191 (2014).

35. Xiao, F., & Yu, G. Impact of a new metro line: analysis of metro passenger flow and travel time based on smart card data. Journal of Advanced Transportation, 9247102. (2018).

36. Si, B., Zhong, M., Liu, J., Gao, Z., & Wu, J. Development of a transfer‐ cost‐ based logit assignment model for the beijing rail transit network using automated fare collection data. Journal of Advanced Transportation, 47(3), 297-318 (2013).

37. Sun, Y., & Xu, R. Rail transit travel time reliability and estimation of passenger route choice behavior: analysis using automatic fare collection data. Transportation Research Record Journal of the Transportation Research Board, 2275, 58-67 (2012).

38. Si, B., Fu, L., Liu, J., Shiravi, S., & Gao, Z. A multi‐ class transit assignment model for estimating transit passenger flows—a case study of Beijing subway network. Journal of Advanced Transportation, 50(1), 50-68 (2016).

39. Sun, L.J., Lu, Y., Jin, J.G., Lee, D.H., & Axhausen, K.W. An integrated Bayesian approach for passenger flow assignment in metro networks. Transportation Research Part C, 52, 116-131 (2015).

40. Xu, X., Xie, L., Li, H., & Qin, L. Learning the route choice behavior of subway passengers from AFC data. Expert Systems with Applications, 95, 324-332 (2018).

41. Kim, K. M., Hong, S. P., Ko, S. J., & Min, J. H. Predicting express train choice of metro passengers from smart card data. Transportation Research Record Journal of the Transportation Research Board, 2544, 63-70 (2016).

42. Shi, J., Zhou, F., Zhu, W., & Xu, R. Estimation method of passenger route choice proportion in urban rail transit based on afc data. Journal of Southeast University, 45(1), 184-188 (2015).

43. Li, W., Luo, Q., Cai, Q., & Zhang, X. Using smart card data trimmed by train schedule to analyze metro passenger route choice with synchronous clustering. (2018).

44. Fu, Q., Liu, R., & Hess, S. A Bayesian modelling framework for individual passenger's probabilistic route choices: a case study on the London Underground. Transportation Research Board 93rd Annual Meeting. (2018).

45. Aljazzar, H., & Leue, S. K*: a heuristic search algorithm for finding the k, shortest paths. Artificial Intelligence, 175(18), 2129-2154 (2011).