Supporting real-world network-oriented mesoscopic traffic simulation on GPU

Simulation Modelling Practice and Theory 74 (2017) 46–63 Contents lists available at ScienceDirect Simulation Modelling Practice and Theory journal ...

Download PDF

2MB Sizes 37 Downloads 132 Views

Report

PDF Reader
Full Text

Simulation Modelling Practice and Theory 74 (2017) 46–63

Contents lists available at ScienceDirect

Simulation Modelling Practice and Theory journal homepage: www.elsevier.com/locate/simpat

Supporting real-world network-oriented mesoscopic traﬃc simulation on GPUR Xiao Song a,∗, Ziping Xie a, Yan Xu b, Gary Tan b,∗, Wenjie Tang c, Jing Bi d, Xiaosong Li e a

School of Automation Science, Beihang University (BUAA), Beijing, China Singapore-MIT Alliance Research & Technology (SMART), Singapore c College of Computer, National University of Defense Technology, China d Beijing University of Technology, Beijing, China e Nanyang Technological University, Singapore b

a r t i c l e

i n f o

Article history: Received 7 September 2016 Revised 12 February 2017 Accepted 13 February 2017

Keywords: Mesoscopic traﬃc simulation GPU Realistic road network Grid

a b s t r a c t Mesoscopic traﬃc simulation is a promising technology for handling large-scale dynamic traﬃc assignment problems. Meanwhile, GPU has been a research hotspot in the last decade because of its massive parallel performance compared with CPU. Although several works deal with GPU-based parallel performance, few have addressed the issue of real-world network applications. Thus, the performance of GPU on realistic road networks is still not clear. In this paper, an ideal grid network and a realistic Singapore expressway network are implemented and compared. The former gets a speedup of 10.00 while the latter only get a speedup of 2.37. The two main reasons for this lower eﬃciency of a real network are the more complex data structure caused by network topology and fewer numbers of threads due to the scale of the expressway network. Future works need more effort to develop a system with better data structure and better designed concurrent threads. © 2017 Elsevier B.V. All rights reserved.

1. Introduction Traﬃc simulation is an appealing solution for traﬃc planners and engineers to solve dynamic traﬃc assignment (DTA) problems for oﬄine system planning and online operation management. Technically, any traﬃc simulation consists of two components: demand and supply [1,2]. Modeling from the travelers’ point of view, the former is to understand how travel decisions are made, such as mode choice, departure time choice, and route choice. Modeling from the traﬃc ﬂow’s point of view, the latter is to understand how traﬃc demand is assigned to available road resources. The two components are essential for mesoscopic simulations such as DynaMIT [2,3], DYNASMART [5,26,36], CONTRAM [27], Dynamo [28], Dynameq [1,29], Mezzo [30,31], MATSIM [32,33], POLARIS [37], DynusT [40,41], and DTALite [42].

R This manuscript is based on our paper presented at PADS 2014, which used the grid-based experimental case study. Now we have applied the algorithm to a realistic Singapore road network. The source code in this paper is shared on the GibHub of the authors. Readers can download and use it from this link: http://github.com/xuyannus/traﬃc_simulation_on_gpu. ∗ Corresponding authors. E-mail addresses: [email protected], [email protected] (X. Song), [email protected] (G. Tan).

http://dx.doi.org/10.1016/j.simpat.2017.02.003 1569-190X/© 2017 Elsevier B.V. All rights reserved.

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

47

Mesoscopic simulations are computationally more eﬃcient than microscopic traﬃc simulators and more accurate to compute traﬃc dynamic details than macroscopic simulations [1,3]. In contrast with aggregated macroscopic demand models, disaggregated demand models are used in mesoscopic traﬃc simulations to gain the advantage of being consistent with various traveler behaviors such as route choice. Meanwhile, the supply models are simpliﬁed compared with microscopic traﬃc simulations, where individual vehicle dynamics are approximated by a speed-density model in the moving part of a link and a queuing model in the queue part of a link [1]. Although it has gained better performance, it is still a challenge for mesoscopic simulation to satisfy the intensive computational requirement of real-world large-scale traﬃc assignment applications, especially real-time applications in metropolitan areas [1,4,16]. As such, performance or scalability problems have emerged in the past decade as one of the major concerns in the ﬁeld of mesoscopic traﬃc simulation, where some useful works have been done [2,7,9,14,15,16,35,40–42]. These works can be mainly divided into two categories: structurally improve supply framework [2, 16, 40] and enhance performance with parallel computing [7,9, 14,15,37,39]. For the former, in our companion paper [16], ETSF was proposed to lower computational complexity of a supply framework, but parallel computing methods were not mentioned. For the latter, in [7,9,14,15,25,34– 39], MPI, OpenMP and GPU were used to accelerate running speed of mesoscopic simulations, and a great 5.5x–60x speedup over serial applications was achieved in Strippgen and Nagel [15]. However, a new generation of mesoscopic simulators such as DynusT and DTALite is still mainly focused on CPU computing and memory optimization [40–42], and few works are capable of GPU computing. Furthermore, one important issue less scrutinized by existing parallel computing based works [25,33] is the real city scale networks, which normally have various lengths of links and complex relations of intersections and lane connections. To overcome this shortcoming of previous research, a GPU-based mesoscopic simulation framework is proposed in this paper. This framework employs CPU and GPU resources to compute the demand and supply component, respectively, with an asynchronous simulation step strategy. Moreover, an ideal grid network and a real-world network of Singapore is illustrated and experimented with the proposed GPU parallel computing method. Also, the speedup factor is compared and reasons are analyzed. This paper is organized as follows. In Section 2 the GPU is introduced and a brief overview of computational eﬃciency method is provided. Section 3 introduces the mesoscopic traﬃc simulation framework on CPU/GPU. Section 4 provides comparison results of ideal grid network and realistic road network. Moreover, the reasons for lower speedup of the latter are analyzed. Finally, we list conclusions and future works in Section 5. 2. Related works To simulate the supply part of an entire metropolitan city such as Singapore [1] or Beijing [4] in peak hours, when a few hundred thousand to a few million vehicles are on the road, the amount of time it would require a single von Neumann-style serial processor to track and compute the states of such large numbers of vehicles makes traﬃc simulations nearly infeasible on these architectures. To tackle this performance problem, researchers have made efforts utilizing two perspectives. First, to improve eﬃciency of the traﬃc framework, mesoscopic traﬃc simulators, such as DynaMIT [3,4], DynaSMART [5,36], POLARIS [37], DynusT [40,41] and DTALite [42], were created to reduce the computational requirement of microscopic traﬃc simulators. Compared with microscopic traﬃc simulators, individual vehicle dynamics in mesoscopic traﬃc simulators are approximated by a speed-density model in the moving part of a link and a queuing model in the queue part of a link [1]. At the same time, in contrast with aggregated macroscopic models [6], vehicles are modeled as agents in mesoscopic traﬃc simulations to gain the advantage that they are consistent with the detailed demand models of traveler behaviors, such as route choice. For their well-designed trade-off between performance and accuracy, mesoscopic traﬃc simulators have been widely used to support large-scale simulation-based DTA systems [3–5,40,42]. Our recent work is ETSF [16], which reduces the theoretical time complexity of the mesoscopic supply simulation, with an assumption that vehicles on the same lane of a segment are moving at the same speed at a time-step. Although ETSF was proposed, mesoscopic traﬃc simulations are still not eﬃcient enough [16] to satisfy the intensive computational requirement of real-world large-scale DTA applications [1]. Moreover, in most cases, tens or hundreds of runs are required for statistical analysis before any decision-making, which make the problem more challenging. Second, many researchers have enhanced traﬃc simulation performance from a parallel computing perspective. They have used multicore CPUs or CPU clusters to handle large computational loads [7–9,18]. Traﬃc network is typically decomposed into segments that are handled by different local/remote processors. Load balancing, inter-processor communication, and synchronization then become important considerations. For optimal use of computing resources, segments must be distributed evenly among processors based on the estimated computation cost of each segment. For this purpose, multicore parallel algorithms and data structures have been developed in [8,9] and message passing interface (MPI)-based communication among CPU clusters used in [2]. In fact, one dilemma of these traditional parallel methods is that, when more distributed processors are used to handle more computational needs, more communication costs such as synchronization and boundary management are generated, which means it is still challenging to eﬃciently run large-scale mesoscopic traﬃc simulations on CPU clusters. Moreover, the cost and complexity of maintenance of such computing resources makes these approaches expensive and sometimes undesirable.

48

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

node

link

segment

lane

Fig. 1. Network-related terminologies used in this paper.

Fortunately, desktop graphic processing unit (GPU) solutions emerged, and it has recently been a promising parallel computing technology to solve this performance problem because of its massive performance compared with the CPU. While GPUs were primarily meant to do 3D rendering in graphics applications, rapid developments in their architectures have enabled their use in scientiﬁc computing [11,22], computational ﬁnance [12], computational biology [12], simulations [17,19– 21], and high-performance computing [13]. Moreover, the development of direct computing application protocol interfaces (APIs) such as CUDA (compute uniﬁed device architecture) [10] and OpenCL have signiﬁcantly reduced programming efforts. However, fundamental differences in GPU and CPU architectures mean that the traditional technique of converting serial implementations to parallel using standards such as OpenMP and MPI is inapplicable. Thus, it is important to study how to implement mesoscopic traﬃc simulations with GPU. Moreover, although existing works have gained some progress in GPU-based simulations, only few works focused on traﬃc simulations have been fulﬁlled. Perumalla et al. [14] introduced a pioneering method to simulate vehicle movement on GPU by using a ﬁeld-based model. This model maps real-world road data onto a 2D lattice, with each element in the lattice representing the possibility of turning either left/right or up/down. By using this possibility data, the vehicles will be directed from one position in the 2D array to another position. The proposed ﬁeld-based model is similar to the classic cellular automata model. However, this ﬁeld-based modeling is not intuitively adaptive to standard ﬂow characteristics such as lane speeds and link densities, and real-time dynamical demands are not considered. The MATSIM team also released a research work to implement an event-driven mesoscopic traﬃc simulation framework based on GPU [15]. The paper introduced two kernel functions (moveLink and moveNode) to implement the core queue simulation. The team also addressed three different implementations of the vehicle array in the GPU memory. Their work obtained a great speedup over serial applications between 5.5 and 60 times depending on different data structures and NVIDIA GPU series. However, the paper did not fully explain how to migrate a real-world mesoscopic traﬃc simulation from the CPU to the GPU, considering both demand and supply components. Perumalla [44] is also a good reference work for us. But it does not provide detailed data structure design and GPU proﬁling results. Recent works include POLARIS [37], DynusT [40,41], and DTALite [42]. Auld [37] discusses a discrete event engine-based agent modeling software development kit. This framework allows the modeling and simulation of transportation system in a high-performance and extensible manner. Tian [40] discusses a new anisotropic mesoscopic simulation (AMS) approach that carefully omits micro-scale details but nicely preserves critical traﬃc dynamics characteristics. Zhou [42] presents a computationally eﬃcient and theoretically rigorous dynamic traﬃc assignment (DTA) model coupled with a computationally eﬃcient emission estimation package MOVES Lite, a mesoscopic simulation-based dynamic network loading framework. All these works are useful but mainly discuss CPU-oriented parallel computing simulations without studying massive parallel computing resources of GPU. To fully utilize GPU capabilities, we attempt to study how to coordinate demand and supply workload with CPU and GPU, how to map GPU kernel functions and data structures with road networks, and ﬁnally apply our approach to Singapore expressway networks.

3. Mesoscopic traﬃc simulation framework on CPU/GPU As shown in Fig. 1, in this paper, a road network is modeled as nodes, links, segments, and lanes. The nodes correspond to intersections of the actual road network, while links represent unidirectional pathways between nodes. Each link is divided into a number of segments, according to geometry features. Each segment contains lanes. Each lane contains a number of vehicles, which are located on the lane. Each lane has capacity constraints at the upstream end and the downstream end, referred to as the input capacity and the output capacity. A queue occurs in a lane if vehicles cannot pass the lane. A spill-back occurs if a lane is blocked, which means the length of the queue on the lane is equal to the length of the lane. According to the vehicle’s status (moving or in queue), its location is updated using the following rules. If a vehicle is located in the moving part of a link, its speed is determined by a speed-density relationship on the density of the link, and its location is then updated using the speed. Example speed-density relationships can be found in the land authority publication such as the manual of Transportation Research Board [43]. In this paper, the following equation in [16,43] is

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

49

used to compute the speed of a link:

v = v0 ∗

1−

k

β α

k jam

,

(1)

where v0 is the free-ﬂow speed, k is the density, kjam is the jam density, and α , β are conﬁgurable parameters to be determined for each link through calibration. In this paper, α = 1.0 and β = 0.05. If a vehicle is located in the queue part of a link, there are two possible conditions: • If the vehicle is at the head of the queue (at the exit of the link), it can leave the queue only if the current link has output capacity left, and the downstream link has suﬃcient empty space and suﬃcient input capacity. If both conditions are satisﬁed, the vehicle can pass the current link to the next. Otherwise, the vehicle stays at the exit of the link. • If the current vehicle is not ﬁrst in the queue (i.e., there are other queuing vehicles ahead of it), it can only advance as far as vehicles in front of it do (assuming no space is left between any two consecutive queuing vehicles). The distance is then determined by how many vehicles have left the head of the queue during the same time step. 3.1. The framework for CPU/GPU In most existing GPU-based simulations [14,15,23], CPU code plays a role of a master thread that controls the program ﬂow. The GPU code, on the other hand, spawns a bunch of worker threads to execute compute-intensive parts of the program in parallel, thus accelerating the overall program. However, this way of implementation has two aspects of shortcomings for traﬃc simulations. First, the demand part, including vehicle generation, departure time choice, path choice, is often dynamically generated in real time and thus is not suitable to be executed on the GPU memory structure. Second, central processing unit (CPU) is not fully used, which is a waste of resources to some extent. To tackle these problems, we design a new simulation framework making full use of CPU and GPU. In this framework, GPU is responsible for the supply part of mesoscopic traﬃc simulation, which includes speed calculation, vehicle movement on a road and between roads, and queue calculation. A key feature of supply simulation is that the simulation of a road is just related to its surrounding roads, which ﬁts GPU’s data parallel requirement. Further, CPU is responsible for the demand part and the I/O part of mesoscopic traﬃc simulation, which includes vehicle generation, departure time choice, pre-trip route choice, en-trip route choice, and pushing simulation results to ﬁles. A key feature of demand simulation is that vehicles are making decisions based on the information on the global road network. Fig. 2 shows the mesoscopic traﬃc simulation framework on CPU/GPU, which explains the logic procedure and the simulation time management. This framework is suitable for general time-stepped mesoscopic traﬃc simulation [3–5]. The traditional simulation time, which controls the turnover of the system status, is divided into three components: a demand time step (td ), a supply time step (ts ), and an I/O time step (tio ). A traﬃc simulation is completed only if td , ts , and tio are all reaching the simulation end. In this framework, multiple time steps enable to identity the exact progress of different components in a traﬃc simulation. Note that, at an instantaneous time, the three time steps can be different. The time management in this framework is controlled by three rules: • Rule 1: td >= ts • Rule 2: ts >= tio • Rule 3: td <= ts + DFD First, td is always not smaller than ts because, only if vehicles entering the simulation at time t are generated, the supply simulation at t can start. Second, ts is always not smaller than tio because, only if the supply simulation at t is completed, the simulation results at t can be outputted to ﬁles. The third rule involves a concept in traﬃc simulation: demand feedback delay (DFD), which is a multiple of the simulation time tick t. Vehicles generated at time t require the simulated results at t – DFD, for departure time choices and route choices. The minimum value of DFD is 1, which means vehicles have real-time instantaneous information about the global traﬃc status in last time step (e.g., 1 s), and DFD tends to be larger in real-world traﬃc systems. In the logic procedure in Fig. 3, steps 1 and 2 initialize the required data structures on the CPU and the GPU, including the road network, traﬃc scenario conﬁgurations, and other parameters. After initialization, the CPU controls the simulation logic, in order to manage the simulation time and to make full use of computational resources. Without breaking the three rules in time management, the following tasks can be executed in parallel: • Task 1: The supply simulation at time ts on GPU (steps 3–5). • Task 2: The demand simulation at time td on CPU (steps 6–8). • Task 3: Push simulation results at time tio to ﬁles (steps 9 and 10). Within a loop of the logic procedure, the CPU ﬁrst checks whether the GPU has ﬁnished the supply simulation at time ts . If yes, the simulation results on the GPU (e.g., road-based speed and density) are copied to the CPU, and the supply time

50

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

CPU

GPU Start

1: CPU initialization simulation time tick = Δt; simulation end = T; demand time (td) = 0; supply time (ts) = 0; I/O time (tio) = 0; set demand feedback delay (DFD); 2: GPU initialization

yes

td >= T && ts >= T && tio >= T no

no

td >= (ts + Δt) && ts < T && GPU supply simulation returned yes 3: copy GPU results to CPU Memory 4: ts = ts + Δt 5: start GPU supply simulation at ts

supply simulation on GPU at ts

results including speed, queue length, etc. no

td < T && td < (ts + DFD) yes 6: td = td + Δt 7: start demand simulation on CPU at td

8: copy new vehicles to GPU Memory

tio < T && tio < ts

no

yes 9: tio = tio + Δt 10: output results at tio to files

End

Fig. 2. Mesoscopic traﬃc simulation framework on CPU-GPU.

ts is advanced. Then, the supply simulation at the next time step is started on the GPU. Note that the CPU will not wait for the GPU supply simulation to ﬁnish. If the supply simulation on the GPU is ongoing, the CPU checks whether the demand simulation can be started. If the simulation results required for demand simulation are available, the demand simulation will be started on the CPU. Otherwise, the CPU checks whether there are available simulation results that need to be written into ﬁles. The CPU will continue the loop until the three time ts , td , and tio are all reaching the simulation end. The logic of the supply simulation and the demand simulation are explained in [1,16].

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

Road Network

Link

Segment

PK link_ID vector vector

PK segment_ID

start Node* end Node* vector

51

Lane PK lane_ID

vector

vector vector vector input_capacity output_capacity

Node PK node_ID node_position vector vector

AllVehicle

Vehicle PK vehicle_ID

vector

moving_on_Lane* Fig. 3. Road network and vehicle modeling on the CPU.

3.2. Data structure of road network and vehicle on GPU For parallel computing performance, the data structure in the GPU memory is completely different from CPU’s, due to the thread hierarchy and the memory hierarchy in the GPU. Fig. 3 shows the road network and vehicle on the CPU memory. A road network is composed of a list of links and a list of nodes. Each link consists of a number of segments, and each node consists of a list of upstream and downstream links. Each segment consists of multiple lanes, and each lane contains a number of lane connections. Each lane also has access to vehicles, which are moving on the lane.

3.2.1. Data structure of CPU vs. GPU From the perspective of implementation, the data structures in the CPU memory and the GPU memory have two key differences. First, on the CPU memory, the large number of road elements and vehicles are stored in random separated memory spaces, and the objects are connecting with each other using pointers. While on the GPU memory, these elements are kept in arrays in a continuous memory space and different elements are connecting each other using the index inside the array. The reasons for doing this on the GPU memory are to make it easy to copy the entire road network from the CPU memory to the GPU memory and, more importantly, to allow eﬃcient coalesced memory access, which means a group of GPU threads in a warp tend to access continuous memory space. Second, on the CPU memory, dynamic memory allocation (e.g., std: vector, which applies for a memory space when it is immediately required) is widely used in the data structure of a road network and vehicles, because of its ﬂexibility and eﬃciency. However, on the GPU memory, dynamic memory allocation has to be replaced by static memory allocation, which applies a suﬃcient large amount of memory space in the beginning. It is a limitation of GPU programming because it is not eﬃcient to do random memory access. In our GPU framework, it uses “start & end indexes” to store the containing relationships. Third, it is good to employ shared memory if possible, as does the matrix multiplication example shown in NVIDIA [10]. However, it is not easy to use shared memory in our traﬃc simulation framework. There exist two problems: (1) In the matrix multiplication example, the submatrix is stored in shared memory because it will be used multiple times during multiplication. However, in our traﬃc scenario, each lane computes its speed, queue length without using the other lane’s data. This means there is little need to use shared memory. (2) Considering the case in Section 4 with 10 0,0 0 0 vehicles, we need 99.95 MB for lanes. However, each shared memory per block is only 48 KB. Using shared memory will lead to multiple sections with various thread indexes. As such, only global memory is employed in our kernel functions. To illustrate these technical points, Fig. 4 shows the data structure of an ideal grid road network and vehicle modeling on the GPU memory.

52

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

Link on GPU PK link_Index Road Network on GPU

Link[LINK_SIZE] Node[NODE_SZIE] Segment[SEGMENT_SIZE] Lane[LANE_SIZE]

start_node_index end_node_index segment_index[] link_ID

Segment on GPU PK segment_Index lane_index[] segment_ID Lane on GPU11

Node on GPU PK node_Index node_position uppersteam_link_Index[] downstream_link_Index[] node_ID

PK lane_Index connected_lane_Index[] vehicles_index[] buffered_vehicles_index[] lane_ID input_capacity output_capacity speed_history[TOTAL_TIME_STEPS(1000)]

Vehicle on GPU11 Vehicles on GPU

Vehicle[]

PK vehicle_Index move_on_lane_index vehicle_ID path_code[kMaxRouteLength(500)] Fig. 4. Ideal grid road network and vehicle modeling on the GPU.

3.2.2. Data structure of ideal grid vs. real road network Compared with the ideal grid network modeling on the GPU memory, the real-world network data structure is shown in Fig. 5. This design uses indexes in data structures to save memory space. The design details are introduced as follows. 1. Realistic road network modeling is more complex. Each link of an ideal grid road network only has one segment and its lane, but there are often two or more lanes in real network. Correspondingly, in the design of a GPU memory, a grid road network has only “lane_index,” whereas there is “lane_start_index” and “lane_end_index” for each segment in a realistic road network. For example, if a segment has four lanes, then this index can be 1 and 4 (if the segment is the ﬁrst segment in the network) or more likely to be (X and X + 3); X is the index of the ﬁrst lane in the whole network. 2. In a grid network, all lanes’ length is equal, and the scope of vehicles’ index is thus ﬁxed. We only need to get “vehicle_index” on the lane to know the ID of the vehicle on the lane. But in a real network, we need to get “vehicle_start_index” and “vehicle_end_index” of each lane because the length of each lane is different. For instance, about struct “LaneOnGPU,” the usage of attributes “vehicle_start_index” and “vehicle_end_index” is to deﬁne global start and end index of vehicles on the network. These two attributes mean that the scope of vehicles’ index on a lane, for example, if a lane is 10 0 0 m long, and each vehicle is 5 m long, then the index is X ∼ X + 199, where X depends on other lanes in the network. 3. Considering of the complexity of real network topology, “lane_connection” is proposed in road modeling, which doesn’t exist in the ideal grid road network because these connections are easily predetermined. The struct “LaneConnection” is devised to contain localized lane-connection information. It answers the question of “which upstream lanes are allowed to enter the downstream lane in an intersection.” 4. Global path selections of vehicles are stored in CPU memory, and each vehicle has its own path. For struct “LaneOnGPU,” the usage of attributes “vehicle_num” is to deﬁne the number of vehicles at a time t. “vehicle_start_index” and “vehicle_end_index” deﬁne the physical limitation of total number of vehicles; they are constant values. But “vehicle_num” is changing as the simulation advances. To further illustrate the above points, an example road network is illustrated to further explain the data structure in the GPU memory. Fig. 6(A) represents a real-world road network. The road network consists of a main road with one on-ramp and one off-ramp. The road network is then modeled as six nodes and ﬁver segments in Fig. 6(B). The two nodes of interest are node 1 and node 2 [as N1 and N2 in Fig. 6(B)]. Node 1 has two upstream segments and one downstream segment; node 2 has one upstream segment and two downstream segments. The main road has two lanes; the on-ramp and off-ramp roads have one lane. The length of segment 3 (S3) is 200 m; the length of all other segments are 100 m. The length of a vehicle is 5 m.

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

53

LaneOnGPU11 SegmentOnGPU

Link on GPU

PK segment_Index

PK link_Index

RoadNetworkOnGPU1

PK lane_Index

start_node_index end_node_index segment_index[] link_ID

links[LINK_SIZE] nodes[NODE_SIZE] segments[SEGMENT_SIZE] lanes[LANE_SIZE]

lane_connection_start_index lane_connection_end_index vehicles_start_index vehicles_end_index buffered_vehicles_start_index buffered_vehicles_end_index vehicle_size buffered_vehicle_size input_capacity output_capacity speed_history[kTotalTimeSteps(1000)]

segment_ID start_node_index end_node_index lane_start_index lane_end_index

NodeOnGPU PK node_Index node_ID upsteam_segment_connect_start_index upsteam_segment_connect_end_index node_position

VehicleSpace1 PK space_index

LaneConnectionOnGPU PK from_lane_Index PK to_lane_Index

vehicle_ID

BufferedVehicleSpace PK space_index vehicle_ID

VehiclesLoader PK time_step load_size[LANE_SIZE] Vehicles[LANE_SIZE][InputCapacity]

VehicleOnGPU11 PK vehicle_Index move_on_lane_index move_on_link_index path_indexes[MAX_ROUTE_LENGTH(800)] next_path_index vehicle_ID

Fig. 5. Realistic road network and vehicle modeling on the GPU.

(C) Data structure in the GPU memory (* means the ID does not exist in the figure) Fig. 6. Example to show the arrays designed for node, segment, lane connection, and vehicle index for a typical road topology.

54

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

N3 S7 S1

N1

S3 N2

S2

S9 N4

Fig. 7. Four GPU processing areas in an example road topology.

The data structure of the road topology in the GPU memory is shown in Fig. 6(C). First, continuous memory spaces (e.g., tables) are used to keep nodes, segments, lanes, etc. Second, each element has a special ID (or index), which indicates its location in the table. For example, the segment S1 is the ﬁrst element in the segment table. Third, a pair of start/end indexes is used to store the relationship between nodes and segments, segments and lanes, lanes and vehicles. Moreover, complex lane connection rules are directly modeled in our data structure model. For example, as shown in the lane-connection table, the on-ramp segment 2 contains only one lane (lane 3), and lane 3 is only connected to one lane of segment 3 (lane 5). It means, in this example, vehicles on segment 2 cannot pass to the other lane of segment 3 (lane 4). Meanwhile, each lane in segment 3 has a space for 40 vehicles, but the other lanes have a space for 20 vehicles. Thus, each lane in segment 3 reserves 40 vehicle ID space. An attribute “on_road_vehicle_num” is used to determine how many vehicles are on the lane at a time and which vehicle IDs are valid. 3.3. Thread and partition of road network We assume that a traﬃc system can be divided into numerous disjoint small partitions and assign each GPU thread to simulate several partitioned traﬃc models. This approach can signiﬁcantly reduce the total execution time if the partitioning solution is well designed. Normally, we have two choices to partition a traﬃc network. The ﬁrst is to assign vehicles to threads. This approach assigns a new vehicle to the least workload thread. It is more appropriate for microscopic simulation where each vehicle has rich behaviors and more computing load. The second is to assign road lanes or segments to threads. This approach is good for our mesoscopic simulation because of its better cache performance, as vehicles on the same road segment are processed in an order by the same thread. Besides, road segment based properties (e.g., the average speed) can be eﬃciently calculated in this approach. In other words, according to current GPU architecture, the cost mainly comes from the memory access of traﬃc models on different threads. Thus, we think it is of high priority is to achieve coalesced memory access of vehicles on the same road segment. As such, we choose the latter approach, where each node and all upstream road segments connected with this node correspond to each GPU thread. For instance, the road topology in Fig. 4(B) is computed with four GPU threads, as shown in Fig. 7. In the proposed assignment method, all upstream road segments, which are connected to the same node, are processed together by the same GPU thread. Besides, all these upstream road segments are stored in a continuous structure in the GPU memory (as in Section 3.2). It allows an eﬃcient way to deal with conﬂicts among vehicles from different upstream road segments in order to enter into and cross the node. 3.4. Kernel functions of supply simulation on the GPU To apply the above data structure design and network partition method, the supply component of our mesoscopic simulation on GPU consists of four key functions: 1) 2) 3) 4)

cpu_update () kernel function: pre_vehicle_passing () kernel function: vehicle_passing () copy_simulation_results_to_cpu ()

The ﬁrst function allows the CPU to change the status of the traﬃc simulation before starting the supply simulation at the next time step. For example, if an incident happens, the road capacity is reduced. The CPU updates the new capacity to the road network on the GPU memory before simulating the next time step. Another example is en-trip route choices. En-trip route changing behavior is simulated on the CPU, and then the new routes are copied to the GPU.

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

55

Algorithm 1 pre_vehicle_passing. Start n GPU threads (n is the number of lanes) For each GPU thread … for each vehicle in the buffered space of the lane do … ... if there is space on the lane do … ... … transfer the vehicle from the buffered space to the lane; … if there are vehicles in the buffered space of the lane do … ... shift vehicles in the buffered space; … for each newly generated vehicle on the lane do … ... if there is space on the lane do … ... … load the newly generated vehicle to the lane; … ... else if there is space on the buffered space of the lane do … ... ... load the newly generated vehicle to the buffered space; … ... else … ... ... cannot load the newly generated vehicle … update the density of the lane; … update the speed of the lane; … update other attributes (e.g., tp ) of the lane.

lane 2

lane 4

lane 1

node 1

lane 3

lane 5

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

node 2

Fig. 8. A node and its upstream links are updated on the same GPU thread.

Algorithm 2 vehicle_passing. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

Start m GPU threads (m is the number of nodes) For each GPU thread … while the node has left capacity during the time step do … ... obtain the vehicle on upstream road segments of the node which has the maximum waiting time … ... if the vehicle ﬁnishes the trip do … ... ... remove the vehicle from the upstream road segment; … ... else if the corresponding downstream road segment has left buffered space do … ... ... transfer the vehicle to the buffered space of the downstream road segment; … ... ... update the vehicle’s status; … ... ... update the node’s status; … ... ... update the upstream road segment’s status; … ... else … ... ... update the upstream road segment as blocked during the time step; … end of while

The second function is a GPU kernel function updating the status of each lane (e.g., density, speed, and tp [16]) before passing vehicles to the downstream lanes. The update unit of this kernel function is a lane, which means each lane is simulated on a GPU thread. This kernel function ﬁrst loads vehicles, which passed to this road at the previous time step. After that, it loads new generated vehicles into the lane. Then, the kernel function calculates the speed of the lane based on the speed density relationship [1,16]. After that, the kernel function calculates tp. The method to calculate tp is explained in our previous work [16]. For this kernel function, the detailed procedure is shown in Algorithm 1. The third is also a GPU kernel function scanning vehicles on the lane and passes a number of downstream vehicles to the next lane. The update unit of this kernel function is each node. As Section 3.3 discusses, each node and its upstream lanes are simulated on a GPU thread. It is because vehicles from upstream lanes might conﬂict with each other when crossing the node to the same next lane. One example is shown in Fig. 8. In this small road network, node 1 has two upstream lanes: lane 1 and lane 4. Thus, vehicles on lane 1 and lane 4 are processed on the same GPU thread, to remove the potential conﬂicts. Moreover, as Fig. 7 showed, because each lane has only one upstream node, from where vehicles might pass through, there is no conﬂict when updating nodes in parallel. There are four rules to determine whether a vehicle can pass a lane to the next lane, which are explained in our previous work [16]. Afterward, if a vehicle crosses from the current lane to the next lane, the corresponding output capacity of the lane, the input capacity and empty space of the next lane are updated. Finally, the vehicle ID should be removed from the current lane and inserted to the next lane. The details of this kernel function are further explained in Algorithm 2. The divergence possibility in these two algorithms seems to be a little big. But it is almost inevitable according to their traﬃc processing logic. It is diﬃcult to improve it because we must guarantee the precise results (such as lane speed and

56

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

Table 1 Demand simulation time and the time to copy data and results between the CPU and GPU. Grid

Singapore

Time (ms)

Size (bytes)

Time (ms)

Size (bytes)

Copy data from host (CPU) To device (GPU)

211

161

Copy data from device (GPU) to host (CPU)

61

Total

272

708.99 M Vehicle pool (10 0,0 0 0 vehicles, 202 M), Node pool (10,201 nodes, 239.08 K), Lane pool (20,200 lanes, 99.95 M) New vehicles in each time step (406.80 M) 404.00 M 20,200 lanes: ﬂow, speed, density, queue_length, vehicle_counts 1112.99 M

479.18 M Vehicle pool (10 0,0 0 0 vehicles, 321.97 M) Node pool (3179 nodes, 74.51 K) Lane pool (9419 lanes, 46.75 M) New vehicles in each time step (110.38 M) 188.38 M 9419 lanes: ﬂow, speed, density, queue_length, vehicle_counts 667.56 M

28

189

void supply_simulation_pre_vehicle_passing(int time_step) { for (unsigned int lane_index = 0; lane_index < the_network->link_size; lane_index++) {…} } (A) Computation of all the lanes on CPU __global__ void SupplySimulationPreVehiclePassing(GPUMemory* gpu_data, int time_step, int segment_length, GPUSharedParameter* data_setting_gpu, GPUVehicle *vpool_gpu) { int lane_index = blockIdx.x * blockDim.x + threadIdx.x; … ; } (B) Computation of all the lanes on GPU Fig. 9. Program comparison of the supply part.

queue length) of the traﬃc simulation. We can ﬁnd the metric of branch taken ratio in Table 1 that the divergence ratio is indeed a little high. But we can still gain 2.37 speedup (see Fig. 14) of a Singapore network running on GPU, which is to some extent acceptable. The last function copies the simulated results, which include the speed, density, ﬂow, queue length, and empty space of each road, from the GPU memory to the CPU memory. As shown in Fig. 5(B) and (C), these data are stored in a contiguous GPU memory space in order to reduce the time cost of data transfer from the GPU memory to the CPU memory. Further, the two above kernel functions are developed as Fig. 9(B) shows. Further, the program comparison of CPU and GPU is illustrated in Fig. 9. As shown in Fig. 9, in the serial program on CPU, the computation of each lane is achieved by using a loop body. The second lane starts computing when we ﬁnish the computation of the ﬁrst lane. Meanwhile, to reduce the time cost and improve eﬃciency, the loop body and the loop control variable disappear in the parallel program on GPU. Instead, we assign each lane to a thread in GPU to compute all the lanes in parallel. Also, the __global__ preﬁx is added to the function that tells the compiler to generate GPU code and not CPU code when compiling this function and to make that GPU code globally visible from within the CPU. 4. Experiments The traﬃc scenario is simulated on two types of platforms: the CPU and CPU/GPU. Only the total time cost of the supply simulation during the 10 0 0 simulation ticks is measured in this experiment. The CPU platform includes an Intel Core i54200H CPU @ 2.80 GHZ, 8 G memory, 500GB SATA 7.2 K RPM. The GPU platform is a GeForce GTX 950 M, which has 640 CUDA cores and 2 GB global memory. The supply simulations on the CPU and the GPU follow the same logic. The source codes are both implemented using C++ on Ubuntu 12.04 and compiled using g++_4.6.3 and CUDA 6.5. The release version executable ﬁle is used to measure the time cost. 4.1. Experiment design and analysis Two cases of experiments were carried out in order to evaluate the computation eﬃciency of CPU/GPU in artiﬁcial grid and real-world traﬃc scenarios.

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

57

Fig. 10. Singapore expressway road network.

Fig. 11. Distribution of segment lengths in Singapore expressway.

First, an artiﬁcial large grid road network (uniform rectangular grid), which is similar to the road network topology in our companion work [25], is designed with 10,201 nodes and 20,200 unidirectional links. The length of each link is 10 0 0 m. Each node has an index from 0 to 10,200, indicating the store location on the GPU memory. Each link also has an index from 0 to 20,199. 10 0,0 0 0 vehicles are loaded into the road network during 10 0 0 simulation ticks (each tick is 1 s). Vehicles are loaded into this network from nodes in the top and the left, which are moving to the bottom and the right. Each vehicle randomly picks a route from the precalculated candidate routes before starting a trip. En-trip route choice [1] is not included in this traﬃc scenario. Second, a real-world Singapore expressway system traﬃc simulation is studied. As shown in Fig. 10, the expressway system consists of expressway segments and ramps connecting local roads with the expressway. The network has been modeled using a detailed representation of the length, geometry, and lanes of each segment. The expressway system is made up of 3179 nodes, 9419 lanes, and 3388 segments. Distribution of each segment length is shown in Fig. 11. Most segment lengths are in the range of (30 0, 60 0). There are some short segments, which are mostly on-ramps or off-ramps, and there are also a few long segments. The demand is modeled as trips from 4106 OD pairs. Each origin is an on-ramp, where vehicles enter the expressway system from local roads, and each destination is an off-ramp, where vehicles depart the network. The conﬁguration and calibration of the OD matrix use the same methods as DynaMIT [23,25], and the calibrated OD matrices in one period: Peak-Hour (7:00 a.m.–8:00 a.m.) are used in this section. In particular, there are in total 10 0,0 0 0 vehicles loaded into the peak traﬃc scenario during 10 0 0 simulation ticks (each tick is 1 s). The routes of vehicles are precalculated using a path size logit model [24], and routes are not changed during the traﬃc scenarios. 4.2. Results and analysis The execution time results are shown in Figs. 12 and 13, and the corresponding speedup in Fig. 14. Also, the detailed proﬁling data of two major kernel functions are listed in Table 2.

58

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

4 3.5 3 2.5

Grid CPU

2 1.5 1

Grid GPU

0.5 0 500

1000

5000

10000

50000

100000

Fig. 12. Performance of Grid on the CPU and the GPU. The horizontal axis is the number of vehicles; the vertical axis is the execution time in seconds.

1.4 1.2

Singapore CPU

1 0.8

Singapore GPU

0.6 0.4 0.2 0 500

1000

5000

10000

50000

100000

Fig. 13. Performance of Singapore Network on the CPU and the GPU. The horizontal axis is the number of vehicles; the vertical axis is the execution time in seconds.

Speedup comparison 12 10 8 6 4 2 0

Grid speedup

Singapore speedup

500

1000

5000

10000

50000

100000

Fig. 14. Speedup comparison of grid and Singapore network. The horizontal axis is the number of vehicles; the vertical axis is the speedup.

Table 2 Proﬁle of major GPU/CUDA Kernel functions. ID

Measurement

Grid Kernel Function 1 (G1): pre_vehicle_passing

Singapore Kernel Function 1 (S1): pre_vehicle_passing

Grid Kernel Function 2 (G2): vehicle_passing

1 2

Launched GPU threads GPU occupancy (theoretical/achieved) Registers (used/available) Transaction per second (load/store) Branch taken ratio (%) Instruction serialization Instruction per clock (IPC) (measurement/ maximum) Warp issue eﬃciency (no eligible %) Issue Stall Reasons (execution dependency) CUDA achieved GFLOPS

20,352 (106 blocks) 94%/79.01%

9600 (50 blocks) 93.75%/76.5%

10,368 (54 blocks) 94%/78.28%

4224/65,536 2.55/4.53

4032/65,536 2.32/5.26

3456/65,536 4.26/1.58

60.62% 0.988% 0.355/4.0

60.05% 1.12% 0.364/4.0

66.44% 3.698% 0.307/4.0

27.62% 1.209% 0.148/4.0

90.35%

91.43%

93.22%

96.06%

18.73%

15.41%

4.33%

4.98%

8.950

6.257

0.0104

0.0 0 012

3 4 5 6 7

8 9 10

Singapore Kernel Function 2 (S2): vehicle_passing 3264 (17 blocks) 93.75%/23.63% 2880/65,536 4.97/1.14

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

59

VehicleOnGPU

Vehicle on GPU PK vehicle_Index move_on_lane_index vehicle_ID path_code[kMaxRouteLength(500)]

PK vehicle_Index move_on_lane_index move_on_link_index path_indexes[MAX_ROUTE_LENGTH(800)] next_path_index vehicle_ID

Fig. 15. Struct of vehicles in grid and real-world network.

Two groups of conﬁgurations are investigated. The ﬁrst is for the artiﬁcial grid; the second is for the Singapore network. Each result is executed 20 times, and the average time cost is shown in Figs. 12 and 13. The speedup is measured by comparing the time cost of supply simulation on a GPU to the time cost of supply simulation on a CPU core. In Fig. 14, we can observe that when the number of vehicles is 10 0,0 0 0, the simulation of grid on GPU obtains a speedup of 10.00 and that of the realistic network is 2.37. We can observe that, compared with the speedup on the artiﬁcial large grid road network, the speedup on the Singapore expressway is much lower. The main reasons are as follows. First, due to the data structure difference discussed in Section 3.2, the times of memory access in the Singapore road network is more than it is in the ideal grid road network. For instance, in the Singapore road network, the memory needs access to each lane’s length. However, in the ideal grid road network, each lane is 10 0 0 m, so there is no need to access memory. Second, in the artiﬁcial grid road network, roads have the same length, and vehicles on roads are directly stored inside roads. However, in the Singapore expressway, roads have different lengths from 50 m to 20 0 0 m (see Fig. 11). Instead of directly storing the vehicles, the “start & end indexes” is introduced in Section 3.2. The “start & end indexes” make the usage of memory space more ﬂexible. It is important for loading the whole city-scale traﬃc simulations into the GPU memory. However, accessing a vehicle on a road requires twice the memory accesses. More memory accesses make the proposed framework less eﬃcient. Third, the number of GPU threads launched on the Singapore expressway is smaller than the grid road network. It makes the GPU occupancy lower. Fourth, the artiﬁcial grid road network is more structured. For example, most nodes have two downstream links and two upstream links. However, the Singapore expressway network topology is more complicated. It makes the memory access less coalesced. Besides the above reasons, more detailed analysis of memory copy time and execution time eﬃciency is studied in the following two subsections, where both cases have 10 0,0 0 0 cars. 4.2.1. Comparison and analysis of the memory copy time In Figs. 12 and 13, the time cost does not include the demand simulation time during which the vehicles are loaded into the network. In Table 1, this demand time is the time to copy data from host to device, which is 211 ms to copy 708.99 MB grid data, including vehicle pool, node pool, lane pool, and new vehicles generated in each time step. It takes 161 ms to copy 479.18 MB Singapore data. We can observe that the rate of this copy is about 3.7 GB/s, which is the average rate in most scenarios. Moreover, we need to clarify why there are more nodes and lanes in the artiﬁcial network than in the real-world one, but the latter does not always have fewer bytes to copy such as the vehicle pool. The main reasons have been noted in Section 3.2.2, i.e., the vehicle, node, lane and lane connection struct in a grid network is simpler than that of the real-world network, as shown in Figs. 15–17. For example, as Fig. 15 shows, the struct of vehicle is simpler because the grid network has a ﬁxed path moving direction, and it is easy to obtain the vehicle index to obtain the current vehicle ID at a time. And the max route length of the grid is smaller than in Singapore because the latter has a more complex route. Meanwhile, as shown in Fig. 16, the grid has no segment because each link has only lane. But the lengths of the segments in Singapore expressway vary from under 100 to over 10 0 0 m, as shown in Fig. 11. This leads to more complex data structure for nodes of realistic network. Also, shown in Fig. 17, each lane of the grid case has the same length, and ﬁxed lane connection relations while the Singapore network are more complex. This is detailed as discussed in Section 3.2 and illustrated in Fig. 6(A)–(C). 4.2.2. Comparison and analysis of the execution eﬃciency Table 2 shows the proﬁle of two kernel functions in the framework: pre_vehicle_passing and vehicle_passing. For the former, the threads are launched to compute the lanes. For the latter, the threads are to compute the nodes. Please note that a table explaining the proﬁling measures of Table 2 is added in the appendix section. The analysis of these main GPU performance proﬁling measurements of the four kernel functions, i.e., G1, G2, S1, S2, is studied as follows.

60

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

NodeOnGPU

Node on GPU PK node_Index node_position uppersteam_link_Index[] downstream_link_Index[] node_ID

PK node_Index node_position upsteam_segment_connect_start_index upsteam_segment_connect_end_index downstream_segment_connect_start_index downstream_segment_connect_end_index node_ID

Fig. 16. Struct of nodes in grid and real-world network.

LaneOnGPU PK lane_Index Lane on GPU PK lane_Index connected_lane_Index[] vehicles_index[] buffered_vehicles_index[] lane_ID input_capacity output_capacity speed_history[TOTAL_TIME_STEPS(1000)]

lane_connection_start_index lane_connection_end_index vehicles_start_index vehicles_end_index buffered_vehicles_start_index buffered_vehicles_end_index vehicle_size buffered_vehicle_size input_capacity output_capacity speed_history[kTotalTimeSteps(1000)]

LaneConnection PK from_lane_Index PK to_lane_Index

Fig. 17. Struct of lanes in grid and real-world network.

First, the simulation thread units in these two kernel functions are lanes and nodes. The block size (blockDim.x) used in these experiments is 192. In these four traﬃc scenarios, G1 launched 20,352 GPU threads and S2 launched 3264 threads. The former gains the highest occupancy, and the latter has the lowest occupancy. Second, most of the occupancies of these kernel functions are high, which indicates the GPU cores are suﬃciently utilized. Only the occupancy of the Singapore kernel function 2 (S2) is the lowest because the number of threads launched is the smallest. The main reason is each thread of S2 is to compute and update the status of each node, i.e., to move the vehicles to the buffered space of the downstream lane. The number of nodes is less than that of lanes. Meanwhile, the computation cost of S2 is less than that of S1, which is determined by their designed functions discussed in Section 3.4. Third, even after moving internal variables from the global memory to registers, registers are not a bottleneck, which means registers are enough for use in these four kernels. Fourth, as proposed in Section 3.2, the data of lanes, nodes, and vehicles are stored into designed arrays; thus, threads in a warp can access a contiguous memory space, known as coalesced memory access. The number of memory transactions per request (both load and store) for these two kernel functions is small (<5.5), which indicates the memory access is well coalesced. In addition, G2 and S2 have less store transactions, which means they have less data to store from kernel to global memories because the vehicles of nodes are transferred to the buffered space of the downstream road lane, which means less data are transferred to the global memory to update the lanes. Fifth, the branch-taken ratio (within threads in the same warp) varies for the kernel functions. This means the threads in these kernels do not take exactly the same branch because roads have a different number of entering vehicles and a different number of passed vehicles, and the number of upstream links and number of vehicles on nodes are different. The branchtaken ratio for kernel S2 is much lower (27.62%). This is expected because the divergence in Algorithm 2 (see Section 3.4) is less than Algorithm 1. Also, in S2 the number of nodes is less and the number of vehicles passing to downstream roads varies in different types of road topologies. Sixth, the instruction serialization ratio of these two kernel functions are low (<4%). This means that the branch taken ratio does not cause performance loss in these kernels. Seventh, the metric of instruction per clock (IPC) for these kernel functions is below 0.4, which is far below the hardware’s peak value (4.0). Regarding our framework, it is mainly because almost all data (e.g., the road network and vehicles) is stored in the global memory, which causes high memory access latency. As Table 1 shows, we need 546.98 MB for lanes and 75.35 MB for nodes in S1 and S2. However, each shared memory per block is only 48 KB. This requires complex data division techniques and programming skills to divide the memory into sections and handle the issue of various blocks. As

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

61

such, only global memory is employed in our kernel functions, and the memory access latency is not completely hidden by thousands of threads. Eighth and ninth, the no eligible warp issue eﬃciency is high for all four kernels (>90%), and the execution dependency of this issue’s stall reason count is less than 20%. This means the memory access latency is the main reason, which is the same as the low IPC, i.e., almost all data are loaded from and stored into global memory. It is diﬃcult to employ highly eﬃcient shared memory because its size is small, only 48 KB per block. If we want to use it, the 546.98 MB of lanes needs to be divided into 11,396 sections. This will lead to lots of branches taken in the kernel functions, which will inevitably bring low eﬃciency. Tenth, the achieved GLOPS for the two kernel functions are also lower than the hardware’s peak, which is again related to the above-mentioned dilemma of global memory latency.

4.3. Discussions This section discusses additional thoughts about running mesoscopic traﬃc simulations on the CPU/GPU platform. First, it is beneﬁcial to run the demand simulation on the CPU, the supply simulation on the GPU, and the data communication between the CPU and the GPU in an asynchronous way. In the proposed framework, the supply simulation on the GPU is the bottleneck, and the time costs of the other two tasks are almost hidden. Second, the memory access latency is a bottleneck in the proposed mesoscopic traﬃc simulation framework. In mesoscopic simulation frameworks, the update of a road depends on its own road status (e.g., road density and queue status) and requires a small number of parameters from its downstream roads. There is little shared data access among nearby roads and nodes, which limits the usage of the more eﬃcient shared memory in the GPU. We believe this is the bottleneck that prevents us from achieving a higher simulation speedup on the GPU.

5. Conclusions and future work In this paper, we use the CPU and the GPU to enhance performance of a traditional time-stepped mesoscopic traﬃc simulation enabled by the entry-time-based supply framework (ETSF), as described in [16]. The main contributions are listed as follows.

1. A comprehensive framework is proposed to run demand and supply framework on the CPU and GPU, respectively, with designed asynchronous simulation step management. For the demand part, real-time demands are periodically generated using calibrated data. For the supply part, road networks and vehicles are modeled and run with GPU kernel functions. 2. To enable real-world networks’ critical parts, including variable lengths of segments and links, CPU linked list structures are converted to more complex GPU array structures. Moreover, to decrease global data access latency, kernel functions are elaborately designed using more registers and memory bandwidth. 3. The proposed mesoscopic traﬃc simulation framework is demonstrated to simulate 10 0,0 0 0 vehicles moving on an ideal grid network, and the supply traﬃc simulation on a GPU (GeForce GT 950 M) gets 10.0 times speedup, compared with running the same supply simulation on a CPU core (Intel i5). However, for the test of a real-world Singapore expressway network, the speedup is only 2.37. This is mainly due to the data structure difference between the two networks. 4. The proposed traﬃc simulation framework on CPU/GPU offers an innovative and high-performance solution in order to reduce the computational cost of various DTA models.

To extend GPU-based traﬃc simulation of real networks, we think that future works should focus on optimization of data structure alleviating memory access time.

Acknowledgments The authors would like to thank Kakali Basak, Stephen Robinson, Lu Yang, Francisco Pereira, and Harish Loganathan for their comments on this paper. This paper is supported by NSFC 61473013 funding of China.

62

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

Appendix Brief introduction to GPU performance metrics ID

Measurement

1

Launched GPU threads

2

GPU occupancy (Theoretical/Achieved)

3

Registers (used / available)

4

Transaction per second (load/store)

5 6

Branch taken ratio (%) Instruction serialization

7

Instruction per clock (IPC) (measurement/ maximum) Warp issue eﬃciency (no eligible %) Issue Stall Reasons (execution dependency)

8 9

10

CUDA achieved GFLOPS

Brief introduction Equal to block size ∗ the number of threads in a block. The block size is deﬁned according to the number of parallel threads. Occupancy is deﬁned as the ratio of active warps on an SM to the maximum number of active warps supported by the SM. Occupancy varies over time as warps begin and end, and can be different for each SM. The SM has a set of registers shared by all active threads. If this factor is limiting active blocks, it means the number of registers per thread allocated by the compiler can be reduced to increase occupancy This metric is the average number of L1 transactions required per executed global memory instruction, separately for load and store operations. For this metric, the lower the better. How many transactions are actually required varies with the access pattern of the memory operation and is also dependent on the compute capability of the target device. Total number of executed branch instructions with a uniform control ﬂow decision. A ﬂow control instruction is considered to be divergent if it forces the threads of a warp to execute different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. For this metric, the lower the better. The average number of issued instructions and executed instructions per cycle accounting for every iteration of instruction replays. The higher numbers indicate more eﬃcient usage of the available all resources. The Warp Issue Eﬃciency (No Eligible) segment shows how frequently issue stalls occur. The lower of no eligible warp means more eﬃcient the code runs on the target device. The issue stall reasons capture why an active warp is not eligible, an input required by the instruction is not yet available. Execution dependency stalls can potentially be reduced by increasing instruction-level parallelism The number of double precision accuracy ﬂoating point instructions executed per second. The higher the better.

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]

J. Barcelo (Ed.), Fundamentals of Traﬃc Simulation, International Series in Operations Research & Management Science, Springer, New York, 2010. W. Yang, Scalability of Dynamic Traﬃc Assignment Ph.D. thesis, Massachusetts Institute of Technology, 2009. M. Ben-Akiva, S. Gao, Z. Wei, W. Yang, A dynamic traﬃc assignment model for highly congested urban networks, Transp. Res. Part C 24 (2012) 62–82. M. Ben-Akiva, M. Bierlaire, D. Burton, H.N. Koutsopoulos, R. Mishalani, Network state estimation and prediction for real-time traﬃc management, Networks Spatial Econ. 1 (3/4) (2001) 293–318. H.S. Mahmassani, T. Hu, R. Jaykrishnan, Dynamic traﬃc assignment and simulation for advanced network informatics (DYNASMART), in: Proceedings of the 2nd International Capri Seminar on Urban Traﬃc Networks, Capri, Italy, 1992. A.K. Ziliaskopoulos, S.T. Waller, Y. Li, M. Byram, Large-scale dynamic traﬃc assignment: Implementation issues and computational analysis, J. Transp. Eng. 130 (5) (2004) 585–593. D.B.C. Gordon, I.D.D. Gordon, Paramics - Parallel microscopic simulation of road traﬃc, J. Supercomput. 10 (1996) 25–53. N. Çetin, Large-Scale Parallel Graph-Based Simulations Ph.D. thesis, ETH Zurich, Switzerland, 2005. H. Aydt, X. Yadong, L. Michael, K. Alois, A multi-threaded execution model for the agent-based SEMSim traﬃc simulation, in: Proceedings of AsiaSim, vol. 402, 2013, pp. 1–12. Nvidia, CUDA C Programming Guide, 2014. available at http://www.nvidia.com/CUDA. J. Michalakes, M. Vachharajani, GPU acceleration of numerical weather prediction, in: IPDPS 2008: IEEE International Symposium on Parallel and Distributed Processing, vol. 18, 2008, p. 531. I. Buck, GPU computing with NVIDIA CUDA, SIGGRAPH ’07, New York, NY, USA, 2007. Z. Fan, F. Qiu, A. Kaufman, S. Yoakum-Stover, GPU clusters for high performance computing, in: Proceedings of SC’04, 2004, pp. 47–58. K.S. Perumalla, G.A. Brandon, B.Y. Srikanth, K.S. Sudip, GPU-based real-time execution of vehicular mobility models in large-scale road network scenarios, in: International Workshop on Principles of Advanced and Distributed Simulation, 2009, pp. 95–103. D. Strippgen, K. Nagel, Multi-agent traﬃc simulation with CUDA, in: Proceedings of International Conference on High Performance Computing & Simulation, Leipzig, 2009, pp. 106–114. X. Yan, S. Xiao, W. Zhiyong, T. Gary, An Entry Time based supply framework (ETSF) for mesoscopic traﬃc simulations, Simul. Modell. Pract. Theory 47 (6) (2014) 182–195. G. Denis, T. Jose-Juan, A. Samuel, M.D.S. Roshan, Graphics processing unit based direct simulation Monte Carlo, Simulation 88 (6) (2012) 680–693. A.R. Brodtkorb, R.H. Trond, L.S. Martin, Graphics processing unit (GPU) programming strategies and trends in GPU computing, J. Parallel Distrib. Comput. 73 (2013) 4–13. K.S. Perumalla, Discrete event execution alternatives on general purpose graphical processing units (GPGPUs), in: International Workshop on Principles of Advanced and Distributed Simulation, 2006, pp. 74–81. J. Passerat-Palmbach, C. Mazel, D.RC. Hill, Pseudo-random number generation on GP-GPU, in: International Workshop on Principles of Advanced and Distributed Simulation, 2011, pp. 1–8. L. Xiaosong, C. Wentong, J.T. Stephen, GPU Accelerated three-stage execution model for event-parallel simulation, International Workshop on Principles of Advanced and Distributed Simulation, 2013. H. Park, P.A. Fishwick, An analysis of queuing network simulation using GPU-based hardware acceleration, ACM Trans. Model. Comput. Simul. 21 (3) (2011). Singapore-MIT Alliance for Research and Technology (SMART), DynaMIT, 1 Mar 2014. accessed on http://137.132.22.82:15014/w/index.php/Main_Page. M. Ben-Akiva, M. Bierlaire, Discrete choice methods and their application to short-term travel decisions, in: Handbook of Transportation Science, 1999, pp. 5–34. Y. Xu, G. Tan, X. Li, X. Song, Mesoscopic traﬃc simulation on CPU/GPU, in: Proceedings of ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, 2014, pp. 39–49.

X. Song et al. / Simulation Modelling Practice and Theory 74 (2017) 46–63

63

[26] R. Jayakrishnan, H.S. Mahmassani, T.Y. Hu, An evaluation tool for advanced traﬃc information and management systems in urban networks, Transp. Res. Part C 2 (3) (1994) 129–147. [27] NicholasB. Taylor, The CONTRAM dynamic traﬃc assignment model, Networks Spatial Econ. 3 (2003) 297–322. [28] Xu Yan, Gary Tan, hMETIS-based oﬄine road network partitioning, in: Proceedings of AsiaSim, vol. 323, Shanghai, 2012, pp. 221–229. [29] INRO, Dynameq, 1 Mar 2015. accessed on http://www.inro.ca/en/products/dynameq/. [30] W. Burghout, H.N. Koutsopoulos, I. Andreasson, in: Hybrid Microscopic-Mesoscopic Traﬃc Simulation, Doctoral thesis, Royal Institute of Technology, Stockholm, 2006, pp. 218–225. [31] W. Burghout, H.N. Koutsopoulos, I. Andreasson, A discrete-event mesoscopic traﬃc simulation model for hybrid traﬃc simulation, Intelligent Transportation Systems Conference, 2006. [32] N. Cetin, Large-Scale Parallel Graph-Based Simulations, Dissertation, ETH Zurich, Switzerland, 2005. [33] R.A. Waraich, D. Charypar, M. Balmer, K.W. Axhausen, Performance improvements for large scale traﬃc simulation in MATSim, in: Proceedings of 9th Swiss Transport Research Conference, 13, 2014, pp. 211–233. [34] N.A. Kallioras, K. Kepaptsoglou, N.D. Lagaros, Transit stop inspection and maintenance scheduling: a GPU accelerated metaheuristics approach, Transp. Res. Part C 55 (2) (2015) 246–260. [35] M.D. Gangi, G.E. Cantarella, R. Di Pace, S. Memoli, Network traﬃc control based on a mesoscopic dynamic ﬂow model, Transp. Res. Part C 66 (2016) 3–26. [36] M. Dell’Orco, M. Marinelli, M. Ali Silgu, Bee colony optimization for innovative travel time estimation, based on a mesoscopic traﬃc assignment model, Transp. Res. Part C 66 (2016) 48–60. [37] J. Auld, M. Hope, H. Ley, V. Sokolov, B. Xua, K. Zhang, POLARIS: agent-based modeling framework specialized for high-performance transportation simulations, Transp. Res. Part C 64 (3) (2016) 101–116. [38] M. Rickert, K. Nagel, Dynamic traﬃc assignment on parallel computers in TRANSIMS, Future Gen. Comput. Syst. 17 (2001) 637–648. [39] H.X. Liu, W. Ma, R. Jayakrishnan, Will Recker, Large-Scale Traﬃc Simulation through Distributed Computing of PARAMICS, 2004 California PATH Research Report UCI-ITS-TS-WP-04-14. [40] Y. Tian, Y.-C. Chiu, Anisotropic mesoscopic traﬃc simulation approach to support large-scale traﬃc and logistic modeling and analysis, in: Winter Simulation Conference, 2011, pp. 1500–1512. [41] Y. Tian, Y.-C. Chiu, A computational eﬃcient approach to retaining zone pair travel time in dynamic traﬃc assignment for activity-based model integration, in: Proceedings of Transportation Research Board 93rd Annual Meeting, 2014, pp. 14–5716. [42] X. Zhou, S. Tanvir, H. Lei, J. Taylor, B. Liu, M. Nagui, H. Rouphail, Christopher Frey, integrating a simpliﬁed emission estimation model and mesoscopic dynamic traﬃc simulator to eﬃciently evaluate emission impacts of traﬃc management strategies, Transp. Res. Part D 37 (2015) 123–136. [43] W., Reilly, Highway Capacity Manual, Transportation Research Board, Washington, DC, USA, 20 0 0. [44] K.S. Perumalla, G.A. Brandon, B.Y. Srikanth, K.S. Sudip, Interactive, graphical processing unit based evaluation of evacuation scenarios at the state scale, Simulation 88 (6) (2012) 746–761.

Supporting real-world network-oriented mesoscopic traffic simulation on GPU

Supporting real-world network-oriented mesoscopic traffic simulation on GPU

Recommend Documents