Accident Analysis and Prevention 95 (2016) 266–273
Contents lists available at ScienceDirect
Accident Analysis and Prevention journal homepage: www.elsevier.com/locate/aap
Influential factors of red-light running at signalized intersection and prediction using a rare events logistic regression model Yilong Ren, Yunpeng Wang (Professor), Xinkai Wu (Professor), Guizhen Yu (Associate Professor) ∗ , Chuan Ding (Assistant Professor) School of Transportation Science and Engineering, Beihang University, 37 Xueyuan Road, Haidian District, Beijing, China
a r t i c l e
i n f o
Article history: Received 17 January 2016 Received in revised form 9 July 2016 Accepted 13 July 2016 Keywords: Red light running Influential factors Rare events Logistic regression High-resolution traffic data
a b s t r a c t Red light running (RLR) has become a major safety concern at signalized intersection. To prevent RLR related crashes, it is critical to identify the factors that significantly impact the drivers’ behaviors of RLR, and to predict potential RLR in real time. In this research, 9-month’s RLR events extracted from highresolution traffic data collected by loop detectors from three signalized intersections were applied to identify the factors that significantly affect RLR behaviors. The data analysis indicated that occupancy time, time gap, used yellow time, time left to yellow start, whether the preceding vehicle runs through the intersection during yellow, and whether there is a vehicle passing through the intersection on the adjacent lane were significantly factors for RLR behaviors. Furthermore, due to the rare events nature of RLR, a modified rare events logistic regression model was developed for RLR prediction. The rare events logistic regression method has been applied in many fields for rare events studies and shows impressive performance, but so far none of previous research has applied this method to study RLR. The results showed that the rare events logistic regression model performed significantly better than the standard logistic regression model. More importantly, the proposed RLR prediction method is purely based on loop detector data collected from a single advance loop detector located 400 feet away from stop-bar. This brings great potential for future field applications of the proposed method since loops have been widely implemented in many intersections and can collect data in real time. This research is expected to contribute to the improvement of intersection safety significantly. © 2016 Elsevier Ltd. All rights reserved.
1. Introduction Red light running (RLR) is a serious safety issue at signalized intersection across the nation and the world. This careless and reckless behavior is responsible for a significant number of intersection crashes and has resulted in substantial numbers of severe injuries and significant property damage. According to NHTSA’s Fatality Analysis Reporting System, in the United States, RLR causes over 750 fatalities each year (NHTSA, 2012). This number is increasing at more than three times the rate of increase for all other fatal crashes (Retting et al., 1995; Retting and Williams, 1996). In China, related study shows that RLR had caused over 4227 severe injury crashes
∗ Corresponding author. E-mail addresses:
[email protected] (Y. Ren),
[email protected] (Y. Wang),
[email protected] (X. Wu),
[email protected] (G. Yu),
[email protected] (C. Ding). http://dx.doi.org/10.1016/j.aap.2016.07.017 0001-4575/© 2016 Elsevier Ltd. All rights reserved.
and 789 fatalities based on the data collected from January 2012 to October 2012 (Wang and Yu , 2016). To prevent RLR, many methods, such as variable message warning signs, signal timing adjustment, or even autonomous vehicle technologies (Bonneson et al., 2002; Ragland and Zabyshny, 2003; Neale et al., 2007; etc.) have been developed. But before we apply these methods, a critical first step is to predict potential RLR behaviors. Notice, RLR at most of situations is not an intentional decision, but an unintentional behavior due to uncertain surrounding situations such as traffic light switching to yellow, or following a platoon (Wang et al., 2012). For RLR prediction, a general way is to first analyze a large amount of traffic data (video data or loop detector data, where loop detectors are in-pavement devices that generally use magnetic field technology to detect vehicle) to statistically identify the factors that may significantly impact the RLR behavior. From a statistical point of view, a driver’s current driving conditions, together with surrounding traffic conditions and signal timing situations, will directly or indirectly lead to the driver’s later
Y. Ren et al. / Accident Analysis and Prevention 95 (2016) 266–273
Stopbar detectors Advance detectors
N
400feet
Advance detectors
Stopbar detectors
Fig. 1. A typical detector layout (as shown in the figure, the advance detector is the loop detector typically located 400 feet upstream from the stop-line; and stop-bar detector is located right behind the stop-line.).
behavior of RLR. Therefore, by statistically analyzing a large amount of traffic data, the inner correlation between all these impact factors and drivers’ RLR behaviors can be derived; and then based on the derived correlation between the impact factors and drivers’ RLR behaviors, either a Probit (Sheffi and Mahmassani, 1981; Sharma et al., 2011) or Logit (Newton et al., 1997; Papaioannou, 2007; Haque et al., 2015) model can be applied to predict RLR. To apply above procedure, the first critical step is to identify significant influential factors which affect drivers’ RLR behaviors. Much research has been working on identifying the significant influential factors of RLR. For example, some research identified that safety belt use, driving records, ethnicity, etc. were critical for RLR (Retting and Williams, 1996; Porter and England, 2000). But apparently these factors cannot be used for real time RLR prediction. This research focuses more on searching for significant factors that can be collected in real time and used for RLR prediction. Some research analyzed high quality video data and found out that the vehicle characteristics (such as vehicle speed) and traffic operations oriented (such as flow and signal-timing) had significant impact on RLR (Bonneson et al., 2001; Bonneson and Son, 2003; Gates et al., 2007; Yang and Najm, 2007; Wang et al., 2009; Zhang et al., 2009; Elmitiny et al., 2010). However, using video data has limited their applications for RLR prediction since the quality of the data is constrained by the quality of cameras and, more importantly, real-time video data analysis is time-consuming and costly. Due to the limitations of video data, recent research begins to explore of using loop detector data for RLR study simply because loop detector data can be easily and automatically collected in real time with low cost. Also, because almost all signalized intersections have been equipped with loop detectors (see Fig. 1) for the purpose of signal operations, using loop detector data to help analysis and prevent RLR becomes very attractive (Wang et al., 2012). However, loop detector data usually has been aggregated into 30-s, 5-min or even 15-min. Such aggregated data is too coarse to describe individual drivers’ behaviors, such as RLR. With recent improvement of data collection methods (e.g. Smaglik et al., 2007; Liu et al., 2009), high-resolution traffic data (event-based or second-by-second data), which provides detailed vehicle arrivals and departures from loop detectors, become more and more popular. Such data, combined with the signal phase changes provided by signal control system, could be used to analyze RLR behavior and help RLR prediction. Another critical problem for RLR prediction is that traditional statistical methods like Probit or Logit have difficulties to accurately predict RLR. These traditional methods work well for Yellow-LightRunning (YLR) prediction (Lu et al., 2015). However, when applying these methods for RLR prediction, the results are not satisfactory.
267
The reason is because RLR is rare events. Such problem has been reported as an imbalanced class problem (Chawla et al., 2004; He and Garcia, 2009). As demonstrated by the data provided by Wu et al. (2013), over 42277 observations (including RLR, YLR, and Firstto-Stop (FSTP, defined as the first vehicle which stops before the stop-line when green ends)) collected from three months at one signalized intersection, only 289 cases (0.7%) were RLR. With such small number of RLR, applying standard classifiers, such as logistic regression, will sharply underestimate the probability for RLR (King and Zeng, 2001, 2002). This research aims to address above issues related to RLR by first exploring the influential factors which have significant impacts on drivers’ RLR behaviors using loop detector data. Particularly, 9-month’s RLR events extracted from high-resolution traffic data collected by loop detectors from three signalized intersections are applied to identify the factors that significantly affect RLR behaviors. Then based on the data analysis results, and also considering the rare events nature of RLR, this research proposes a modified rare events logistic regression model originally developed by King and Zeng (2001, 2002) to predict RLR. King and Zeng’s rare events logistic regression model has been applied in many fields and shows impressive performance; but so far according to our limited knowledge, none of previous research has applied this method to predict RLR. The proposed model is further evaluated using 9month high-resolution traffic data collected by loop detectors. The results demonstrate that the new method outperforms the standard logistical regression method with a significant improvement of RLR prediction rate. The rest of the paper is organized as follows. Section 2 provides a brief description of data preparation, followed by the exploration of influential factors that significantly affect RLR behaviors using high-resolution traffic data in Section 3. Section 4 introduces the proposed rare events logistic regression model for RLR prediction and presents the comparison results for both standard and rare evens logistic regression RLR prediction models. At the end, we conclude this paper with a few perspectives for future research. 2. Data preparation The proposed regression models are based on the correlation between drivers’ behaviors (i.e. RLR, YLR, or FSTP) and impact factors including velocity, time gaps (the time difference between the arrival time of the following vehicle and the departure time of the leading vehicle), etc. So the first step is to extract RLR, YLR, and FSTP from high-resolution data. The detailed data extraction procedure has been published in Wu et al. (2013). We briefly present the basic idea in this paper. 2.1. High-resolution traffic data High-resolution traffic data was collected by the SMART-SIGNAL (Systematic Monitoring of Arterial Road Traffic and SIGNAL) system (Liu et al., 2009), which is capable of continuously collecting and archiving high-resolution vehicle-detector actuation and signal phase change data. The raw data contains information of individual vehicles’ arrival and departure time. From that, the occupancy time (i.e., the time the detector is occupied by a vehicle), individual vehicle speed, and time gap can be derived (Wu et al., 2013). The SMART-SIGNAL system has been installed on a major arterial (Trunk Highway 55) with six intersections in the Twin Cities area since July 2008. All intersections are equipped with vehicle-actuated signals, with advance detectors typically located 400 feet upstream from the stop-line for green extension and stop-bar detectors located right behind the stop-line for presence detection (see Fig. 2). For research purposes, link entrance detectors on major approaches are
268
Y. Ren et al. / Accident Analysis and Prevention 95 (2016) 266–273 Boone Ave.
2635 ft
TH 55
Phase 2 (EB)
Rhode Island Ave.
Winnetka Ave.
1777 ft
842 ft
400 ft
Stopbar detectors Entrance detectors Advance detectors Unused detectors
400 ft 400 ft
TH 55
Phase 6 (WB)
400 ft
400 ft
400 ft
Fig. 2. Detector layout.
also installed. In this research, a total of 9 months of high-resolution data collected from three intersections (Boone Ave., Winnetka Ave., and Rhode Island Ave., see Fig. 2) were used for analysis, amounting to well over 10 million records. 2.2. RLR identification A critical step is to identify drivers’ behaviors of RLR, YLR, or FSTP from high-resolution traffic data. To do so, first the data collected from stop-bar detectors are used to directly identify RLR, YLR, and FSTP cases. The basic idea is very intuitive: if a vehicle approaches stop-bar with a relatively high speed (higher than a threshold value), we conclude that the driver decides to run through the intersection; otherwise this driver decides to stop. For the stop cases, only the vehicle which is the first to stop before the stop-line is defined as “FSTP”. For the running cases, if the signal indication is red when the vehicle passes through the stop-bar detector, it is an RLR; and it is an YLR if the signal is yellow. Note if the stopbar detector is installed directly behind the stop-bar (such as Int. Rhode Island, see Fig. 2), the threshold value is set as 10 mph. But for some intersections like Int. Boone and Int. Winnetka, the detectors are located about 50 feet behind stop-line. A different threshold value of 20 mph is applied. Note the threshold values of 10 mph and 20 mph are experimentally verified and relatively conservative in order to ensure that the identifications of RLR are correct, but with the cost of missing some RLR events (Wu et al., 2013). 2.3. Event matching between stop-bar and advanced detectors Another important step is to match FSTP/YLR/RLR events identified by a stop-bar detector with the vehicle events recorded by the corresponding advance detector located on the same lane but 400 feet upstream (Fig. 3(a)). This step is important since only the information collected by the advance detectors will later be used in our prediction models. The prediction must be made at the advance detector so that there will be time for performing any RLR prevention strategy before vehicles reach the intersection. A simple “window-searching” method is applied to match the events recorded by advance and stop-bar detectors (see Fig. 3(b)). This method first identifies a “time window” for each event recorded by the advance detector based on a possible maximum and minimum travel time required for a vehicle traveling from advance detector to stop-bar detector. Mathematically, the “window-searching” method can be formulated as follows. If we denote that one vehicle Vehad arrives at the i advance detector at time Tiad but a set of vehicles have been detected by stop-bar detector during the time t. If we define this set of vehicles as, the match problem then can be defined as a process to find a vehicle Vehsd in set, which is the same vehicle of Vehad . Here p is j i
total number of vehicles in the set; and t is the “time window” calculated based on the possible maximum and minimum travel time. Specifically, the maximum travel time (TTimax ) for vehicle i is estimated base on the assumption that the vehicle will fully stop at the stop-bar; and the minimum travel time (TTimin ) is calculated
by assuming a maximum acceleration rate (aconstant ) of 6 feet/s2 suggested by Long (2000), as shown in the following equation:
⎧ ⎪ ⎪ ⎨
TTimax =
ad ⎪ ⎪ ⎩ TT min = −vi +
i
2∗d
vad i
ad 2
vi
+ 2 ∗ d ∗ aconstant
(1)
aconstant
is the individual vehicle speed for when this vehicle where vad i arrives at advance detector. Thus the t can be defined as the time duration of Tiad + TTimin , Tiad + TTimax . For vehicle Vehad and any vehicle in set during t, we can define i a match strength function, mi,j , for these two vehicles: mi,j =
1 |Tjsd
− Tiad − ti,j |
(2)
where is the arrival time of Vehsd at the stop-bar detector and ti,j is j the expected travel time of vehicle from advance detector to stopbar detector. If we assume vehicles keep a constant acceleration or deceleration rate, can be calculated as ti,j =
2d Viad + Vjsd
(3)
where Vjsd is the vehicle speed when Vehsd arrives at the stop-bar j detector and d is the distance between the advance detector and stop-bar detector, as shown in Fig. 3(b). To search for the right match, for vehicle Vehad , we calculate i between and any other vehicle in set Sstopbar . The vehicle pair with the highest value of mi,j is considered as a right match. Considering that lane changing could bias our matching results, we further use the data collected by downstream link entrance detectors (see Fig. 2) to ensure that those “running” vehicles (i.e. RLR or YLR) travelled directly from advance detector to link entrance detectors. In such situation, the possibility of lane-changing activities within such short distance (400 ft) is small. However, due to the limitation of the loop detector data, our method cannot detect all RLR cases. For example, although it is rare, it is possible that a vehicle that passes an advance detector decides to RLR by passing the car in front of him/her and moves to the other lane. In fact, our purpose is not to detect all RLR cases, but to ensure that all identified RLR cases are correct matches. Also, because we use a large amount of event data, missing some RLR cases will not have significant impact on our results. 2.4. Data summary Using 9 months of event data (3 for each intersection), the program identified a total of 109,774 cases, in which 18,426 are FSTP, 90,486 are YLR, and 862 are RLR (see Table 1). These data will be used for data analysis.
Y. Ren et al. / Accident Analysis and Prevention 95 (2016) 266–273
Distance
Downstream
RUN (YLR)
STOP (FSTP)
(a)
RUN (GLR) RUN (RLR)
Distance
window 1
Stop-bar detetor
Match
269
(b)
Window 2 Event 2S
Event 1S
Stop-Bar Detector Match
d GLR: Green-Light Running YLR: Yellow-Light Running RLR: Red-Light Running FSTP: First-to-Stop
TT MAX TT MIN Trajectory (Event 1)
Advance Detector Match
Upstream
Advance detector
Trajectory (Event 2)
Event 1A
Time
Time
Event 2A
Fig. 3. (a) STOP or RUN event matches; (b) “Window-searching” method.
Table 1 Vehicles’ behavior for each intersection. Intersection
Total events
RLR events
YLR events
FSTP events
Boone Winnetka Rhode Island Total
42,227 34,127 33,370 109,774
289 307 266 862
36,155 28,788 25,543 90,486
5833 5032 7561 18,426
3. Influential factors for RLR
LL(ˇc ) = ln(l(ˇc )) =
The large amount of matched event data presented in Table 1 record the detailed driving behaviors of each individual vehicle as well as their behaviors (i.e. RLR, YLR, or FSTP). From a statistical point of view, a driver’s current driving conditions (i.e. speed and the time gap between the target and leading vehicles), together with surrounding traffic conditions (i.e. the driving behaviors of surrounding vehicles) and signal timing situations (i.e. signal status of green, yellow, or red and their durations), will directly or indirectly lead to the driver’s later behavior of RLR and Non-RLR (note YLR and FSTP are combined as Non-RLR class). Therefore, by statistically analyzing a large amount of event data, the inner correlation between all these impact factors and driver’s RLR and Non-RLR behavior can be derived; and such correlation then can be used to predict RLR through some statistical methods, such as logistic regression. 3.1. binary logistic regression model Since a driver’s RLR behavior is a binary variable, a binary logistic regression model is applied to describe the correlation between all impact factors and driver’s RLR behavior (RLR and Non-RLR). If we define RLR and Non-RLR as “1” and “0”, respectively, a standard logistic regression model for RLR can be described as: logit [i ] = log
i 1 − i
= ˛ + ˇc xi
(4)
where i is the probability that the target vehicle i is a RLR and 1 − i is the probability that the target vehicle i is a Non-RLR; xi represents a vector of all control factors which impact the behavior of vehicle i (the details about control factors will be discussed in next section); is an intercept parameter; and is a vector of the coefficients of the corresponding control factors. and are estimated by the method of maximum likelihood estimation (MLE). The likelihood function is constructed as Eq. (5). By maximizing the log likelihood expression shown in Eq. (6), the estimate of the new intercept parameter and coefficients vector can be obtained accordingly. Note the “logit” function in the Stata (v. 10) was used to obtain all the results. l(ˇc ) =
n
{(xi )yi [1 − (xi )]1−yi }
i=1
(5)
n i=1
{yi ln[(xi )] + (1 − yi )ln[1 − (xi )]}
(6)
where yi represent that whether vehicle i would run the red light, with value of either 0 or 1 only; i = 1, 2, . . .n and n is the total number of observed vehicles. 3.2. Influential factors and significance To apply Eq. (4), the first critical step is to figure out influential or control factors (i.e. xi in Eq. (4)), which have directly or indirectly impacts on drivers’ RLR or Non-RLR behaviors. Much previous research has studied this problem (e.g. Bonneson et al., 2001; Gates et al., 2007; Yang and Najm, 2007; Elmitiny et al., 2010; etc.). In this research, we consider all the potential factors extracted from the high-resolution traffic and signal event data which could impact a driver’s RLR behavior, as shown in Table 2. This list includes the information of occupancy time and the time gap between consecutive vehicles, which essentially indicate the vehicle’s velocity and the (time) distance between the target and the leading vehicles at the time at which the target vehicle passes the detector. Signal timing has also been included in the list. Two types of signal timing related information that significantly affects drivers’ behaviors have been chosen to be used here: (a) the used yellow time, that is, the portion of yellow time that elapsed before the vehicle arrives at the advance detector; and (b) the time to yellow start, that is, the time left until the signal changes to yellow. In addition, to analyze whether the status of the surrounding vehicles has an effect on the target vehicle, the data for three preceding vehicles are collected. The data include occupancy times, time gaps, and their behaviors (i.e., RLR or YLR) and the information on whether there is a vehicle driving on the adjacent lane. In summary, this list includes most of potential factors such as the driving conditions of the target vehicle itself, driving conditions of surrounding vehicles, and signal timing information. The significance of these factors will be determined during the regression process. After identify all potential influential factors as shown in Table 2, we then use the data extracted from Section 2 to statistically test the significance of these influential factors. Note for each intersection, three-month data were collected, but only two-month’ data were used to derive intercept parameters and coefficient vectors; the other month’s data were used later for evaluating RLR prediction. To test the significance of influential factors, all factors described in Table 2 were used to fit a binary logistic regression model. The
270
Y. Ren et al. / Accident Analysis and Prevention 95 (2016) 266–273
Table 2 Descriptive statistics of the potential variables. Variable
Description
Occ A Gap A Yt Used A G2Y Start A Occ A1 Occ A2 Occ A3 Gap A1 Gap A2 Gap A3 YLR A1 Dm YLR A2 Dm YLR A3 Dm RLR A1 Dm RLR A2 Dm RLR A3 Dm AA Dm
Occupancy time for the target vehicle when passing advance detector (s) Time gap between the target vehicle and the nearest preceding vehicle (s) The yellow time that elapsed before vehicle arrives at advance detector (s) The green time left until signal changes to yellow (s) Occupancy time for three preceding vehicles (A1 is the nearest one to the target vehicle) (s) Time gap for three preceding vehicles (A1 is the nearest one to the target vehicle) (s) Vehicles’ YLR behaviors for three preceding vehicles (A1 is the nearest one to the target vehicle); a binary variable with “1” represents yes (i.e. YLR) and “0” no. Vehicles’ RLR behaviors for three preceding vehicles (A1 is the nearest one to the target vehicle); a binary variable with “1” represents yes (i.e. RLR) and “0” no. Presence of running vehicles on the adjacent lane, a binary variable with “1” represents yes and “0” represents no.
Table 3 Standard binary logistic regression model. Int. Boone Ave.
Int. Winnetka
Int. Rhode Island
Variable
Parameter
P-value
Parameter
P-value
Parameter
P-value
Constant Occ A Gap A Yt Used A G2Y Start A Occ A1 RLR A1 Dm Gap A2 YLR A2 Dm Gap A3 YLR A3 Dm AA Dm
−2.663 −1.275 −0.004 0.282 −3.052 / / 0.008 0.670 / 0.628 0.481
P < 0.01 P < 0.01 P < 0.05 P < 0.01 P < 0.01 / / P < 0.01 P < 0.01 / P < 0.01 P < 0.05
−2.140 −0.707 −0.002 0.296 −1.616 1.146 −3.371 / 0.965 / / 0.213
P < 0.01 P < 0.01 P < 0.05 P < 0.01 P < 0.01 P < 0.01 P < 0.01 / P < 0.01 / / P < 0.05
−2.263 −1.867 −0.010 0.526 −3.437 1.947 / / 0.266 0.010 0.591 0.354
P < 0.01 P < 0.01 P < 0.01 P < 0.01 P < 0.01 P < 0.01 / / P < 0.05 P < 0.01 P < 0.05 P < 0.01
Note: *: Significance of model = 0.000; “/” indicates that variable is not significant.
results are shown in Table 3. The “logit” function in the Stata (v. 10) was used to obtain all the results. Only variables which are statistically significant at the significance level of 0.05 are included in the table. The results show that Occ A, YT Used A, and G2Y Start A are common significant factors for all three intersections. This is consistent with our understanding of RLR behaviors. As we know, Occ A indicates the speed of the target vehicle. This factor, together with used yellow time (YT Used A), and time to yellow start (G2Y Start A), certainly has significant impact on whether a driver decides to pass the intersection within the rest of yellow time. The results also show that Gap A (i.e. time gap) is a common significant factor for all three intersections. This essentially indicates vehicles’ car-following behavior, i.e., the shorter gap time, the more likely the following vehicle will follow the leading vehicle and run through intersection even traffic light is red. Interestingly, the results also show that the YLR behavior of the second preceding vehicle (i.e. YLR A2 Dm) is significant for all three intersections. This probably indicates platoon behavior, i.e., when the first vehicle in a platoon runs through intersection during yellow, the following several vehicles very likely end up either YLR or RLR since long yellow time (i.e. 5 s) is provided for these three intersections. In addition, whether there is a vehicle passing through on the adjacent lane (AA Dm) shows a significant effect on RLR for all three intersections. This demonstrates that the “following” behaviors of drivers are not only impacted by preceding vehicles, but also by adjacent vehicles. Note all fittings are statistically significant.
4. RLR prediction using rare events logistic regression Base on Eq. (4), the probability that the target vehicle is a RLR can be described by the logistic distribution as shown in the following equation:
p (yi = 1|xi ) = i =
exp ˛ + ˇc xi
1 + exp ˛ + ˇc xi
(7)
where yi denotes the behavior of vehicle i. yi = 1 means the vehicle i is a RLR. The above logistic regression model works well for YLR prediction (Lu et al., 2015). However, when applying to RLR prediction, the prediction accuracy is very low due to the rare events nature of RLR. As shown in Table 1, the proportions of RLR events are low for all three intersections: 0.7% at Int. Boone, 1.0% at Int. Winnetka, and 0.8% at Int. Rhode Island. With such small portions of RLR, applying the standard binary logistic regression method sharply underestimates the probability for RLR. To address this challenging issue, a modified rare events binary logistic regression originally developed by King and Zeng (2001, 2002) is applied in this research to improve RLR predictions.
4.1. Rare events logistic regression method & fitting results The proposed rare events logistic regression method essentially includes the following three-step correction procedure (King and Zeng, 2001, 2002):
Y. Ren et al. / Accident Analysis and Prevention 95 (2016) 266–273
271
Table 4 Rare events logistic regression model. Int. Boone Ave.
Int. Winnetka
Int. Rhode Island
Variable
Parameter
P-value
Parameter
P-value
Parameter
P-value
Constant Occ A Gap A Yt Used A G2Y Start A Occ A1 RLR A1 Dm Gap A2 YLR A2 Dm Gap A3 YLR A3 Dm AA Dm
−1.998 −1.371 −0.001 0.342 −3.331 / / 0.039 0.640 / 1.202 0.803
P < 0.01 P < 0.01 P < 0.01 P < 0.01 P < 0.01 / / P < 0.01 P < 0.01 / P < 0.01 P < 0.05
−1.885 −1.685 −0.010 0.176 −1.745 1.068 −2.337 / 1.106 / / 0.436
P < 0.01 P < 0.01 P < 0.01 P < 0.01 P < 0.01 P < 0.01 P < 0.01 / P < 0.01 / / P < 0.05
−1.537 −1.830 −0.017 0.721 −3.074 1.035 / / 0.169 0.011 1.263 0.162
P < 0.01 P < 0.01 P < 0.01 P < 0.01 P < 0.01 P < 0.01 / / P < 0.05 P < 0.01 P < 0.05 P < 0.01
Note: *: Significance of model = 0.000; “/” indicates that variable is not significant.
Table 5 RLR predictions. Int. Boone Regression Model*
Standard
Total Events Actual RLR Number Predict RLR Number Correct RLR Number False Alarm Rate Accurate Predication Rate
19960 119 247 58 0.9% 48.7%
Int. Winnetka Rare
Standard
405 90 1.6% 75.6%
14730 128 211 63 1.0% 49.2%
Int. Rhode Rare
Standard
Rare
357 99 1.7% 77.3%
11400 74 169 32 1.2% 43.2%
293 52 2.3% 70.3%
Note: *: Standard and Rare represent standard binary logistic regression model and rare events binary logistic regression model, respectively.
4.1.1. Step 1 The first step is to apply a choice-based data collection strategy to randomly select Non-RLR cases. This step will form a new dataset, in which all RLR cases will be included but only a portion of randomly selected Non-RLR cases will be included. In this new dataset, the ratio of RLR events to Non-RLR events is recommended set around 1:10 as suggested by other rare events studies (e.g. Beguería, 2006; Guns and Vanacker, 2012). With the new dataset, the MLE ˆ technique can be applied to estimate new intercept parameter ˛ ˆ c. and coefficients vector ˇ 4.1.2. Step 2 However, the use of choice-based sampling strategy could significantly bias the estimation of intercept parameter ˛. ˆ Therefore, the second step is to apply a prior correction to avoid sampling bias. In this step, the corrected intercept parameter ˛ ˜ is calculated based on the uncorrected intercept parameter ˛, ˆ as in Eq. (8): ˆ − ln ˛ ˜ =˛
1 − ε
(8)
1−ε
where is the actual fraction of “1”s (i.e. RLRs) in the whole population, and ε is the observed fraction of “1”s in the new dataset.
4.1.3. Step 3 ˆ c , Eq. With corrected intercept and updated coefficients vector ˇ ˆ (7) can be used to calculate the updated probabilities Pi . However, Pˆ i is still an underestimated probability, because the estimation ˆ c is neglected. So the third uncertainty on the coefficients vector ˇ step is to correct Pˆ i by introducing a correction factor Ci to Pˆ i . The final corrected probability Pi is obtained as:
Pi = Pˆ i + Ci
Ci = 0.5 − Pˆ i Pˆ i 1 − Pˆ i Xi V ˇ X´ıi
(9)
where ˇ =
ˆ c , Xi = [1, xi ], ˇ and Xi have same dimensions, ˛, ˜ ˇ
X´ıi is the transpose of Xi , and V ˇ is the estimated variancecovariance matrix of the estimated coefficients. The rare events logistic regression model is applied to fit the data collected in Section 2. The regression fitting results are displayed in Table 4. Note two same months’ data were used to derive intercept parameter and coefficients vector, and the other month’s data were used for evaluation. To implement the rare events logistic regression model, we use the “relogit” function in the Stata (v.10). The results are shown in Table 4. The significant impact factors of the rare events model are consistent with those factors for standard logistic regression model. In addition, there are no any factors which flipped the signs after applying the rare events logistic regression model. This further indicates the consistency between the standard and rare events logistic regression models. But the values for all significant factors have been changed. 4.2. Prediction comparison For comparison, both the regression models presented in Tables 3 and 4 are used to predict RLR based on the event data collected from the third month. The variables that are statistically significant at the significance level of 0.05 are included in the predict models. As we mentioned before, for each intersection, three-month’s data were collected, but only two-month’ data were used to derive intercept parameters and coefficient vectors; the other month’s data were used for evaluating prediction accuracy. Table 5 presents all the prediction results. Note “standard” and “rare” represent standard binary logistic regression model and rare events binary logistic regression model respectively. From Table 5, the result shows that the rare events regression models perform significantly better than standard models. The accurate predication rate (defined as the ratio between correct RLR prediction number and actual RLR number) jumps from
272
Y. Ren et al. / Accident Analysis and Prevention 95 (2016) 266–273
(a) ROC curves (Int. Boone)
(b) ROC curves (Int. Winn)
1
0.9
0.8
0.8
0.7
0.7
0.7
0.5 0.4 0.3 0.2
0
0
0.2
0.4
0.6
False positive rate
0.8
0.6 0.5 0.4 0.3
Auc of Logit Model=0.701 Auc of Rare Events Model=0.858
0.1
1
0
0.6 0.5 0.4 0.3 0.2
0.2
Auc of Logit Model=0.706 Auc of Rare Events Model=0.852
0.1
True positive rate
0.9
0.8
0.6
(c) ROC curves (Int. Rhode)
1
0.9
True positive rate
True positive rate
1
0
0.2
0.4
0.6
0.8
Auc of Logit Model=0.670 Auc of Rare Events Model=0.779
0.1 1
False positive rate
0
0
0.2
0.4
0.6
0.8
1
False positive rate
Fig. 4. ROC plots and AUC values for standard and rare events models.
around 45% to near 80% after applying the rare events binary logistic regression methods. Although the false alarm rate (defined as the falsely predicted RLR number over total event number) has been increased, but an overall 2.0% false alarm rate is acceptable for real applications. To further compare the performance of all these models, the Receiver Operating Characteristics (ROC) curves are plotted and the areas under the curve (AUC) of these models are calculated. ROC curves and AUC are frequently used to compare the performance of different methods (Ling et al., 2003). ROC curve directly shows that which model is dominate, and the model with larger AUC also indicates better performance. Fig. 4(a)–(c) show the ROC curves and AUC values for the standard and rare events binary logistic regression models for three intersections, respectively. The figures clearly demonstrate that the rare events logistic regression model is significantly superior to the standard binary logistic regression model. 5. Concluding remarks To predict and prevent the potential RLR, it is important to gain a better understanding of the relationship between RLR and the impact factors which contribute to drivers’ RLR behaviors. This requires collecting a large amount data. This research uses a large amount of high-resolution traffic and signal data collected from loop detectors to extract 9-month’s RLR events from three signalized intersections, and then identifies the influential factors that significantly affect RLR behaviors. The data analysis indicated that occupancy time, time gap, used yellow time, time left to yellow start, whether the preceding vehicle runs through the intersection during yellow, and whether there is a vehicle passing through the intersection on the adjacent lane are significantly factors for RLR behaviors. Furthermore, this research addresses the rare events issue of RLR prediction by developing a rare events binary logistic regression model. To be noted, it is the first time to apply rare events logistic regression for RLR study according to our limited knowledge. The results show that rare events logistic regression model performs significantly better than standard logistic regression model. The accurate predication rate jumps from about 45% for standard logistic regression models to near 80% for rare events regression methods. Although the false alarm rate has been increased, an overall 2.0% false alarm rate is still acceptable for real applications. More importantly, the proposed RLR prediction methods are purely based on the loop detector data collected from single advance detectors located 400 feet away from stop-bar. This demonstrates that the proposed models have great potential for future field applications since loops have been widely implemented in most of intersections and can automatically collect data in real
time. Therefore, this research is expected to significantly contribute to the improvement of intersection safety. However, although the rare events logistic regression model is superior to the standard logistic regression model, the accuracy of the RLR prediction rate is still not very high, and the false alarm rate is relatively high. One potential reason is that some important factors which could not be collected in real time (i.e. the gender of the driver, age, vehicle type, etc.) were not included in our analysis. Also some other classification models such as decision trees, linear discriminant analysis (LDA), vector support machine (VSM) could also be applied to generate better prediction results. All these will be left for our future research. Acknowledgments This work is partly supported by the National Natural Science Foundation of China (Grant 61371076 and 51278021). We would like to thank the editors and anonymous reviewers for their valuable support and comments. References Beguería, S., 2006. Changes in land cover and shallow landslide activity: a case study in the Spanish Pyrenee. Geomorphology 74 (1), 196–206, http://dx.doi. org/10.1016/j.geomorph.2005.07.018. Bonneson, J., Son, H., 2003. Prediction of expected red-light-running frequency at urban intersections. Transp. Res. Rec.: J. Transp. Res. Board 1830, 38–47, http:// dx.doi.org/10.3141/1830-06. Bonneson, J., Brewer, Marcus, Zimmerman, Karl, 2001. Review and evaluation of factors that affect the frequency of red-light-running. No. FHWA/TX-02/4027-1. http://trid.trb.org/view.aspx?id=713767. Bonneson, J., Middleton, D., Zimmerman, K., Charara, H., Abbas, M., 2002. Intelligent detection-control system for rural signalized intersections. No. FHWA/TX-03/4022-2. http://trid.trb.org/view.aspx?id=731149. Chawla, N.V., Japkowicz, N., Kotcz, A., 2004. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explor. Newslett. 6 (1), 1–6, http://dx. doi.org/10.1145/1007730.1007733. Elmitiny, N., Yan, X., Radwan, E., Russo, C., Nashar, D., 2010. Classification analysis of driver’s stop/go decision and red-light running violation. Accid. Anal. Prevent. 42 (1), 101–111, http://dx.doi.org/10.1016/j.aap.2009.07.007. Gates, T., Noyce, D., Laracuente, L., Nordheim, E., 2007. Analysis of driver behavior in dilemma zones at signalized intersections. Transp. Res. Rec.: J. Transp. Res. Board 2030, 29–39, http://dx.doi.org/10.3141/2030-05. Guns, M., Vanacker, V., 2012. Logistic regression applied to natural hazards: rare event logistic regression with replications. Nat. Hazards Earth Syst. Sci. 12, 1937–1947, http://dx.doi.org/10.5194/nhess-12-1937-2012. Haque, M.M., Ohlhauser, A.D., Washington, S., Boyle, L.N., 2015. Decisions and actions of distracted drivers at the onset of yellow lights. Accid. Anal. Prevent., http://dx.doi.org/10.1016/j.aap.2015.03.042. He, H., Garcia, E.A., 2009. Learning from imbalanced data. IEEE Transact. Knowl. Data Eng. 21 (9), 1263–1284, http://dx.doi.org/10.1109/TKDE.2008.239. King, G., Zeng, L., 2001. Logistic regression in rare events data. Polit. Anal. 9 (2), 137–163, http://dx.doi.org/10.18637/jss.v008.i02. King, G., Zeng, L., 2002. Estimating risk and rate levels, ratios and differences in case-control studies. Stat. Med. 21 (10), 1409–1427, http://dx.doi.org/10.1002/ sim.1032.
Y. Ren et al. / Accident Analysis and Prevention 95 (2016) 266–273 Ling, C.X., Huang, J., Zhang, H., 2003. AUC: a better measure than accuracy in comparing learning algorithms. Adv. Artif. Intell., 329–341, http://dx.doi.org/ 10.1007/3-540-44886-1 25 (Springer). Liu, H.X., Wu, X., Ma, W., Hu, H., 2009. Real-time queue length estimation for congested signalized intersections. Transp. Res. Part C: Emerg. Technol. 17 (4), 412–427, http://dx.doi.org/10.1016/j.trc.2009.02.003. Long, G., 2000. Acceleration characteristics of starting vehicles. Transp. Res. Rec.: J. Transp. Res. Board 1737, 58–70, http://dx.doi.org/10.3141/1737-08. Lu, G., Wang, Y., Wu, X., Liu, H., 2015. Analysis of yellow-light running at signalized intersections using high-resolution traffic data. Transp. Res. Part A: Policy Pract. 73, 39–52, http://dx.doi.org/10.1016/j.tra.2015.01.001. NHTSA, 2012. NHTSA’s Fatality Analysis Reporting System (FARS) Reports. National Highway Traffic Safety Administration, Washington DC, USA. Neale, V.L., Perez, M.A., Lee, S.E., Doerzaph, Z.R., 2007. Investigation of driver-infrastructure and driver-vehicle interfaces for an intersection violation warning system. J. Intell. Transp. Syst. 11 (3), 133–142, http://dx.doi.org/10. 1080/15472450701410437. Newton, C., Mussa, R.N., Sadalla, E.K., Burns, E.K., Matthias, J., 1997. Evaluation of an alternative traffic light change anticipation system. Accid. Anal. Prevent. 29 (2), 201–209, http://dx.doi.org/10.1016/S0001-4575(96)00073-5. Papaioannou, P., 2007. Driver behavior, dilemma zone and safety effects at urban signalized intersections in Greece. Accid. Anal. Prevent. 39 (1), 147–158, http:// dx.doi.org/10.1016/j.aap.2006.06.014. Porter, B.E., England, K.J., 2000. Predicting red-light running behavior: a traffic safety study in three urban settings. J. Saf. Res. 31 (1), 1–8, http://dx.doi.org/10. 1016/S0022-4375(99)00024-9. Ragland, D.R., Zabyshny, A.A., 2003. Intersection decision support project: Taxonomy of crossing-path crashes at intersections using GES 2000 data. Safe Transportation Research & Education Center. https://escholarship.org/uc/item/ 0201j0v2. Retting, R.A., Williams, A.F., 1996. Characteristics of red light violators: results of a field investigation. J. Saf. Res. 27 (1), 9–15, http://dx.doi.org/10.1016/00224375(95)00026-7.
273
Retting, R.A., Williams, A.F., Preusser, D.F., Weinstein, H.B., 1995. Classifying urban crashes for countermeasure development. Accid. Anal. Prevent. 27 (3), 283–294, http://dx.doi.org/10.1016/0001-4575(94)00068-W. Sharma, A., Bullock, D., Peeta, S., 2011. Estimating dilemma zone hazard function at high speed isolated intersection. Transp. Res. Part C: Emerg. Technol. 19 (3), 400–412, http://dx.doi.org/10.1016/j.trc.2010.05.002. Sheffi, Y., Mahmassani, H., 1981. A model of driver behavior at high speed signalized intersections. Transp. Sci. 15 (1), 50–61, http://dx.doi.org/10.1287/ trsc.15.1.50. Smaglik, E., Sharma, A., Bullock, D., Sturdevant, J., Duncan, G., 2007. Event-based data collection for generating actuated controller performance measures. Transp. Res. Rec.: J. Transp. Res. Board 2035, 97–106, http://dx.doi.org/10. 3141/2035-11. Wang, X., Yu, R., 2016. A field investigation of red-light-running in Shanghai, China. Transp. Res. Part F: Traffic Psychol. Behav. 37, 144–153, http://dx.doi.org/10. 1016/j.trf.2015.12.010. Wang, L., Zhang, L., Zhang, W.B., Zhou, K., 2009. Red light running prediction for dynamic all-red extension at signalized intersection. Intelligent Transportation Systems. ITSC’09. 12th International IEEE Conference on (pp. 1–5), IEEE, http:// dx.doi.org/10.1109/ITSC.2009.5309545. Wang, L., Zhang, L., Zhou, K., Zhang, W., Wang, X., 2012. Prediction of red-light running on basis of inductive-loop detectors for dynamic all-red extension. Transp. Res. Rec.: J. Transp. Res. Board 2311, 44–50, http://dx.doi.org/10.3141/ 2311-04. Wu, X., Vall, N., Liu, H., Cheng, W., Jia, X., 2013. Analysis of drivers’ stop-or-run behavior at signalized intersections with high-resolution traffic and signal event data. Transp. Res. Rec.: J. Transp. Res. Board 2365, 99–108, http://dx.doi. org/10.3141/2365-13. Yang, C.D., Najm, W.G., 2007. Examining driver behavior using data gathered from red light photo enforcement cameras. J. Saf. Res. 38 (3), 311–321, http://dx.doi. org/10.1016/j.jsr.2007.01.008. Zhang, L., Zhou, K., Zhang, W., Misener, J.A., 2009. Prediction of red light running based on statistics of discrete point sensors. Transp. Res. Rec. 2128, 132–142, http://dx.doi.org/10.3141/2128-14.