Accident prediction model for public highway-rail grade crossings

Accident prediction model for public highway-rail grade crossings

Accident Analysis and Prevention 90 (2016) 73–81 Contents lists available at ScienceDirect Accident Analysis and Prevention journal homepage: www.el...

2MB Sizes 0 Downloads 76 Views

Accident Analysis and Prevention 90 (2016) 73–81

Contents lists available at ScienceDirect

Accident Analysis and Prevention journal homepage: www.elsevier.com/locate/aap

Accident prediction model for public highway-rail grade crossings Pan Lu a,∗ , Denver Tolliver b a

Assistant Professor of Transportation, Upper Great Plains Transportation Institute, Dept. 2880, North Dakota State University, Fargo, ND 58108-6050, USA Director and Professor of Transportation, Upper Great Plains Transportation Institute, Dept. 2880, North Dakota State University, Fargo, ND 58108-6050, USA b

a r t i c l e

i n f o

Article history: Received 25 June 2015 Received in revised form 15 January 2016 Accepted 18 February 2016 Available online 27 February 2016 Keywords: Accident prediction Railroad grade crossing Poisson Quasi-Poisson Gamma Binomial Bernoulli Under-dispersion

a b s t r a c t Considerable research has focused on roadway accident frequency analysis, but relatively little research has examined safety evaluation at highway-rail grade crossings. Highway-rail grade crossings are critical spatial locations of utmost importance for transportation safety because traffic crashes at highway-rail grade crossings are often catastrophic with serious consequences. The Poisson regression model has been employed to analyze vehicle accident frequency as a good starting point for many years. The most commonly applied variations of Poisson including negative binomial, and zero-inflated Poisson. These models are used to deal with common crash data issues such as over-dispersion (sample variance is larger than the sample mean) and preponderance of zeros (low sample mean and small sample size). On rare occasions traffic crash data have been shown to be under-dispersed (sample variance is smaller than the sample mean) and traditional distributions such as Poisson or negative binomial cannot handle underdispersion well. The objective of this study is to investigate and compare various alternate highway-rail grade crossing accident frequency models that can handle the under-dispersion issue. The contributions of the paper are two-fold: (1) application of probability models to deal with under-dispersion issues and (2) obtain insights regarding to vehicle crashes at public highway-rail grade crossings. © 2016 Elsevier Ltd. All rights reserved.

1. Introduction More than 97% of highway-rail crossings are railroad grade crossings (RGCs) where the highway and railroad tracks are at the same elevation and allow traffic to alternate between railroads and highways. RGCs are critical spatial locations of utmost importance for transportation safety because traffic crashes at RGCs are often catastrophic with serious consequences including fatalities, injuries, extensive property damage and delays in railway and highway traffic (Raub, 2009). The consequences can be further exacerbated if collisions involve freight trains carrying hazardous materials which could spill resulting environmental disaster or increased danger to those nearby. From 1996 to 2014, 26% of RGC accidents in North Dakota involved hazardous material. The need to improve traffic safety has been a major social concern in the United States for decades. Transportation agencies and other stakeholders must identify the factors that contribute to the likelihood of RGC collision to be able to better predict probability of

∗ Corresponding author. Fax: +1 701 231 1945. E-mail addresses: [email protected] (P. Lu), [email protected] (D. Tolliver). http://dx.doi.org/10.1016/j.aap.2016.02.012 0001-4575/© 2016 Elsevier Ltd. All rights reserved.

crashes and provide direction for RGC designs and policies that will reduce the number of crashes. A large amount of literature exists in regard to evaluation of vehicle accidents on roadway intersections or roadway segment crashes (Young and Liesman, 2007; Campbell, 1991; Chen and Chen, 2011; Schoor et al., 2001; Chen et al., 2011; Zhang et al., 2013, 2012; Wang and Abdel-Aty, 2008; Xie et al., 2014; Kwon et al., 2015; Jung et al., 2014; Chen and Xie, 2014; Kashani et al., 2014). Although several studies have been carried out to investigate accidents at RGCs (Belle et al., 1975; Gitelman and Hakkert, 1997; Yan et al., 2010; Austin and Carson, 2002; Raub, 2009; Oh et al., 2006), our literature review found relatively little research effort has focused on RGC accidents compared to roadway accidents (Oh et al., 2006). Salmon et al. (2013) pointed that because of the limited research effort, various aspects of RGC performance remain poorly understood. Therefore, safety evaluation (i.e., accident frequency prediction) of RGCs is needed to re-examine both prediction methods and contribution factors at RGCs (Austin and Carson, 2002). This paper seeks to investigate RGC crashes and contributing factors through the application of statistical models to crash data. Following a review of earlier highway-rail crossing accident

74

P. Lu, D. Tolliver / Accident Analysis and Prevention 90 (2016) 73–81

prediction methods, the authors suggest improved accident prediction model alternatives that allows for handling the underdispersion issue and provide the greatest insight into public RGC related crashes, especially in North Dakota. This paper is divided into six sections. Section 2 presents a literature review on crash models and prediction methods to deal with under-dispersion issues regarding to highway-rail grade crossings. Section 3 describes the details of statistical methodologies used for estimating the models. Section 4 provides details about the data used in the paper. Section 5 covers the application of models to the dataset and described the results of the analysis and comparisons. Section 6 summarizes the key results of the study and provides recommendations for further work.

between frequentist approach and Bayesian approach lies in the assumption about the parameters. Frequentist method assumed the response variable was random and the parameters and variance were fixed and unknown while Bayesian approach assumes that both the response variable and parameters are random. Thus, many researchers developed empirical or full Bayes methods in highway safety analysis (Lord and Park, 2008; Aguero-Valverde, 2013; Ezra Hauer, 2001) and found promising results in better predictions, estimations and less error. However, Bayes method focuses on the randomness of parameters but does not assume under- or overdispersion distribution thus Bayes approach is not considered in this study. 3. Statistical methodology

2. Background and literature review Because crash accident data are random, discrete, and nonnegative in nature, hence generalized linear models (GLMs) (known as the Frequentist approach in the statistical literature) have been frequently used to investigate the relationship between crash frequency and contributing factors. Poisson regression has been commonly used to model crash frequency because of the discrete and non-negative nature of crash data. GLMs have several advantages because they are believed to be better suited for discrete and nonnegative crash frequency data (Zhang et al., 2012). However, Lord and Mannering (2010) pointed out that although the GLMs possess desirable elements for describing accidents, these models face various data challenges which have been shown to be a potential source of error in terms of incorrectly specifying statistical models that can result in incorrect prediction and explanatory factors. The most common crash data issue is over- or underdispersion. Over-dispersion is where sample variance is greater than sample mean. In many collision databases, the variance in accident frequency is higher than the mean. Over-dispersion arises from the unmeasured uncertainties associated with the observed or unobserved variables (Lord and Park, 2008). And on rare occasions, crash data can display under-dispersion where sample variance is less than the sample mean (Oh et al., 2006). These issues are problematic (Lord and Mannering, 2010) because the most common count-data modeling approach requires that the variance be equal to the mean. Over- and under-dispersed data would lead to inconsistent standard errors for parameter estimates when using the traditional Poisson distribution (Cameron and Trivedi, 1998). Because of this, Poisson regression is usually a good modeling starting point (Oh et al., 2006). When data shows over-dispersion, some modifications to the standard Poisson model are available to account for over-dispersion such as the Negative Binomial (NB) model (Lord and Mannering, 2010). When under-dispersion arise, less common models such as Gamma probability count model is believed to be capable of handling under-dispersion issues (Oh et al., 2006). Poisson and NB models have been shown to have great limitations when applied to under-dispersed crash data (Oh et al., 2006). This research will explore the potential GLM model options to handle under-dispersed RGC crash data by (1) demonstrating the general forms of various models and (2) Investigating and comparing models that may handle under-dispersed data with ND public RGC crash data. It is worthwhile to note that Bayesian method gain its popularity over the last few years in highway safety research, mainly due to its ability to handle so called regression-to-mean (RTM) problem in highway safety which is often associated with traditional frequentist approaches such as maximum likelihood estimation to estimate the parameters in GLMs. RTM refers to the tendency of high or low crashes in one time period to regress or to return to the mean in subsequent time periods. The distinction

3.1. Poisson model Non-negative integer count data are often approximated well by the Poisson regression model. In Poisson regression the dependent variable is an observed count that follows the Poisson distribution, probability of RGC i having yi crash (where yi yi is the expected number of 0, 1, 2,. . .) is given by:



P (yi ) =

e(−i ) i

yi



yi !

=

e− (yi ) yi !

(1)

Where i is the predicted count or Poisson parameter for RGC i, which is equal to RGC i’s expected number of crashes per year, E[yi ] or . In other words, if we say the probabilities of number of crashes follow a Poisson distribution with parameter ␭, the mean of the crashes,yi , is its parameter lambda. Poisson parameter Lambda is the total number of events divided by the number of units in the data. Poisson Model is specifying the Poisson parameter i as a function of explanatory variables, more specifically, Poisson model let the logarithm of the mean depend on a vector of explanatory variables xs. That is, for a given set of predictors,xi , the categorical outcome follows a Poisson distribution with ratei . The most commonly selected functional form (or function link) is in log-linear form: log (i ) = ˇ0 + ˇ1 xi1 + ˇ2 xi2 + . . . + ˇm xim

(2)

Where ˇsare the estimated regression coefficients and xs are all the explanatory variables. In this model, the regression coefficient ˇj represents the expected change in the log of the mean per unit change in the predictorxj . The log link is the canonical link function for the Poisson distribution. Techniques to find an estimates of ˆ to achieve maximized log likelihood the regression coefficients, ˇ, can be applied and the expected value of the response is modeled. One important property of the Poisson distribution model is that it restricts equal mean and variance of the distribution: Var [Y ] = E[Y ] = 

(3)

If the mean is not equal to the variance of the crash counts, then the data are said to be either over- or under-dispersed and the resulting parameter estimated will be biased (Cameron and Trivedi, 1998). Crash data has been found to often exhibit over-dispersion due to unmeasured variances associated with the observed or unobserved variables (Lord and Park 2008). 3.2. Negative binomial model The negative binomial as a special case of Poisson–gamma mixture model is a variant of the Poisson model designed to deal with over-dispersed data. The negative binomial model relaxes the

P. Lu, D. Tolliver / Accident Analysis and Prevention 90 (2016) 73–81

constraint of equal mean and variance. The model assumes that the Poisson parameter i follows a gamma probability distribution as: log (i ) = ˇ0 + ˇ1 xi1 + ˇ2 xi2 + . . . + ˇm xim + εi

(4)

Where Exp (εi )is a gamma distributed error term with other variables are as earlier defined. The addition of this term allows the sample mean to differ from the sample variance such as: Var [Y ] = E [Y ] (1 + kE [Y ]) = E [Y ] + kE[Y ]2 =  + k2

(5)

Where k is a second ancillary or heterogeneity parameter and often refers to the dispersion parameter (Saha and Paul, 2005), if kequals 0, the negative binomial model reduces to the Poisson model. As a NB model, counts are gamma distributed as they enter into the model, k enters into the model as well as a measure of over-dispersion in the data. The NB model is probably the most frequently used model in traffic crash data analysis, however, the NB models suffers limitation to handle under-dispersed data. If the dispersion parameter, k, is set as a negative value to try to handle under-dispersion issue, it would make the count no longer gamma distributed and lead to unreliable parameter estimates especially when sample mean is low and sample size is small (Saha and Paul, 2005; Oh et al., 2006).

75

 < 1, the distribution will have longer tails than the Poisson distribution and can model over-dispersed data. A special case in this situation is when  = 0 and < 1 = 0 and < 1, the distribution is geometric distribution, an extreme over-dispersion. If  > 1, the distributions have short tails than the Poisson distributions and can model under-dispersed data. Another special case in this situation is when  → ∞ and < 1, the distribution is Bernoulli distribution, an extreme under-dispersion, and the data can only take the values of 0 and 1. The CMP distribution does not have closed-form expressions for its moments in terms of the parameters and, approximated by Shmueli et al. (2005), mean and variance are estimated by: E [Y ] ≈ 1/ + Var [Y ] ≈





1 1 − 2 2

Var [Y ] ≈

1



(9) 1/

Guikema and Coffelt (2008) propose a re-parameterization of the Shmueli et al. (2005) model and mean and variance can be approximated as: 1 1 − 2 2

E [Y ] ≈  +

(11)



3.3. The gamma model

Var [Y ] ≈

The gamma model has been proposed by Oh et al. (2006) to handle under-dispersed RGC crash data. The gamma count model for count data is given as:

The dispersion is defined as:

Pr (yi = j) = Gamma (˛j, i ) − Gamma (˛j + ˛, i )

(6)

Wherei = exp(ˇXi ) and i is the mean of the crashes,.

Gamma (˛j, i ) =

⎧ ⎪ ⎪ ⎨

1, ifj = 0



i

1

⎪ ⎪ ⎩  (˛j)

u˛j−1 e−u du ifj > 0

(7)

3.4. The Conway–Maxwell–Poisson model The Conway–Maxwell–Poisson (CMP) is a generalization of the Poisson distribution that enables it to model both under- and overdispersed data. The CMP is defined to be the distribution with probability mass function:

Note

y

1 i i foryi = 0, 1, 2, . . . Z (i , vi ) (yi !)i that

except



Z (i , vi ) which equals to

the n i

(n!) n=0

D [Y ] =

(12)



Var (Y ) 1 ≈ E (Y ) 

(13)

3.5. The Bernoulli model The Bernoulli distribution model is a logistic model which restricts responses and follows the Bernoulli distribution and only has two possible outcomes, namely ‘failure’ or ‘success’ (0 or 1). The distribution is given: P (yi ) = yi (1 − )(1−yi ) foryi = 0, 1, 2,....

0

Where ˛ is the dispersion parameter, if ˛<1, there is overdispersion; if ˛>1 there is under-dispersion; and if ˛ = 1 the Gamma model reduces to the Poisson model which would indicate the model can handle both under- and over-dispersion. Although the model is flexible enough to handle crash data, its limitations have limited its applications (Lord and Mannering, 2010) such as: timedependent observation assumption (Cameron and Trivedi, 1998) and two-state characteristic to handle zero observations separately (Lord and Mannering, 2010) because general gamma distribution restricts discrete responses to positive values.

P (yi ) =

(10)

i

normalization

(8) factor,

, the CMP distribution is very

similar to Poisson distribution with an extra parameter, i , which can take any non-negative value. The CMP distribution overcomes the requirement that mean and variance are equal by introducing   to allow flexibility in modeling the tail behavior of the distribution. If  = 1, the distribution is Poisson distribution. If

(14)

For the Bernoulli distribution, the response is binomial events. The variance is: Var [Y ] =  (1 − )

(15)

Bernoulli model only can handle under-dispersed data based on the relationship between sample variance and mean. 3.6. The hurdle poisson model The hurdle Poisson model allows for a systematic difference in the statistical process governing observations with zero counts and those with a positive number of counts. One part of the hurdle model is a binary outcome model (logisitic) governing observations with zero and positive counts and the second part of the model is a truncated-at-zero Poisson count model for observations with positive counts. The hurdle model is not only flexible enough to handle both under- and over-dispersion but also can account for “excess zeros.” The probability distribution of the hurdle model is given:

f (yi ) =

⎧ ⎪ ⎨

f1 (zi )

yi = 0

⎪ ⎩ 1 − f1 (zi ) f2 (yi ) = f2 (yi ) yi = 1, 2,

(16)

1 − f2 (0)

Where f1 (zi ) , the probability of extra, ␲, is the density function of the logistic model with explanatory variables zi and parameters . f2 (yi ) is the probability density function of a truncated Poisson regression model. The numerator of  is the probability of crossing the hurdle and if numerator is the same as the denominator which is

76

P. Lu, D. Tolliver / Accident Analysis and Prevention 90 (2016) 73–81

Table 1 Independent variable description. Variable

Property

Description

Minimum

Maximum

Mean

SD

Crossing Warning Devices AADT Train per Day Track Numbers Paved Highway Max Train Speed No Historical Accident

Category Numeric Numeric Numeric Binary Numeric Binary

Cross buck, stop sign, gate, flashing, and no control Annual average daily traffic Daily total train movements Number of rail tracks Is highway paved or not? 1 = yes and 0 = no Maximum allowable train speed Did the crossing has historical accident? 1 = yes and 0 = no

N/A 1 0 0 0 0 0

N/A 13,000 43 6 1 79 1

N/A 329 5 1 0.18 31 0.06

N/A 1387 10 0.59 0.38 19 0.23

the sum of probabilities of positive counts,  = 1 which will reduce the hurdle model back to a one-stage parent model. When excesszeros exists, > 1. The variance and mean are defined as: Var [Y ] = 2 (2 + 1) − (2 )

2

(17)

1− Var [Y ] = E [Y ] + (E [Y ])2 

(18)

Where 2 is the expected value of the un-truncated parent distri +1 bution. Under-dispersion is obtained for 1 <  < 2 and when 2

0 <  < 1 over-dispersion is obtained. 3.7. Zero-inflated poisson model (ZIP)

Zero-inflated models are theorized to account for “excess zeroes,” which means zeroes observed in the database are beyond the number of zeroes predicted by Poisson models. The model is a mixed model with two components and is a dual-state model like the hurdle model. The difference between the ZIP and hurdle models is that only one component of hurdle model corresponds to the zero-generating process, but both components in the ZIP model govern the zero-generating process. One part of the ZIP model is a binary outcome model (logisitic) governing observations with excess zeros and non-zero counts. The second part of the model is governed by a Poisson distribution that generates counts,

some of which can be zero. ZIP assumes that events, Y = yi are independent and the probability distribution of the hurdle model is given:



f (yi ) =

f1 (zi ) + (1 − f1 (zi )) f2 (0)

yi = 0

(1 − f1 (zi )) f2 (yi )

yi = 1, 2,

Where f1 (zi ) is the probability of extra zeros, .f2 (yi ) =

(19) yi e− yi !

is

the probability density of Poisson and i is the expected Poisson count for ith individual. The variance and mean are defined as: E [Y ] =  (1 − )

(20)

Var [Y ] =  (1 − ) (1 +  )

(21)

4. Data sources With consideration given to data size and needs for local RGC crash analysis for North Dakota (ND), data for this investigation was extracted from public RGC data from 1996 to 2014 for ND. Data to support the research came from two resources: (1) the FRA’s Office of Safety accident/incident database and (2) the FRA’s Office of Safety highway-rail crossing inventory. The accident/incident database provides accident-related information such as location, time, weather and road conditions for each accident occurrence. While the highway-rail crossing inventory database describes each crossing’s location, traffic conditions, infrastructure equipment, and historical accident information. A new data set was generated by using the highway-rail grade crossing identification number in both datasets to include data elements in both datasets for each crossing.

A total of 16 explanatory variables including, warning devices, highway pavement condition, appearance of pavement markings, appearance of interconnection/pre-emption, smallest crossing angle, appearance of pullout lane, functional classifications of highway, train traffic density, highway user types, weather conditions, track conditions, highway traffic density, maximum train speed, location, accident history, and commercial power availability were investigated and tested. The database contains all the historical crash information at HRGCs in North Dakota during 1996–2014. A total of 344 public highway-rail grade crossing accidents occurring from 1996 to 2014 are selected from a sample of 5551 highway-rail grade crossings records. With the intent to study crash-associated factors, a binary target variable (ACCIDENT) is defined with two classes: a value of 1 indicates that there was a crash, while value of 0 represents a crossing with no crash. Sixteen input variables were tested in this study. These variables can be divided into three categories: traffic characteristics, highway characteristics, and crossing characteristics. Traffic characteristics contain traffic information at crossings. These characteristics describe highway and railway traffic volume and traffic speed. Highway characteristics contain highway design information at crossings, such as number of highway lanes, pavement, and highway system levels. Crossing characteristics describe warning devices, crash history and other crossing related characteristics. Table 1 lists all seven independent variables selected in GLM models. 5. Data analysis 5.1. Results of the analysis As suggested by many researchers, the Poisson regression model is a good modeling starting point (Oh et al., 2006; Lord and Mannering, 2010) to explore the dispersion issue and potential explanatory variables because accident frequencies are discrete and non-negative integer values. ND RGC crash data from 1996 to 2014 was fitted with a Poisson regression model. A 99% confidence interval was selected as a cutoff value for statistical significance. The best-fitted Poisson model is selected through an explicit enumeration of all 16 possible independent variables with several commonly used criteria including Akaike’s Information Criteria (AIC), and Bayesian Information Criteria (BIC). They can be mathematically expressed as Eqs. (22) and (23): AIC = n ∗ ln BIC = n ∗ ln

SSE  n

SSE  n

+ 2k +

2 (k + 2) n 2 2n2 4 − SSE SSE 2

(22)

(23)

Where n is sample size, SSE is the sum of the squares of errors, ei is validation error, k is the number of independent variable, and

2 is error variance. The model with the smallest AIC and BIC is deemed the “best” model because it minimizes the difference from the given model to the “true” model. In other words, smaller AIC or BIC values indicate a better-fitted model (UCLA, 2007).

P. Lu, D. Tolliver / Accident Analysis and Prevention 90 (2016) 73–81 Table 2 Poisson estimation results. Variable

77

PH, Bernoulli, and CMP are selected targeting to handle underdispersed data.

Estimated coefficient P-statistic > ChiSq Type 3 analysis P

−1.3266 Intercept −0.6576 Cross buck Gate −0.2664 No vontrol −2.9525 −1.2687 Flashing Stop sign – 0.0001 AADT 0.0267 Train per day 0.1763 Track numbers Paved highway 0.5764 0.0149 Max train speed No historical accident −2.4058 1939 AIC 2012 BIC Pearson chi-square 3888.66 0.69 Value/DF −958.579 Log likelihood

0.0001 0.024 0.3895 0.0046 0.057 – <0.0001 <0.0001 0.0079 0.0001 <0.0001 <0.0001

5.2. Model comparison

<0.0001 <0.0001

Table 3 compares statistically significant contributing variables and model goodness of fit (GOF) statistics including AIC, BIC, Pearson chi-square statistics, degree of freedom (DF) and log likelihood (LL) statistic. As described earlier, smaller AIC and BIC values indicate a better fit. If the model fits the data perfectly without any dispersion, the Pearson chi-square is roughly equal to the model degree of freedom. In other words, the closer the Pearson chi-square is to the DF, the better the model fits the data (SAS, 2015). The loglikelihood statistic (LL) is calculated by taking the logarithm of the estimated likelihood for each observation and summarizing those log-likelihoods. In the closer-to-zero sense, the larger LL indicates the better model (UCLA, 2007). The mathematical description of the statistics are given as:

0.0004 <0.0001 0.01 0.0002 <0.0001 <0.0001

chi − sq =

The estimation results of the saturated Poisson model are presented in Table 2. Seven variables are found to be statistically significant in determining accident likelihood and under-dispersion may exist according to fractural lack-of-fit value, value/DF = 0.69 which is significantly different from 1. This reveals the data may be under-dispersed. As shown in Table 2, highway traffic, train traffic, number of tracks, and max train speed all significantly and positively influence the accident likelihood. Certain types of warning devices will decrease the accident likelihood compared with “stop sign” warning device only. When the highway is paved, the accident likelihood increases compared with an unpaved highway. Likewise, when a crossing has no historical accident record, the likelihood of an accident decreases. All the results indicate that Poisson model is a good first choice for investigating crash data. However, as mentioned earlier in this paper, the negative binomial model which is suggested by many researchers as a modeling method for crash data is not appropriate to handle the under-dispersed data which may occur in this research dataset. To continue to improve the crash frequency regression model while suspecting possible underdispersion, attempts were made to analyze accident frequency from under-dispersed data with Zero-Inflated-Poisson (ZIP), NB, Poisson Hurdle (PH), Bernoulli, and Conway–Maxwell–Poisson models using least squares regression techniques. Where, ZIP and NB models are selected for their popularity in the literature and

n 2 (Oi − Ei )

(24)

Ei

i=1 n LL = ˙i=1 log (Pi )

(25)

DF = n − (p + 1)

(26)

Where Oi is the observed frequency of count equal to i Ei is the expected frequency of count equal to i Pi is the expected likelihood of count equal to i n is the sample size used in fitting the model p is the number of parameters used in fitting the model Indicated from Table 3, all of Poisson, CMP, Bernoulli, NB and Poisson hurdle models provide consistent significant contributory variables except ZIP model identifying less contributory variables are significant. Some findings are: (1) All GOF variables provide consistent model preferences: for AIC and BIC, the Bernoulli model is the first model preference followed by CMP, PH, ZIP, Poisson, and NB models; for Pearson chi-square/DF and LL, both the Bernoulli and PH models are tied as the first preference then followed by CMP, ZIP, Poisson, and NB models. (2) All three proposed models, CMP, Bernoulli, and PH, that potentially can handle under-dispersed data do fit better than Poisson model. The three models performs almost equally well for under-dispersed data since the GOF variables’ values generated by those three models are very close. (3) If a wrong

Table 3 GOF Model Comparison Results. Variable

Poisson

CMP

Bernoulli

PH

Intercept Cross buck Gate No control Flashing Stop sign AADT Train per day Track numbers Paved highway Max train speed No historical accident AIC BIC Pearson chi-square DF Log likelihood

X

X

X

X

X X

X

X X – X X X X X X 1609(2) 1709(2) 4732(3) 5605 −789.49(3)

X X – X X X X X X 1601(1) 1674(1) 4753(1) 5605 −789.5(1)

X X – X X X X X X 1623(3) 1769(3) 4753(1) 5605 −789.5(1)

X

– X X X X X X 1939(5) 2012(5) 3889(5) 5609 −958.58(5)

‘X’ indicates significant parameter and “–”indicates the variable is a reference variable for warning devices.

ZIP

– X X X X 1825(4) 1971(4) 4482(4) 5609 −890.7(4)

NB

– X X X X X X 1941(6) 2021(6) 3889(5) 5609(5) −958.58(5)

78

P. Lu, D. Tolliver / Accident Analysis and Prevention 90 (2016) 73–81

Fig. 1. CURE plot for Poisson model against highway traffic.

Fig. 2. CURE plot for Poisson model against railway traffic.

model, such as the NB model, was selected to fit under-dispersed data, the GOF criteria values will indicate poor model fit compared to the Poisson model. And (4) If a model, such as the ZIP model, is selected to solve other issues but not under-dispersion, the GOF

criteria values may show improved model fit compared to the Poisson model as the potential improvement of un-noticed issue but only with limited improvement since the under-dispersed issue is not solved by ZIP.

P. Lu, D. Tolliver / Accident Analysis and Prevention 90 (2016) 73–81

79

Fig. 3. CURE plot for Bernoulli model against highway traffic.

Fig. 4. CURE plot for Bernoulli model against railway traffic.

Note that the nu parameter,, in the CMP model is estimated as 32 for the dataset. Recall that If  > 1, CMP can model underdispersed data and when  → ∞ the distribution is reduced to

Bernoulli. Observation of nu equals to 32 and the close GOF variables indicates that dataset used in the research may be an extreme

80

P. Lu, D. Tolliver / Accident Analysis and Prevention 90 (2016) 73–81

Table 4 Model parameter estimator comparison results. Variable

Poisson

CMP

Bernoulli

PH

Intercept Cross buck Gate No control Flashing Stop sign AADT Train per day Track numbers Paved highway Max train speed No historical accident

−1.3266 −0.6576 −0.2664 −2.9525 −1.2687 – 0.0001 0.0267 0.1763 0.5764 0.0149 −2.4058

0.133448 −0.025279 −0.005391 −0.097882 −0.061455 – 0.005071 0.001485 0.009971 0.021940 0.000549 −0.244862

3.5165 −0.7970 −0.1698 −3.0864 −1.9378 – 0.000160 0.04684 0.3144 0.6917 0.0173 −7.7216

3.5178 −0.7975 −0.1704 −3.0876 −1.9384 – 0.000160 0.04683 0.3144 0.6918 0.01731 −7.7225

under-dispersion and the CMP performs model equally as well as the Bernoulli model. To further show the model performance and impact on estimated contributor variables of the proposed three models, CMP, Bernoulli, and PH, Table 4 summarizes the parameter estimates comparison results. Note that the Poisson and CMP models have log link functions, and Bernoulli and PH models have logit link functions. Consequently, the parameter estimates cannot be directly compared between the two groups. However, the sign indication and the relative relationship can be assessed and the estimates can be compared between the models within the same link function group. Several observations can be drawn from Table 4: (1) all models provide the same parameter signs for all the variables. (2) The Bernoulli regression model and Poisson hurdle model provide almost identical parameter estimates indicating that both the Bernoulli and PH models can handle under-dispersed crash data well and provide consistent model results. (3) The Convey–Maxwell–Poisson model generates quite different parameter estimates for explanatory variables compared to the Poisson regression model, indicating that even though the models provide consistent relative relationship between contributory variables and crash likelihood, the absolute parameter estimate can be significantly affected by model types. (4) Crossing warning devices, highway traffic, rail traffic, maximum train speed, number of tracks, appearance of paved highway, historical accident records are all significant contributing variables to RGC accident likelihood. And (5) AADT, trains per day, number of tracks, paved highway, historical crashes and maximum train travel speed all positively affect highway-rail grade crash likelihood. Most of them meet the expectations intuitively. For example, more highway or railway traffic more likely there is a crash. The higher the number of tracks, the potential higher rail traffic and longer distance to across the crossing, then the higher crash likelihood. Paved highway can indicate higher highway travel speed than unpaved highway, which will in turn cause higher probability to crash onto a train without successfully stopping. However, higher train speed associated with higher crash likelihood is a little counterintuitive. If a crossing has faster traveling trains, then one can assume highway traveler will be more cautious to across that crossing and then will lead to lower crash likelihood. Potential rationale for the positive relationship between train speed and crash likelihood can be: if most drivers are unfamiliar with the crossing and do not know it has faster traveling trains then positive relationship can be established. Thus, detailed contributory variable effect analysis should be investigated to further understand the effects. Hauer and Bamfo (1997) proposed a cumulative residuals (CURE) method to further investigate how well the model fits the data with respect to each individual covariate, in which the cumulative residuals are plotted in ascending order for a specific covariate. Hauer and Bamfo (1997) stated that if a model equation fits the

data along the whole range values by a variable, one can expects the cumulative graph to be a random walk oscillating around 0 and end close to 0. CURE method prosed by Hauer and Bamfo (1997) defined upper and lower bounds for cumulative residuals by ± 2*standard deviance, ±2 ∗ . Their method is graphical test. Some other researchers reinforced the same CURE analysis method as Hauer and Bamfo however extended it with simulation tool to make both graphical and numerical test possible (Lin et al., 2002; SAS, 2015). Their method does not define ± 2*standard deviance boundaries but rather perform a resampled simulation to capture simulated cumulative residual paths. If the simulation size large enough, not only ± 2*standard deviance lines can be included in the simulation paths but also the associated p-value for the Kolmogorov-type supremum test can be calculated based on simulations. This paper selected two model comparisons against two exposure variables with CURE method: Poisson and Bernoulli against highway traffic (AADT) and rail traffic (trains per day). Poisson model is basic modeling starting point and Bernoulli model is believed to have better fit regarding to under-dispersion and moreover, Poisson Hurdle model developed in this research is almost identical to Bernoulli model which will generate almost identical CURE plot. Details regarding the statistics and construction of the plot maybe found in other resources (Lin et al., 2002; SAS, 2015). Figs. 1 and 2 are CURE plots for Poisson model against for highway and rail traffic respectively. Highway traffic is in the unit of 1000. Figs. 3 and 4 are CURE plots for Bernoulli model against for highway and rail traffic respectively. Rail traffic is in the unit of 10. If the CURE value steadily increases within a range of values of the independent variable, it basically means that the model predicts fewer accidents than have been observed, and a decreasing CURE line indicates the model predicts more accidents than observed. The main method of the plots shown above is that the behavior of the actual data (solid bold line) should oscillating around zero and ends close to zero, and fall within a set of bounds (dashed lines) found by re-sampling the crash data several times. The graphical results of the four Figures show the actual cumulative residuals are oscillating around the value zero and end close to zero, and for most parts of the solid lines fall within the boundaries which indicating correct model fits against traffics. According to numerical results, shown in Figs. 1 and 2, the p-values for the Kolmogorov-type supremum test under the null hypothesis of a correct model fit based on 10,000 resampling are 0.0168 and 0.0002 respectively for highway traffic and rail traffic, indicating reject the null hypothesis of correct model fits which suggests that a more adequate model might be possible. The p-values are 0.0491 and 0.0778 respectively for Bernoulli model against highway and railway traffic indicating that Bernoulli models are a much improved functional form for crash data. Based on both graphical and numerical tests, it can be concluded that the Bernoulli models fit the data better than Poisson models.

6. Summary and conclusion The paper presents an empirical analysis of RGC accidents in ND. A range of models were investigated and estimated with the same set of data which displays under-dispersion. The modeling results are compared based on model goodness-of-fit. The investigation intended to address two main issues. Highway-rail grade crossing safety is a major social and economic focus for transportation system operators because of the catastrophic and serious consequences of RGC accidents. Relatively little research has been conducted on RGC accidents compared to highway and intersection accidents. Additional research on RGC

P. Lu, D. Tolliver / Accident Analysis and Prevention 90 (2016) 73–81

accidents is needed to better understand and identify contributing factors to RGC accidents. RGC crashes can show significantly different distributions compared to motor vehicle crashes because train operation is significantly different than motor vehicle operation. For example, most motor vehicle crash data shows over-dispersion. However, the RGC crash data used in this research shows under-dispersion. Consequently, existing popular safety evaluation methods may not adequately be suitable to assess highway-rail grade crossing safety. After the application and investigation of few statistical models for ND public highway-rail grade crossing crashes, the following general conclusions can be drawn: Of the six tested prediction models, the Poisson model is suggested to be selected as the starting model because the model’s equal-dispersion assumption and the values of the crash data are represented by discrete, non-negative integer. The Poisson model’s equal-dispersion assumption can help assess data dispersion. The Convey–Maxwell–Poisson, Bernoulli, and Poisson Hurdle models are suitable models for assessing ND RGC accident data, which exhibited under-dispersion with respect to the Poisson distribution. The Convey–Maxwell–Poisson and Poisson hurdle models are flexible models that can accommodate both under- and over-dispersion. The Bernoulli distribution regression model is only appropriate for under-dispersion. All three proposed models have seen used rarely in transportation safety literature. For future highway-rail grade crossing safety research, detailed model fitness, model performance, and model robustness assessments for each model should be conducted individually for various under-dispersed data i.e., various mean levels various variance levels. Other non-parametric, evolutionary, and data mining techniques should be investigated against model fitness and performance of traditional regression methods. The strength and weakness of each type of modeling methods should be compared for RGC safety evaluation. More detailed data investigation should be conducted to include additional driver and geometric factors. For example, driver behavior and near-misses crash analysis should be conducted for RGC safety evaluations. Acknowledgments The research for this paper was financially supported by the U.S. Department of Transportation (USDOT)/University Transportation Center (UTC). We gratefully acknowledge their financial supports without which the present study could not have been completed. References Aguero-Valverde, Jonathan, 2013. Full Bayes Poisson Gamma, Poisson lognormal, and zero inflated random effects models: Comparing the precision of crash frequency estimates. Accid. Anal. Prev. 50, 289–297. Austin, R.D., Carson, J.L., 2002. An alternative accident prediction model for highway-rail interfaces. Accid. Anal. Prev. 34, 31–42. Belle, G.V., Meeter, D., Farr, W., 1975. Influencing factors for railroad-highway grade crossing accidents in florida. Accid. Anal. Prev. 7, 103–112.

81

Cameron, A.C., Trivedi, P.K., 1998. Regression Analysis of Count Data. University Press, Cambridge, UK. Campbell, K.L., 1991. Fatal accident involvement rates by driver age for large trucks. Accid. Anal. Prev. 23 (4), 287–293. Chen, F., Chen, S., 2011. Injury severities of truck drivers in single- and multi-vehicle accidents on rural highways. Accid. Anal. Prev. 43 (5), 1677–1688. Chen, S., Chen, F., Wu, J., 2011. Multi-scale traffic safety and operational performance study of large trucks on mountainous interstate highway. Accid. Anal. Prev. 43 (1), 429–438. Chen, C., Xie, Y., 2014. Modeling the safety impacts of driving hours and rest breaks on truck drivers considering time-dependent covariates. J. Saf. Res. 51, 57–63. Ezra Hauer, 2001. Overdispersion in modeling accidents on road sections and in empirical Bayes estimation. Accid. Anal. Prev. 33, 799–808. Gitelman, V., Hakkert, A.S., 1997. The evaluation of road-rail crossing safety with limited accident statistics. Accid. Anal. Prev. 29, 171–179. Guikema, S.D., Coffelt, J.P., 2008. Modeling count data in risk analysis and reliability engineering. In: Misra, K.B. (Ed.), Handbook of Performability Engineering. Springer, London. Hauer, E., Bamfo, J., 1997. Two tools for finding what function links the dependent variable to the explanatory variables. In: In Proceeding of the ICTCT 1997 Conference, Lund, Sweden, 1997. Jung, S., Jang, K., Yoon, Y., Kang, S., 2014. Contributing factors to vehicle to vehicle crash frequency and severity under rainfall. J. Saf. Res. 50, 1–10. Kashani, A.T., Rabieyan, R., Besharati, M.M., 2014. A data mining approach to investigate the factors influencing the crash severity of motorcycle pillion passengers. J. Saf. Res. 51, 93–98. Kwon, O.H., Rhee, W., Yoon, Y., 2015. Application of classification algorithms for analysis of road safety risk factor dependencies. Accid. Anal. Prev. 75, 1–15. Lord, Dominique, Mannering, Fred, 2010. The statistical analysis of crash-frequency data: a review and assessment of methodological alternatives. Transp. Res. Part A Policy Pract. 44 (5), 291–305. Lord, D., Park, P.Y., 2008. Investigating the effects of the fixed and varying dispersion parameters of Poisson-gamma models on empirical Bayes estimates. Accid. Anal. Prev. 40, 1441–1457. Lin, D.Y., Wei, L.J., Ying, Z., 2002. Model-checking techniques based on cumulative residuals. Biometrics 58, 1–12. Oh, J., Washington, S.P., Doohee, Nam, 2006. Accident prediction model for railway-highway interfaces. Accid. Anal. Prev. 38, 346–356. Raub, R., 2009. Examination of highway-rail grade crossing collisions nationally from 1998 to 2007. Transp. Res. Rec. 2122 (1), 63–71. Salmon, P.M., Lenne, M.G., Young, K.L., Walker, G.H., 2013. An on-road network analysis-based approach to studying driver situation awareness at rail level crossings. Accid. Anal. Prev. 58, 195–205. Saha, K., Paul, S., 2005. Bias-corrected maximum likelihood estimator of the negative binomial dispersion parameter. Biometrics 61 (1), 179–185. SAS, (accessed 07.12.15.) http://support.sas.com/documentation/cdl/en/statug/ 63033/HTML/default/viewer.htm#genmod toc.htm;. Schoor, O.V., Niekerk, J.L.V., Grobbelaar, B., 2001. Mechanical failures as a contributing cause to motor vehicle accidents—South Africa. Accid. Anal. Prev. 33 (6), 713–721. Shmueli, G., Minka, T.P., Kadane, J.B., Borle, S., Boatwright, P., 2005. A useful distribution for fitting discrete data: revival of the Conway–Maxwell–Poisson distribution. J. R. Stat. Soc. C 54 (1), 127–142. UCLA, 2007. Regression Models with Count Data (accessed 12.07.15.) http://www. ats.ucla.edu/stat/stata/seminars/count presentation/count.htm. Wang, X., Abdel-Aty, M., 2008. Modeling left-turn crash occurrence at signalized intersections by conflicting patterns. Accid. Anal. Prev. 40, 76–88. Xie, K., Ozbay, K., Yang, H., 2014. Spatial analysis of highway incident durations in the context of hurricane sandy. Accid. Anal. Prev. 74, 77–86. Yan, X., Richards, S., Su, X., 2010. Using hierarchical tree-based regression model to predict train-vehicle crashes at passive highway-rail grade crossing. Accid. Anal. Prev. 42, 64–74. Young, R.K., Liesman, J., 2007. Estimating the relationship between measured wind speed and overturning truck crashes using a binary logit model. Accid. Anal. Prev. 39 (3), 574–580. Zhang, G., Yau, K.K.W., Chen, G., 2013. Risk factors associated with traffic violations and accident severity in China. Accid. Anal. Prev. 49, 18–25. Zhang, Y., Xie, Y., Li, L., 2012. Crash frequency analysis of different types of urban roadway segments using generalized additive model. Accid. Anal. Prev. 43, 107–114.