Accident Analysis and Prevention 41 (2009) 202–209
Contents lists available at ScienceDirect
Accident Analysis and Prevention journal homepage: www.elsevier.com/locate/aap
Correcting erroneous crash locations in transportation safety analysis Robert Tegge, Yanfeng Ouyang ∗ Department of Civil and Environmental Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, United States
a r t i c l e
i n f o
Article history: Received 21 August 2008 Received in revised form 19 October 2008 Accepted 24 October 2008 Keywords: Crash location error Transportation safety analysis Lagrangian relaxation Network design model
a b s t r a c t Errors and inconsistencies in crash location records are realistic problems that often compromise accuracy of safety model outcomes. This paper proposes a new safety analysis framework where an optimal network design module that estimates the most-probable site for each crash is combined with the standard regression analysis. An effective solution algorithm based on Lagrangian relaxation is developed for the network design model, and an iterative computation approach is proposed for location estimation and statistical regression. The proposed models are implemented with empirical data and numerical results are presented. © 2008 Elsevier Ltd. All rights reserved.
1. Introduction In transportation safety analysis, a roadway network is often decomposed into individual sites, such as intersections, homogeneous or unit–length segments, or corridors. Statistical count models are normally developed to determine the expected crash frequencies at these sites, as well as how roadway, traffic, and environmental characteristics affect crash frequencies. Results from such models enable public agencies to locate high-crash sites in the roadway network, to identify significant causal factors that contribute to crashes, and to improve roadway safety via engineering measures, education, and law enforcements. The development of count models usually requires multiple types of data, such as crash-related information (e.g., severity, location, time), roadway site-related information (e.g., length, geometric design features, traffic volume), and environment-related information (e.g., weather, surface condition). These various types of raw data must then be temporally and spatially matched for each site in order to obtain input data for the dependent variable (e.g., the number of certain type of crashes at a site within the time horizon) and explanatory variables (e.g., segment length, traffic volume, weather). The accuracy of spatial information of both the crashes and the roadway sites is critical for the matching process. In practice, different types of data are usually collected and maintained by different departments or divisions of the transportation agencies. For example, crash information may be reported by the police department to the traffic safety agency, roadway
∗ Corresponding author. Tel.: +1 217 333 9858. E-mail address:
[email protected] (Y. Ouyang). 0001-4575/$ – see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.aap.2008.10.013
information may be available at the engineering or maintenance agency, and environmental information may be stored at a separate organization (e.g., national weather service). Errors and inconsistencies in spatial locations may occur for numerous reasons. First, crash location information is generally extracted from paper-based police reports whose accuracy highly depends on the experience of the police officer and data preparation personnel; e.g., an officer unfamiliar with the area may put down a wrong milepost. Second, crash information and road geometry datasets are often developed on separate software platforms that utilize different spatial coordinate systems (e.g., mileposts versus GIS stationing or GPS coordinates). Errors may occur when the datasets are being matched, since these spatial coordinates are often incompatible. This issue is especially exacerbated if the respective coordinate systems have gone through independent updates or redefinitions (e.g., roadway geometry realignment) over time. Finally, in practice many crashes are located at the borderline of two or more segments, probably not because they occur precisely there, but because such locations can be easily identified by police officers. In most safety analyses, each crash is usually assigned to one roadway site based on the researcher’s best judgment, or rather arbitrarily; this is likely to introduce data errors that may lead to flawed model output and erroneous policy implications. Hence, when crash or roadway locations are subject to error, a process to improve the accuracy of the location data is needed. To the authors’ best knowledge, however, no procedure exists to systematically and effectively correct for possible location errors in crash and roadway data. The closest literature might be Hausman et al. (1998), which proposes a maximum likelihood estimator that will correct for misclassifications of the dependent variables in
R. Tegge, Y. Ouyang / Accident Analysis and Prevention 41 (2009) 202–209
203
discrete-choice models. This study tries to fill this gap by formulating an optimal network design (location–allocation) module that estimates the most-probable site for each crash. An effective solution algorithm based on Lagrangian relaxation is also developed. This location–estimation module is an addition to the standard safety analysis framework, so that the likelihood of observing the potentially erroneous crash locations is approximated and maximized. The new framework conducts location estimation and statistical regression iteratively to improve safety analysis results. The remainder of the paper is organized as follows. Section 2 introduces the new safety analysis framework, including the network optimization model for crash assignment, the statistical regression models, and the iterative computation approach. Section 3 explains the Lagrangian relaxation algorithm for solving the network optimization problem. Section 4 presents numerical results of an empirical case study. Section 5 provides conclusions and suggestions for future research.
The second equality holds by assuming that pmn , the probability for a crash observed at location zn to actually have occurred on site m, is independent of m . This assumption is reasonable because most of the sources for location errors are unlikely to be correlated with the characteristics of the sites. The value of probability pmn shall be determined for all crash-site combinations, such that
2. Methodology
pmn =
pmn = 1, ∀n
As aforementioned, in crash frequency analysis, the standard practice is to define a roadway site (e.g., segments or intersections) m = 1, 2, . . ., M, and to obtain the observed count of crashes that have occurred on this site, yˆ m . Each of the M sites in the roadway network1 is spatially located around a center point, cm , and has an expected crash frequency m (unknown). Statistical regression models (e.g., Poisson, negative binomial or other variants)2 will be developed to estimate the set of {m } (as a function of various causal factors and attributes) that maximizes the following conditional probability function: Pr{ˆy1 , . . . , yˆ M |1 , . . . , M }. Suppose the observed (i.e., recorded) locations for N crashes are represented by a set of spatially distributed points on the roadway network, {z1 , . . ., zN }. The distance between any two points on the network, zn and cm , can be defined by a metric dmn ; common metric choices include Euclidean distance or network shortest-path distance. Since the observed crash locations are not necessarily accurate, a crash observed at location zn may actually occur on any site m (even if location zn appears to be outside of site m). Assume that the actual number of crashes on site m is ym , which is unknown. The probability for the observed locations, conditional on the expected crash occurrences, is given by Pr{z1 , . . . , zN |1 , . . . , M } =
exp(−dmn /lm )
M
Pr{y1 , . . . , yM |1 , . . . , M }
Pr{z1 , . . . , zN |y1 , . . . , yM }
∀{y1 ,...,yM }
Pr{y1 , . . . , yM |1 , . . . , M }
exp(−djn /lj )
, ∀m, n,
(2)
where lm is the “spatial size” of site m (e.g., segment length) that normalizes the distance metric between crash locations and segment centers. For the given observed crash locations, probability (1) should ideally be maximized with regard to {m }. This is generally very difficult. To simplify this, we propose to maximize Pr{z1 , . . . , zN |y1 , . . . , yM } Pr{y1 , . . . , yM |1 , . . . , M },
(3)
and estimate {m } with the following iterative method. Step 0: Initialize input parameters {dmn } and {pmn }. Let k = 1. Set tolerance ε. Obtain 01 , . . . , 0M by regression using erroneous or arbitrarily-determined observations, yˆ 1 , . . . , yˆ M . k based on z , . . ., z and Step 1: Estimate the most-likely y1k , ..., yM N 1
, . . . , k−1 . k−1 M 1 Step 2: Update k1 , . . . , kM by regression, using the updated crash k . counts y1k , . . . , yM
k−1 Step 3: If |km − k−1 m |/(m ) ≤ ε for all m, terminate the loop and output the most recent regression results; otherwise, let k = k + 1 and go to Step 1.
This new framework is also illustrated by the solid arrows in Fig. 1. The traditional statistical modeling approach is depicted by the dashed arrow. The following sections discuss Steps 1 and 2 in more detail; for notation simplicity, the superscripts (e.g., k) will be omitted. 2.2. (Steps 0 and 2) Determining 1 , . . ., M based on y1 , . . ., yM
Pr{z1 , . . . , zN |y1 , . . . , yM ; 1 , . . . , M }
∀{y1 ,...,yM }
=
pmn ≥ 0, ∀m, n.
Based on the empirical setting, different rules for defining pmn may be adopted. For example, if the observed location zn is known to be rather accurately located within site m1 , then we may simply / m1 . If location zn is right on let pm1 n = 1, and pmn = 0 for all m = the borderline between sites m1 and m2 , then we may let pm1 n = / m1 ,m2 . If location zn is off the pm2 n = 1/2, and pmn = 0 for all m = network (e.g., does not spatially overlap with any of the sites), we may choose to define pmn as a function of the distance metric dmn only; e.g., we may choose to use the following formula:
j=1
2.1. Model framework
and
m
(1)
1 Depending on the focus of the safety analysis, the roadway network may be defined to include a certain subset of the roadway sites (e.g. limited-access highway segments only). 2 A brief review of recent literature on count-data models can be found in Section 2.2.
This section provides basic background information on the standard statistical modeling techniques for crash frequency analysis. Washington et al. (2003) provide a detailed description of these modeling techniques. Readers with ample knowledge on this subject may skip this subsection without losing continuity. When y1 , . . ., yM are observed for a set of roadway sites, count models are usually developed to estimate the safety performance of the roadway sites as a function of explanatory variables. Most standard regression models assume that for roadway site m, the logarithm of the expected crash frequency can be written as the following linear function of q explanatory variables; i.e., ln(m ) = ˇ0 + ˇ1 Xm1 + ˇ2 Xm2 +, · · ·, +ˇq Xmq ,
(5)
204
R. Tegge, Y. Ouyang / Accident Analysis and Prevention 41 (2009) 202–209
Tarko (2005), and Lord (2006). The probability of observing ym crashes is (ym + 1/K) ym ! (1/K)
Pr(ym ) =
K ym m 1 + Km
1 1 + Km
1/K
,
and the log-likelihood function for all sites is LL(1 , 2 , . . . , M , K)
⎡ M y m −1 ⎣ = ln(1 + Kj) + ym ln(m ) m=1
j=0
− ym +
1 K
⎤ ln(1 + Km ) − ln(ym !)⎦
⎡ ⎤ M ym −1 ⎣ or LL(ˇ0 , ˇ1 , ..., ˇq , K) = ln(1 + Kj) − ln(ym !)⎦ m=1 M
+
j=0
ym (ˇ0 + ˇ1 Xm1 + ˇ2 Xm2 + · · · + ˇq Xmq )
m=1 M
−
m=1
Fig. 1. Traditional and proposed methodology to estimate {m }.
exp(−m )(m ) ym !
ym
and the log-likelihood function for all sites, using (5), is LL(1 , 2 , . . . , M ) =
M m=1
ym ln(m ) −
M
m −
m=1 M
LL(ˇ0 , ˇ1 , , . . . , ˇq )=
M
ln(ym !),
−
m=1
ln(1 + K exp(ˇ0 + ˇ1 Xm1 (7)
Numerical procedures (such as the Newton–Raphson method) are often used to estimate the regression coefficients ˇ0 , ˇ1 , ˇ2 , . . ., ˇq and K that maximize the log-likelihood (6) or (7), which also yields the expected crashes, {m }, from (5). Building on the abovementioned Poisson and negative binomial models, other variants of count models have also been developed for transportation safety analysis. Some of the wellknown examples include zero-inflated negative binomial models (e.g., Shankar et al., 1997), negative binomial with random effects models (e.g., Shankar et al., 1998), negative binomial with random parameters (Anastasopoulos and Mannering, 2008), and Conway–Maxwell–Poisson generalized linear model (Lord et al., 2008). 2.3. (Step 1) Estimating y1 , . . ., yM based on given z1 , . . ., zN and 1 , . . ., M
or
m=1
ym (ˇ0 + ˇ1 Xm1 + ˇ2 Xm2 + · · · + ˇq Xmq )
m=1 M
1 K
+ˇ2 Xm2 + · · · + ˇq Xmq ))
where Xm1 , Xm2 , . . ., Xmq are the explanatory variables at site m and ˇ0 , ˇ1 , ˇ2 , . . ., ˇq are the unknown regression coefficients. Popular crash count models include the Poisson regression model and the negative binomial regression model. For Poisson regression, the crash count variable is assumed to follow a Poisson distribution with mean m ; see Jovanis and Chang (1987), Jones et al. (1991), Wang et al. (1997), Vogt and Bared (1998) for some empirical applications. The probability of observing ym crashes is Pr(ym ) =
ym +
M
exp(ˇ0 + ˇ1 Xm1 + ˇ2 Xm2 + · · · + ˇq Xmq )−
We estimate y1 , . . ., yM from any given z1 , . . ., zN and 1 , . . ., M by optimizing (3), which, due to the monotonicity of logarithm functions, is equivalent to max
∀{y1 ,...,yM }
ln(ym !)
ln[Pr{z1 , ..., zN |y1 , ..., yM }] + ln[Pr{y1 , ..., yM |1 , ..., M }] (8)
(6)
m=1
Poisson distribution requires that the mean of the count variable is equal to the variance, which is often found to be restrictive. Negative binomial regression allows for overdispersion (i.e., variance larger than the mean) by assuming that the count variable follows a negative binomial distribution with mean m and variance i + K(i )2 , where K is the overdispersion parameter. There have been extensive studies that are based on negative binomial regression; see for example, Bonneson and McCoy (1993), Hadi et al. (1995), Shankar et al. (1995), Poch and Mannering (1996), Council and Stewart (1999), Abdel-Aty and Radwan (2000), Savolainen and
Define location assignment variable
xmn =
1, 0,
if the observation at zn actually occurs on site m . otherwise
Then, conditional on any y1 , . . . , yM , Pr{z1 , ..., zN |y1 , ..., yM } = pmn xmn , or equivalently
∀m,n
ln[Pr{z1 , ..., zN |y1 , ..., yM }] =
M N n=1 m=1
ln(pmn )xmn
(9)
R. Tegge, Y. Ouyang / Accident Analysis and Prevention 41 (2009) 202–209
This is subject to the constraint that the total number of crash assignments to site m should equal to the actual crash number on site m; i.e., ym =
xmn , ∀m.
n
We assume that, for given 1 , . . ., M , the actual occurrence of crashes on each site is mutually independent and each follows a Poisson distribution. Then, Pr{y1 , . . . , yM |1 , . . . , M } =
M e−m ym m
m=1
ym !
.
Obtaining exact solutions for complicated optimization problems such as (11a)–(11f) can take excessive computation time. Lagrangian relaxation based algorithms have been known to solve similar problems efficiently (Fisher, 1981; Daskin, 1995). The basic idea of the Lagrangian relaxation method is to relax the original problem by removing certain hard constraints while adding penalty terms to the objective function for any violations to the removed constraints. The solution to the relaxed problem will provide a near-optimum solution (sometimes exactly optimal) and an optimality gap for the original problem. The following section shows the implementation of Lagrangian relaxation to solve problem (11a)–(11f).
ln[Pr{y1 , ..., yM |1 , ..., M }] M
[−m + ln(m ) ym − ln(ym !)]
m=1
≈
M
[−m + ln(m ) ym − (ym ln(ym ) − ym + 1)]
m=1
=
M
known to be computationally intractable (i.e., NP-hard). For practical applications, the number of crashes N and the number of sites M are both quite large. Due to such a large scale, and nonlinearity in the objective function, existing commercial software such as CPLEX (or MINOS) may have difficulty solving this problem efficiently. We hence develop a Lagrangian relaxation (LR)-based heuristic algorithm to solve this optimization problem. This will be introduced in the next section. 3. Lagrangian relaxation approach for location estimation
Thus, by Stirling’s approximation,3
=
205
[−m − 1 + ym + ym ln(m ) − ym ln(ym )]
(10)
3.1. Lagrangian relaxation subproblems
m=1
Substituting (9) and (10) into (8), and omitting constant terms in the objective, we now have the following mixed integer nonlinear optimization problem. max y1 , ..., yM {xmn }
M N
ln(pmn )xmn +
n=1 m=1
M
[ym + ym ln(m ) − ym ln(ym )]
m=1
We relax the M constraints in (11b) and move them to the objective function (11a) with a set of unrestricted multipliers u: = {um : ∀m}. This yields the following LR subproblem: max z(u) = y1 , ..., yM {xmn }
M N
(11a) + s.t.
ym =
M
[ym + ym ln(m ) − ym ln(ym )]
m=1
xmn , ∀m
(11b) +
n
ym = N xmn = 1, ∀n
(11d)
xmn ∈ {0, 1}, ∀m, n
(11e)
ym ∈ {0, 1, 2, . . .}, ∀m
(11f)
3 This approximation may not be accurate when ym is very small, especially when ym = 0. Alternatively, we could avoid the approximation and directly use ln(ym !); the proposed modeling framework will still work. Note that ln(ym !) is a convex function of ym and hence the objective function (8) will remain concave.
um
ym −
xmn
n
s.t. (11c)–(11f). Note that, for any given u, the subproblem can be separated into the following two problems, which can be solved independently.
m
Note that constraints (11d) ensure that each observed crash is assigned to exactly one site. Constraints (11c) are redundant given (11b), but Section 3.1 will show that we shall keep them in the formulation for computational efficiency. This mathematical program, (11a)–(11f), resembles a generalized version of the discrete facility location problem, where ln(pmn ) corresponds to the “access cost” for customer at n to be served by a “facility” at m. The nonlinear cost for operating the facility depends on the number of “customers” served by it. This type of problem is
m
(11c)
m
ln(pmn )xmn
n=1 m=1
max {xmn }
N M
[ln(pmn ) − um ]xmn
s.t. (11d) − (11e),
(S1)
m=1 n=1
and max
y1 ,...,yM
M
[ym + ym ln(m ) − ym ln(ym ) + ym um ]
m=1
s.t. (11c) and (11f).
(S2)
Both (S1) and (S2) can be solved efficiently. Problem (S1) is a straightforward optimization problem. For each n, we pick xmn = 1 only for m = arg max[ln(pjn ) − uj ], and break ties arbitrarily. Ignorj
ing integrality constraints (11f), (S2) is equivalent to a simple constrained convex optimization problem, where a concave function is maximized over a convex set. The first-order conditions are
206
R. Tegge, Y. Ouyang / Accident Analysis and Prevention 41 (2009) 202–209
⎧ ) + um = constant, ∀m, ⎪ ⎨ ln(m ) − ln(ym M ym = N, ⎪ ⎩
accordingly. The sequential heuristic algorithm can be summarized as follows.
m=1
and it is to see that trivial the (non-integer) optimal solution is ym = N eum m / k euk k , ∀m. For simplicity, we then round ym to the nearest integer while ensuring that
M
ym = N is satisfied.
Step H0: Define the set of unassigned crashes R: = {1, 2, . . ., N}; initialize temporary crash counts, wm : = 0 for all m. Step H1: If cardinality |R| > 0, find (m, n):=arg max{pmn |n ∈ R, wm < ym } and break ties arbitrarily; let xmn :=1, wm :=wm + 1, R:=R\{n}; repeat Step H1. Step H2: If |R| = 0, terminate and output {xmn }.
m=1
This is reasonable because the objective function in (S2) is obtained by Stirling’s approximation, and it is likely to be flat around the optimum. 3.2. Lagrangian multiplier updates For each given u, the optimization of the LR subproblem z(u) yields an upper bound of the optimal objective value of the original problem. It is therefore desirable to seek the smallest possible upper bound z(u) in the hope that it will possibly equal (or be very close to) the true optimum. Mathematically, we solve another optimization problem regarding the nonnegative multipliers, u: Min z(u) u
M s.t. u ∈ R+
In the above minimization, although the multipliers are allowed to be any real numbers, we force them to be non-negative for computational efficiency (Daskin, 1995). We solve this problem using the conventional subgradient method. The major steps are as follows: Set the initial multipliers um = 0 for all m. At each iteration j, Let uj be the current multiplier and tj > 0 the current step size: (1) Update the multipliers:
j+1
j
0, um − t j
um = max
ym −
N
xmn
, ∀m.
n=1
(2) Update step size: t j+1 =
j (z(uj ) − z LB )
M m=1
ym −
N
2 ,
x n=1 mn
where j is a control parameter normally starting with a value of 2, and zLB is the best lower bound since the first iteration (i.e., the objective value of the best feasible solution to the original problem). If z(u) does not improve in a specified number of iterations, decrease the value of j slightly. Within each iteration, the optimum solution to the relaxed LR subproblem automatically gives an upper bound of the optimal objective value of the original problem. If this solution (to the LR subproblem) turns out to satisfy constraints (11b), i.e., be feasible to the original problem, then it is also the optimal solution to the original problem; the iteration stops. Otherwise, the multipliers’ values are updated and the iteration continues. After each iteration, if the exact optimum solution is not yet found, a heuristic feasible solution can be constructed to provide (or update) a lower bound to the original problem. To do this, we simply keep the values of all {ym } variables that are determined from the LR subproblem, then re-assign values to the {xmn } variables
This solution is feasible for the original problem and must yield a lower bound. If the lower bound equals the upper bound at any iteration, then the optimal solution is found. Otherwise, the difference between these bounds provides an optimality gap—the difference between the true optimum and the feasible solution is sure to be no larger than this gap, since the true optimal solution must be between the upper and lower bounds. 4. Case study The modeling framework and solution approaches proposed in Sections 2 and 3 are tested with empirical roadway and crash data from the Illinois Department of Transportation (IDOT). IDOT used to maintain safety-related data at separate divisions, on different database platforms. Although many efforts have been made in recent years to improve data quality, the datasets are still subject to multiple sources of errors. Preliminary analysis has demonstrated many incorrect crash locations that will clearly reduce the accuracy of safety analysis. The following sections further describe the empirical setting, the implementation of the proposed framework, and the numerical results. 4.1. Empirical setting The IDOT IRIS (Illinois Roadway Information System) database keeps records of 60,238 roadway segments (total length 13,607.2 miles) on state-maintained interstates, United States (US) routes, and Illinois (IL) routes. In this study, roadway sites are defined as segments with homogeneous geometry (with variable lengths).4 The dataset incorporates information on various roadway characteristics, traffic conditions, and administrative details. The summary of quantitative and categorical variables are presented in Tables 1 and 2, respectively. For demonstration purposes, this study focuses on 3120 roadway segments in five central Illinois counties (Champaign, Douglas, Edgar, Piatt, and Vermilion), which includes 604 Interstate segments, 1515 US route segments, and 1001 IL route segments. During 2001–2005, 14,010 crashes are documented on the state-maintained routes in the five central Illinois counties. IDOT provides historical records of both quantitative and categorical variables. The quantitative variables include crash location, number of vehicles, number of injuries, number of fatalities, time, and year. The dataset contains a significant amount of categorical variables that includes collision type, weather, lighting, county, township, route name, traffic control, vehicle type, vehicle direction, vehicle maneuver, event, case-ID, month, and day of week. Preliminary data matching based on the roadway and crash databases reveals that 55.6% of the crashes either are located on the borderline of multiple sites, or do not overlap with the existing roadway network.
4 The sites can be defined in various ways (e.g. equal-length segments, intersections). It shall be obvious that the choice of site definition will not affect the implementation of the proposed framework.
R. Tegge, Y. Ouyang / Accident Analysis and Prevention 41 (2009) 202–209
207
Table 1 Descriptive statistics of continuous variables. Variable
Mean
Standard deviation
Minimum
Maximum
Average annual daily traffic (veh/day) Inside type 1 shoulder width (feet) Inside type 2 shoulder width (feet) Outside type 1 shoulder width (feet) Outside type 2 shoulder width (feet) IRI (in./mile) Number of lanes Lane width (feet) Median width (feet) Rut depth (inch) Speed limit (mph) Number of intersections per segment Average daily traffic on intersecting roads Segment length (miles)
9810.00 1.64 0.02 4.49 0.73 105.21 2.75 12.51 13.14 0.09 50.87 0.68 1023.00 0.22
9156.00 3.18 0.18 3.96 1.43 3.71 0.97 2.75 23.69 0.06 10.57 1.20 5424.00 0.39
650 0 0 0 0 0 1 9 0 0 25 0 0 0.01
54,700 11 2 13 10 230 4 27 88 0.27 65 24 267,618 5.19
4.2. Implementation Negative binomial regression is implemented in SAS 9.2 GENMOD. Since the purpose of this example is not to present detailed safety implications, we choose to stay focused and include only some explanatory variables in the analysis: segment length, AADT, area type (urban or rural), access control (fully controlled or not Table 2 Categorical variables. Access control Uncontrolled Partially controlled Fully controlled Functional classification Interstate Urban freeway and expressway Other principal arterial Minor arterial (rural) Major collector (rural) Minor collector (rural) Local road or street (rural) Minor arterial (urban) Collector (urban) Local road or street (rural) Surface Portland cement concrete Asphalt concrete Area type Urban Rural
fully controlled), median type (no median or with median), and speed limit. These selected variables are widely used in the literature, and have been found to be the most statistically significant for Illinois (Tegge and Ouyang, 2007). The model specification is m = exp(ˇ0 )(segment length)m (AADTm )ˇ1 exp[ˇ2 (area type)m + ˇ3 (access control)m + ˇ4 (median type)m + ˇ5 (speed limit)m ]
where m is the expected crash frequency per five years per mile on segment m and ˇ0 , ˇ1 , . . ., ˇ5 are parameters. The optimization component of the procedure is conducted in MATLAB 7.0. Probability {pmn } is computed based on the rules suggested in Section 2.1. While applying (2) to roadway segments, up to three adjacent segments (i.e., upstream and downstream) of the observed crash locations are considered. 4.3. Numerical results Implementation of the proposed methodology produces significant convergence after just a few iterations, when the maximum percentage change of expected crash frequency between consecutive iteration drops below ε = 2%. Fig. 2 shows the performance of the Lagrangian relaxation procedure, where the upper and lower bounds after the third iteration remain relatively constant. Fig. 3 illustrates the log-likelihood values of the negative binomial regression models over iterations. The value increases from −7687 (for the initial “arbitrary” crash location assignment) to about −6700 after approximately only four iterations.
One or two way Shoulder type No shoulder Earth Sod Aggregate Surface treated Bituminous Concrete (untied) Concrete (tied) “V” gutter Curb and gutter Median type No median Unprotected Curbed Positive barrier Rumble strip Painted High tension cable M-2.12 transversable median
(12)
Fig. 2. Upper and lower bounds from the Lagrangian relaxation method.
−0.8041 (0.1501) −0.0546 (0.0029) −0.0543 (0.0030) −0.8724 (0.0985) −0.0526 (0.0030) −0.0521 (0.0030) −0.0522 (0.0030) −0.0526 (0.0030)
Speed limit
3.4789 (0.0960) 1.4067 (0.0410) 1.5966 (0.0467) 1.5270 (0.0450) 1.5724 (0.0463) 1.5600 (0.0461) 1.5901 (0.0469) 1.5395 (0.0455)
R. Tegge, Y. Ouyang / Accident Analysis and Prevention 41 (2009) 202–209 Overdispersion parameter K (per mile)
208
−0.8041 (0.1501) −0.8390 (0.0952) −0.8172 (0.0997) −0.8724 (0.0985) −0.8859 (0.0997) −0.9039 (0.0993) −0.9041 (0.1006) −0.9159 (0.0990) 0.2794 (0.0990) 0.1690 (0.0653) 0.1258 (0.0700) 0.1274 (0.0687) 0.1160 (0.0694) 0.1052 (0.0694) 0.1121 (0.0700) 0.1123 (0.0688) 0.9448 (0.0545) 1.2443 (0.0374) 1.3236 (0.0401) 1.3661 (0.0395) 1.3822 (0.0402) 1.3966 (0.0403) 1.3916 (0.0406) 1.4045 (0.0401) −2.7266 (0.5298) −5.2732 (0.3556) −5.9879 (0.3789) −6.4104 (0.3741) −6.5880 (0.3801) −6.7349 (0.3812) −6.6818 (0.3838) −6.7775 (0.3790) 0 1 2 3 4 5 6 7
Area type (0 = rural, 1 = urban) Log(AADT) Intercept Iteration number
Table 3 Regression coefficients across iterations.
Fig. 4. Maximum percentage change in m .
Median type (0 = no median, 1 = with median)
Fig. 4 illustrates the maximum percentage change in expected crash frequencies. The results indicate that crash re-assignments during the first few iterations have the largest impact on the estimated crash frequencies. After the sixth iteration, the algorithm converged as the percentage change dropped below the tolerance, 2%. As the algorithm converges, so do the predictive capabilities of the regression model. Table 3 lists the parameter estimates and the standard deviations (in parentheses) across iterations. Obviously, the parameter estimates become quite stable after only four iterations; this is consistent with the algorithm convergence shown in Figs. 3 and 4. On the other hand, Table 3 clearly shows significant changes in parameter estimates before and after the correction of location errors (i.e., iteration 0 versus iteration 7). For example, before the correction, the coefficients of the log(AADT) and the median type variables are 0.9448 and 0.2794, respectively. However, after the correction, the coefficient of log(AADT) increases significantly to 1.4045, while the coefficient of the median type variable almost drops by half to 0.1123. This verifies our intuition that the proposed modeling framework has the potential to significantly impact safety analysis results. Without the error correction, the statistical model output could overestimate or underestimate the impacts of various safety factors on the accident likelihood. Although our regression model is not intended for complete safety implications, the parameter estimates at convergence are consistent with our expectations. All coefficient estimates are sta-
0.4437 (0.0960) 0.3724 (0.0613) 0.3644 (0.1258) 0.3599 (0.0631) 0.3591 (0.0642) 0.3575 (0.0639) 0.3507 (0.0646) 0.3435 (0.0634)
Access control (0 = no control, 1 = full control)
Fig. 3. Log-likelihood values from the negative binomial regression.
R. Tegge, Y. Ouyang / Accident Analysis and Prevention 41 (2009) 202–209
tistically significant. Traffic volume has a positive impact on crash frequency as expected. The positive parameter estimate for the area type variable indicates that urban areas have higher crash expectations than rural areas. Similarly, the existence of a median increases the likelihood of crashes, while the presence of access control lowers it. Finally, higher speed limits seem to decrease the expected crash frequency. This is probably because the speed limit variable is usually highly correlated with the AADT and access control variables. 5. Conclusions Safety analysis models are usually built upon spatially and temporally matched crash and roadway datasets. Often, location errors occur due to problems in data collection, management and processing procedures. Such errors may reduce the accuracy of the estimated regression parameters. This study presents a systematic methodology to correct for errors in crash locations so as to create more accurate safety models for crash frequency estimation. It formulates an optimal network design module and develops a Lagrangian relaxation-based solution algorithm to estimate the most-probable location for each crash. Standard statistical regression models can be constructed based on the estimated crash locations in an iterative manner. The numerical example demonstrates fast convergence and satisfactory results. As explained in the paper, the proposed framework is very flexible with respect to the choices of roadway types, crash types, site definitions, and even statistical regression models. The iterative location correction procedure can be integrated with any chosen statistical modeling module. In other words, the introduction of the proposed modeling framework shall not restrict the application of statistical analysis models that a researcher would choose. The proposed framework may be extended or applied in several directions. For example, the crash location re-assignment module is very flexible. Similar ideas may be potentially applicable to other safety analysis contexts (e.g., severity analysis). The proposed rules for computing {pmn }, e.g. (2), may not be the suitable for all empirical applications. Eq. (2) for example may not yield satisfactorily high probabilities for segments that are adjacent to the observed crash location. The best formulas for {pmn }, and the possibility of estimating {pmn } within the location reassignment framework may be worthy of further investigation. Acknowledgments This work was partially supported by the Illinois Department of Transportation through the Illinois Center for Transportation research project ICT-R27-20. The contents of this paper reflect the view of the authors, who are responsible for the facts and the accuracy of the data presented herein. The contents do not necessarily reflect the official views or policies of the Illinois Center for Trans-
209
portation, the Illinois Department of Transportation, or the Federal Highway Administration. This report does not constitute a standard, specification, or regulation. The helpful comments of Professor Fred Mannering (Purdue University) and two anonymous reviewers are gratefully acknowledged. References Abdel-Aty, M.A., Radwan, A.E., 2000. Modeling traffic accident occurrence and involvement. Accident Analysis and Prevention 32 (5), 633–642. Anastasopoulos, P.C., Mannering, F., 2008. A note on modeling vehicle-accident frequencies with random-parameters count models. Working Paper, Purdue University. Bonneson, J., McCoy, P., 1993. Estimation of safety at two-way STOP-controlled intersections on rural highways. Transportation Research Record 1401, 83–89. Council, F., Stewart, J., 1999. Safety effects of conversion of rural two-lane to fourlane roadways based on cross-sectional models. Transportation Research Record 1665, 35–43. Daskin, M.S., 1995. Network and Discrete Location: Models, Algorithms and Applications. Wiley, New York, USA. Fisher, M.L., 1981. The Lagrangian relaxation method for solving integer programming problems. Management Science 27, 1–18. Hadi, M.A., Aruldhas, J., Lee–Fang Chow, Wattleworth, J.A., 1995. Estimating safety effects of cross-section design for various highway types using negative binomial regression. Transportation Research Record 1500, 169–177. Hausman, J.A., Abrevaya, J., Scott-Morton, F.M., 1998. Misclassification of the dependent variable in a discrete-response setting. Journal of Econometrics 87, 239–269. Jones, B., Janssen, L., Mannering, F., 1991. Analysis of the frequency and duration of freeway accidents in Seattle. Accident Analysis and Prevention 23 (2), 239–255. Jovanis, P.P., Chang, H.L., 1987. Modeling the relationship of accidents to miles traveled. Transportation Research Record 1068, 42–51. Lord, D., 2006. Modeling motor vehicle crashes using Poisson-gamma models: examining the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter. Accident Analysis and Prevention 38 (4), 751–766. Lord, D., Guikema, S.D., Geedipally, S.R., 2008. Application of the Conway–Maxwell–Poisson generalized linear model for analyzing motor vehicle crashes. Accident Analysis and Prevention 40 (3), 1123–1134. Poch, M., Mannering, F., 1996. Negative binomial analysis of intersection–accident frequencies. Journal of Transportation Engineering 122 (2), 105–113. Savolainen, P.T., Tarko, A.P., 2005. Safety impacts at intersections on curved segments. Transportation Research Record 1908, 130–140. Shankar, V., Mannering, F., Barfield, W., 1995. Effect of roadway geometrics and environmental factors on rural accident frequencies. Accident Analysis and Prevention 27 (3), 371–389. Shankar, V., Milton, J., Mannering, F., 1997. Modeling accident frequencies as zeroaltered probability processes: an empirical inquiry. Accident Analysis and Prevention 29, 829–837. Shankar, V., Albin, R., Milton, J., Mannering, F., 1998. Evaluating median cross-over likelihoods with clustered accident counts: an empirical inquiry using the random effects negative binomial model. Transportation Research Record 1635, 44–48. Tegge, R.A., Ouyang, Y., 2007. Multivariate analysis of crash data for Illinois. Research Report, ICT R27-20 (Phase II). Illinois Department of Transportation. Vogt, A., Bared, J., 1998. Crash Models for Two-Lane Rural Roads: Segments and Intersections. FHWA-RD-98-133, Federal Highway Administration, Washington, DC. Wang, J., Hughes, W., Stewart, J., 1997. Safety effects of cross-section design on rural multilane highways. HSIS summary report, FHWA Publication FHWA-RD-97027, Federal Highway Administration, Washington, DC. Washington, S.P., Karlaftis, M.G., Mannering, F.L., 2003. Statistical and Econometric Methods for Transportation Data Analysis. Chapman & Hall/CRC.