Data mining of tree-based models to analyze freeway accident frequency

Data mining of tree-based models to analyze freeway accident frequency

Journal of Safety Research 36 (2005) 365 – 375 www.elsevier.com/locate/jsr www.nsc.org Data mining of tree-based models to analyze freeway accident ...

193KB Sizes 0 Downloads 24 Views

Journal of Safety Research 36 (2005) 365 – 375 www.elsevier.com/locate/jsr

www.nsc.org

Data mining of tree-based models to analyze freeway accident frequency Li-Yen Chang *, Wen-Chieh Chen Graduate Institute of Transportation and Logistics, National Chia-Yi University, 300 University Road, Chia-Yi, 600, Taiwan Received 18 October 2004; received in revised form 25 April 2005; accepted 15 June 2005

Abstract Introduction: Statistical models, such as Poisson or negative binomial regression models, have been employed to analyze vehicle accident frequency for many years. However, these models have their own model assumptions and pre-defined underlying relationship between dependent and independent variables. If these assumptions are violated, the model could lead to erroneous estimation of accident likelihood. Classification and Regression Tree (CART), one of the most widely applied data mining techniques, has been commonly employed in business administration, industry, and engineering. CART does not require any pre-defined underlying relationship between target (dependent) variable and predictors (independent variables) and has been shown to be a powerful tool, particularly for dealing with prediction and classification problems. Method: This study collected the 2001 – 2002 accident data of National Freeway 1 in Taiwan. A CART model and a negative binomial regression model were developed to establish the empirical relationship between traffic accidents and highway geometric variables, traffic characteristics, and environmental factors. Results: The CART findings indicated that the average daily traffic volume and precipitation variables were the key determinants for freeway accident frequencies. By comparing the prediction performance between the CART and the negative binomial regression models, this study demonstrates that CART is a good alternative method for analyzing freeway accident frequencies. Impact on industry: By comparing the prediction performance between the CART and the negative binomial regression models, this study demonstrates that CART is a good alternative method for analyzing freeway accident frequencies. D 2005 National Safety Council and Elsevier Ltd. All rights reserved. Keywords: Accident frequency; Freeway; Data mining; Classification and regression trees (CART); Negative binomial regression

1. Introduction Traffic accidents have been one of the top 10 leading causes of death and injury in Taiwan. Each year about 3,000 people are killed and thousands are injured in traffic accidents in Taiwan. In addition, traffic accidents often result in enormous costs to society, including excessive delay for roadway users and public property damage. Therefore, each year traffic authorities invest considerable efforts into reducing traffic accidents (e.g., geometry improvement, traffic control, and enforcement). However, the annual number of traffic accidents remains nearly the same from year and year. Clearly, there is an increasing need of efficient methodologies for identifying the risk factors for accidents. * Corresponding author. Tel.: +886 5 2717982; fax: +886 5 2717981. E-mail address: [email protected] (L.-Y. Chang).

Regression analysis (e.g., linear regression models, Poisson regression and/or negative binomial regression models) has been the most popular technique in traffic safety analysis because the relationship between accidents and risk factors can be clearly identified. With this information, the accident-prone areas can be located by the traffic engineers and mitigation or safety measures, such as illumination and enforcement, can then be effectively applied. However, most regression models have their own model assumptions and pre-defined underlying relationship between dependent and independent variables. If these assumptions are violated, the model could lead to erroneous estimation of accident likelihood. Classification and Regression Tree (CART), one of the most widely applied data mining techniques, has been commonly employed in business administration, medicine, industry, and engineering (Hui & Jha, 2000; Marshall, 2001; Breault, Goodall, & Fos, 2002; Bevilacqua, Braglia, & Montanari, 2003; Yang et al.,

0022-4375/$ - see front matter D 2005 National Safety Council and Elsevier Ltd. All rights reserved. doi:10.1016/j.jsr.2005.06.013

366

L.-Y. Chang, W.-C. Chen / Journal of Safety Research 36 (2005) 365 – 375

2003; Pendharkar, 2004; Fu, 2004). CART is a nonparametric model that does not require a functional form to be specified and the assumption of additivity of predictors; it has also been shown to be a powerful tool, particularly for dealing with prediction and classification problems. However, the applications of CART to analyze traffic safety problems have been relatively few. Therefore, the objective of this study is to examine whether the CART model can be employed to analyze the relationship between risk factors and accidents. This is done by comparing the analysis results between the CART model and negative binomial regression model. The paper begins with a brief review of previous literature on modeling accident frequencies and then presents the methodological approaches, followed by a description of the available data and an assessment of the model estimation results. The paper concludes with a summary and directions for future research.

2. Literature review Past research on analyzing accident frequencies mainly relied on statistical models because the occurrence of accidents on a highway section can be regarded as a random variable. From a methodological perspective, the analysis approaches ranged from multiple linear regression models to generalized linear models such as Poisson and negative binomial regression models. Because of the distributional property (i.e., random, discrete, and nonnegative) of vehicle accidents on a specified roadway segment, Poisson or negative binomial regression models have been commonly employed to model accident frequencies (Miaou, 1994; Poch & Mannering, 1996; Hadi, Aruldhas, Chow, & Wattleworth, 1995; Shankar, Mannering, & Barfield, 1995; McCarthy, 1999; Carson & Mannering, 2001; Martin, 2002). More recently, zero-inflated Poisson and zeroinflated negative binomial models were also employed to analyze accident frequencies to deal with the overdispersion problem potentially caused by the extra zero in traffic accident data (Shankar, Milton, & Mannering, 1997; Lee & Mannering, 2002; Lee, Stevenson, Wang, & Yau, 2002). The application of zero-altered counting processes allows modeling roadway accident frequencies in two states: zeroaccident state (where no accidents will ever be observed) and accident state (where accident frequencies follow Poisson or negative binomial distribution). The findings show that zero-altered probability process provides great flexibility in uncovering processes affecting accident frequencies on roadway sections with zero accidents and those observed with accidents. From an empirical standpoint, many researchers (Poch & Mannering, 1996; Shankar et al., 1995; Milton & Mannering, 1998; Ivan, Wang, & Bernardo, 2000; Carson & Mannering, 2001) attempted to estimate the accident frequency (or rate) for a highway section or intersection by identifying the various nonbehavioral factors affecting the

accident frequency. These nonbehavioral factors included highway geometric variables (e.g., horizontal and vertical alignments, median types, or shoulder width), traffic characteristics (e.g., hourly volume, AADT, or percentage of trucks), and environmental conditions (e.g., land use, pavement conditions, light conditions, or weather conditions). For example, Ivan et al. (2000) investigated the effects of traffic density, land use, light conditions, and time of day on single and multi-vehicle crash rates on two-lane highways. The findings indicate that time of day, volume/ capacity ratio, percent of no passing zone, shoulder width, number of intersections, and driveways have significant influence on single-vehicle crashes while time of day, number of intersections, and driveways significantly affect multi-vehicle crashes. Carson and Mannering (2001) examined the effect of warning signs on ice-accident frequency and estimated three separate models to analyze ice-accident frequency for interstate freeways, principal arterial state highways, and minor arterial state highways. The findings indicate that spatial (e.g., urban), roadway (e.g., shoulder width, grade), and traffic characteristics (e.g., AADT, truck percentage) have significant effects on the accident frequency. McCarthy (1999), on the other hand, focused on the effectiveness of public policy (e.g., traffic regulations, alcohol availability, and enforcement) in reducing fatal accidents. The results indicated that traffic enforcement has larger beneficial effects on the incidence of fatal accidents. Data mining has been an active analytical technique in many scientific areas for many years. These areas range from business, industry, medicine, and agriculture to engineering (Shaw, Subramaniam, Tan, & Welge, 2001; Rygielski, Wang, & Yen, 2002; Valafar & Valafar, 2002; Breault et al., 2002). Among the data mining techniques, decision trees and rules, nonlinear regression and classification methods, example-based methods, probabilistic graphical dependency models, and relational learning models have been the popular data mining techniques (Fayyad, Piatetsky-Shapiro, Smyth, & Uthurusamy, 1996). However, the applications of data mining techniques to analyze the transportation-related problems are relatively few. In the field of safety analysis, some studies applied tree-based models to analyze accident rates and injury severity problems. For example, Kuhnert, Do, and McClure (2000) employed logistic regression, CART, and multivariate adaptive regression splines (MARS) to analyze motor-vehicle injury data. By comparing the analysis results from logistic regression, one of the most widely used analysis techniques in epidemiological studies, they demonstrated that CART and MARS, which are capable of graphically displaying the analysis results and identifying the groups of people with potential high accident risk, are informative and attractive models for motor-vehicle accident analysis. They also suggested that CART and MARS can be used as a precursor to a more detailed logistic regression. Karlaftis and Golias (2002) applied hierarchical tree-based regression (HTBR) to analyze the effects of road geometric and traffic characteristics on accident rates for rural two-lane

L.-Y. Chang, W.-C. Chen / Journal of Safety Research 36 (2005) 365 – 375

and multilane roads. Their study concluded that HTBR (nonparametric model) without any assumption of functional form of the model, has both theoretical and applied advantages over multiple linear and negative binomial regression models (parametric models) in analyzing highway accident rates.

3. Methodology 3.1. Classification and regression trees (CART) CART analysis is critical for prediction problems. When the target variable is discrete valued, a classification tree is developed, whereas a regression tree is developed for the continuous target variable. Since this study aims to model the number of accidents occurring on a highway section over a one-year time period, a classification tree is developed. The development of a CART model mainly consists of three steps. The first step is tree growing. The principle behind tree growing is to recursively partition the target variable to minimize ‘‘impurity’’ in the terminal nodes. The impurity of a node for a classification tree can be defined as: iðt Þ ¼ Uð pð1jt Þ; pð2jt Þ; . . . ; pð jjt ÞÞ

ð1Þ

where i(t) is a measure of impurity of node t, p( j|t) is the node proportions (i.e., the cases in node t belonging to class j), and

367

A is a nonnegative function. The measure of node impurity by the Gini criterion, the CART’s default, is defined as X i ðt Þ ¼ pðijt Þpð jjt Þ: ð2Þ jmi

The partitioning is done by searching all possible threshold values for all input variables (splitters) to find the threshold that leads to the greatest improvement in the purity score of the resultant nodes. In other words, the goodness of a split is then defined as the impurity decrease between the parent node and its children: Diðs; t Þ ¼ iðt Þ  pR iðtR Þ  pL iðtL Þ

ð3Þ

where s is a candidate split, and p L and p R are the proportions of observations of the parent node t that go to the child node t L and t R , respectively. The best splitter is the one that maximizes Di(s,t). For a binary target variable, such as a highway section with or without accident(s), tree growing aims to group all the highway sections into two groups: a group of highway sections with accidents and the other group without accidents. As shown in Fig. 1, the best split can be reached by first splitting the subjects into groups, depending on whether or not the traffic volume of the highway section is more than c1 vehicles, and the section with traffic volume more than c1 vehicles is further split into two groups, depending on whether or not the vertical alignment of

Fig. 1. The principle of splitting.

368

L.-Y. Chang, W.-C. Chen / Journal of Safety Research 36 (2005) 365 – 375

highway sections is more than c2%. Likewise, for a j category target variable, CART tries to obtain nodes, which contain as many subjects as possible belonging to only one of the j categories. With the splitting rule, the target variable (root node) is first split into two child nodes, the child nodes themselves are split again. At the end of tree growing, a saturated tree is obtained, in which each terminal node contains one case or cases of a single category. The second step is tree ‘‘pruning.’’ Pruning is a mechanism to create a sequence of simpler trees, through cutting off increasingly important nodes. The pruning process starts with the saturated tree and selectively pruning upward produces a sequence of sub-trees of the saturated tree, and eventually collapses to the tree of the root node. The pruning relies on a complexity parameter which can be calculated through a cost function of the misclassification of data and the size of the tree. To define the complexity parameter and the cost-complexity measure, it can be started with defining misclassification cost (or rate) for a node and a tree. The node misclassification cost can be defined as rðt Þ ¼ 1  pð jjt Þ;

ð4Þ

and the tree misclassification cost can be defined as X RðT Þ ¼ rðt Þpðt Þ:

ð5Þ

raT

The cost-complexity measure for each subtree T, R a(T), can be defines as Ra ðT Þ ¼ RðT Þ þ ajT˜ j

ð6Þ

where |T˜ | the tree complexity, which is equal to the number of terminal nodes of the subtree, and a the complexity parameter, which measures how much additional accuracy is added to the tree to warrant additional complexity. During the pruning process, the value of a will gradually be increased from 0 to 1. For each value of a, a subtree, T(a), can be found to minimize R a (T). The larger a becomes, the smaller |T˜ | is to minimize R a (T). Therefore, by gradually increasing a, a sequence of pruned subtrees from the saturated tree can be generated. The last step is to select a tree of right size from the pruned trees. Overly large trees could result in higher misclassification when applied to analyze new data sets. The goal is thus selecting the right sized tree with respect to a measure of misclassification cost on an independent dataset so that the information in the original learning dataset will not overfit. To do this, the data is usually divided into two subsets, one for learning and the other for testing. The learning sample is used to split nodes, while the testing sample compares the misclassification for all the sub-trees. As the tree grows larger and larger, the misclassification cost decreases monotonically for the learning data, whereas that for the testing data reaches a minimum and then increases. This indicates that the saturated tree will give the best fit to the learning data, but could result in overfit when applied to an independent data. The right sized tree can be determined

when the misclassification costs reach a minimum for both learning and testing data. A more detailed description of CART analysis and its applications can be found in Breiman, Friedman, Olshen, and Stone (1998). 3.2. Negative binomial regression Statistical modeling techniques have been undertaken to analyze the relationship between accidents and risk factors for many years. Among these techniques, Poisson and negative binomial regression have been extensively employed in recent years because of the nature (i.e., discrete and non-negative integer) of accident frequency on a highway section or intersection. In applying a Poisson regression to model accident frequency on a highway section over a one-year time period, the probability of highway section i having ni accidents per year is given by P ð ni Þ ¼

kni i expð  ki Þ ni !

ð7Þ

where P(ni ) is the probability of n accidents occurring on highway section i per year, and ki is the Poisson parameter for highway section i, which is equal to the expected accident frequency (i.e., E(ni )) for highway section i. When applying the Poisson regression model, the expected accident frequency is assumed to be a function of explanatory variables such that ki ¼ expðBXi Þ

ð8Þ

where Xi is a vector of explanatory variables including the geometric, traffic, and weather characteristics of highway section i that determine accident frequency and B is a vector of estimable coefficients. The coefficient vector B then can be estimated by the maximum likelihood method. The important characteristic of Poisson probability distribution is that the mean and variance of a Poisson probability distribution are equal. However, past research indicated that accident frequency data were likely to be overdispersed (i.e., having a variance that exceeds the mean). If the overdispersion problem exists, Poisson regression may result in biased and inefficient coefficient estimates. To overcome this problem, negative binomial regression has been commonly applied by past research as the alternative by relaxing the assumption that the mean of accident frequencies is equal to the variance. To do this, an error term is added to the expected accident frequency (ki ) such that Eq. (8) becomes ki ¼ expðBXi þ ei Þ

ð9Þ

where exp((i ) is a gamma-distributed error term with mean one and variance a. The formulation of the negative binomial distribution is P ð ni Þ ¼

C ð h þ ni Þ h I u ð 1  ui Þ n i ½CðhÞ Ini ! i

ð10Þ

L.-Y. Chang, W.-C. Chen / Journal of Safety Research 36 (2005) 365 – 375

where ui = h / (h + ki ) and h = 1 / a, and C(I) is a value of gamma distribution. The corresponding likelihood function is  h  ni Cðh þ ni Þ h ki h þ ki h þ ki i¼1 CðhÞni ! N

Lðki Þ ¼ k

accident, primary causes of the accident, and injury levels of occupants. Information regarding the accidents that occurred during 2001 – 2002 was extracted for this study. The highway geometric design information and traffic data were supplied by the Taiwan Area National Freeway Bureau. The highway geometric design information includes number of lanes, horizontal curvature, vertical grade, and shoulder width, while traffic information includes ADT of various vehicle types, peak hour factors, and traffic distribution over lanes. The weather information was taken from the annual report of climatological data. This report records detailed weather information of cities and towns along National Freeway 1 including pressure, temperature, humidity, precipitation, wind speed, and cloudiness. To analyze the effects of highway geometric characteristics on accident frequencies, the study area was divided into fixed-length (1 km long) sections. Because of the opposite values of vertical alignment, northbound and southbound roadway sections had to be considered separately. After screening out the north and south end sections due to different operational characteristics (e.g., reduced speed limits and signal control at the end of freeway), 373 kilometers of freeway were disaggregated into 742 sections. The geometric characteristics of each highway section are then determined according to the characteristic with the largest proportion. For example, a highway section has composite grade of 1% for 800 meters and 3% for 200 meters. The 1% grade was selected for the vertical alignment of this highway section. During the 2001 – 2002 period, there were 1,075 accidents resulting in deaths and/or injuries. The summary statistics of these 1,484 highway sections (i.e., each section produces two observations) are presented in Table 1. In order to be able to compare the predictions between CART model and statistical model, the collected data were also randomly divided into two subsets, one for training and the other for testing. The number of cases used for model training and testing is 1,113 (75% of the total observations) and 371, respectively. A chi-squared test shows that the accident frequency distributions between the two sub-samples are not significantly different.

ð11Þ

where N is the total number of highway sections. The coefficient estimates can be obtained by the maximum likelihood method. This model structure allows the mean to differ from the variance such that, var½ni  ¼ E½ni ½1 þ aE½ni 

369

ð12Þ

where a is used as a measure of dispersion. If a is not significantly different from zero, the negative binomial model simply reduces to a Poisson model with var[ni ] = E[ni ]. If a is significantly different from zero, the negative binomial model is the correct choice. A more detailed description of negative binomial regression analysis can be found in Washington, Karlaftis, and Mannering (2003).

4. The data National Freeway 1, which connects most of the major cities and metropolitan areas and serves as the most important transportation corridor in Taiwan, was selected as the study area. National Freeway 1 is a tolled freeway of 373 kilometers long and contains 47 interchanges and 10 mainline toll plazas. To investigate the relationship between vehicle accidents and highway geometric, traffic characteristics, and environment conditions data from a number of resources were collected. The vehicle accident data were taken from National Traffic Accident Investigation Reports provided by the Ministry of Transportation and Communications (MOTC). For each reported accident, it records detailed information on accident location, involved driver’s characteristics, environmental conditions at the time of Table 1 Sample summary of statistics of characteristics of road sections

Accident frequency (per year) Degree of horizontal curve (angle, in degree, subtended by a 100 m arc, equal to 18,000/(k Radius)) Vertical grade (percent) ADT (in 1000’s of vehicles) Truck ADT (in 1000’s of vehicles) Tractor-trailer ADT (in 1000’s of vehicles) Bus ADT (in 1000’s of vehicles) Peak hour factor Number of days with precipitation Annual precipitation (millimeters)

Minimum

Maximum

0 0

6 14.3

5.3 29.31 1.00 1.36 0.37 0.77 49 1195

5.3 116.56 13.56 8.29 11.76 0.97 174 3749

Mean

Standard deviation

0.72 1.66

0.95 1.99

0 52.52 5.99 4.96 1.84 0.91 93.5 1669.1

1.44 18.98 2.07 1.39 1.16 0.04 26.6 692.2

370

L.-Y. Chang, W.-C. Chen / Journal of Safety Research 36 (2005) 365 – 375

5. Analysis results 5.1. CART model estimation and interpretation Twelve predictor variables were used with the categorical target variable of freeway accident frequency in an attempt to identify the important patterns that traffic engineers wish to understand. Although the proportion of freeway sections with two or more accidents is relatively small (e.g., 17.8% in the training data), treating the freeway sections with one or more accidents as a category would significantly simplify the analysis (i.e., the value of target variable collapses to a binary variate: 0 vs. 1 or more). However, if the binary outcome target variable is applied, some valuable information such as what variables contribute to a high accident frequency will not be obtained. Therefore, in the present study, freeway sections with four or more accidents were treated as a category for the target variable. Table 2 gives the definition of the variables. Fig. 2 shows the classification tree reproduced by CART. The interpretation of results is straightforward. The initial split in node 1 is based on the ADT of 20,622 vehicles/lane. CART sends the freeway sections with ADT less than or equal to the 20,622 vehicles to the left forming a terminal node and greater than the 20,622 vehicles to the right. This indicates that the single best variable to classify the accident frequency on the freeway sections is ADT. For the freeway sections with ADT less than or equal to 20, 622 vehicles/lane, the tree predicts 66.7% (217/325) of them with zero accidents. Conditioned on the ADT greater than 20,622 vehicles, the second best variable to classify accident frequency on the freeway sections is the number of rainy days. For the freeway sections with rainy days less than or equal to 81, the splitting are made on the annual precipitation amount of 887.25 millimeters producing terminal nodes 2 and 3. In terminal node 2, the tree predicts 50% (7/14) of freeway sections with one accident. In terminal node 3, the tree predicts 63% (108/172) of freeway sections with zero accidents. For the freeway sections with rainy days greater than 81, CART further split the freeway sections with bus

volume greater than 4,677 buses/day to the left forming terminal node 16, which predicts 50% (9/18) of freeway sections with two accidents. With this splitting rule, the prediction of freeway accident frequency can be obtained by continuing down the tree branches until a terminal node is reached. As indicated by the terminal nodes 5, 6, 9, 13, 14, 15, and 16 in Fig. 2, freeway sections with higher traffic volume (in terms of ADT/lane, bus volume, truck volume, and semitractor volume), higher precipitation (in terms of days and amount), and geometric alignment (grade > 3.85% and degree of horizontal curvature > 0.40) have greater tendency to be classified with relatively higher accidents. For the traffic variables, the conflicts between vehicles and the exposure to potential risk of accidents are expected to increase with increasing number of vehicles. In addition, rainy conditions contribute more to accidents as the visibility is poor and it is also difficult to maneuver the vehicle under wet pavement conditions. As for the geometric design variables, grade can significantly influence vehicle operation speed, particularly that of large trucks and buses. The effect of speed differentials can play an important role in accident occurrence. The classification tree indicates that freeway sections with grade greater than 3.85% have higher tendency to be classified to have one accident, conditioned on the traffic conditions of ADT/ lane > 20,622 vehicles and bus ADT > 4677 buses, and environmental conditions of the number of days with precipitation > 81 days (terminal node 16). In general, the effects of traffic variables, environmental variables, and geometric design variables on traffic accidents identified by the classification tree are consistent with the findings of previous studies (Miaou, 1994; Karlaftis & Golias, 2002; Carson & Mannering, 2001). 5.2. Comparisons of CART and negative binomial regression models Table 3 summarizes the estimation results of the negative binomial regression model. Sixteen variables are found

Table 2 Description of variables Variable

Symbol

Type

Description

Accident frequency Horizontal alignment Vertical alignment Fog zone Section characteristic Number of lanes ADT (per lane) Bus ADT Truck ADT Tractor-trailer ADT PHF Annual precipitation Precipitation day

FREQUENC CURVE GRADE FOGZONE INTCHGOR NOLANES VOL_LANE BUSES TRUCKS SEMI PHF P PDAY

Qualitative Continuous Continuous Qualitative Qualitative Count Continuous Continuous Continuous Continuous Continuous Continuous Continuous

The target variable (0 – 4, 4 representing 4 or more crashes) Angle, in degree, subtended by a 100 m arc, equal to 18,000/(k Radius) Grade in percent 1, fog zone; 0 otherwise 1, interchange; 2, toll plaza; 3, military section; 0 otherwise Number of moving traffic lanes in the section Average daily traffic volume per lane Average daily bus volume Average daily single-unit truck volume Average daily tractor-trailer volume Peak hour factor The total amount of precipitation in one year Number of days with precipitation

Node 1 N=1113 VOL_LANE <=20622.25

>20622.25

Terminal Node 1 N=325

Node 2 N=788 PDAY

C0:217 C1: 79 C2: 23 C3: 6 C4: 0

<=81

>81

Node 3 N=186 P

Node 4 N=602 BUSES

<=887.25

<=4677

Terminal Node 2 N=14

Terminal Node 3 N=172

C0: 3 C1: 7 C2: 3 C3: 1 C4: 0

C0:108 C1: 42 C2: 17 C3: 3 C4: 2

Node 5 N=584 GRADE

Node 6 N=564 INTCHGOR

<=1.75

>1.75

Node 10 N=77 CURVE <=0.4

>8054.5

Terminal Node 7 N=34

Terminal Node 8 N=221

Terminal Node 9 N=7

C0: 19 C1: 8 C2: 5 C3: 1 C4: 1

C0: 115 C1: 59 C2: 28 C3: 14 C4: 5

C0: 1 C1: 5 C2: 1 C3: 0 C4: 0

>24348.688

<=24348.688

<=8054.5

>3060

<=3060

Terminal Node 10 N=17

Node 11 N=228 SEMI >29833.25

Node 13 N=145 TRUCKS

C0: 3 C1: 9 C2: 3 C3: 1 C4: 1

<=8217.5

>8217.5

Node 14 N=141 TRUCKS <=4942.5

Terminal Node 6 N=63

Terminal Node 11 N=63

C0: 21 C1: 36 C2: 4 C3: 2 C4: 0

C0: 29 C1: 9 C2: 15 C3: 4 C4: 6

>0.4

C0: 5 C1: 11 C2: 1 C3: 3 C4: 0

= interchange or toll station

Node 12 N=162 SEMI

Node 8 N=174 VOL_LANE

C0: 3 C1: 4 C2: 9 C3: 1 C4: 1

Terminal Node 15 N=20

Node 7 N=402 CURVE

Node 9 N=140 VOL_LANE

Terminal Node 16 N=18

>3.85

= military or otherwise

<=29833.25

>4677

Terminal Node 14 N=4 C0: 1 C1: 0 C2: 0 C3: 3 C4: 0

>4942.5

Node 15 N=78 BUSES

<=2096.5

>2096.5

Terminal Node 4 N=35

Terminal Node 5 N=42

Terminal Node 12 N=58

Terminal Node 13 N=20

C0: 20 C1: 7 C2: 8 C3: 0 C4: 0

C0: 16 C1: 21 C2: 5 C3: 0 C4: 0

C0: 27 C1: 14 C2: 9 C3: 6 C4: 2

C0: 4 C1: 12 C2: 4 C3: 0 C4: 0

371

Fig. 2. The output of CART tree.

L.-Y. Chang, W.-C. Chen / Journal of Safety Research 36 (2005) 365 – 375

>887.25

372

L.-Y. Chang, W.-C. Chen / Journal of Safety Research 36 (2005) 365 – 375

Table 3 Negative binomial estimation results Variable

Estimated coefficient

Constant Degree of horizontal curve Severe horizontal curve indicator (1 if degree of horizontal curve 8-; 0 otherwise) Severe upgrade indicator (1 if grade > 3%; 0 otherwise) Severe descent grade indicator (1if grade 4%; 0 otherwise) Level indicator (1 if 0% grade 1%; 0 otherwise) Interchange indicator (1 if section contains an interchange; 0 otherwise) Toll plaza indicator (1 if section contains a toll plaza; 0 otherwise) Military section indicator (1 if section is a military section, 0 otherwise) Number of lanes ADT per lane (in 1000’s of vehicles) Low ADT indicator (1 if ADT per lane 16,000; 0 otherwise) High truck ADT indicator (1 if truck ADT > 7,000; 0 otherwise) Low bus ADT indicator (1 if bus ADT < 1,000; 0 otherwise) High PHF indicator (1 if PHF 0.95, 0 otherwise) High annual precipitation indicator (1 if annual precipitation > 2,000 mm; 0 otherwise) Low precipitation day indicator (1 if number of day with precipitation < 80; 0 otherwise) a (dispersion coefficient)

1.576 0.050 0.562

5.27 2.20 1.37

0.560

1.81

0.214

1.80

0.139

1.59

0.376

4.33

0.329

1.79

0.632

2.06

0.155 0.024 0.473

2.12 2.38 2.06

0.174

1.85

0.145

1.03

0.209

1.33

0.095

1.09

0.232

2.07

0.199

2.49

Number of observations Restricted Log Likelihood (constant only) Log Likelihood at converge

t-statistics

1113 1243.090 1237.941

statistically significant or marginally significant in determining accident likelihood. It is noteworthy that the dispersion parameter, a, is significantly different from zero. This confirms the appropriateness of the negative binomial model relative to the Poisson model. The variables with a positive sign indicate that they can significantly increase the accident likelihood. For example, the positive coefficients of high ADT and high truck ADT indicator variables imply that conflicts between vehicles and the exposure to potential risk of accidents increase with increasing number of vehicles and trucks. Other variables that can result in the increased accident likelihood include presence of horizontal curve, grade greater than 3% or less than  4%, interchange, toll plaza, as well as number of lanes, low bus ADT, and high annual precipitation. In contrast, the variables with a negative sign imply that they can significantly reduce the accident likelihood. These variables include the presence of degree of horizontal curve greater than 8-, level section, low ADT, high PHF, and the number of days with precipitation less than 80. The reduced accident likelihood for the sections with the degree of horizontal curve greater than

8- seems counterintuitive but is consistent with previous findings (Milton & Mannering, 1998). An explanation for this is that drivers are more likely to drive cautiously at sharp horizontal curves. When compared to analysis results by the CART model, it can be easily found that the CART model relies more on traffic and environmental variables than geometric and location variables to classify accident frequencies on the freeway sections. For example, ADT/per lane is used three times in splitting the nodes, and the number of lanes and toll plaza do not appear to have any effects on determining the accident occurrence in the CART model. In order to further understand the performance of the CART model in analyzing the freeway accident frequency, the comparisons of model prediction performance between the CART and the negative binomial regression models are examined. To predict accident probability for each individual freeway section using the negative binomial regression model, the average accident frequencies (i.e., E defined by Eq. (9)) have to be determined first. Having the average accident frequencies of each individual freeway section, the accident probability can be computed and the classification can be determined by the frequency category with the largest probability. For example, the negative binomial regression model predicts that the accident probabilities of a particular freeway section are 30%, 40%, 20%, 7%, and 3% for 0, 1, 2, 3, and 4 or more crash frequencies, respectively. This freeway section will be classified as one crash frequency. For the CART model, the classification of Table 4 Prediction results of the CART model

Training data Observed accident frequency

Predicted accident frequency 0

1

2

3

4 or more

Row marginal total

0 535 53 3 1 0 592 1 218 101 4 0 0 323 2 105 21 9 0 0 135 3 34 7 1 3 0 45 4 or more 16 1 1 0 0 18 Column 908 183 18 4 0 1113 marginal total (The overall prediction accuracy is 58.2% for training data) Testing data Observed accident frequency

Predicted accident frequency 0

1

2

3

4 or more

Row marginal total

0 174 15 3 1 0 193 1 93 20 3 1 0 117 2 36 9 1 0 0 46 3 6 2 1 0 0 9 4 or more 4 2 0 0 0 6 Column 313 48 8 2 0 371 marginal total (The overall prediction accuracy is 52.6% for testing data)

L.-Y. Chang, W.-C. Chen / Journal of Safety Research 36 (2005) 365 – 375

freeway accident frequency is obtained by continuing down the tree branches until a terminal node is reached. The prediction results are summarized in Tables 4 and 5. By combining the freeway sections with two or more accidents as a group, chi-squared tests of 3 3 contingency tables are conducted to compare the differences between the predicted and the observed frequencies for the CART model. Although the test results show significant differences between the predicted and the observed results for the training data (p-value < 0.001) and testing data (pvalue = 0.026), Tables 4 and 5 can still provide valuable information on the prediction performance of the negative binomial regression and the CART models. For the CART model, the overall model prediction accuracy for the training data is about 58.2%, while that for the testing data is about 52.6%. For the negative binomial model, the overall model prediction performances for the training data and the testing data are 52.9% and 52.3%, respectively. The CART model performs slightly better than the negative binomial regression model in analyzing the training data. As for predicting each accident frequency category, the CART model performs better than the negative binomial model for the highway sections with one or more accidents, but the accuracy is relatively low. The negative binomial model performs slightly better than the CART model for the sections with zero accidents, but it cannot predict the sections with accident frequencies greater than one. Although both models correctly predict over 90% of the zero-accident freeway sections, the interpretation of preTable 5 Prediction results of the negative binomial regression model

Training data Observed accident frequency

Predicted accident frequency 0

1

2

3

4 or more

Row marginal total

585 7 0 0 0 592 0 319 4 0 0 0 323 1 127 8 0 0 0 135 2 41 4 0 0 0 45 3 18 0 0 0 0 18 4 or more Column 1090 23 0 0 0 1113 marginal total (The overall prediction accuracy is 52.9% for training data) Testing data Observed accident frequency

Predicted accident frequency 0

1

2

3

4 or more

Row marginal total

192 1 0 0 0 193 0 115 2 0 0 0 117 1 45 1 0 0 0 46 2 9 0 0 0 0 9 3 6 0 0 0 0 6 4 or more 367 4 0 0 0 371 Column marginal total (The overall prediction accuracy is 52.3% for testing data)

373

diction results should proceed with caution because using percent correctly classified has a number of limitations. The zero-accident freeway sections have the largest percentage of subjects and will result in a higher percentage of correct predictions. Based on these comparison results, it is difficult to distinguish which model has a better prediction performance than the other.

6. Discussion In the present study, the CART model and negative binomial regression model provide similar results in terms of prediction performance on the training data and test data. This demonstrates that CART analysis is an appropriate methodology for analyzing traffic accidents. Although it is difficult to distinguish which modeling approach is better than the other according to the analysis results of this study, there are some notes that might be of great interest for future research. The CART analysis provides both theoretical and applied advantages relative to negative binomial regression or other parametric models (e.g., classical multiple linear regression models). From the theoretical perspective, the advantage of the CART analysis is that it does not require functional form of the model to be specified in advance and the assumption of additive relationship between risk factors. In this application, if the relationship between risk factors and traffic accidents is not additive, the estimated accident likelihood by the negative binomial regression could be erroneous. Another advantage is that the CART analysis can effectively handle collinearity problems. When there exists a serious correlation between independent variables, the variability of estimated coefficients will be inflated and an interpretation of relationship between independent variables and dependent variable will also be difficult. When the CART analysis is applied, the correlation problems between independent variables would not be a great concern. Compared to widely applied regression models in traffic safety analysis, this is an important advantage of employing CART models to analyze traffic accidents because an accident is rarely due to a single risk factor, but an outcome of a series of factors. In addition, outliers can adversely affect the coefficient estimates for the negative binomial regression model and other parametric models. In the CART model, outliers are isolated into a node and do not contribute to splitting. From the practical perspective, the first advantage of the CART model is the capability of graphically displaying the results. This will make the analysis results easily understood across disciplines, also by nonstatisticians. The classification tree is structured as a sequence of ‘‘if-then’’ questions. By answering these questions and tracing a path down the tree to a terminal node, traffic engineers can easily predict the accident likelihood of a freeway section. Another advantage of the CART model is that CART can automatically search the best

374

L.-Y. Chang, W.-C. Chen / Journal of Safety Research 36 (2005) 365 – 375

cut-point to split the nodes. Compared to the regression analysis, if an indicator variable is applied instead of continuous variable (e.g., grade), the cut-point value of the indicator variable is generally determined through experimentation or based on the findings of past studies. Therefore, past studies suggested that the CART model could be used as a precursor to a parametric model (Kuhnert et al., 2000). This study also estimated a negative binomial model based on the variables found in the CART model, but it did not result in a better fit of the data. Despite these advantages, the CART model has its own disadvantages. As discussed by Harrell (2001), CART has disadvantages of not utilizing continuous and ordinal variables effectively and of overfitting in three directions: searching for predictors, the best splits, and multiple searches. Secondly, CART analysis does not provide a probability level or confidence interval for the risk factors and predictions. When the CART model is employed to analyze a new data set, the lack of formal statistical inference procedures and the risk of over-fitting is a critical problem. In addition, the simple binary tree appears to have difficulty in handling the interactions between risk factors. Although in this application the interactions between risk factors such as environmental and geometric factors were tested to be statistically insignificant in determining accident likelihood in the negative binomial model, past studies did indicate that the interactions between risk factors could have significant effect on accident occurrence. A further drawback of the CART method is the difficulty in doing elasticity analysis or sensitivity analysis. Elasticity analysis (or sensitivity analysis) provides valuable information on the marginal effects of the variables on accident frequency. It is particularly important for traffic authorities in allocating limited resource for mitigation. A final disadvantage of CART is the cost of the software. Although Salford systems provides trial version of CART software that is free of charge, the free software is workable only for a short period of time.

7. Conclusions A non-parametric tree-based (CART) model was proposed to establish the empirical relationship between traffic accidents and highway geometric variables, traffic characteristics, and environmental factors. By comparing the analysis and prediction results of negative binomial regression model, this study demonstrated that CART is a good alternative for analyzing freeway accident frequency. This represents an important methodological step in analyzing traffic accident frequency. The results obtained here, by exploring a broad range of variables including highway geometry, traffic and environmental characteristics, provide valuable insight into the underlying relationship between risk factors and vehicle accidents. In terms of future work, an application of the methodological

approaches used in this paper to analyze the accident severity would be interesting. As discussed previously, statistical models such as logit and ordered probit models were the commonly employed techniques in safety analysis. Further exploration by non-parametric modeling techniques might provide a better understanding of the risk factors that can influence the injury severity of traffic accidents. It would also be interesting to employ different data mining techniques such as association or artificial neural network to explore the factors that affect accident frequency and to see if more potential risk factors can be uncovered and the prediction results can be improved. In addition, neither the CART model nor the negative binomial model provides the satisfied prediction accuracy for the highway sections with accident frequencies greater than one. It would be important for future studies to seek analysis tools that can effectively identify or predict the high accident locations so that traffic engineers can further promote the highway design.

Acknowledgments This study is sponsored by the National Science Council, Taiwan (NSC93-2211-E-415-003).

References Bevilacqua, M., Braglia, M., & Montanari, R. (2003). The classification and regression tree approach to pump failure rate analysis. Reliability Engineering and System Safety, 79(1), 59 – 67. Breault, J. L., Goodall, C. R., & Fos, P. J. (2002). Data mining a diabetic data warehouse. Artificial Intelligence in Medicine, 26(1 – 2), 37 – 54. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1998). Classification and regression trees. London’ Chapman & Hall/CRC. Carson, J., & Mannering, F. (2001). The effect of ice warning signs on iceaccident frequency and severity. Accident Analysis and Prevention, 33(1), 99 – 109. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery and data mining. Boston’ The MIT Press. Fu, C. Y. (2004). Combining loglinear model with classification and regression tree (CART): An application to birth data. Computational Statistics and Data Analysis, 45(4), 865 – 874. Hui, S. C., & Jha, G. (2000). Data mining for customer service support. Information and Management, 38(1), 1 – 13. Hadi, M., Aruldhas, J., Chow, L-F., & Wattleworth, J. (1995). Estimating safety effects of cross-section design for various highway types using negative binomial regression. Transportation Research Record, 1500, 169 – 177. Harrell, F. E. (2001). Regression modeling strategies. New York’ Springer. Ivan, J. N., Wang, C., & Bernardo, N. R. (2000). Explaining two-lane highway crash rates using land use and hourly exposure. Accident Analysis and Prevention, 32(6), 787 – 795. Karlaftis, M. G., & Golias, I. (2002). Effects of road geometry and traffic volumes on rural roadway accident rates. Accident Analysis and Prevention, 34(3), 357 – 365. Kuhnert, P. M., Do, K.-A., & McClure, R. (2000). Combining nonparametric models with logistic regression: An application to motor

L.-Y. Chang, W.-C. Chen / Journal of Safety Research 36 (2005) 365 – 375 vehicle injury data. Computational Statistics and Data Analysis, 34(3), 371 – 386. Lee, A. H., Stevenson, M. R., Wang, K., & Yau, K. (2002). Modeling young driver motor vehicle crashes: Data with extra zero. Accident Analysis and Prevention, 34(4), 515 – 521. Lee, J., & Mannering, F. (2002). Impact of roadside features on the frequency and severity of run-off-roadway accidents: An empirical analysis. Accident Analysis and Prevention, 34(2), 149 – 161. Marshall, R. J. (2001). The use of classification and regression trees in clinical epidemiology. Journal of Clinical Epidemiology, 54(6), 603 – 609. Martin, J. L. (2002). Relationship between crash rate and hourly traffic flow on interurban motorways. Accident Analysis and Prevention, 34(5), 619 – 629. McCarthy, P. S. (1999). Public policy and highway safety: A citywide perspective. Regional Science and Urban Economics, 29(3), 231 – 244. Miaou, S. P. (1994). The relationship between truck accidents and geometric design of road sections: Poisson versus negative binomial regressions. Accident Analysis and Prevention, 26(4), 471 – 482. Milton, J., & Mannering, F. (1998). The relationship among highway geometrics, traffic-related elements and motor-vehicle accident frequencies. Transportation, 25(4), 395 – 413. Pendharkar, P. C. (2004). An exploratory study of object-oriented software component size determinants and the application of regression tree forecasting models. Information and Management, 42(1), 61 – 73. Poch, M., & Mannering, F. (1996). Negative binomial analysis of intersection-accident frequencies. Journal of Transportation Engineering, 122(2), 105 – 113. Rygielski, C., Wang, J.-C., & Yen, D. C. (2002). Data mining techniques for customer relationship management. Technology in Society, 24(4), 483 – 502.

375

Shankar, V. N., Mannering, F., & Barfield, W. (1995). Effect of roadway geometrics and environmental factors on rural freeway accident frequencies. Accident Analysis and Prevention, 27(3), 371 – 389. Shankar, V. N., Milton, J., & Mannering, F. (1997). Modeling accident frequencies as zero-altered probability processes: An empirical inquiry. Accident Analysis and Prevention, 29(6), 829 – 837. Shaw, M. J., Subramaniam, C., Tan, G. W., & Welge, M. E. (2001). Knowledge management and data mining for marketing. Decision Support Systems, 31(1), 127 – 137. Valafar, H., & Valafar, F. (2002). Data mining and knowledge discovery in proton nuclear magnetic resonance (1H-NMR) spectra using frequency to information transformation. Knowledge-Based Systems, 15(4), 251 – 259. Washington, S., Karlaftis, M. G., & Mannering, F. L. (2003). Statistical and econometric methods for transportation data analysis. London’ Chapman & Hall/CRC Press. Yang, C-C., Prasher, S. O., Enright, P., Madramootoo, C., Burgess, M., Goel, P. K., & Callum, I. (2003). Application of decision tree technology for image classification using remote sensing data. Agricultural Systems, 76(3), 1101 – 1117. Li-Yen Chang is an assistant professor in the Graduate Institute of Transportation and Logistics at National Chia-Yi University in Taiwan. He received his PhD from the Department of Civil and Environmental Engineering at the University of Washington in 1997. His research interests include vehicular safety, statistical and econometric methodologies, travel demand, and traffic operations. Wen-Chieh Chen received his MS from the Graduate Institute of Transportation and Logistics at the National Chia-Yi University. His research interests include road safety, travel demand, and supply chain management.