Analysis of the traffic injury severity on two-lane, two-way rural roads based on classification tree models

Analysis of the traffic injury severity on two-lane, two-way rural roads based on classification tree models

Safety Science 49 (2011) 1314–1320 Contents lists available at ScienceDirect Safety Science journal homepage: www.elsevier.com/locate/ssci Analysis...

540KB Sizes 1 Downloads 38 Views

Safety Science 49 (2011) 1314–1320

Contents lists available at ScienceDirect

Safety Science journal homepage: www.elsevier.com/locate/ssci

Analysis of the traffic injury severity on two-lane, two-way rural roads based on classification tree models Ali Tavakoli Kashani, Afshin Shariat Mohaymany ⇑ School of Civil Engineering, Iran University of Science and Technology, Narmak, Tehran, Iran

a r t i c l e

i n f o

Article history: Received 24 December 2009 Received in revised form 17 April 2011 Accepted 28 April 2011 Available online 31 May 2011 Keywords: Data mining Classification and regression trees (CART) Injury severity Rural roads

a b s t r a c t Two-lane, two-way roads constitute a major portion of the rural roads in most countries of the world. This study identifies the factors influencing crash injury severity on these roads in Iran. Classification and regression trees (CART), which is one of the most common methods of data mining, was employed to analyze the traffic crash data of the main two-lane, two-way rural roads of Iran over a 3-year period (2006–2008). In the analysis procedure, the problem of three-class prediction was decomposed into a set of binary prediction models, which resulted in a higher overall accuracy of the predictions of the model. In addition, the prediction accuracy of the fatality class, which was nearly 0% in some of the previous studies, increased significantly. The results indicated that improper overtaking and not using a seatbelt are the most important factors affecting the severity of injuries. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction A significant proportion of the rural road network of Iran is comprised of two-lane, two-way roads. The 21,579 km of main two-lane rural roads constitute 30% of road network of Iran, and more than 90% of Iranian passengers choose the road transportation mode for their intercity travels (R.M.T.O, 2008). Iran has one of the highest rates of traffic crash fatalities and injuries. To be more precise, each of the last 3 years, 24,000 people (i.e., 3 persons per hour) on average died in traffic crashes (F.M.O.I, 2009). Seventy percent of the above-mentioned fatalities occurred on rural roads and 30% on urban ones. The number of injuries is almost ten times higher, around 240,000 cases per year. These high rates plus the high proportion of passenger carriage increase the necessity of a comprehensive study of passenger safety. The main objective of this study was to find the significant factors influencing injury severity of vehicle occupants (excluding drivers) involved in crashes on two-lane two-way rural roads in Iran. The data mining approach was employed to identify the significant factors. Data mining is discovering and analyzing a large amount of data to find meaningful models and patterns (Han and Kamber, 2006). Considering the large amount of data pertaining to crashes on rural roads in Iran, data mining was considered a suitable approach for this study. It is the first time that such a study has been carried out on data from Iran.

⇑ Corresponding author. Tel.: +98 021 77240098; fax: +98 021 77240398. E-mail address: [email protected] (A.S. Mohaymany). 0925-7535/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.ssci.2011.04.019

Identification of factors that affect the crash severity has attracted the attention of many researchers in the field of traffic safety. Recognizing such elements can help not only to reduce the number of deaths in traffic crashes but also to decrease the number of the crashes with severe injuries. Most studies done in this field have used regression-type generalized linear models, which presume a linear relationship between injury severity and the causes of crashes. However, this presumption can lead to errors in estimating the probability of injury severity (Mussone et al., 1999). Classification and regression trees (CART), which is an important data mining technique, is a non-parametric model with no presumed relationships between the dependent and independent variables. Decision tree models can identify and easily explain the complex patterns associated with crash risk and do not need to specify a functional form. CART is widely employed in commerce, industry, engineering, and other sciences and is a useful method to determine the most important independent variables and solve prediction and classification problems. In this study, the CART method was employed to identify the most important independent variables affecting the injury severity of vehicle occupants. The models found revealed that the cause of the crash and the use of seatbelts are the most significant factors influencing injury severity. The next section summarizes a review of previous literature on the injury severity in traffic crashes. Section 3 gives the methodology, an introduction to the CART method and the computation of the variable importance. The data description is given in Section 4, the results and discussions are presented in Section 5, and, finally, the conclusion is described in Section 6.

1315

A.T. Kashani, A.S. Mohaymany / Safety Science 49 (2011) 1314–1320

2. Literature review Many studies concerning the process of predicting and modeling traffic crashes and their consequences have been conducted. Most of these studies have been carried out to determine the significant factors that affect the crash severity the most so that elimination or control of these factors may prevent the occurrence of serious crashes and, hence, severe injuries and fatalities. From a methodological standpoint, logit models are of the most practical ones used in the analysis of crash severity and/or injury severity (Al-Ghamdi, 2002; Bedard et al., 2002; Pai, 2009; Pai and Saleh, 2008; Valent et al., 2002; Wood and Simms, 2002). As an example, Al-Ghamdi (2002) showed that the location and the cause of the accident were important factors for the crash severity. Wood and Simms (2002) recognized car size as the significant factor for injury severity. Bedard et al. (2002) and Valent et al. (2002) found seatbelt and helmet use to be the main factors. Ordered probits are another group of models that have become popular in crash severity modeling (Abdel-Aty and Keller, 2005; Kweon and Kockelman, 2003; Pai and Saleh, 2007; Renski et al., 1999; Zajac and Ivan, 2003). For example, Zajac and Ivan (2003) investigated the effect of roadway and area features on the severity of pedestrian crashes in rural areas. Kweon and Kockelman (2003) showed that using a seatbelt decreases the risk of injury in crashes, and Renski et al. (1999) determined that an increase in the speed limit causes an increase in injury severity. In recent years, some researchers have analyzed crashes using non-parametric methods and data mining techniques (Chong et al., 2005; Pande and Abdel-Aty, 2009; Sohn and Lee, 2003; Sohn and Shin, 2001; Tesema et al., 2005). Many of them have tried to classify crash severity with the help of these techniques. Such a classification finds patterns and models to sort each record from a large amount of data to its related class of crash severity, whether non-injury, injury or fatality. In this way, the conditions causing a crash to be in a specific class of crash severity, depending on its own related independent variables, are determined. Sohn and Shin (2001) classified crashes in Korea with respect to their severities using the CART method, neural networks and regression analysis and showed that protective devices (seatbelts and helmets) were the most important factors that determined the crash severity. Artificial neural networks are other techniques for data mining and are non-parametric methods, with which some researchers have analyzed crash severity and the injury severity of occupants involved in crashes (Abdelwahab and Abdel-Aty, 2001, 2002; Delen et al., 2006). For example, Delen et al. (2006) analyzed the injuries of 30,358 American drivers involved in crashes and showed that the use of a seatbelt, drinking alcohol and the age and the gender of the drivers were some of the most important factors influencing injury severity. The CART method is one of the popular tools in classification and prediction whose results have been used by some researchers in studies related to traffic safety (Chang and Wang, 2006; Chong et al., 2005; Magazzù et al., 2006; Sohn and Shin, 2001; Stewart, 1996; Tesema et al., 2005). Using the CART method, Chang and Wang (2006) analyzed 12,604 crash data in Taiwan and showed that vulnerable users (pedestrians, bicyclists, motorcyclists and passengers) are at more risk for severe injuries in crashes. They also concluded that mixed traffic operation of motorcycles and bicycles with other traffic results in many dangerous situations for motorcyclists and bicyclists, such as increased turning conflicts at intersections. Another study with 4658 crash data for the Ethiopian capital was carried out with the same method and indicated that speeding and denying pedestrian priorities were the most significant causes of increased injury severity (Tesema et al., 2005).

This study is different than some previous ones because of its geographical vastness and the attention to the injuries of car occupants (except drivers). The data of the injured occupants on main two-lane two-way rural roads in Iran over the past 3 years (2006–2008) were analyzed using the CART method. The high proportion of road transportation in passenger carriages in Iran and the high rate of road crashes provided a large amount of data, and the hope is that crash injury severity can be decreased by taking proper, effective measures. Studying the injury severity of car occupants (excluding drivers) has been given little attention in previous studies, which is why it is investigated specifically in this study. 3. Methodology The ‘‘decision tree’’ was used to classify the target variable, and the variables that have more important roles in the classification of the target variable were found on the basis of relations called ‘‘variable importance’’. The terms ‘‘Decision tree’’ and ‘‘variable importance’’ are explained below. 3.1. Decision tree Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts to use the model to predict the classes of objects (Han and Kamber, 2006). Classification models are generated on the basis of the training data whose independent variables and target variables are known, and then, they are used on a new dataset to predict the target variable. There are different methods of classification: neural networks, decision trees and regression. When a decision tree is used to classify a nominal target variable, it is called a ‘‘classification tree’’, and when it is used to predict a continuous one, it is called a ‘‘regression tree’’. The CART method can be used for both target variables mentioned above. In this study, the target variable is nominal (injury severity), and thus, the ‘‘classification tree’’ was used. Fig. 1 shows the principle of the CART method in developing the classification tree. First, all of the data are concentrated at a node located at the top of the tree. Then this ‘‘root node’’ is divided into two child nodes on the basis of an independent variable (splitter) that creates the best homogeneity. In fact, the data in each child node are more homogenous than those in the upper parent node. This process is continued repeatedly for each child node until all of the data in each node have the greatest possible homogeneity. This node is called a terminal node or ‘‘leaf’’ and has no branches. The following example is given for further explanation. Fig. 2 shows the process of growing a tree and creating a new tree along with its explanation on the assumption that the target variable is a binary variable and has two classes: injury and fatality. The classification tree divides the answer area into rectangular areas. The hypothesized tree in Fig. 2 shows that if an occupant does not use a seatbelt, he will be dead (terminal node T1), but if he uses seatbelt, on the condition of being male, he will be injured (terminal node T3), and on the condition of being female, she will be dead (terminal node T2). There are different indexes for splitting. The most famous of them for nominal data is the Gini index, shown as follows:

PðjjmÞ ¼

pðj; mÞ ; pðmÞ

GiniðmÞ ¼ 1 

j X j¼1

pðj; mÞ ¼ p2 ðjjmÞ

pðjÞNj ðmÞ Nj

;

pðmÞ ¼

j X

pðj; mÞ

ð1Þ

j¼1

ð2Þ

1316

A.T. Kashani, A.S. Mohaymany / Safety Science 49 (2011) 1314–1320

Fig. 1. General structure of a decision tree.

Fig. 2. An example of the basic rules of the CART method.

where J is the number of target variables or classes, pðjÞ is the prior probability for class j, p(j|m) is the conditional probability of a record being in class j provided that it is in node m, Nj(m) is the number of records in class j of node m, Nj is the number of records of class j in the root node, and Gini(m) or the Gini index is an indication of impurity in node m. For example, if all of the observations in one node belong to one class, Gini(m) will equal zero, which means the least impurity or greatest purity in that node. The maximum value of Gini(m) is obtained when the same ratio of observations exists in the node. The prior probability shows the proportion of observations in each class in the population. Tree growth, based on the Gini index, starts from the root node, which contains all of the observations. For each tree created, the ‘‘misclassification error rate’’ or ‘‘misclassification cost’’, or in other words, the ‘‘goodness of fit’’ index, is calculated as

Misclassification error rate ¼

M X m¼1

" pðmÞ 1 

j X

# p2 ðjjmÞ

in the complexity cost, that branch will be pruned, and a new tree is created. Fig. 3 shows how an optimal tree is selected from the trees or models created. With an increase in complexity (more terminal nodes), the misclassification cost for train data will repeatedly decrease. However, for test data, first there is a decrease, and then an increase is observed. An optimal tree is the one that has the least misclassification cost for the test data. A more detailed description of CART analysis and its applications can be found in Breiman (1998).

3.2. Variable importance The identification of variables has major importance in the prediction of the target variable is one of the most crucial steps in modeling. One of the outputs of the CART method is the variable

ð3Þ

j¼1

where p(m) is the proportion of existing observations in the terminal node or leaf m (from all observations) and M is the number of terminal nodes. The above relation shows the data that are misclassified in unrelated classes (e.g., injury observations that are placed in a fatality terminal node). In the CART method, tree growth will continue until there are only similar observations in each terminal node. In such a case, the maximal tree that over fits the training data is created. Now, to lessen the complexity of the maximal tree and create simpler trees, pruning is performed according to the cost-complexity algorithm. The simpler tree has fewer terminal nodes and a greater misclassification error rate. After pruning a branch, if the increase in the misclassification cost is sufficiently lower than the decrease

Fig. 3. A hypothetical plot of the misclassification error rates for both training and testing data as a function of tree complexity (e.g., the number of terminal nodes in the tree).

1317

A.T. Kashani, A.S. Mohaymany / Safety Science 49 (2011) 1314–1320 Table 1 Variable description. Variable

Description

Injury severity Gender Seatbelt Cause of crasha

Target variable: 1. Light injury, 2. Serious injury, 3. Fatality 1. Male, 2. Female 1. Used, 2. Not used, 3. Unknown 1. Following too closely, 2. Ignoring proper lateral distance, 3. Ignoring right of way, 4. Inattention to ahead, 5. Inability to drive, 6. Failure to vehicle control, 7. Speeding, 9. Improper overtaking, 11. Straying to the right, 13. Improper turning, 14. Crossing prohibited place, 15. Driving on the wrong side of the road, 16. Improper backing, 17. Vehicle defect, 19. Swerving, 20. Pedestrian violation, 22. Improper packing, 23. Improper towing, 24. Red light running, 25. Turning in no-turn zone, 26. Other 1. Collision with motorcycle/bicycle, 2. Two vehicle collision, 3. Multi vehicle collision, 4. Collision with pedestrian, 5. Collision with animal, 6. Fixed object collision, 7. Overturn, 8. Fire/explosion, 9. Motorcycle collision with ped/bicycle, 10. Two motorcycle collision, 11. Other 1. Segment, 2. Intersection, 3. Bridge, 4. Tunnel, 5. Roundabout, 6. Other 1. Day light, 2. Dark, 3. Dusk/dawn 1. Clear, 2. Fog, 3. Rain, 4. Snow, 5. Stormy, 6. Cloudy, 7. Dusty 1. Dry, 2. Wet, 3. Icy, 4. Gravel/sand, 5. Slush/mud, 6. Standing oil, 7. Other

Collision type Location type Lighting condition Weather condition Road surface condition Occurrence Shoulder type a

1. On roadway, 2. On Shoulder, 3. In median, 4. On roadside, 5. Outside traffic way, 6. Other 1. None, 2. Stabilized gravel, 3. Paved

Cases Nos. 8, 10, 12, 18 and 21 are not in the dataset.

importance index, on whose basis the relative importance of variable xj is calculated through the following equation:

VIMðxj Þ ¼

T X nt DGiniðSðxj ; tÞÞ N t¼1

ð4Þ

where DGiniðSðxj ; tÞÞ is the reduction in the Gini index at node t that is achieved by splitting variable xj, nNt is the proportion of the observations in the dataset that belong to node t, T is the total number of nodes and N is the total number of observations. This value is calculated for all of the independent variables and is scaled such that its summation for all variables is one. The variable that has the most importance with respect to the others has the largest number.

and as a result, a new dataset with 7241 records was created. It should be noted that the data regarding the occupants with the no-injury class is not recorded on the traffic crash record form. A random effect or Generalized Estimating Equations (GEE) in statistical methods such as logit and ordered probit were used to avoid bias due to multiple observations of the same crash, but there was no possibility of applying them in the CART model. Table 1 shows the descriptions of each variable. Eventually, ten independent variables and one target variable were defined and inserted into the CART model. As Stewart (1996) suggested, part of the data should be assigned to training and another part to testing. In this study, seventy percent of the data were randomly assigned to train the model, and the remaining thirty percent was allocated to the test.

4. Data description The data for the rural traffic crashes in Iran over the last 3 years (2006–2008) were collected from the Traffic Secretary of the Iran Traffic Police. Because the scope of the present study encompasses the entire main two-lane roads network of Iran, the data pertaining to the freeways, multi-lane highways, and minor roads were excluded. The Iranian traffic crash record form, known as KAM 114, contains important data concerning each traffic crash. It contains general characteristics about each crash, such as the time and place of the crash; the road type (e.g., freeway, multi-lane highway, etc.); collision type (e.g., multi-vehicle, collision with a fixed object, etc.); the cause of the crash (e.g., improper overtaking, speeding, etc.); the conditions of the road surface (e.g., dry, wet, etc.); shoulder type (e.g., none, paved, stabilized gravel); data regarding all the drivers involved, such as age, gender, driving license type, and injury severity (i.e., no-injury, injury, fatality); information concerning the pedestrians injured in the crash; and data relating to the injured occupants involved. Because this study focused on the analysis of the factors affecting the injury severity of occupants (excluding drivers) involved in crashes, all of the crashes involving pedestrians and cyclists were excluded. The injury severity of the occupants involved was recorded in terms of three levels: light injury, serious injury, and fatality. A dataset with 21,025 records, in which each record represented a single injured occupant in the crashes, was created by combining the dataset pertaining to the general characteristics of the crashes with the dataset related to the injured occupants. To prevent bias from deviation to crashes with more occupant injuries, crashes with more than one injured occupant were omitted,

5. Results and discussion 5.1. Reducing the multi-class problem to a series of two-class (binary) problems As mentioned before, in this study, the target variable (injury severity) was placed in three classes (categories), but these classes were not balanced, and the number of serious injuries was seven times the number of fatalities (number of light injuries: 3171, serious injuries: 3556, fatalities: 514). In cases of imbalance among target variable classes, the prediction accuracy of the overall model or each class is decreased. For example, after running a model with a three-class target variable, only 23.32% of the data actually belonging to the serious injury class were classified correctly (Table 2). To increase the prediction accuracy in problems with multi-class target variables, it has been suggested to convert them to problems with two-class target variables (binary) (Allwein et al., 2001; Delen et al., 2006; Dissanayake and Lu, 2002; Tax and Duin, 2002), which can be done in two ways. In the first method, called one–vs–all (OVA), each class is compared to the others separately. Thus, for a problem with N classes, N problems with a binary

Table 2 Prediction accuracy of three-class model.

Light injury Serious injury Fatality Overall

Training (%)

Testing (%)

47.31 23.63 68.10 37.20

42.85 23.32 54.86 34.06

1318

A.T. Kashani, A.S. Mohaymany / Safety Science 49 (2011) 1314–1320

Fig. 4. Graphical representation of two-class model configurations.

classifier are created. In the second method, called all–vs–all (AVA), all possible binary combinations are created. In this case, for a   N problem with N classes, problems with binary target vari2 ables are generated. In this study, the problem involves a threeclass target variable, so the OVA and AVA methods cannot be used because, as an example, placing fatality and light injuries into one class and serious injuries in another is quite meaningless. Therefore, to reduce these classes, some researchers have suggested performing this task in such a way that lower injury severity levels are placed versus higher level ones (Delen et al., 2006; Dissanayake and Lu, 2002). Fig. 4 shows the target variable combinations for the four models examined graphically in this analysis. In Model 1.1, for example, the data whose target variables are serious and light injuries are placed together in one class with label 0, and fatalities are placed in another one with label 1. In Model 1.2, fatality data are not entered into the CART classification model at all, and only light and serious injuries are compared. 5.2. Prediction accuracy of models The CART method was employed to analyze all four binary models. The Gini index was used for tree growth. Prior probabilities, p(j) were set to be equal for all models. The prior probability shows the proportion of each class in the population, but in cases where the proportion of one class is much greater than that of another class (such as Model 1.1 in which Class Label 0 is approximately thirteen times Class Label 1), if their prior probabilities are also adjusted on the basis of the proportion of each class in the training data, the resulting model will predict all of the data in the dominant class, and thus, the overall accuracy of the model

is increased. In the studies related to crash severity or injury severity, because the proportion of fatality data is generally less than the data on property damage only or injury, its prediction accuracy decreases. To solve this problem, in cases where levels of target variables have an unbalanced proportion but the same prediction accuracy importance, it has been suggested to set equal prior probabilities such that the ones that have a lower proportion may also be taken into consideration in predictions (Steinberg and Golovnya, 2007). Although the overall accuracy of the model decreases, the prediction accuracy of the data with the least proportion increases, which is more important for decision makers in most cases. Table 3 shows the prediction accuracy of the models for the training and testing data and also for the overall model. The results show that, in all models, the accuracy of the overall model improved between 10.37% and 25% with respect to a three-class model (Table 2).

5.3. Variable importance The relative importance of variables for all four of the binary models was calculated and is presented in Table 4. In all of the models, the cause of the crash is one of the two first variables in terms of importance. This result confirms the study of Al-Ghamdi (2002) in Saudi Arabia, where the cause of the crash was an important factor in increasing the crash severity. Studying decision trees resulting from the models reveals that, in all cases when the cause of crash is improper overtaking, the related branch of that tree predicts more severe crashes (Class Label 1). Seatbelt use is another important variable. In Models 1.1–2.1, it is among the first two important variables. Looking at the trees created, when an occupant did not use a seatbelt, the probability of his/ her being in a more severe situation of injury is higher. This result has also been shown in many other previous studies (Bedard et al., 2002; Delen et al., 2006; Kweon and Kockelman, 2003; Sohn and Shin, 2001; Valent et al., 2002). Unfortunately, using a seatbelt for back seat occupants of a vehicle is not mandatory in Iran. This study reveals the importance of seatbelts for these occupants as well. The second determining variable was the cause of the crash. This study shows that, for main two-lane two-way rural roads, improper overtaking is one of the most important factors in

Table 3 Prediction accuracy of models with binary class labels. Model 1.1

Class Label 0 Class Label 1 Overall

Model 1.2

Model 2.1

Model 2.2

Testing (%)

Training (%)

Testing (%)

Training (%)

Testing (%)

Training (%)

Testing (%)

Training (%)

56.80 58.33 56.91

57.56 67.83 58.32

58.12 59.88 59.06

59.02 62.64 60.94

57.55 57.62 57.59

62.10 59.64 60.72

40.70 69.51 44.43

44.56 72.28 48.02

Table 4 Relative importance of variables. Model 1.1 Cause of crash Seatbelt Road surface condition Location type Collision type Weather condition Lighting condition Occurrence Gender Shoulder type

Model 1.2 0.61 0.11 0.07 0.07 0.07 0.07 0 0 0 0

Seatbelt Cause of crash Weather condition Road surface condition Shoulder type Collision type Occurrence Location type Gender Lighting condition

Model 2.1 0.59 0.2 0.06 0.05 0.04 0.03 0.02 0.02 0 0

Seatbelt Cause of crash Road surface condition Collision type Weather condition Lighting condition Shoulder type Occurrence Location type Gender

Model 2.2 0.53 0.23 0.06 0.06 0.04 0.03 0.02 0.02 0.02 0

Gender Cause of crash Occurrence Location type Collision type Weather condition Road surface condition Seatbelt Shoulder type Lighting condition

0.31 0.26 0.14 0.14 0.14 0 0 0 0 0

A.T. Kashani, A.S. Mohaymany / Safety Science 49 (2011) 1314–1320

increasing the severity of injury. Because a driver has to use the opposite lane for overtaking on such roads, the consequence is usually a severe crash with many casualties. The construction of passing lanes at required places and special attention to police enforcement by the mobile patrol vehicles are some relatively less costly measures to lower the risk of improper overtaking. The patrol police report the injuries from a crash they observe at the site, but many of those who have been reported as injured die at or on the way to hospital. Because the present data bank was taken from the police, it can be concluded that, of the four models above, Models 1.2 and 2.1 are more important because they show the borderline between the life and death of an injured occupant. As shown in Table 4, the order of importance of the first two variables is the same for these models, which shows their conceptual similarity. However, Model 2.2 shows some differences compared to the others. The target variables of this model are severe injury and fatality, and because most of the severely injured occupants die on the way to hospital, there is not a distinct differentiation between its target variable classes. Therefore, its prediction accuracy is very low (44.3%), and in addition, the important variables predicted are different than those predicted by the other models. 5.4. Decision tree One of the beneficial characteristics of a decision tree compared to other modeling methods is that it gives decision makers some

1319

rules to answer ‘‘if-then’’ questions efficiently. Here only a decision tree of Model 1.2, which is displayed in Fig. 5, is presented. The interpretation of the tree is as follows. Node 0, which is the first and the root node, is divided into 2 child nodes based on the variable of seatbelt usage. It indicates that the best variable to classify and predict the injury severity in a traffic crash is seatbelt usage. Node 2, on the right side, shows the data related to using seatbelts or unknown conditions of using seatbelts. On the right branch of the tree, there are four terminal nodes (nodes 5, 9, 13 and 14). In all of these terminal nodes, except for node 14, the occupants are predicted to receive light injuries (Class Label 0), not severe injuries (Class Label 1), which implies that, if an occupant uses safety equipment, he/ she will more probably sustain light injuries, regardless of any other variables. Based on variable of the cause of the crash, node 2 is split into node 6 and terminal node 5. Terminal node 5 shows that, in the case of using seatbelts, if a crash takes place due to the causes of crashes Nos. 1, 3, 5, 14, 15, 16, 19, 23 and 25, there is a 68% probability that the occupant will sustain a light injury (Class Label 0). Node 6 further is split into node 10 and terminal node 9 on the basis of whether the condition variable predicting light injury is more probable than serious injury in cases of rainy, stormy and dusty weather. The last terminal nodes of the right branch, 13 and 14, also show that, if seatbelts are used, the occupants are predicted to sustain light injuries.

Fig. 5. Decision tree for Model 1.2.

1320

A.T. Kashani, A.S. Mohaymany / Safety Science 49 (2011) 1314–1320

On the left branch of the tree, node 1 is divided into node 4 and terminal node 3 on the basis of the cause of the crash. As shown by terminal node 3, in the case of not using seatbelts, a crash will probably result in serious injuries to the occupants if it takes place due to following too closely, inattention ahead, failure of vehicle control, speeding, improper overtaking, improper backing, vehicle defect, swerving or improper towing. Because this study is focused on two-lane two-way rural roads, improper overtaking can result in serious injuries (as confirmed in terminal node 3). Terminal node 7, created from node 4 by considering the shoulder type, indicates that, if the shoulder is paved, only light injuries are likely. Finally, in terminal nodes 11 and 12, which are generated by the weather conditions, clear, snowy and foggy weather are associated with serious injuries and other conditions are associated with light injuries. In conclusion, what is important from the traffic safety viewpoint is paying attention to the underlying causes of situations in which serious injuries happen. In the tree in Fig. 5, node 3 is of greater importance because it reveals the considerable significance of the simple but critical use of seatbelts and avoiding improper overtaking on such roads. The cause of crash variable has 26 states. It is initially unknown which of these states leads to more severe injuries. After creating the tree, it was found that, among these cases, improper overtaking has a significant role in increasing the severity. From the beginning, improper overtaking could not be included as an independent variable because it was one of the states of the cause of crash variable. 6. Conclusion The CART model can be easily understood and interpreted because of the graphical nature of its results. The tree structure of this model makes it possible to answer ‘‘if-then’’ questions easily. It can also show the complex relations between the variables. The traffic injury data often show a serious correlation between the independent variables (e.g., weather conditions, pavement conditions, driver/vehicle action and collision type). However, when CART analysis is applied, the correlation problems between independent variables are not a great concern. The model is quite practical and efficient when there are sufficient data with many independent variables, and it can easily find the important variables. The CART has benefits and some disadvantages. Tree models are often unstable. Tree models are formed based on their random seed number. Thus, the trees will not be the same, and it is possible to obtain outcomes that vary. In this research, the variation was small, and the deduction that ‘‘improper overtaking has a significant role in increasing the severity’’ remained the same. This study, carried out with the CART model, showed that not using a seatbelt is the most important factor in increasing the injury severity of occupants. This study indicates that, in developing countries such as Iran that have a high rate of traffic crash fatalities due to the low driving culture in seatbelt usage, the seatbelt is still the most important factor in increasing injury severity. Therefore, it is necessary to make the use of seatbelts mandatory for back seat occupants in Iran (presently, it is only mandatory for the drivers and front seat occupants). The cause of crashes was found to be the second most important variable. Because this study pertained to two-lane two-way rural roads, this result was not unexpected. Constructing passing lanes and intensifying law enforcement by the police seem quite necessary to lessen improper overtaking maneuvers on such roads. Acknowledgments The authors gratefully acknowledge Mr. Sabbaq and Mr. Mishani from the Traffic Secretary of the Iran Traffic Police for providing the crash data.

References Abdel-Aty, M., Keller, J., 2005. Exploring the overall and specific crash severity levels at signalized intersections. Accident Analysis and Prevention 37 (3), 417–425. Abdelwahab, H., Abdel-Aty, M., 2001. Development of artificial neural network models to predict driver injury severity in traffic accidents at signalized intersections. Transportation Research Record 1746 (1), 6–13. Abdelwahab, H., Abdel-Aty, M., 2002. Artificial neural networks and logit models for traffic safety analysis of toll plazas. Transportation Research Record 1784 (1), 115–125. Al-Ghamdi, A., 2002. Using logistic regression to estimate the influence of accident factors on accident severity. Accident Analysis and Prevention 34 (6), 729–741. Allwein, E., Schapire, R., Singer, Y., 2001. Reducing multiclass to binary: a unifying approach for margin classifiers. The Journal of Machine Learning Research 1, 113–141. Bedard, M., Guyatt, G., Stones, M., Hirdes, J., 2002. The independent contribution of driver, crash, and vehicle characteristics to driver fatalities. Accident analysis and prevention 34 (6), 717. Breiman, L., 1998. Classification and Regression Trees. Chapman & Hall/CRC. Chang, L., Wang, H., 2006. Analysis of traffic injury severity: an application of nonparametric classification tree techniques. Accident Analysis and Prevention 38 (5), 1019–1027. Chong, M., Abraham, A., Paprzycki, M., 2005. Traffic accident analysis using machine learning paradigms. Informatica 29, 89–98. Delen, D., Sharda, R., Bessonov, M., 2006. Identifying significant predictors of injury severity in traffic accidents using a series of artificial neural networks. Accident Analysis and Prevention 38 (3), 434–444. Dissanayake, S., Lu, J., 2002. Factors influential in making an injury severity difference to older drivers involved in fixed object-passenger car crashes. Accident analysis and prevention 34 (5), 609–618. F.M.O.I, 2009. Forensic Medicine Organization of Iran; Statistical Data, Accidents (in Farsi). (31.05.09). Han, J., Kamber, M., 2006. Data Mining: Concepts and Techniques. Morgan Kaufmann. Kweon, Y., Kockelman, K., 2003. Driver attitudes and choices: seatbelt use, speed limits, alcohol consumption, and crash histories. The 82nd Annual Meeting of the Transportation Research Board. Washington, DC. Magazzù, D., Comelli, M., Marinoni, A., 2006. Are car drivers holding a motorcycle licence less responsible for motorcycle—Car crash occurrence? a nonparametric approach. Accident Analysis and Prevention 38 (2), 365–370. Mussone, L., Ferrari, A., Oneta, M., 1999. An analysis of urban collisions using an artificial intelligence model. Accident Analysis and Prevention 31 (6), 705–718. Pai, C.W., 2009. Motorcyclist injury severity in angle crashes at T-junctions: Identifying significant factors and analysing what made motorists fail to yield to motorcycles. Safety Science 47 (8), 1097–1106. Pai, C.W., Saleh, W., 2007. An analysis of motorcyclist injury severity under various traffic control measures at three-legged junctions in the UK. Safety Science 45 (8), 832–847. Pai, C.W., Saleh, W., 2008. Modelling motorcyclist injury severity by various crash types at T-junctions in the UK. Safety Science 46 (8), 1234–1247. Pande, A., Abdel-Aty, M., 2009. Market basket analysis of crash data from large jurisdictions and its potential as a decision support tool. Safety Science 47 (1), 145–154. R.M.T.O, 2008. I.R. of Iran Road Maintenance & Transportation Organization; Annual Report (in Farsi). (15.05.09). Renski, H., Khattak, A., Council, F., 1999. Effect of speed limit increases on crash injury severity: analysis of single-vehicle crashes on North Carolina Interstate highways. Transportation Research Record 1665 (1), 100–108. Sohn, S.Y., Lee, S.H., 2003. Data fusion, ensemble and clustering to improve the classification accuracy for the severity of road traffic accidents in Korea. Safety Science 41 (1), 1–14. Sohn, S.Y., Shin, H., 2001. Pattern recognition for road traffic accident severity in Korea. Ergonomics 44 (1), 107–117. Steinberg, D., Golovnya, M., 2007. CART 6.0, User‘s Guide. Stewart, J.R., 1996. Applications of classification and regression tree methods in roadway safety studies. Transportation Research Record 1542 (1), 1–5. Tax, D., Duin, R., 2002. Using Two-class Classifiers for Multiclass Classification. International Conference on Pattern Recognition, pp. 124–127. Tesema, T., Abraham, A., Grosan, C., 2005. Rule mining and classification of road traffic accidents using adaptive regression trees. International Journal of Simulation systems 6, 80–94. Valent, F., Schiava, F., Savonitto, C., Gallo, T., Brusaferro, S., Barbone, F., 2002. Risk factors for fatal road traffic accidents in Udine, Italy. Accident Analysis and Prevention 34 (1), 71–84. Wood, D., Simms, C., 2002. Car size and injury risk: amodel for injury risk in frontal collisions. Accident Analysis and Prevention 34 (1), 93–99. Zajac, S., Ivan, J., 2003. Factors influencing injury severity of motor vehicle-crossing pedestrian crashes in rural Connecticut. Accident Analysis and Prevention 35 (3), 369–379.