Accident Analysis and Prevention 31 (1999) 705 – 718 www.elsevier.com/locate/aap
An analysis of urban collisions using an artificial intelligence model Lorenzo Mussone *, Andrea Ferrari, Marcello Oneta Department of Transport Systems and Mobility, Polytechnic of Milan, Piazza Leonardo da Vinci 32, I-20133 Milan, Italy Received 20 July 1998; received in revised form 23 February 1999; accepted 2 March 1999
Abstract Traditional studies on road accidents estimate the effect of variables (such as vehicular flows, road geometry, vehicular characteristics), and the calculation of the number of accidents. A descriptive statistical analysis of the accidents (those used in the model) over the period 1992–1995 is proposed. The paper describes an alternative method based on the use of artificial neural networks (ANN) in order to work out a model that relates to the analysis of vehicular accidents in Milan. The degree of danger of urban intersections using different scenarios is quantified by the ANN model. Methodology is the first result, which allows us to tackle the modelling of urban vehicular accidents by the innovative use of ANN. Other results deal with model outputs: intersection complexity may determine a higher accident index depending on the regulation of intersection. The highest index for running over of pedestrian occurs at non-signalised intersections at night-time. © 1999 Elsevier Science Ltd. All rights reserved. Keywords: Artificial neural networks; Accidents; Intersections; Safety; Urban environment; Models
1. Introduction A widely used approach in describing accident phenomena involves the creation and calibration of a functional relationship capable of treating the above-mentioned variables. The model is calibrated by samples of accidents which determine both its structure and its forecasting capabilities. The latter one is a fundamental property which a model must possess since intervention policies, aimed at increasing road safety, depend on the forecasting capabilities. The artificial neural network (ANN) model in this paper is the follow up on an approach which we have used before (Mussone and Rinelli, 1996; Ferrari et al., 1997). Though similar to traditional analyses of accidents the ANN has different capabilities and characteristics. The making of accurate predictions depends on the collection and manipulation of accident data and on the choice of variables used in the model structure. Instead of working out an analytical functional relationship, usually a complex and laborious task, the * Corresponding author. Tel.: +39-2-2399-6612; fax: + 39-2-23996606. E-mail address:
[email protected] (L. Mussone)
neural network model reconstructs, simply by learning from the real data of accidents, the weights by which each variable must be weighed. The study focuses only on accidents that occurred at intersections and not on links which would require a different approach. The data, therefore, have been desegregated in order to have a clear view of the kind and number of accidents at different types of intersections. We examine only superficially accidents on links. In fact, the aim of this paper is to identify the most significant parameters that determine the possibility of an accident occurring at an intersection. Attention has been paid to the identification of possible causes of accidents: roadway conditions; visibility; weather; and the characteristics of vehicles and drivers. On the other hand, a descriptive analysis enables us to clarify the distribution of the values of variables noted by the authorities, but this analysis does not enable us to say anything about the presumed causes of accidents, or the influence of the various factors involved and in general it is not possible to single out any relationships between the causes and effects of accidents. For this purpose it is necessary to use mathematical models in order to quantify the degree of danger linked to particular flow and environmental conditions.
0001-4575/99/$ - see front matter © 1999 Elsevier Science Ltd. All rights reserved. PII: S 0 0 0 1 - 4 5 7 5 ( 9 9 ) 0 0 0 3 1 - 7
706
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
The paper uses a database of accidents that occurred in the city of Milan. The first part contains a descriptive statistical analysis that delineates the principal characteristics of the 83 000 accidents recorded by the police over the period from 1992 to 1995. We analyse the principal circumstances and causes of accidents. Almost 46 000 accidents occurred at intersections and for each of these the police furnished the circumstances and characteristics of the crashes including: the environmental variables present at the scene of the accident (meteorology, visibility, conditions of the road surface); crash variables (type of accident, type of vehicles involved, human factors, etc.); and geometrical variables (presence of traffic lights and night-time signalization, presence of horizontal road markings and vertical signs, type of road surface, vehicular flow, number and characteristics of conflict points).
2. Accident models The generalised linear model (GLM), based on multiple linear regression, is capable of offering an appropriate method in the study of accidents: in England these models have been recently used by the Transport Research Laboratory in studies about road accidents at intersections in urban areas (Maher and Summersgill, 1996) and by the Foundation for Road Safety Research as reported by Graham (1996), on rural and highway roads and intersections. The application of the GLM revealed certain technical problems related to the use of data with a non homogenous distribution: a low mean value and a high variance are the main problems to be faced. Therefore data must be carefully evaluated and corrected in order to make results more reliable and substantial. Statistical problems arise when the intercorrelation between the variables is great or when the results of the model furnish low to middle values in comparison to the real values used for the parameters. In such cases, the model generally contains a greater error than desirable. In order to decrease the error it is necessary to reduce the number of variables used and to eliminate those that are less meaningful. It must simply be remembered that the desegregation of data, such as when the data are divided on the basis of accident type, flow on each segment leading to an intersection and the time period, can cause the increase of statistical uncertainty linked to sparse events. An analysis of these problems has led to different development and changes of the basic methodology of GLM. Maycock and Hall (1984), for example, show how a negative binomial model can be used, instead of the Poisson model, to study the distribution of accidents in a fixed time period at a certain intersection.
In the last ten years the interrelation between accidents, the volume of traffic and the geometry of the road has been the object of many studies. Zeeger et al. (1990), Okamoto and Koshi (1989), Joshua and Garber (1990), Miaou and Hu (1992), Shankar et al. (1995) have, for instance, studied the incidence of accidents on road segments using linear regression to establish the relations between the type of accident, vehicle and geometric elements of the road (width of the road centre, curvature radius). The researchers at the Transport Research Laboratory have also considered the nature of accidents at the different types of urban and non urban road intersections. Recent applications use multiple linear regression and a Poisson distribution to represent the frequency of accidents within a certain period and at a particular place. For instance, the relationship between frequency of accidents, environmental variables and route length has been modelled (Jovanis and Chang, 1986). The effects of the physical conditions of the drivers of heavy vehicles in relation to the accidents (Dionne et al., 1993) has also been studied. The same forms of relationship has derived from the formulation construction of Poisson models can be developed by using the technique of the generalised linear models (Miaou and Hu, 1992). Log-linear models, by means of an analysis of contingency tables, are able to work out a mathematical model of the relationships that occur between data. Tables can be n-dimensional and this technique allows us to discover both the dependence or independence of a variable or group of variables and second order relationships. The log-linear formulation, when the table of contingency is on two dimensions, is a simple linear equation relating the observed frequency for two variables in relation to the average level of the process. So it is possible to single out the combination of variables which best characterise the process. By increasing the dimension of the contingency tables the complexity of the formulation increases, therefore less significant terms can be cancelled from the model. Applications of this method are in Salvatore (1992) for studying accidents on motorways and Kim et al. (1995a,b) for singling out the relationships between different types of accidents and between driver behaviour and the seriousness of the accident.
3. Data The data used have been furnished by the police of Milan and include all of the accidents that occurred in the city in the years from 1992 to 1995. Data have been organised and used by the Informatic Accident Section of the Urban Police, and are stored in ten files (integrated by 27 tables of codes and descrip-
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
tions). The database contains the following types of information: 1. Incident, contains the only key which the other files can be connected to. In it there is a description of the dynamic of the accident and the descriptions of the road, environmental conditions and traffic; 2. Luogoi, contains a precise description of the place where the accident occurred; 3. Datiper, describes, for each vehicle involved in the accident, personal data of passengers, correct use of safety devices and eventual injuries suffered by passengers or pedestrians; 4. Archferit, describes the severity of injuries and hospitalisation for each person involved; 5. Arcdocum, describes the identity of the people involved; 6. Inc-6eic, describes in detail the characteristics of the involved vehicles; 7. Inc6eico, reports insurance policies, the final position of the vehicles involved and damage to them; 8. Infrazio, describes possible traffic violations for the driver of each vehicle; 9. Frenata, describes, if any, traces of braking, or skid marks including their shape, length and intensity; 10. Datostat, contains information relating to the filling out of an accident card for ISTAT (National Institute of Statistics). The relationship between those data can be linked up to the others by using the primary key and the progressive number of vehicles involved and the code number of people apart the driver (either front or seated in the back). For each accident, there are 112 descriptive items. In this paper we used less information. We chose only those relevant to the causes of the accident rather than those concerning the consequences. To simplify matters, the various relationships (worked out in the DBASEIII PLUS® environment) were compacted in a single database called TABTOT. The end result of the development of TABTOT is to obtain a database with all necessary information relating to an accident on a single line without having to search different files. In order to construct the table TABTOT, the authors started from the relationship Incident, selecting those fields whose function is explained in Table 1. In the absence of a sketch of the accident, which was not always available in our studies, we considered only two vehicles in collision in which at least one violation was attributed. In this way, unless the violation report identifies the true cause of the accident and the guilty driver (e.g. traffic light violation), we can reconstruct the probable cause of the accident from the accident type and dynamics, if any; otherwise the record is discarded. In file 8 (Infrazio), three fields, for both vehicle 1 and vehicle 2, describe (if notified) the violation committed.
707
The fields NIncidents and AI (accident index) were introduced and calculated with the purpose of being able to identify a classification of the most dangerous intersections. The creation of this index is related to the necessity of having an explicit relationship in order to compare intersection features. It gives an indicative value and percentage of the degree of danger at every intersection in relation to the most dangerous one over a period of four years: AI=Ni /Nmax
(1)
where Ni is the number of accidents in the i-th intersection and Nmax, the number of accidents at the most dangerous intersection.
4. Data transformation In this initial phase of the study, after we had counted the parameters relative to accidents on road segments and at intersections, we focused on the geographical distribution of the statistically most dangerous intersections and developed a general investigation that underlines the influence and the presence of the various effects assembled in TABTOT. In order to separate the accidents on links from those at intersections, the database was filtered (by using the ACCESS® software). Among the more than 80 000 accidents in the database, only certain records were selected. These concern the fields PARTSTRD and LOC2 present in the file TABTOT. The field PARTSTRD provides a number which indicates the localisation at which the accident occurred; this number is translated through the chart TBPARTIC that furnishes the translation codes (Table 2). Among all records, for the purpose of this work, only those with code PARTSTRD equal to: 1, 2, 3, 4, 104, 5 were selected. The fields that compose the final database TABTOT are contained in Table 1. Further investigations were conducted in a restricted area of Milan; effects relative, for instance, to the traffic flow measured in vehicles per hour, or to the characteristics of circulation at every intersection, were added to the preceding information. Flow values are calculated by an assignment algorithm by using O/D matrices for the Milan network worked out each year by ATM, the Milan Transport Company. These matrices have an hourly resolution and allow us to reconstruct hourly average flow on a single road or path. Other variables such as the width of a road or the number of lanes could be included in the NN model (and probably will in subsequent research) in order to study their effects on some types of accidents (for example on pedestrian risk of injury) but in this initial approach the selected variables can be considered a good subset of evaluation; observe that the conflict
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
708
points can be used as an indirect measure of certain geometrical road variables.
5. Descriptive statistics of the accidents The first phase in the analysis of road accidents consists of an analysis that uses descriptive statistics. The principal variables are analysed by taking into consideration different combinations of each. In this phase, the exact localisation of accidents in order to determine dangerous spots is of fundamental importance. The extrapolation of data relating to the whole sample of accidents was carried out through procedures of ‘filtering’, ‘arrangement’ and, in general, instructions in SQL language by Access® programmes.
The geographical distribution of the 263 intersections at which more than or up to 30 (30 is a large number in statistics and therefore significant) accidents that have occurred over the 4 years do not evidence a particularly critical zone, but it can be verified that accidents occurred more frequently at intersections belonging to the most important radial arteries of penetration into the city, as well as at the external ring road. The total number of accidents on the whole road network amounts equal to 83 335: by analysing their distribution it can be seen that they are distributed linearly with a constant rate of growth of about 540 cases per year. The average number of accidents is about 20 834 per year. Of these, 37 363 cases, equal to 45% of the total, occurred on links, while the remaining 55%, equal to 45 972 cases, occurred at intersections. The number of persons involved increases year by year;
Table 1 The fields of database TABTOT Field number
Field name
Field type
Digits
Description
1 2 3 4 5 6 7 8 9 10
ANNO PROTOCOLLO DATAINC TSFESTIVSN ORAINCIDEN LOC1 LOC2 INCROCIO NOME CDSEMAFO
Numerical Numerical Date Character Numerical Numerical Numerical Character Character Numerical
2 6 8 1 5 5 5 10 60 1
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
NATURINC SNATURINC PARTSTRD TIPOSTRD PAVIMENT CONDSTR FONDSTRD CONDATMO CONDTRAF VEIC1 ETA1 TIPOVEIC1 CILINDRAT1 IMMATRICO1 VEIC2 ETA2 TIPOVEIC2 CILINDRAT2 IMMATRICO2 CODICE1 ARTICOE1 COMMAE1 CODICE2 ARTICOE2 COMMAE2 N.INCIDENTI
Numerical Character Numerical Numerical Numerical Numerical Numerical Numerical Numerical Character Date Numerical Numerical Numerical Character Date Numerical Numerical Numerical Numerical Numerical Character Numerical Numerical Character Numerical
3 60 3 3 3 3 3 3 3 1 8 3 4 2 1 8 3 4 2 1 4 2 1 4 2 3
37
AI
Numerical
3
Year in which the accident occurred Number of the protocol of the accident Date of the accident s, working day; n, not working day Time of the accident Numerical code of the first street that composes the intersection Numerical code of the second street that composes the intersection Numerical code obtained from the couple LOC1 and LOC2 Name of the intersection Traffic lights flag. 0, not operative; 1, operative; 2, off; 3, lighting; 4, bad operating Type of accident Description of type of accident Type of road Number of lanes Type of road bed Type of signalisation Condition of paving (slippery, dry...) Atmospheric conditions Traffic volume [veh./h] Presence of 1 vehicle that has committed an infringement Age of driver of vehicle 1 Type of vehicle Capacity of vehicle 1 in cc. Age of vehicle 1 Presence of a second vehicle that has committed infringement Age of driver of vehicle 2 Type of vehicle Capacity of vehicle 2 in cc. Age of vehicle 2 Violation of vehicle 1 Article of violation (veh. 2) Paragraph of article (veh. 2) Violation of vehicle 2 Article of violation (veh. 2) Paragraph of article (veh. 2) The total number of accidents occurred in that intersection over the four years considered accident index (see Eq. (1))
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718 Table 2 The field TBPARTIC and its codes for accident localisation Numeric code pointed out by PARTSTRD
Description in TBPARTIC
1 2 3 4
Intersection Roundabout Signalised intersection Intersection regulated by flashing yellow lamp Intersection regulated by traffic warden Not regulated/not signalised intersection Passage to attended level crossing Passage to unattended level crossing Rectilinear stretch Curve with good visibility Curve without visibility Artificial hillock Bump Narrow passage Bridge Slope Illuminated gallery Illuminated subway Not Illuminated gallery Not Illuminate subway
104 5 6 106 7 8 108 9 109 209 309 10 11 111 12 112
from 58 636, in 1992, to 63 007 in 1995, with an increase of 1092 persons involved each year. The number of fatal accidents decreases in accordance with the trend observed in Europe. About 33% of accidents occur at night and the remaining 67% in the daytime. This shows that the
709
number of night-time accidents is not negligible in comparison to those that occurred in the daytime. The most frequent accident types at intersections are frontal/lateral (56% of the cases), collision (15% of the cases), side (11% of the cases) and accidents involving pedestrians (6% of the cases).
6. The model
6.1. The area of in6estigation The area selected for this investigation of accidents at intersections is located just outside the historical inner centre of Milan (Fig. 1). We selected this area due to its particular road geometry. With a few exceptions, intersections have four arms at right angles. The choice of this sub-area rather than the whole city was to simplify the next phase which is designed to enrich the accident database with further information concerning the specific geometrical and control characteristics of intersections. This task would have been difficult to perform (given the resources at our disposal) over the whole urban area. We define a conflict point as the area inside the intersection where traffic flow, coming from different directions, intersects: it is called virtual if traffic light signalization is active and cancels the conflict; it is called real if the conflict is regulated only by a give-way sign. For all the 217 conflict points that compose the area under investigation (an intersection generally has more than one conflict point) new variables are added
Fig. 1. The urban area where data were collected.
710
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
Fig. 2. Example of representation of conflict points in a complex intersection.
to the TABTOT database, describing the site where the accident occurred with special regard to traffic flow values over the day and the number of conflict points involved. It must be noted that in signalised intersections flow is generally greater than in not signalised intersections and consequently the accident index could be higher. This fact doesn’t affect the significance of accident index because the NN model is capable, through its non linear approximation capability, of identifying directly this relationship by means of the data. Conflict points were identified directly in the field. The virtual points are considered real if, at night, traffic lights are not active (only flashing yellow). Types of conflict points considered in the calculation are: merging (or converging), diverging and intersecting. The first two do not represent a real conflict if the signals prevent or
reduce the likelihood of collisions. The number of conflict points indirectly provides a measure of the area as well as data on the complexity of the intersection. In fact, the number of conflict points is based on the total parallel flow lines that cross the intersection. In Fig. 2 an example of intersection layout can be seen with the conflict points marked. A more complete analysis would estimate the conflict points relative to possible manoeuvres not permitted by horizontal and vertical signals. Despite the fact that it was observed how frequent is the non observance of these dispositions and the subsequent situations of potential danger, it was decided not to count the conflict points linked to such manoeuvres since their wide variety would have considerably complicated both the survey and the subsequent analyses.
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
There are several assumptions regarding the number of the lines of vehicles turning at intersections and in circulation on the main roads. It is assumed that a single line of vehicles turns 90° at an intersection. Other criteria were used to improve the database which is the basis on which the neural net model is elaborated. The purpose was to integrate the information contained in TABTOT, with the field survey of flow circulation and complexity of intersections.
6.2. The NN methodology The methodology used to build the model is based on artificial neural networks. We used feed-forward neural networks with a back-propagation learning paradigm. This type of neural network is well known and has been described in Cybenko (1989) paper and in other papers (Hornik et al., 1990; Hornik, 1991; Girosi and Poggio, 1991; Leshno et al., 1993). These papers describe the ANNs capability of approximating with minimum error any function belonging to L2 space (the Lebesgue two space). Applications regarding transport, planning and control fields are numerous (Dougherty, 1995; Mussone, 1995). This approximation capability is a fascinating tool which does not need any a priori assumption on relationships between variables, linear or non linear. In problems where the phenomena are not well known and an analytical approach could be time consuming, NN offers the opportunity to investigate and create the first discriminant analysis. Essentially the methodology in building an ANN model consists of four phases. The first is a random extraction from the entire data set of a subset. The number may vary from a few 100 to 1000 cases. The subset must satisfy the law of large numbers. A very large subset may slow down the learning procedure. When data are grouped into classes, the model should recognise a particular classification pattern. Data extraction moreover should be as homogenous as possible for each class both for train and test set and an equal number of cases should be randomly extracted for each class. If not, because of the too low number of cases used which don’t cover all possible combinations, ANN performance (evaluated on the test set) can worsen; this may happen when many variables are used and data are not homogeneously distributed. Therefore a lesser number of input variables can improve data extraction and case distribution between test and train set. In the second phase the data subset is randomly subdivided into two sets, the training and test sets, which will be used for the learning and validation of the ANN. The third and fourth phase consist in the learning and testing of the ANN, iteratively until the optimal network is determined. Optimal means that ANN minimises model error. The number of inputs and outputs of models does not have any theoretical limit.
711
However the exact relationships between input/output vectors and the optimal network are not known. Therefore, it is not known how many cases are needed for the learning phase. When the number of inputs and outputs increases the complexity of the process, the more complicated networks require more data. The heuristic approach in all four phases increases the amount of computational time needed in order to work out the optimal network. If the heuristic procedure fails to recognise certain network configurations that lead to better performance, results obtained by other means may be considered valid. To build our ANN we first used the neural network toolbox in Matlab to single out the optimal network. The best ten network configurations were analysed using NeuralWorks (a neural network shell) to single out the best model based on an evaluation of their performance. The structure of the ANN used to work out the model is drawn in Fig. 3. The input layer has ten neurones for eight variables: in fact the Crash variable is represented by three binary neurones (0 or 1) to consider the six types of accident (Table 3). The hidden layer has four neurones and the output has one neurone. The neurone transfer functions of the input layer are linear, hyperbolic tangent for the hidden layer and sigmoid for the output one. It is not easy to explain why the transfer functions differ from the hidden to the output layer but it is the result of the learning phase aimed at the reduction of RMSE (root mean square error). Other transfer functions such as sine were tried but without success. Each neurone belonging to the ‘outer’ layer represents a variable while those on the hidden layers are directly connected to the values of non physical variables, so that they haven’t a physical sense. The input variables of the neural network have either a numerical value (generally normalised) or a binary code such as for the Crash variable. In the latter case, if the whole range of the possible values cannot be coded by a unique binary variable (0/1), it is necessary to use more
Fig. 3. Structure of the model.
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
712
Table 3 Characteristics of the input neurones of the model Number
Variable name
1 2 3 4 5 6 7 8
Day or night 0, night; 1, day Flows circulating in the intersection Numerical value (normalisation flow 10 000 [v/h]) Number of virtual conflict points Numerical value (normalisation value 40) Number of real conflict points Numerical value (normalisation value 60) Type of intersection 0, not signalised; 1, signalised Crash 1 0/1 Crash 2 0/1 Crash 3 0/1 Type of accident (combination of the three variables crash 1, 1-0-0, frontal impact; 1-1-0, frontal/side impact; 0-1-0, side impact; 2 and 3) 0-0-1, bump; 0-0-0, pedestrian running over; 1-1-1, other types Road bed 0, dry; 1, wet Weather conditions 0, clear; 1, cloudy or rainy
9 10
than one neurone. In general with n binary neurones, 2n values of a variable can be represented. The variables related to the ten neurones of the input layer, are described in detail in Table 3. The output layer of the model has one neurone representing the Accident Index (AI) as described by Eq. (1). The AI, that the model furnishes as an answer, represents the degree of danger associated with a particular set of variables described by the input layer. The value Nmax in Eq. (1) is the number of accidents that occurred at the most dangerous intersection of the zone under examination (69 accidents rounded up to 70). This means that, if the AI for a certain set of input values is, for instance 0.5, the degree of danger is about 35 accidents in four years. The safety of intersections with the same conditions can be estimated with a certain margin of error. It can be noted that, along with the descriptive variables in the input layer, there is also the accident index in the output layer. This is a characteristic of the type of ANN used. Back-propagation provides a solution to the problem of learning with a feed-forward ANN. This type of ANN has the capacity to reconstruct any continuous function regardless of whether the input–output sequence is known. The process of iterative learning used many sets of data that represent the real associations between the causes and effects of intersection collisions. The whole set of data is divided in two equal subsets: one for training and another for testing. The reduction in the number of variables in comparison to those contained in the TABTOT was essential for performance. By increasing the number of variables, local minima points increase. Therefore error, which is minimised by searching along an n-dimensional surface stopping where reliability of the results obtained by the simulations increases, can be higher. The smaller the number of variables used, the better the generalisation of the results. The loss of information of those dropped variables simplify the overall execution of the ANN.
Binary/numerical code
The number of cases used for the process of learning of the net is 2073 of which 1037 cases were used for the phase of training and 1036 for the testing phase. The best solution which minimises the RMSE calculated on the testing set was found after 50 000 cycles of learning; RMSE is equal to 18.24% equal to an error on AI of about 3.2 accidents per year. The course of the RMSE in relation to the number of learning cycles is shown in Fig. 4; it is also possible to notice the phenomenon of ‘overfitting’ that occurs when the number of learning cycles exceeds the zone of least error. An example of the test results is reported in Table 4. Each case is characterised by an input data vector, the real AI value and that obtained by the model; the last column is the error. The table shows also how the NN model can be used to obtain answers for a specific input data vector. The particular choice of the variable involves a good fit of the model to reality. It remains difficult to simulate an AI closer to the unit. This is a consequence of the ‘distance’ of such values from the middle value of the index in relation to all accidents. Since the ANN minimises the RMSE for all accidents simultaneously, certain isolated cases cannot be accurately estimated.
6.3. Application of techniques other than NN Other techniques are applied to the same data, used in the above-mentioned model, to compare achievable performance. In particular multiple linear and exponential regressions are tested by using tools in Excel7 software. Exponential regression is in the form: y= bmix .i where b is a coefficient, xi are the dependent variables and mi the coefficient calculated by the regression. The main difficulties to include into a regression model some variables arise with class variables such as day/night, intersection, road-bed, weather and, above all, with type of accident. The latter is resolved dividing data according to each class present in type of accident; this can lead to a greater computing effort but, above
Table 4 Examples of model outputs for some cases of the test file Flow in the intersection normalised to 10 000 [vet/h]
Virtual conflict points normalised to 40
Real conflict points normalised to 60
Type of intersection
Type of accident
Road surface conditions
Meteorological conditions
Actual number Calculated num- Error of accident nor- ber of accidents malised to 70 normalised to 70
Day Day Night Night Day Day
0.461 0.674 0.181 0.186 0.617 0.343
0.225 0.175 0.225 0.225 0.375 0.000
0.050 0.050 0.050 0.050 0.067 0.083
Signalised Signalised Signalised Signalised Signalised Not signalised
Dry Dry Dry Dry Dry Dry
Clear Clear Clear Clear Clear Clear
0.229 0.129 0.257 0.271 0.457 0.114
0.2378 0.1269 0.2630 0.2621 0.4661 0.1184
0.0088 −0.0021 0.0060 −0.0089 0.0091 0 0044
Day
0.188
0.000
0.083
Not signalised
Dry
Clear
0.129
0.1289
−0.0001
Day Day Day Day Day Night Day Day Day Day Day
0.964 0.229 0.013 0.288 0.157 0.156 0.238 0.122 0.299 0.695 0.695
0.250 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.425 0.425
0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.100 0.100
Signalised Not signalised Not signalised Not signalised Not signalised Not signalised Not signalised Not signalised Not signalised Signalised Signalised
Wet Dry Dry Dry Dry Wet Dry Dry Dry Dry Dry
Clear Clear Clear Clear Clear Clear Clear Clear Clear Clear Clear
0.100 0.114 0.114 0.114 0.157 0.143 0.143 0.157 0.143 0.471 0.471
0 0964 0.1206 0.1230 0.1137 0.1625 0.1377 0.1417 0.1564 0.1349 0.4742 0.4647
−0.0036 0.0066 0.0090 −0.0003 0.0055 0.0053 −0.0013 −0.0006 −0.0081 0.0032 −0.0063
Day Day Day Night Night Night Day Day Night Day Day Night Night
0.750 0.269 0.682 0.010 0.007 0.104 0.299 0.471 0.004 0.682 0.234 0.142 0.013
0.275 0.000 0.425 0.000 0.000 0.000 0.000 0.000 0.000 0.425 0.000 0.000 0.000
0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100
Signalised Not signalised Signalised Not signalised Not signalised Not signalised Not signalised Not signalised Not signalised Signalised Not signalised Not signalised Not signalised
Frontal/side Frontal/side Other Other Frontal/side Pedestrian running over Pedestrian running over Bump Bump Bump Side Frontal Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Pedestrian running over Frontal Frontal Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side
Dry Wet Dry Dry Dry Wet Dry Dry Dry Dry Dry Dry Dry
Clear Cloudy/rainy Clear Clear Clear Cloudy/rainy Clear Cloudy/rainy Clear Clear Clear Clear Clear
0.286 0.143 0.471 0.186 0.186 0.171 0.143 0.129 0.200 0.471 0.143 0.171 0.186
0.2939 0.1484 0.4738 0.1893 0.1898 0.1727 0.1357 0.1385 0.1903 0.4738 0.1431 0.1683 0.1889
0.0079 0.0054 0.0028 0.0033 0.0038 0.0017 −0.0073 0.0095 −0.0097 0.0028 0.0001 −0.0027 0.0029
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
Day or night
713
714
Table 4 (Continued) Flow in the intersection normalised to 10 000 [vet/h]
Virtual conflict points normalised to 40
Real conflict points normalised to 60
Type of intersection
Type of accident
Road surface conditions
Meteorological conditions
Actual number Calculated num- Error of accident nor- ber of accidents malised to 70 normalised to 70
Day Day Day Night Day Day Day
0.225 0.531 0.402 0.340 0.381 0.381 0.119
0.000 0.425 0.100 0.100 0.100 0.100 0.000
0.100 0.100 0.133 0.133 0.133 0.133 0.150
Not signalised Signalised Signalised Not signalised Signalised Signalised Not signalised
Dry Dry Dry Dry Wet Wet Dry
Clear Clear Clear Clear Cloudy/rainy Cloudy/rainy Clear
0.143 0.471 0.157 0.157 0.157 0.157 0.129
0.1441 0.4699 0.1657 0.1522 0.1600 0.1600 0.1383
0.0011 −0.0011 0.0086 −0.0048 0.0030 0.0030 0.0093
Day
0.120
0.000
0.150
Not signalised
Dry
Clear
0.143
0.1382
0.0048
Night Day Day Day
0.273 0.175 0.166 0.171
0.525 0.000 0.000 0.000
0.150 0.167 0.167 0.183
Signalised Not signalised Not signalised Not signalised
Dry Dry Dry Wet
Clear Clear Clear Cloudy/rainy
0.357 0.157 0.157 0.129
0.3487 0.1571 0.1584 0.1309
−0.0083 0.0001 0.0014 0.0019
Day Night Day Day Day Day Day Day Day Day Day Day Day Day
0.701 0.782 0.234 0.344 0.364 0.357 0.332 0.090 0.102 0.113 0.095 0.196 0.330 0.406
0.575 0.575 0.000 0.500 0.000 0.000 0.500 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.183 0.183 0.183 0.200 0.200 0.200 0.200 0.217 0.217 0.217 0.217 0.217 0.283 0.867
Signalised Signalised Not signalised Signalised Not signalised Not signalised Signalised Not signalised Not signalised Not signalised Not signalised Not signalised Not signalised Signalised
Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Pedestrian running over Pedestrian running over Frontal/side Frontal/side Frontal/side Pedestrian running over Bump Bump Side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side Frontal/side
Dry Dry Dry Dry Dry Dry Dry Dry Wet Dry Dry Dry Dry Wet
Clear Clear Clear Clear Clear Clear Clear Clear Cloudy/rainy Clear Clear Clear Clear Cloudy/rainy
0.314 0.314 0.114 0.429 0.143 0.143 0.429 0.186 0.186 0.186 0.186 0.171 0.171 0.557
0.3138 0.3137 0.1208 0.4215 0.1367 0.1375 0.4198 0.1874 0.1829 0.1825 0.1863 0.1660 0.1766 0.5531
−0.0002 0.0003 0.0088 −0.0075 −0.0063 −0.0055 0.0092 0.0014 −0.0031 0.0035 0.0003 0.0050 0.0056 −0.0039
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
Day or night
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
all, in some cases to a reduction of statistical significance. The remaining variables are transformed into a binary variable. For example, the day/night variable is transformed into two variables: day becomes a variable which is 1 when the accident occurred in the daytime, and night becomes another variable which is 1 when the accident occurred at night time. Output is the accident index as previously calculated in Eq. (1). Only some results are reported but they are sufficient to evaluate model performance. Multiple linear regression, applied to the sets of data, gives an r 2 (r is the coefficient of correlation) ranging from 0.20 to 0.40. RMSE (root mean square error) calculated on real data and data calculated by the regression model ranges from 0.5 to 0.6. In some cases when output must be close to zero, the regression model gives negative values, too. Multiple exponential regression gives an r 2 ranging from 0.17 to 0.41. RMSE calculated on real data and data calculated by of the regression model ranges from 0.5 to 0.7. Therefore, considering the low level of these results no further analysis was carried out using these methods.
7. Results and discussion The ANN model can include all possible permutations of variables, discrete or continuous. For continuous variables a classification is carried out only in order
715
to reduce the number of possible values and to lessen computational time. As previously mentioned, Table 4 represents a set of possible enquires and outputs. If these enquires are organised by type, certain particular graphs can be obtained. Some of these examples are drawn in Figs. 5 and 6. The following results are most significant for the discussion: 1. the degree of danger is the highest in night-time collisions, for any type of accident, at signalised intersections; 2. the degree of danger is the highest, with respect to all types of accident, for the running over of pedestrians at night-time at non-signalised intersections; 3. meteorological conditions and the state of the road surface are not determinant in the number of the accidents of each intersection; that is, at the level of analysis of a single intersection there is no statistically relevant correlation between the accident index and meteorological conditions or the state of the road surface but this doesn’t mean that grouping together the accident data of some intersections a significant relationship cannot be singled out; 4. the accident index is greater for an intersection of average complexity (number of real and virtual points) if not signalised, compared to a signalised one with the same complexity; on the other hand, for signalised intersections of average complexity at night-time it is smaller than at non-signalised ones or ones with flashing yellow intersections;
Fig. 4. Learning cycles necessary to minimise the root mean square error.
716
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
Fig. 5. Accident Index for different types of intersection and regulation.
5. the accident index is greater for an intersection of average dimensions compared to more or less complex ones; 6. the accident index is greater for a small intersection if signalised compared to a small intersection that is not signalised; 7. real conflict points (that do not depend on trafficlights) are more important than virtual conflict points; 8. the accident index of a single intersection does not depend on accident type; 9. the rate of increase of average flow at a single intersection has no influence on the accident index; this means that in average in the same intersection accidents occurred both with high and low flow so that no rule can be singled out. It can usually be observed that by increasing the average flow (hourly flow for example) on a road network the number of accidents increases; but it is rather difficult to understand whether the accidents occurred with precisely those flow conditions or not. In general the ability of the ANN to simulate road accidents can be attributed, like any other deterministic or statistical models, to the following factors: 1. the choice and number of the variables used. The ANN requires a classification of the variables to be used. The choice is both theoretical (supported by analogous studies already conducted), and is linked to the availability of data already gathered. For instance, it may be easy to survey an intersection or obtain data on climatic conditions, but the same cannot be said for many human factors. The reduction of the number of variables used to describe the problem leads to an increase in the potential error of the model. Simple ANNs cannot fully explain multi-dimensional phenomena.
2. data collection factors. Once the variables for the ANN are chosen, the collection of these can frequently introduce other biases and errors. The data collection errors impinge in an unpredictable manner on the ANN model results; 3. stochastic aspect of the phenomenon. Though all the variables necessary to describe the process are employed and collected with a minimal degree of bias, the deterministic nature of the ANN does not assure irrelevance of model error when variables have a great stochastic component. This last factor needs further clarification. When an ANN model is used it learns from realty expressed by the use of real data. The accident index associated with the particular set of values of variables that describe the accident is heavily influenced by the distribution of values in the actual set of accidents the ANN receives for the learning process. If a particular value of one variable in the learning set has a high accident index value the ANN will assign a higher link to this factor than to one with a lower value. Whether this link is randomly high or low the ANN will output an average value. In order to solve a statistical problem of large values, it is necessary to make an appropriate extraction from real data to guarantee an equal distribution to all factors. For this purpose a database that represents every learning process should be built which contains a uniform distribution for the different values of the most meaningful variables. For example, accidents that occurred in daytime are around 70% of the total. The consequence of this imbalance may be that a greater index is associated by the model to daytime cases. Regardless of whether a uniform database was constituted, the ANN classifies the most dangerous conditions only at those places
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
where crashes most frequently occur. Such a selection of data would lead to a reduction of the available sets of real data for learning and consequently a reduced capacity to understand the nature of collisions. For the purposes of this paper it was not possible to satisfy a similar requirement because it would have been necessary to increase the number of intersections. The fact that the model does not work perfectly (that is the variance the model actually explains better than stochastic character of the output as shown in Section 6.2) is due to the possibility that other variables should have been used. Intuitively this effect may be more serious in those cases exhibiting accident indexes with low values. This explains the apparently random nature of certain situations. For instance, during the nighttimes of festive occasions elevated speeds are adopted by young drivers. Unfortunately speed and flow data were not collected for all collisions at all times. The absence of these variables has contributed to making the daytime cases less ‘clear’. They are more linked to a factor that varies in ways our ANN cannot understand. All these aspects help to explain some of the discrepancies between real values and those estimated by our ANN. Another way of testing the validity of results is to apply the findings to other situations. This is to establish whether a certain model, trained with a certain set of data, is capable of carrying out reliable solutions for contexts different from those where data are collected. Generally the greater the quantity of data used for learning, the better the possibility of generalising the results and extending them to other urban contexts. It
717
is necessary to emphasise, however, that if we want to extend the results of the simulations on a model ‘educated’ with data from a certain urban context, the similarity of the contexts must be evaluated since: 1. the association of the same variables do not necessarily give rise to the same results when a different context is examined; it is probable that in another city or environment the same variable may have a different weight as regards the actual number of accidents at intersections; 2. not using variables considered not fundamental for the designing of the model in some contexts, could lead to errors in model of other places. If, for instance driving in rainy weather represents a normal condition in one city, the same ANN cannot be used for a city where rain is rare. If we want to extend our generalisation to other urban realities the results obtained by the ANN used in a specific context, attention should be paid to the choice of a ‘similar’ context. This applies not only to the values of variables used in the ANN model, but also in terms of those variables that were discarded during the learning phase. Another approach is to create a large database containing descriptions of all possible cases. Such a database is more likely to have a similar distribution of values across variables.
8. Conclusions To understand which of the three subset factors, human, vehicular or road, is the most important and
Fig. 6. Accident Index for intersections of average dimensions and not signalised at night-time according to the type of accident.
718
L. Mussone et al. / Accident Analysis and Pre6ention 31 (1999) 705–718
which one it is necessary to intervene on, is not a simple task. We need to create tools of investigation capable of answering this question with a reasonably low margin of error. Our paper shows one method to follow in the use of the ANN. Thanks to the clarity of the results, in terms of which factors explain accidents and which factors contribute to a higher degree of danger, the capability of extracting information from data, exhibited by neural networks, makes the proposed methodology of analysis still more interesting, above all for the possibility of using databases that already exist or have been built for other purposes. It is necessary to continue work on these types of models, to tackle and solve problems connected to the distribution of data with a low average value. In order to make this factor less preponderant, we suggest a procedure that envisages an extraction of preliminary data (for example by a cluster analysis) so as to improve the training phase of ANNs. Another problem not yet resolved in technical papers concerns the different recurrence of phenomena which may cause an accident; this can lead to the collection data that are not homogenous in time and affect results in an unpredictable way. To avoid these difficulties studies on driver behaviour and peculiarities have already been carried out in order to understand what is the best (minimum) sample interval for the various accident scenarios. The use of the accident index to compare the degree of danger of the intersections may be criticised. It is a macroscopic value which could lead to a misunderstanding of microscopic effects occurring at the intersection.
Acknowledgements Thanks are due to the City Hall for providing data and other help. This paper is partially funded by CNR (Murst 40%).
References Cybenko, G., 1989. Approximation by superpositions of sigmoidal functions. Math. Control Signals Syst. 2, 303–314. Dionne, G., Desjardins, D., Laverge-Nadeau, C., Maag, U., 1993. Medical conditions, risk exposure and truck drivers’ accidents: an analysis with regression models. The 37th Annual Meeting of The Association for the Advancement of Automotive Medicine, San Antonio, Texas.
.
Dougherty, M., 1995. A review of neural networks applied to trasnport. Transpn. Res. 3 (4), 247 – 260. Ferrari A., Mussone L., Oneta M., 1997. Evaluation of accident gravity in urban crossroads (in Italian). The 6th SIDT (Italian Society of Transport Professors) Annual Meeting Proceedings, Bologna, Italy. Girosi, F., Poggio, T., 1991. Networks for learning. In: Antognetti, P., Milutinovic, V.V. (Eds.), Neural Networks: Concepts, Applications and Implementation. Prentice Hall, Inc, Englewood Cliffs, NJ, pp. 110 – 154. Graham, A., 1996. An application of generalized linear modelling to the analysis of traffic accidents. Traffic Engng. Control 37 (12), 691 – 696. Hornik, K., Stinchombe, M., White, H., 1990. Universal approximation of an unknown mapping and its derivative using multilayer feedforward networks. Neural Netw. 3, 551 – 560. Hornik, K., 1991. Approximation capabilities of multilayer feedforward networks. Neural Netw. 4, 251 – 257. Joshua, S.C., Garber, N.J., 1990. Estimating truck accident installments and involvments using linear and Poisson regression models. Transp. Plan. Technol. Record 15, 41 – 58. Jovanis, P.P., Chang, H.L., 1986. Modelling the relationship of accidents to miles travelled. Transp. Res. 106, 42 – 51. Kim, K., Nitz, L., Richardson, J., Li, L., 1995a. Personal and behavioural predictors of automobile crash and injury severity. Accid. Anal. Prev. 27 (4), 1 – 13. Kim, K., Nitz, L., Richardson, J., Li, L., 1995b. Analyzing the relationship between crash types and injuries in motor vehicle collisions in Hawaii. Transp. Res. Record 1467, 9 – 13. Leshno, M., Lin, V.Ya., Pinkus, A., Schocken, S., 1993. Multilayer feedforward networks with a non polynomial activation function can approximate any function. Neural Netw. 6, 861 – 867. Maher, M.J., Summersgill, I., 1996. To comprehensive methodology for the fitting of predictive accident models. Accid. Anal. Prev. 28 (3), 281 – 296. Maycock G., Hall, R.D., 1984. Accidents at 4-arm roundabout. Laboratory Report LR1120, Transportation Research Laboratory, Crowthorne, Berks, UK. Miaou, S.P., Hu, P.S., 1992. Relationship between truck accidents and highway geometric design: to Poisson regression approach. Transp. Res. Lab. Record 1376, 10 – 18. Mussone L., 1995. Capabilities and first applications of neural networks in transports (in Italian). In: Cascetta E., Salerno G. (Eds.) (Italian Society of Transport Professors), Developments of Research in Transport Systems, Collana Trasporti, Franco Angeli, Milano, Italy, pp. 536 – 565. Mussone L., Rinelli S., 1996. An accident analysis for urban vehicular flow, Urban Transport 96, CMP, pp. 583 – 592. Okamoto, H., Koshi, M., 1989. The method to cope with the random errors of observed accident rates in regression analysis. Accid. Anal. Prev. 21, 317 – 332. Salvatore, F., 1992. Trattamento statistico dei dati di incidente (in Italian). Autostrade 1, 27 – 31. Shankar, V., Mannering, F., Barfield, W., 1995. Effect of roadway geometrics and environmental factors on rural freeway accident frequencies. Accid. Anal. Prev. 27 (3), 371 – 389. Zeeger C.V., Stewart, R., Reinfurt, D., et al., 1990. Effective geometric improvements for safety upgrading horizontal curves. Prepared for the Federal Highway Administration by the University of North Carolina.