A framework of boundary collision data aggregation into neighbourhoods

A framework of boundary collision data aggregation into neighbourhoods

Accident Analysis and Prevention 83 (2015) 1–17 Contents lists available at ScienceDirect Accident Analysis and Prevention journal homepage: www.els...

4MB Sizes 4 Downloads 125 Views

Accident Analysis and Prevention 83 (2015) 1–17

Contents lists available at ScienceDirect

Accident Analysis and Prevention journal homepage: www.elsevier.com/locate/aap

A framework of boundary collision data aggregation into neighbourhoods Ge Cuia , Xin Wanga,* , Dae-Won Kwonb a b

Department of Geomatics Engineering, University of Calgary, Calgary, Canada Office of Traffic Safety, City of Edmonton, Edmonton, Canada

A R T I C L E I N F O

A B S T R A C T

Article history: Received 20 January 2015 Received in revised form 4 May 2015 Accepted 5 June 2015 Available online xxx

A large portion of the total number of motor collisions can be boundary collisions; therefore, exaggerated or underestimated numbers for boundary collisions aggregated into neighbourhoods may hamper road safety analyses and management. In this paper, we propose a systematic framework for boundary collision aggregation. First, an entropy-based histogram thresholding method is utilized to determine the boundary zone size and identify boundary collisions. Next, the collision density probability distribution is then established, based on the collisions in each neighbourhood. Last, an effective boundary collision aggregation method, called the collision density ratio (CDR), is used to aggregate boundary collisions into neighbourhoods. The proposed framework is applied to collision data in the City of Edmonton for a case study. The experimental results show that the proposed entropy-based histogram thresholding method can identify boundary collision with the high precision and recall, and the proposed CDR method is more effective than the existing methods, the half-to-half ratio method and the one-to-one ratio method, to aggregate boundary collisions into neighbourhoods. ã 2015 Elsevier Ltd. All rights reserved.

Keywords: Boundary collision aggregation Entropy Boundary zone Collision density ratio

1. Introduction Road traffic accidents are social and public health challenges, as they almost always result in injuries and/or fatalities (World Health Organization, 2013). The World Health Organization reports that road collisions, as the ninth leading cause of death in 2004, will be ranked as the fifth leading cause in 2030 (World Health Organization, 2013). It has been estimated that over one million people are killed each year in road collisions, which is equivalent to 2.1% of the annual global mortality, resulting in an estimated social cost of $518 billion (Peden et al., 2004). Previous traffic safety studies have shown that the occurrences of motor vehicle accidents are rarely random in space and time; therefore, the macro-level analysis of collision data is a substantial component of traffic planning and traffic management. Examples include LaScala et al. (2001) conducting a geostatistical analysis to examine the relationship of neighbourhood characteristics to alcohol-related pedestrian injury collisions, the use of macro-level collision prediction models (CPMs) in road safety evaluation and planning (Lovegrove and Sayed, 2006), and some studies showing that neighbourhood street patterns have a significant impact on

* Corresponding author. E-mail address: [email protected] (X. Wang). http://dx.doi.org/10.1016/j.aap.2015.06.003 0001-4575/ ã 2015 Elsevier Ltd. All rights reserved.

traffic collision frequency (Rifaat et al., 2009). A critical step in conducting a macro-level analysis of collision data in road safety is the effective aggregation of collisions into neighbourhoods. The aggregation of collision data has a large impact on traffic analysis and management. Boundary collisions are motor accidents that occur on the boundaries of neighbourhoods. Boundaries are usually defined by conspicuous natural or artificial ground objects, such as main roads, rail lines, trails and rivers. As boundaries of neighbourhoods are often main roads where most collisions happen, boundary collisions can account for a large proportion of the total collisions (Siddiqui and Abdel-Aty, 2012; Wang et al., 2012; Lee et al., 2014). Therefore, traffic analysis and management are considerably affected by these boundary collisions: this phenomenon is referred to as the boundary effect. Boundary collisions can, however, be difficult to identify after the digitalization and geocoding process, for the following reasons:  Boundaries may not be coincident with the corresponding roads on the map, which leads to boundary collisions not located on the boundary lines, and  In the representation of roadways in the features of a geographic information system (GIS) where, for example, roadways with a width of 10 m are often represented as single lines, collisions may deviate from the roads.

2

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

Fig. 1. Boundary collision identification.

Fig. 1 illustrates these two cases by using real collision data from the City of Edmonton in Alberta, Canada. In the figure, 111 Avenue NW is a roadway served as the boundary between the neighbourhoods “INGLEWOOD” and “WESTMOUNT”. Three boundary collisions ‘a’, ‘b’ and ‘c’ are on the boundary road 111 Avenue NW and one non-boundary collision ‘d’ is close to 129 Street NW. In Fig. 1, the road and the boundary do not match after digitalization. Moreover, after the digitalization, though collision ‘b’ is located very close to the boundary, it is neither on the road nor on the boundary.

The manual inspection of boundary collisions is time-consuming and requires significant human resources. A better solution is the generation of a boundary zone to include nearby collisions. The boundary zone is a buffer centred at the boundaries of neighbourhoods. Collisions located within a boundary zone are assumed as boundary collisions; and, boundary collisions can then be assigned to the neighbourhoods through aggregation methods. Fig. 2 shows a 6-m boundary zone for the example in Fig. 1. In Fig. 2, the boundary zone contains all three collisions ‘a’, ‘b’ and ‘c’ as boundary collisions. The collision ‘d’ is beyond the boundary zone and treated as a non-boundary collision.

Fig. 2. Boundary zone.

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

One critical issue of a boundary zone generation is the determination of its buffer size. If the size is too large, some non-boundary collisions may be contained mistakenly. If the boundary size is too small, some boundary collisions may be overlooked. Few researches have been conducted to solve the problem. Siddiqui and Abdel-Aty (2012) determined a buffer size of 100 ft for boundary pedestrian crashes by examining the curve slope of the buffer distance versus the percentage of total pedestrian crashes. Some researchers suggest to use a fixed buffer size, such as 200 ft (Ivan et al., 2006). However, the size of the boundary zones may be different for various neighbourhoods and roads. Therefore, a quantitative method for setting a proper boundary zone size is required. Another challenge in boundary effect research is the aggregation of collisions into neighbourhoods. Non-boundary collisions are usually aggregated to neighbourhoods where they are located. Nonetheless, due to interactions of all neighbourhoods around them, it is necessary to effectively assign boundary collisions to adjacent neighbourhoods. The one-to-one and the half-to-half ratio methods are two widely used aggregation methods (Sun, 2009; Wei, 2010); however, these two methods treat all neighbourhoods equally, without considering the spatial distribution of collisions around neighbourhood boundaries. One observation from real collision data is that boundary collisions are not evenly distributed around the boundary. An example is shown in Fig. 3, where more collisions are located around the boundary (i.e. 34 Avenue NW) in the north neighbourhood (STRATHCONA INDUSTRIAL PARK) than in the south neighbourhood (PARSONS INDUSTRIAL). One possible explanation may be that the traffic conditions (e.g. traffic volume, number of traffic signals and signs, road surface, etc.) of the north neighbourhood are poorer than those of the south neighbourhood, indicating that traffic conditions of neighbourhoods may have different degrees of impact on a boundary collision. Therefore, aggregating boundary collisions into adjacent neighbourhoods without considering the distribution of the boundary collisions is not an appropriate solution.

3

The objective of the paper is the development of solutions to the aforementioned problems for boundary collision aggregation. We take boundary collisions as an independent object and explore the definition and discriminant methods of boundary collisions, its aggregation methods, and the resulting effect on road safety. Specifically, the contributions of the paper can be summarized as follows:  Proposal of a framework for the aggregation of boundary collisions. This framework includes three main steps: determination of the boundary zone size, construction of the collision density distribution, and aggregation of boundary collisions.  Examination of the distribution of spatial distances from boundary collisions, non-boundary collisions and outliers to neighbourhood boundaries, and utilization of the entropy-based histogram thresholding method to determine the size of the boundary zone. The method is implemented as an automatedprogramme, which can effectively and efficiently differentiate boundary data from non-boundary data and outliers.  Proposal of a collision density ratio (CDR) method to aggregate boundary data into neighbourhoods, taking into account the distribution of the boundary collisions to the neighbourhoods. In the CDR method, the degree of impact of the neighbourhood on a specific location is represented by using the collision density probability estimated from individual neighbourhood.  Results from a case study on real datasets from the City of Edmonton. The experimental results show that the proposed method outperformed other boundary data aggregation methods in statistical analysis. By comparing the boundary collision identification result with the manual inspection result, it shows that the proposed entropy-based histogram thresholding method can identify the boundary collisions with the high precision and recall. Besides, macro-level boundary collision prediction models (CPMs) were developed and the results show that all models of the three aggregation methods fit successfully. However, chi-square test between boundary collision aggregation results and the truth data indicates that the proposed CDR method better reflects the number of boundary collisions

Fig. 3. Collisions are concentrated on one side of the boundary.

4

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

aggregated into neighbourhoods, while the results of the half-tohalf ratio method and the one-to-one ratio method have a large discrepancy with the truth data. This paper is organized as follows: Section 2 introduces previous works related to boundary collisions. Section 3 proposes a framework for the aggregation of boundary data. Section 4 discusses a case study of two areas in the City of Edmonton. Section 5 evaluates the effectiveness of the proposed boundary collision identification method and three boundary collision aggregation methods. Section 6 presents conclusions and suggests future work. 2. Related works Some research studies have suggested that the boundary effect has little influence on traffic analysis. Ladron de Guevara et al. (2004) and Khondakar et al. (2010) examined the issue when they developed macro-level collision models for Tucson, Arizona, USA and Greater Vancouver, British Columbia, Canada, respectively. They found that the percentage of collisions involved on the boundaries of traffic analysis zones (TAZs) was about 5%, which did not significantly impact their collision models results. Other studies, however, have indicated that the boundary effect may affect traffic analysis. Fotheringham and Wegener (1999) observed that spatial data located near zone boundaries may have an inter-zonal influence. Lovegrove (2007) aggregated boundary data in an automatic geospatially precise way in developing a macro-level collision distribution model for the Great Vancouver Regional District. It was supposed that a large amount of boundary collisions existed in the district and the aggregation result of boundary collisions would affect collision model estimation. Siddiqui and Abdel-Aty (2012) conducted a study on boundary pedestrian crashes at traffic analysis zones. They first determined a buffer threshold to identify boundary collisions by examining the curve slope of the buffer distance versus the percentage of total pedestrian crashes. Secondly, they considered the effects of neighbours on boundary collisions and calculated explanatory variables for interior collisions and boundary collisions separately. Thirdly, they built three collision models, i.e. interior collision model, boundary collision model and traditional collision model. Among them, the interior and boundary collision models have better model fit than the traditional collision model. Wang et al. (2012) mentioned that the considerable number of collisions happened at the TAZ boundaries in Orange County of Florida. They aggregated the boundary collisions into TAZs manually, and proposed a new collision model by dividing collisions into onsystem road and off-system road collisions. Lee et al. (2014) developed a new zone system for macro-level crash analysis and regarded that the proposed zone system could help to improve the inaccuracy resulted from boundary collisions by regionalization process which combined small TAZs with similar traffic collision pattern into sufficiently large zones. All above studies contributes different collision models to the boundary collision analysis, but they did not focus much on the buffer zone determination mentioned in Section 1. Only very few research works have been conducted on the aggregation methods of boundary collisions. Wei (2010) made use of five boundary data aggregation approaches (the one-to-one ratio method, the half-to-half ratio method, the geospatial method, the vehicle kilometres travelled (VKT) ratio method and the total lane kilometres (TLKM) ratio method) for geocoded data on TAZ boundaries. The one-to-one ratio method counts the boundary collisions once for every neighbourhood adjacent to the location of collisions. For example, if a collision is located at a four-way intersection with

four neighbourhoods around it, the collision is counted four times. Obviously, this method exaggerates the total number of collisions. The half-to-half ratio method works similarly to the one-to-one ratio method; however, it counts each collision once in total and assigns the average collision value to the neighbourhoods. The geospatial method aggregates boundary collisions into a neighbourhood if collisions are located within the boundary of the neighbourhood and is thus affected by the accuracy of reported collisions and surveyed neighbourhood boundaries. The VKT and TLKM ratio methods aggregate boundary data into adjacent zones according to the ratios of VKT and TLKM, respectively. However, the cause of collisions is quite complicated, and consideration of only the ratio of VKT or TLKM variables among the neighbourhood does not work well. Wei (2010) constructed groups of collision prediction models (CPMs) and concluded that the different aggregation methods for boundary data significantly impacted the CPM results. They recommended the half-to-half ratio method as the best one. 3. Methodology In this paper, we propose a new framework to handle boundary collision aggregation. The input data of the boundary collision aggregation in this study include the collision data, neighbourhoods and road networks. This framework consists of three main steps: (1) determine the boundary zone size, (2) construct the collision density distribution for neighbourhoods, and (3) aggregate the boundary collisions using the collision density ratio (CDR) method. A flow chart of the proposed framework is shown in Fig. 4. The size of boundary zone is determined so that boundary collisions can be identified. Construction of the collision probability density distribution is established for each neighbourhood, and the collision density value of a location in the boundary zone can

Fig. 4. Framework of boundary collision aggregation.

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

then be estimated from collisions within nearby neighbourhoods. Boundary collisions are assigned into neighbourhoods by the ratios of the collision density probability values estimated from the neighbourhoods. Each step of the framework is discussed in detail in the following subsections. 3.1. Boundary zone size determination The first step in the framework is the determination of the boundary zone size. As previously mentioned, a suitable boundary zone size is critical for boundary collision identification and aggregation. In this paper, boundary collisions are defined as the collisions within the boundary zone. As most collisions are on the road or close to the road, one possible way is to use the width of road and clear zone to set boundary zone size. However, the width of roads and clear zones are usually different. Also the main roads and neighbourhood boundaries generally do not match and may have certain deviation from their accurate locations. Thus, it is not an efficient way to set boundary zone size with the width of roads and clear zones. The distribution of the distance between boundary collisions and neighbourhood boundaries is used to infer the boundary zone size. The first procedure in boundary zone size determination is the removal of some potential non-boundary collisions, from which the distance to the boundary of a neighbourhood is greater than the distance to its closest non-boundary roadway. However, the remaining collisions may still include outliers. For example, in Fig. 5, though the outlier collision marked as X is closer to the neighbourhood boundary than to its nearest road (i.e. 97 Street NW), it is not likely to be a boundary collision because its distance to the boundary is more than 40 m, which is far away to the boundary. Outliers may result from the process of data collection or geocoding and should be filtered out to obtain the correct boundary collisions. The boundary zone size can be inferred using the spatial distance distribution between the rest of the boundary collisions and the neighbourhood boundaries. Therefore, the second step in the determination of the boundary zone size is to filter out of

5

outliers from the rest potential boundary collisions. To distinguish outliers, a distance-level histogram is developed. The histogram describes the distribution of distances from collisions to their closest boundaries. The horizontal axis of the histogram represents the distance interval (e.g., scale 0 representing 0–1 m, scale 1 representing 1–2 m, etc.) between collision and boundary. The vertical axis represents the number of the collisions located within the distance intervals. As boundary collisions are close to boundaries and outliers are much farther from boundaries, the distance threshold of the histogram discriminating the two kinds of collisions corresponds to the boundary zone size. Fig. 6 shows an example of a distance-level histogram of boundary collisions and outliers. As can be seen, the distribution is skewed to right. However, it is difficult to determine the threshold as the histogram is not bimodal. In the histogram, the number of boundary collisions is large, and they are concentrated around the boundary; whereas the number of outliers is smaller, and their distance to the boundary is farther than boundary collisions, spreading smoothly on the horizontal axis. A reasonable threshold to distinguish boundary collisions and outliers needs to be identified from the histogram. In this paper, an entropy-based histogram thresholding method is utilized to set the threshold of distance. Entropy is well known as a measure of uncertainty about sources of information, and is widely used in histogram thresholding (Kapur et al., 1985; Wong and Sahoo, 1989). In this research, boundary collisions and outliers are considered as two different sources of information. The priori data of the two sources are the frequencies of collisions in specific distance intervals. According to the principle of the maximum entropy, the probability distribution that best represents the knowledge is the one that maximizes the entropy subject to the given priori data. Therefore, the distance threshold discriminating boundary collisions and outliers should lead to maximum entropy of the distribution of collisions. If there are n distance intervals (0–1 m, 1–2 m, . . . ), N is the total number of boundary collisions and outliers, and f1, f2, . . . , fn is the observed number of boundary collisions and outliers in the corresponding intervals, the probability of collision appearance of

Fig. 5. Outlier in the potential boundary collisions.

6

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

Fig. 6. Distance-level histogram of potential boundary collision data.

a given interval i is pi, pi ¼

fi ; and N

Xn i¼1

pi ¼ 1; i ¼ 1; :::; n;

(1)

Let t be the selected distance as the threshold differentiating between boundary collisions and outliers. Therefore, the distribution of boundary collisions is: p1 Pt

j¼1 pj

p ; Pt 2

j¼1 pj

p ; :::; Pt t j¼1

(2)

pj

outliers. We have implemented the above entropy-based histogram thresholding method in C# programming language. Please note that some potential non-boundary collisions are removed in the first step based on the criteria that they are closer to their closest roadways than to the closest neighbourhood boundaries, and might be inside of the boundary zone. The number of these boundary collisions is usually small, which would not affect the threshold determination of the histogram. Experiments are developed to support the statement in Section 5.1. 3.2. Construction of collision density distribution

The entropy associated with the distribution of boundary collisions is: Eb ¼ 

Xt

pi

i¼1

pi

Pt

pj

j¼1

lnPt

j¼1

(3)

pj

The distribution of outliers is: p Pn tþ1 j¼tþ1

pj

p ; Pn tþ2 j¼tþ1

pj

; :::; Pn

pn

j¼tþ1

(4)

pj

The entropy associated with the distribution of outliers can be calculated as: Eo ¼ 

Xn i¼tþ1

Pn

pi

j¼tþ1

pj

lnPn

pi

j¼tþ1

pj

(5)

In Eq. (6), the histogram threshold discriminating boundary collisions and outliers ranges from 1 to n, and it aims to find the threshold T leading to the maximum summation entropy of the distribution of boundary collisions Eb and outliers Eo T ¼ Argmaxt¼1;:::;n ðEb þ Eo Þ

(6)

As T can effectively discriminate the distribution of boundary collisions and outliers, it is assigned to the boundary zone size. The collisions with distances to the boundaries that beyond the threshold are then filtered out. In summary, the determination of boundary zone size has three steps: first, if the distance from a collision to the neighbourhood boundary is greater than the distance to its closest roadway, the collision is removed; then, a distance-level histogram of the remaining collisions is constructed; next, the threshold of the histogram is calculated out by using the maximum entropy. Last, the threshold is used as the size of boundary zone to filter out the

The second step of the framework is the construction of the collision density distributions for each neighbourhood. As previously discussed, collisions can be unevenly distributed on the two sides of neighbourhood boundaries, i.e. for two neighbourhoods sharing one boundary, boundary collisions are generally concentrated on one side of the boundary. Based on this observation, two assumptions are made: Assumption 1. Traffic condition of a neighbourhood has an impact on the occurrence of its boundary collisions. The occurrence of boundary collisions is influenced by traffic conditions of neighbourhoods around them (e.g. driver commuter density, traffic volume, traffic signals, signs, and road surfaces). The possibility of boundary collisions resulting from the neighbourhood with poorer traffic conditions is relatively larger than the neighbourhood with better traffic conditions. For instance, commercial zones generally have higher traffic volumes with heavier traffic loads compared to low-density residential areas. Therefore, neighbourhoods with poorer traffic conditions should have more accountability of the occurrence of boundary collisions than those with better traffic conditions. Assumption 2. Traffic condition of a location is reflected by the number of collisions at this location and also by the number of collisions near the location. The degree of traffic condition of neighbourhoods can be inferred from the number of collisions. Based on this, the traffic conditions of a certain location can be estimated from the number of the collisions that occur there. Meanwhile, according to the first law of geography, “Everything is related to everything else, but near things are more related than distant things” (Tobler, 1970); therefore, the occurrence of collisions may also be affected by nearby traffic conditions. It is a very common phenomenon that the occurrence of collisions has the effect of agglomeration. In

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

other words, the area around a location with a high number of collisions usually has a higher probability of collisions as well. Based on the above assumption, collision density probability value can be used to estimate the degree of the traffic condition at a location. The collision density values obtained from the kernel density estimation method (KDE) satisfy these two assumptions. First, the collision probability density is a variable that reflects the traffic condition, and the collision probability density is usually higher for locations with poorer traffic conditions. Second, the collision density value estimated from the KDE method complies with Assumption 2, as the KDE method assumes that each collision spot is a kernel that is a probability density function. Let a set of neighbourhoods NEI = {N1, N2, . . . , Nm} contain m neighbourhoods. For any neighbourhood Ni, let ni be the total number of collisions in Ni and LC(Ni) = {l1,l2, . . . ,lt} be the set of t locations where collisions happened in or around neighbourhood Ni. Each location lj = (xj,yj)T is a two-dimensional column vector with geographic coordinates of x,y. The collision probability density value ^f ðljN Þ in location l = (x,y) estimated from N is given i

by:

i

^f ðljN Þ ¼ i

1 Xni 2

ni hi

j¼1

 K

7

1 ðl  lj Þ hi

 (7)

where K() is a kernel function, and hi is the band width and set as  1=5 4s i 5 =3ni according to Silverman’ rule of thumb (1986, p. 45), where s i is the standard deviation of the collisions in the neighbourhood Ni. The occurrence of one collision can, therefore, reflect not only the traffic conditions of the location, but also those of the area round the location. If a location is closer to a collision location, the probability of an accident occurring at the location is greater. In this paper, the kernel function is applied with the quartic approximation of a true Gaussian kernel function:  T 2 T 1 if l l < 1 (8) KðlÞ ¼ 3p ð1  l lÞ 0 otherwise The advantage of this kernel is that it has higher differentiability properties and can be calculated more quickly than the normal kernel (Silverman, 1986; p. 76).For the probability density function of each collision, the density value rises to the peak at the spot of the collision and decreases with increasing distance from the

Fig. 7. Collision probability density distribution of neighbourhoods.

8

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

collision point, reaching zero at the band width farthest from the collision point. A continuous smooth surface is fitted over each collision. Fig. 7 shows the collision probability density distribution of two adjacent neighbourhoods that share a mutual boundary. The estimated collision density probabilities of locations on this boundary are distinct from neighbourhoods, which indicate neighbourhoods have different impacts on the occurrence of boundary collisions. 3.3. Aggregation of boundary collisions by density probability ratio The third step of the framework is the aggregation of boundary collisions using the CDR method. Each neighbourhood has a collision probability density distribution constructed from the collisions in this area. A boundary collision may be deemed as the mutual interaction of all adjacent neighbourhoods. As different neighbourhoods have distinct spatial distributions of collisions around a boundary, any location in the boundary zone also has distinct a probability density value estimated from different neighbourhoods. If a collision location has a higher collision density value from a neighbourhood, it means the neighbourhood has a larger influence on the collision occurrence at this location. Therefore, the CDR method is proposed to aggregate the boundary data into adjacent neighbourhoods by the ratio of collision density of a location estimated from corresponding neighbourhoods. Furthermore, aggregation of boundary collisions into neighbourhoods with the ratio of collision density value can aggregate larger numbers of boundary collisions into neighbourhoods with poorer traffic conditions, and fewer numbers of boundary collisions into neighbourhoods with better traffic conditions. The collision density distribution of each neighbourhood is the mutual contribution from the inner and boundary zones. As shown in Fig. 8, the inner zone is the area of a neighbourhood after removing the boundary zone. The collision probability density distribution of neighbourhoods 1 and 2 (N1 and N2) are established based on collisions located in both the inner and boundary zones. After constructing the collision probability density distribution for all neighbourhoods, the collision probability density values of each collision location is determined from the designated neighbourhoods. The principle of the CDR method in the aggregation of boundary collisions is as follows: for any location l in the mutual boundary zones of v neighbourhoods {N1,N2, . . . Nv}, its estimated collision density value from neighbourhood N is ^f ðljN Þ; i ¼ 1; 2; :::; v and i

^f ðljN Þ Xv i Vðc ! Nu Þ ¼ 1 ; u¼1 ^f ðljN Þ u u¼1

Vðc ! Ni Þ ¼ Pv

(9)

For instance, some collisions occurred at the three locations – l1, l2, and l3 (Fig. 8) – in the boundary zone of neighbourhoods 1 and 2. For collisions c1, c2 and c3 at the locations l1, l2 and l3, respectively, if the estimated collision density probability ^f ðl jN Þ 1

1

of location l1 from neighbourhood 1 is 0.003 and the probability ^f ðl jN Þ from neighbourhood 2 is 0.007, the collision value from 1 2 collision c1 to neighbourhood 1 is Vðc1 ! N1 Þ ¼ 0:003=ð0:003 þ 0:007Þ ¼ 0:3; and the collision value from collision c1 to neighbourhood 2 is Vðc1 ! N2 Þ ¼ 0:007=ð0:003 þ 0:007Þ ¼ 0:7. This means collision c1will be aggregated into neighbourhood 1 with the value of 0.3 and into the neighbourhood 2 with the value of 0.7. Similarly, if ^f ðl jN Þ ¼ 0:0065 and ^f ðl jN Þ ¼ 0:0035, 2

1

2

2

Vðc2 ! N1 Þ ¼ 0:0065=ð0:0065 þ 0:0035Þ ¼ 0:65; and Vðc2 ! N2 Þ ¼ 0:0035=ð0:0065 þ 0:0035Þ ¼ 0:35; and, ^f ðl3 jN1 Þ ¼ 0:004 and ^f ðl jN Þ ¼ 0:001, then Vðc ! N Þ ¼ 0:004=ð0:004 þ then

3

2

3

1

0:001Þ ¼ 0:8; and Vðc3 ! N2 Þ ¼ 0:001=ð0:004 þ 0:001Þ ¼ 0:2. Therefore, for the aggregation of boundary collisions c1, c2, c3, the aggregated collision value into neighbourhood 1 is 0.3 + 0.65 + 0.8 = 1.75, and the aggregated collision value into neighbourhood 2 is 0.7 + 0.35 + 0.2 = 1.25. 4. Case study for the City of Edmonton 4.1. Data description This section presents a case study of boundary collision aggregation in Edmonton, the capital city of Alberta, Canada. Collisions cause an enormous social and economic burden for the City of Edmonton every year (OTS, 2013). For example, there were 25,454 collisions in 2009 and 24,461 collisions in 2010. Furthermore, collisions around neighbourhood boundaries are a large proportion of the total collisions in Edmonton, as shown in Table 1. Two areas in Edmonton were selected for this case study: the downtown area and the south industrial/commercial area. These two areas are covered by dense road networks, and there are frequent collision accidents. To study the aggregation of boundary collisions in the study areas, the adjacent neighbourhoods of the two study areas were also considered because adjacent neighbourhoods of the two study areas are also involved in the aggregation of boundary collisions.

i

l = (x,y) is the geographic coordinates of the location. For any collision at the location, collision c is assigned into the neighbourhoods based on the ratio of the collision probability density estimated from the corresponding neighbourhood. The collision value V(c ! Ni) aggregated from collision c to the neighbourhood Ni is given by:

 Ten neighbourhoods are in the downtown area, including McCauley, Queen Mary Park, Central McDougall, Westmount, Downtown, Oliver, Inglewood, Prince Rupert, Alberta Avenue and Boyle Street. Besides, there are another 18 adjacent neighbourhoods of downtown area involved in boundary collision aggregation. Fig. 9 shows the neighbourhoods and road network in the downtown area.  Ten neighbourhoods are in the south industrial/commercial area, including Parsons Industrial, Calgary Trail South, Rideau Park, Duggan, Empire Park, Steinhauer, Royal Gardens, Greenfield, Sweet Grass and Ermineskin. Besides, there are another 20 adjacent neighbourhoods of south industrial/commercial area involved in boundary collision aggregation. Fig. 10 shows the neighbourhoods and road network in the south industrial/ commercial area. Table 2 shows the number of collisions in the downtown and south areas by year. It can be observed that the number of collisions in both areas varied by around 100 every year and that there is a

Fig. 8. Aggregation of boundary collisions.

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

9

Table 1 Number (proportion) of collisions around neighbourhood boundaries in Edmonton. Collision number around the boundaries of neighbourhoods in Edmonton Year

Total collision number

0 m (on the boundary)

Within 10 m

Within 20 m

Within 30 m

2009 2010

25,454 (100%) 24,461 (100%)

2,774 (10.9%) 2,784 (11.4%)

13,457 (52.9%) 13,238 (54.1%)

14,495 (56.9%) 14,161 (57.9%)

15,032 (59.1%) 14,711 (60.1%)

drastic decline in 2011 and 2012. In addition, there were more collisions in the downtown area than the south area every year. Table 3 shows the distribution of the distances between collisions and boundaries. Overall, 25% of the collisions in the downtown area had distances to boundaries of less than 4.043 m. For the south area, 25% of the collisions were within 2.079 m of the boundaries. The five number (from minimum to maximum) summary describes the scenario of collision dispersion, namely that collisions were distributed densely when close to a boundary, while sparsely when far away from a boundary. Therefore, boundary collisions accounted for a large proportion of the total collisions in the two areas. 4.2. Boundary zone size As previously mentioned, the distance between a nonboundary collision and a boundary should be greater than the distance between a non-boundary collision and its nearest nonboundary road. Based on this criterion, non-boundary collisions can be removed, with the remainder as potential boundary collisions. After identifying the potential boundary collisions, histograms were constructed to present the distribution of the distances between the potential boundary collisions and the boundaries for the two areas, as shown in Figs. 11 and 12. The horizontal axis represents the distance interval; and, the vertical axis represents the number of collisions, where the distances to the nearest boundary fall into intervals. The distance interval used for this case study is 1 m. As shown in these figures, the majority of boundary collisions are densely concentrated within 1 m of the boundary. The number of boundary collisions declined with increased distance from the boundary. Beyond a certain distance, the boundary data become very rare and can be considered outliers. With the entropy-based thresholding method, the optimal boundary zone size is selected by maximizing the entropy of the distribution of boundary collisions and outliers. The entropy values for different thresholds are exhibited in Fig. 13. The figure shows that the entropy values for the downtown area reach the first peak when the threshold distance is at 6 m and reach the second peak when the threshold is equal to 23 m. In this case, 23 m is not appropriate for the threshold, since the width of roads in the downtown area is far less than this value; therefore, the boundary zone size was set as 6 m for downtown area. The entropy values in the south area reach the maximum value when the threshold distance was 9 m. The possible explanation for the different boundary zone sizes may be due to the differences in the widths of the roads in these two areas. In most cases, roads in the downtown area are narrower than the roads in other areas; therefore, the boundary zone size in downtown area is smaller than that in south area. One advantage of the maximum entropy-based method is that it can determine the proper boundary sizes based on the distribution of the collisions. It is reasonable to have different boundary sizes for different study areas. With the established boundary zones for the downtown and south areas, boundary collisions were about 25.73% and 43.23% of total collisions, respectively.

4.3. Boundary collision aggregation results In this paper, three boundary collision aggregation methods (CDR, one-to-one ratio and half-to-half ratio methods) were utilized and compared using data from the City of Edmonton. For every neighbourhood in our two study areas, the aggregation results from 2006 to 2012 were calculated. Figs. 14 and 15 show the aggregation results of boundary collisions for the downtown and south areas, respectively. As shown in the figures, the number of aggregated collisions from the one-to-one ratio method was greater than the CDR and half-to-half ratio methods. The reason is that the one-to-one ratio method double counts the number of boundary collisions, exaggerating the total number of collisions. The half-to-half ratio method aggregates boundary collisions into adjacent neighbourhoods based on average; therefore, the aggregation number in every neighbourhood using the one-to-one method is nearly twice that of the halfto-half ratio method. However, as some boundary collisions are located in the intersections of three or four neighbourhoods, these collisions are aggregated into neighbourhoods as a third or a quarter of one collision; and the aggregated collision number in the half-to-half ratio method is a slightly smaller than half of the aggregated number using the one-to-one ratio method. In some neighbourhoods, such as Boyle Street, Oliver, Alberta Avenue, Rideau Park, and Prince Rupert, the aggregated number of boundary collisions in the CDR method was more than that of the half-to-half ratio method, meaning that there were lots of collisions around the boundary of these neighbourhoods, leading to these neighbourhoods having larger ratios when aggregating boundary collisions. 5. Evaluation In this section, boundary collision identification method and different aggregation methods were evaluated. First, the ground truth data of boundary collisions were extracted from the collision dataset by manual inspection. The ground truth data is used to check the precision of the identified boundary collisions by the entropy-based histogram thresholding method. Second, a neighbourhood-based macro-level collision prediction model was used to construct a relationship between boundary collisions and explanatory variables in the boundary zone of neighbourhoods. The goodness-of-fit of the model could reflect the effectiveness of the aggregation results. Third, boundary collisions were aggregated into adjacent neighbourhoods through the manual inspection to check the effectiveness of the three boundary collision aggregation methods. 5.1. Evaluation for boundary collision identification Ground truth of boundary collisions of two study areas was extracted manually to evaluate the effectiveness of boundary collision identification with the proposed entropy-based histogram thresholding method. Table 4 shows the result of comparing the proposed method with the ground truth. In Table 4, ‘Ground truth’ column shows the number of boundary collisions identified

10

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

Fig. 9. Illusions of the downtown area (a) neighbourhoods; (b) road network.

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

Fig. 10. Illusions of the south area (a) neighbourhoods; (b) road network.

11

12

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

Table 2 Number of collisions in the downtown and south areas (2006–2012). Downtown area

South area

2006 2007 2008 2009 2010 2011 2012

3,354 3,434 3,663 3,748 3,621 2,682 2,767

1,386 1,487 1,450 1,447 1,464 1,120 1,204

Total

23,269

9,558

by the visual inspection, collision addresses and information from the police reports. The numbers shown in the column is considered as the true value of the number of boundary collisions in the study areas. ‘Histogram thresholding’ column is the number of boundary collisions identified by the proposed entropy-based histogram thresholding method; ‘Correctly identified’ column is the number of boundary collisions identified correctly by the entropy-based histogram thresholding method with respective to the ground truth data. Two measurements were chosen for the evaluation. Precision measures the percentage of correct identified boundary collisions by the entropy-based histogram thresholding; ‘Recall’

Table 3 Distance statistics between collisions and boundaries in the downtown and south areas. Downtown area

South area

Min.

1st Quartile

Median

3rd Quartile

Max.

Min.

1st Quartile

Median

3rd Quartile

Max.

2006 2007 2008 2009 2010 2011 2012

0 0 0 0 0 0 0

4.084 5.555 5.992 6.754 5.998 1.744 1.478

132.255 135.337 138.986 134.095 128.091 121.638 120.978

245.970 254.844 260.737 260.82 245.613 243.537 242.883

608.891 608.418 610.3748 606.965 608.891 606.965 608.418

0 0 0 0 0 0 0

2.101 2.101 2.079 2.101 2.646 2.079 2.079

12.766 16.341 12.766 14.577 24.957 16.341 12.766

54.788 54.788 49.564 54.788 53.180 65.027 54.788

737.460 737.460 737.460 737.460 737.460 737.460 737.460

Total

0

4.043

131.150

249.034

610.375

0

2.079

16.341

54.788

737.460

Fig. 11. Downtown area: histogram of the frequency distribution of boundary collisions and outliers.

Fig. 12. South area: histogram of the frequency distribution of boundary collisions and outliers.

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

13

Fig. 13. Entropy with different boundary zone sizes for the downtown and south areas.

measures the percentage of the true boundary collisions identified by the entropy-based histogram thresholding method. As shown in the table, 99.55% and 94.36% of boundary collisions identified by the entropy-based histogram thresholding method were true boundary collisions in the downtown area and south area respectively, and 96.01% and 86.49% of true boundary collisions were correctly identified by the entropy-based histogram thresholding method. Although manually inspection can identify the boundary collision more accurately, it is a very timeconsuming process. In our case, it took us more than 16 h to examine the boundary collision by manual inspection. Nevertheless, the execution of the programmed entropy-based histogram thresholding method only took a few seconds to finish the

boundary collision identification. Therefore, the proposed entropybased histogram thresholding method is an effective and efficient way to identify boundary collisions. Besides, as mentioned, some boundary collisions are removed as the potential non-boundary collisions if their distance to the closest roadway is less than the distance to the closest boundary. However, the removed boundary collisions do not affect the determination of boundary zone size. In this experiment, the removed boundary collisions within 6 m distance to neighbourhood boundary in the downtown area and within 9 m distance to neighbourhood boundary in the south area were added back to the dataset to construct the distance-level histogram, and then to

Fig. 14. Aggregation results of ten neighbourhoods in the downtown area with a buffer of 6 m.

14

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

Fig. 15. Aggregation results of ten neighbourhoods in the south area with a buffer of 9 m.

Table 4 Evaluation of boundary collision identification by entropy-based histogram thresholding. Downtown area The number of boundary collisions

Evaluation

Ground truth

Histogram thresholding

Correctly identified

Precision

Recall

2006 2007 2008 2009 2010 2011 2012

1,224 1,280 1,281 1,319 1,338 1,063 1,086

1,190 1,228 1,238 1,272 1,284 1,028 1,045

1,181 1,224 1,229 1,266 1,283 1,026 1,039

99.24% 99.67% 99.27% 99.53% 99.92% 99.81% 99.43%

96.49% 95.63% 95.94% 95.98% 95.89% 96.52% 95.67%

Total

8,591

8,285

8,248

99.55%

96.01%

South area The number of boundary collisions

Evaluation

Ground truth

Histogram thresholding

Precision

Recall

2006 2007 2008 2009 2010 2011 2012

1,128 1,093 1,078 1,026 900 844 909

1,032 1,000 966 928 824 800 846

985 949 919 879 785 736 782

95.44% 94.90% 95.13% 94.72% 95.27% 92.00% 92.43%

87.32% 86.83% 85.25% 85.67% 87.22% 87.20% 86.03%

Total

6,978

6,396

6,035

94.36%

86.49%

Correctly identified

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

15

Fig. 16. Entropy with different boundary zone sizes after adding removed boundary collisions.

determine the threshold by the entropy-based method. The result is shown in Fig. 16. From the figure, the thresholds were the same as removing them from constructing the histogram, i.e. still 6 m and 9 m for downtown area and south area, respectively. Therefore, the removal of these small number of boundary collisions does not affect the determination of the boundary zone size. 5.2. Evaluation for boundary collision aggregation 5.2.1. Neighbourhood-based macro-level collision prediction model Neighbourhood-based macro-level CPMs are widely used in road safety evaluation and planning. Generally, neighbourhoodbased macro-level CPMs, taking neighbourhoods as elements, construct a relationship between the number of collisions and other variables in neighbourhoods, such as exposures (vehicle kilometres travelled, total road lane kilometres, etc.), transportation demand management related variables (total commuters, total commuter density, etc.), network related variables (signal density, intersection density per unit area, etc.) and sociodemographic related variables (job density, population density, etc.) (Lovegrove and Sayed, 2006; Siddiqui and Abdel-Aty, 2012; Wang et al., 2012; Karim et al., 2013). This section intends to investigate the influence of boundary collision aggregation on CPMs, and to evaluate the effectiveness of the three boundary collision aggregation methods based on the fitness of models. In this paper, we used the negative binomial (NB) regression method to build a neighbourhood-based macro-level CPM to study the relationship between the number of boundary collisions occurrence and some independent variables in road safety, and to evaluate whether the three aggregation methods fit the negative binomial collision model. NB regression is one of generalized linear regression models that have been proven effective in the modelling of random, discrete and sporadic collision data. An NB regression model commonly assumes a log-linear relationship between the NB parameter li and explanatory variables:

li ¼ Eðyi Þ ¼ ebXi þei

(10)

where yi is the observed number of boundary collisions at neighbourhood i, E(yi) is the expected number of boundary collisions at neighbourhood i, Xi is a vector of explanatory variables, b is a vector of an unknown coefficient, and eei is a gamma distributed error term with a mean of 1 and a variance of k2. Under the NB regression model, the relationship between the mean and the variance is presented as: VARðyi Þ ¼ Eðyi Þ þ kEðyi Þ2

(11)

The form of the NB probability distribution is given as: e

Pðyi Þ ¼

eðli e i Þ ðli eei Þyi yi !

(12)

In this paper, six explanatory variables (shown in Table 5) were used in the CPM construction for boundary collisions. The statistics of the six variables are listed in Table 5. In this table, ‘NUM_CDR’, ‘NUM_HALF’ and ‘NUM_ONE’ are the numbers of boundary collisions aggregated into neighbourhoods by CDR method, halfto-half ratio method and one-to-one ratio method, respectively. Besides, the six explanatory variables are the total lane kilometres in neighbourhood (TLKM), the total lane kilometres in boundary zone (TLKMB), the number of intersections in the boundary zone (INTB), the length of neighbourhood boundary (BL), the area of neighbourhood (AON) and the total lane kilometres per area in neighbourhood (TPA). The form of CPM is as follows: Eðyi Þ ¼ aebXi

(13)

where a, b are model parameters and can be estimated by the standard maximum likelihood method. E(yi) is the expected number of boundary collisions at neighbourhood i and Xi are a vector of explanatory variables at neighbourhood i. Pearson x2 and scaled deviation (SD) are two goodness-of-fit measures for the

Table 5 Statistics of variables. Variable

Definition

Mean

SD

Min.

Max.

Response variables NUM_CDR NUM_HALF NUM_ONE

The number of boundary collisions aggregated by CDR method The number of boundary collisions aggregated by half-to-half ratio method The number of boundary collisions aggregated by one-to-one ratio method

66.57 68.25 154.90

44.72 39.20 95.57

0.80 11 22

206.53 212.83 524

18.34 4.24 24.52 5.42 14.66 1.28

6.50 1.40 8.27 1.01 5.56 0.21

7.64 2.07 11 4.19 6.54 0.76

31.67 7.00 46 7.84 26.77 1.64

Explanatory variables TLKM Total lane kilometres in neighbourhood (km) TLKMB Total lane kilometres in boundary zone (km) INTB Intersections in the boundary zone BL Length of neighbourhood boundary (km) AON Area of neighbourhood (km2) TPA Total lane kilometres per area in neighbourhood (km1)

16

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

Table 6 Goodness-of-fit statistics for CPMs based on boundary collisions in two study areas. Collision density ratio

Coefficient z-value Coefficient z-value Coefficient z-value

Half-to-half ratio One-to-one ratio

Collision density ratio (model 1) Half-to-half ratio (model 2) One-to-one ratio (model 3)

k

SD

Pearson x2

x2

AIC

Log-likelihood

137 137 137

0.23 0.16 0.20

145.54 145.63 145.23

129.83 137.97 142.71

165.32 165.32 165.32

1,336.58 1,309.94 1,551.30

664.29 650.97 771.65

where k is the over dispersion parameter for the model. SD approximates a x2 distribution with (n–p) degrees of freedom (DOF), where n is the number of observations and p is the number of parameters. Xn i¼1

3.75, INTB: 0.05, TPA: 0.08 13.08, INTB: 9.68, TPA: 3.25 4.37, INTB: 0.05, TPA: 0.12 18.57, INTB: 11.52, TPA: 6.08 5.29, INTB: 0.06, TPA: 1.33 20.50, INTB: 11.25, TPA: 6.26

DOF

negative binomial regression model (McCullagh and Nelder, 1989) and are defined as follows:     Xn  yi ðyi þ kÞ  (16) y ln ðy þ k Þln SD ¼ 2 i i i¼1 Eðyi Þ þ k Eðyi Þ

Pearsonx2 ¼

Con: Con: Con: Con: Con: Con:

½yi  Eðyi Þ2 Varðyi Þ

(17)

where yi is the observed collision frequency in the boundary zone of neighbourhood i. Var(yi) is the variance for the boundary zone of neighbourhood i. Besides, Akaike information criterion (AIC) is a measure of goodness of model which balances model fit against model simplicity. AIC ¼ 2lnL þ 2k

(18)

where L is the maximized value of the log-likelihood function for the model and k is the number of parameters estimated in the model. In this study, a 95% confidence level was used; and, if the Pearson x2 and SD value of any established model was smaller than x2 (0.05, n–p), the model successfully fit. Besides, variable selection of the model was a step by step procedure. First, in every step, the variable was selected into the model if it decreased AIC of the model most. Second, the z-statistic of each variable parameter needed to be significant at the 95% confidence level. Last, the added variable was not allowed to have a high correlation with other variables in the model. Table 6 presents the CPM results and a summary of the related statistics. All CPMs in the table were built successfully with the collision data from the downtown and south area. INTB, TPA and TLKMB could decrease AIC of CPMs from all three aggregation methods and INTB could make a largest drop of AIC. However, TLKMB was not added into CPMs at last as TLKMB had a high correlation (0.54) with INTB. Therefore, INTB and TPA were selected into CPMs built.

As shown in Table 6, the SD values and Pearson x2 of all three models were less than the corresponding critical x2, which means that all three models were fitted successfully. However, it is difficult to compare the performance of the three boundary collision aggregation methods by the three CPMs because various measurements of model fit indicated different conclusions. For instance, the value of Pearson x2 of CDR method was smaller than those of the two other methods while in the respective of AIC, the half-to-half ratio method with smaller values had the best goodness of fit among the three methods. The scaled deviances of CPMs from three aggregation methods were quite close to each other. One possible reason is that some important explanatory variables such as traffic volume data are not included in this study due to data limitation, and these variables have a large impact on the occurrence of boundary collisions. Thus, we conducted chisquare test on the three models in the next subsection. 5.2.2. Chi-square test In this section, the collision assignment results from the proposed CDR method, half-to-half ratio, and one-to-one ratio, are compared with the manually assignment which is assumed as the most accurate assignment (i.e. true value). A critical criterion for collision aggregation is that the number of aggregated collisions into neighbourhoods should not be exaggerated or underestimated compared with the true value, so the three boundary collision aggregation methods are assessed to see if there is a large discrepancy between the aggregation result and the true value. Chi-square test is used to evaluate the effectiveness of the boundary collision aggregation methods. Based on geocoding information of collisions and police reports, the true value was obtained by aggregating boundary collisions into neighbourhoods manually. If a boundary collision location from police reports is located in a neighbourhood, then the collision is assigned to the neighbourhood. It took us more than 30 h to finish the assignment for the study areas, which is not feasible for the large amount of collision data. The result of chi-square test is shown in Table 7: As 20 neighbourhoods over 7 years were from two study areas, in total 140 neighbourhoods were used by the aggregation methods. Therefore, the chi-square distribution is with 139 degree of freedom and the critical value for the 0.05 significance level is

Table 7 The chi-square test for the three aggregation methods. The number of boundary collisions aggregated into neighbourhoods

True value Collision density ratio Half-to-half ratio One-to-one ratio

Mean

Std.

Degree of freedom

Test statistic

P-value

66.856 66.565 68.252 154.898

48.243 44.715 39.198 95.569

139 139 139 139

– 165.971 1,752.664 25,441.806

– 0.06 5.16E-277 0.000

G. Cui et al. / Accident Analysis and Prevention 83 (2015) 1–17

167.514. In Table 7, the test statistic of CDR was 165.971, which was smaller than the critical value, therefore, the aggregation result of CDR has very little discrepancy with the true value. In addition, the test statistic of half-to-half ratio and one-to-one ratio were 1,752.664 and 25,441.806, respectively. Both values were larger than the critical value. That indicates that there is a significant change between the true value and the number of aggregated boundary collisions by using one-to-one ratio and half-to-half ratio methods. 6. Conclusions and future work The analysis of collisions in neighbourhoods is very useful for transportation management and road safety planning. This paper focused on the study of boundary collisions. We proposed a framework for the aggregation of boundary collisions, including determination of boundary zone size, construction of collision density distribution, and aggregation of boundary collisions. We also made use of a boundary zone for the identification of boundary collisions and of an entropy-based histogram thresholding method for the determination of the boundary zone size. We also proposed a collision density ratio (CDR) method to aggregate boundary collisions into neighbourhoods. We conducted a case study of two areas in the City of Edmonton (the downtown and south areas) and compared the proposed CDR method with two other commonly used methods (the half-to-half and one-toone ratio methods) by using a macro-level CPM and a chi-square test. The following results were determined:  With the proposed entropy-based thresholding method, we suggest the use of 6 m as the boundary zone size for the downtown area and 9 m for the south area. Based on the established boundary zone, boundary collisions were 25.73% and 43.23% of the total collisions in the downtown and south areas, respectively.  The aggregation results from the three methods showed that the aggregated collision number of the one-to-one ratio method was much larger than those of the CDR and half-to-half ratio methods, greatly exaggerating the boundary collision numbers.  By compared with the ground truth data, the proposed entropybased thresholding method can identify boundary collision with the high precision and recall.  We used a CPM to test the effectiveness of the three aggregation methods. All CPMs from the three aggregation methods were established successfully. Moreover, the results showed that INTB and TPA had a large influence on the occurrence of boundary collisions.  We used a chi-square test to evaluate the effectiveness of the three aggregation methods by comparing the aggregation results with the true values from the manual inspection. The test result showed that there is no significant difference between the aggregation result of the CDR method and the true value, while the one-to-one ratio and half-to-half ratio method exaggerated or underestimated the true value greatly. In this paper, we utilized six variables to establish CPMs due to the limitations of the data. In future, we will consider more relevant variables such as traffic volume data to build the CPM and explore the relationship between boundary collisions and its

17

dependent variables. Besides, boundary zone size may have certain correlation with speed limit and traffic volume. We will study their relationship in future to improve the determination of boundary zone size. Acknowledgment The research is supported by the Natural Sciences and Engineering Research Council of Canada Discovery Grant to the second author, MITAC internship with City of Edmonton to the first author, and National Natural Science Foundation of China (Grant No. 41271387). References City of Edmonton, 2013. Motor Vehicle Collisions 2012. Office of Safety, Edmonton. Fotheringham, S., Wegener, M., 1999. Spatial Models and GIS: New and Potential Models. CRC Press, London. Ivan, J.N., Deng, Z., Jonsson, T., 2006. Procedure for allocating zonal attributes to link network in gis environment. Transportation Research Board 85th Annual Meeting (No. 06-2561) . Kapur, J.N., Sahoo, P.K., Wong, A.K., 1985. A new method for gray-level picture thresholding using the entropy of the histogram. Comput. Vision Graphics Image Process. 29 (3), 273–285. Karim, M.A., Wahba, M.M., Sayed, T., 2013. Spatial effects on zone-level collision prediction models. Transp. Res. Rec.: J. Transp. Res. Board 2398 (1), 50–59. Khondakar, B., Sayed, T., Lovegrove, G., 2010. Transferability of community-based collision prediction models for use in road safety planning applications. J. Transp. Eng. 136 (10), 871–880. Ladron de Guevara, F., Washington, S.P., Oh, J., 2004. Forecasting crashes at the planning level: simultaneous negative binomial crash model applied in Tucson, Arizona. Transp. Res. Rec.: J. Transp. Res. Board 1897 (1), 191–199. LaScala, E.A., Johnson, F.W., Gruenewald, P.J., 2001. Neighborhood characteristics of alcohol-related pedestrian injury collisions: a geostatistical analysis. Prev. Sci. 2 (2), 123–134. Lee, J., Abdel-Aty, M., Jiang, X., 2014. Development of zone system for macro-level traffic safety analysis. J. Transp. Geogr. 38, 13–21. Lovegrove, G.R., 2007. Road Safety Planning: New Tools for Sustainable Road Safety and Community Development. AV Akademikerverlag, Saarbrücken. Lovegrove, G.R., Sayed, T., 2006. Using macrolevel collision prediction models in road safety planning applications. Transp. Res. Rec.: J. Transp. Res. Board 1950 (1), 73–82. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models, vol. 2. Chapman and Hall, London. Peden, M., Scurfield, R., Sleet, D., Mohan, D., Hyder, A.A., Jarawan, E., Mathers, C., 2004. World Report on Road Traffic Injury PreventionWorld Health Organization, Geneva. . (accessed 12.01.14.) http://cdrwww.who.int/ violence_injury_prevention/publications/road_traffic/world_report/intro.pdf. Rifaat, S., Tay, R., Perez, A., Barros, A.D., 2009. Effects of neighborhood street patterns on traffic collision frequency. J. Transp. Saf. Secur. 1 (4), 241–253. Siddiqui, C., Abdel-Aty, M., 2012. Nature of modeling boundary pedestrian crashes at zones. Transp. Res. Rec.: J. Transp. Res. Board 2299 (1), 31–40. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall, London; New York. Sun, J., 2009. Sustainable Road Safety: Development, Transference and Application of Community-based Macro-level Collision Prediction Models. Thesis (Master). University of British Columbia. Tobler, W.R., 1970. A computer movie simulating urban growth in the Detroit region. Econ. Geogr. 46, 234–240. Wang, X., Jin, Y., Abdel-Aty, M., Tremont, P.J., Chen, X., 2012. Macrolevel model development for safety assessment of road network structures. Transp. Res. Rec.: J. Transp. Res. Board 2280 (1), 100–109. Wei, F., 2010. Boundary Effects in Developing Macro-level CPMs: A Case Study of City of OttawaUniversity of British Columbia. . (accessed 06.08.13.) http://www. cite7org/scholarships_awards/documents/2010_StudentPaper_FengWei.pdf. Wong, A.K., Sahoo, P.K., 1989. A gray-level threshold selection method based on maximum entropy principle. IEEE Trans. Syst. Man Cybern. 19 (4), 866–871. World Health Organization, 2013. WHO Global Status Report on Road Safety 2013: Supporting a Decade of ActionWorld Health Organization. . (accessed 16.09.14.) http://apps.who.int/violence_injury_prevention/road_safety_status/2013/ report/en/index.html.