Accident Analysis and Prevention 40 (2008) 1013–1017
Methodology for estimating the variance and confidence intervals for the estimate of the product of baseline models and AMFs Dominique Lord ∗ Zachry Department of Civil Engineering, Texas A&M University, 3136 TAMU, College Station, TX 77843-3136, USA Received 9 August 2007; received in revised form 8 November 2007; accepted 21 November 2007
Abstract This manuscript describes a methodology for estimating the variance and 95% confidence intervals (CI) for the estimate of the product between baseline models and Accident Modification Factors (AMFs). This methodology is provided for the upcoming Highway Safety Manual (HSM) currently in development in the United States (U.S.). The methodology is separated into two parts. The first part covers the proposed approach for estimating the variance of the estimate of the product between baseline models and AMFs. The second part presents the method for estimating the variance of baseline models. Several examples are presented to illustrate the application of the methodology. © 2007 Elsevier Ltd. All rights reserved. Keywords: Crash prediction models; Baseline models; Variance estimation; Confidence intervals; Highway Safety Manual
1. Introduction This manuscript describes a methodology for estimating the variance and 95% confidence intervals (CI) for the estimate of the product between baseline models and Accident Modification Factors (AMFs). This methodology is provided for the upcoming Highway Safety Manual (HSM) (see Hughes et al., 2005, for additional information) currently in development in the United States (U.S.). The HSM, which is near completion, is a document that will serve as a tool to help practitioners make planning, design, and operations decisions based on safety. The document will serve the same role for safety analysis that the Highway Capacity Manual (HCM) (TRB, 2000) serves for traffic-operations analyses. The purpose of the HSM is to provide the best factual information and tools in a useful and widely accepted form to facilitate explicit consideration of safety in the decision making process. The technique described in the HSM consists of first developing baseline models using data that meet specific nominal conditions, such as 12-ft. lane and 8-ft. shoulder widths for segments or no turning lanes for intersections. These conditions usually reflect design or operational variables most commonly
∗
Tel.: +1 979 458 3949; fax: +1 979 845 6481. E-mail address:
[email protected].
0001-4575/$ – see front matter © 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.aap.2007.11.008
used by state transportation agencies (defined as state DOTs). Consequently, baseline models typically only include traffic flow as covariates (e.g., μ = β0 F1β1 F2β2 ). The second component of the technique consists of multiplying the output of such models by AMFs to capture changes in geometric design and operational characteristics (Hughes et al., 2005). An important assumption about using this technique is that the AMFs are considered independent, which may not always be true in practice. The formulation of the technique is given by the following: μfinal = μbaseline × AMF1 × · · · × AMFn
(1)
where μfinal is the final predicted number of crashes per unit of time; μbaseline is the baseline predicted number of crashes per unit of time (via a regression model); and AMF1 × · · · × AMFn are the accident modification factors assumed to be independent. Recent discussions at various meetings related to the production of the Manual have shown that estimating the uncertainty associated with baseline models, AMFs, and the estimate of the product between the two have become very important in the eyes of the Task Force members, the committee responsible for the implementation of the Manual, as well as for potential HSM users. So far, the work in this area has only focused on estimating the uncertainty associated with AMFs (Bahar et al., 2007) and, to a lesser degree, with regression models
1014
D. Lord / Accident Analysis and Prevention 40 (2008) 1013–1017
(Wood, 2005; Agrawal and Lord, 2006) (the latter not in the context of the HSM however). There is therefore a need to fill this gap by providing a methodology for estimating the variance of the estimate of the product of baseline models and AMFs. This document is divided into three sections. The second section describes the first part of the methodology, which consists of estimating the variance of the product of baseline models and AMFs. This section covers the background material on the product of random materials and presents two examples describing different combinations of baseline models, AMFs and their associated uncertainties. The third section describes the second part of the methodology and explains how to compute the variance of baseline models. An example illustrates the computation of the variance of baseline models as well as the 95% CI for the final product.
2.1.2. Variance of a product The variance of a product is obtained by taking the expectation square of z:
2. Estimating the variance of baseline models and AMFs
νz = (λ2x1 )(λ2x2 )(λ2x3 )· · · − λ2z = λ2z − λ2z = 0
This section describes the first part of the methodology and is divided into two sub-sections. The first sub-section provides details about the theory behind the multiplication of independent random variables. The second sub-section presents the application of the proposed method for estimating the variance of the estimate of the product of baseline models and AMFs. Two examples are provided.
2.2. Application of the theory
2.1. Computing the product of random variables The estimation of the variance can be accomplished using the theory behind the multiplication of independent random variables (Ang and Tang, 1975; Browne, 2000). This theory states that the equations presented below will be exact independent of the type of distribution to which each random variable belongs. For the purpose of this description, we will define z as the product of independent random variables: z = x1 x2 x3 · · ·
(2)
where z is the product of independent random variables; and, x’s is the random variable taken from any kind of distribution. It should be pointed out that the mean and variance estimates are defined as E[x] = λ and E[(x − λ)2 ] = ν (second central moment), respectively.
z2 =
x12 x22 x32 · · ·
E[z2 ] =
E[x12 ]E[x22 ]E[x32 ]· · ·
(λ2z
+ νz ) =
(λ2x1
+ νx1 )(λ2x2
+ νx2 )(λ2x3
(4) + νx3 )· · ·
Note: E[xn ] = E[((x − λ) + λ)n ]; for n = 2, E[x2 ] = E[((x − λ) + λ)2 ] = E[(x − λ)2 ] + E[2λ(x − λ)] + E[λ2 ] = ν + λ2 . The variance νz is computed by first calculating the product on the right hand side and then subtracting the square of the mean λ2z computed in Eq. (4): νz = (λ2x1 + νx1 )(λ2x2 + νx2 )(λ2x3 + νx3 )· · · − λ2z
(5)
Note that if all νx ’s equal zero, the variance νz will also equal zero: (6)
This section describes the application of the theory behind the multiplication of independent random variables. Two examples describing different values of predicted values, AMFs, and associated uncertainties are presented. The uncertainty associated with the baseline models can be estimated using the method described in the next section. For estimating the uncertainty related to AMFs, the reader is referred to the work of Bahar et al. (2007), which will be incorporated into the forthcoming HSM. 2.2.1. Example 1—one AMF This example shows the application of a single AMF. Let x1 represent the predicted value of a baseline model and x2 an AMF: x1
= 5.0 crashes/year (standard deviation or S.D. = 2.0 crashes/year)
x2
= 0.80 (S.D. = 0.10) The mean is given by:
λz = 5.0 × 0.80 = 4.0 The variance is given by: νz = (5.02 + 4.0)(0.802 + 0.01) − 4.02 = (29)(0.65) − 16.0
2.1.1. Mean of a product The mean of the product is the direct application of Eq. (2):
= 18.85 − 16.0 = 2.85
z=
x1 x2 x3 · · ·
The final value is estimated to be:
E[z] =
E[x1 ]E[x2 ]E[x3 ]· · ·
λz =
λx1 λx2 λx3 · · ·
(3)
The mean or average of the product is simply the product of the mean value of the random variables.
4.0 crashes/year (S.D. = 1.69 crashes/year) 2.2.2. Example 2—two AMFs In this example, two AMFs are used. Let x1 represent the predicted value of a baseline model and x2 and x3 independent
D. Lord / Accident Analysis and Prevention 40 (2008) 1013–1017
AMFs: x1 x2 x3
= 5.0 crashes/year(S.D. = 2.0 crashes/year) = 0.95 (S.D. = 0.10) = 0.90 (S.D. = 0.20) The mean is given by:
λz = 5.0 × 0.90 × 0.95 = 4.275
1015
Table 1 Variance estimation for Poisson and Poisson–gamma models (Wood, 2005) Parameter
Variance Poisson
Poisson–gamma μ ˆ Var(ˆη)
μ
νμ
μ ˆ2
Var(ˆη)
m
νm
–
y
νy
{μ ˆ 2 Var(ˆη) + μ} ˆ
2
= (29)(0.912)(0.85) − 18.276 = 22.481 − 18.276 = 4.205 The final value is estimated to be: 4.275 crashes/year (S.D. = 2.05 crashes/year) It should be pointed out that adding (independent) AMFs in Eq. (1) increases the uncertainty associated with the final estimate, especially if the uncertainty for each AMF is large. This will have an influence for the comparison of different highway design alternatives based on safety, as discussed below. The next section explains how to estimate the variance associated with baseline models. 3. Estimating the variance of baseline models This section briefly describes the second part of the methodology, which consists of estimating the variance of predictive baseline models. It should be pointed out that the methodology is not limited to baseline models, since any it can also be applied to any type of regression models (e.g., general average daily traffic or ADT models, models with covariates, etc.). The first sub-section describes the basic method for estimating the variance. The second sub-section presents an application of the method and provides details about estimating the 95% CI after the product of baseline models and AMFs. 3.1. Method for computing the variance There are difference methods for estimating the variance (and confidence intervals) of predicted values generated from generalized linear models (Cameron and Trivedi, 1998; Myers et al., 2002). The most recent and relevant method has been proposed by Wood (2005), who specifically developed a procedure for computing the variance and CI for crash prediction models. He worked out the procedure for both Poisson and Poisson–gamma models. Table 1 shows the variance of the expressions for estimates of the Poisson mean μ, gamma mean m, and predicted response y for Poisson and Poisson–gamma models. This table shows that confidence intervals used to estimate the uncertainty of the gamma mean and the predicted response for the Poisson–gamma model both incorporate the inverse dispersion parameter, φ. It should be pointed out that in most cases the predicted response will be of interest, since the model will be applied to an observation (or site) that was not used for developing the statistical models.
μ ˆ 2 Var(ˆη)+μ ˆ2 φ
μ ˆ 2 Var(ˆη) +
μ ˆ 2 Var(ˆη)+μ ˆ2 φ
The variance is given by: νz = (5.02 + 4.0)(0.952 + 0.01)(0.902 + 0.04) − 4.2752
μ ˆ 2 Var(ˆη) +
+μ ˆ
Note: Var(ˆη) = x0 (XWX )−1 x0 ; μ: ˆ mean estimate of the Poisson or Poisson–gamma model (see Eq. (1) above); φ: inverse dispersion parameter of the Poisson–gamma model; νμ : variance of the Poisson mean; νm : variance of the gamma mean; νy : variance of the predicted response.
3.2. Application of the method For this application, data collected at 800 four-legged signalized intersections in Toronto were used (a subset of the original data found in Lord, 2000). The data included intersection and intersection-related crashes as well as entering flows at major and minor approaches for the year 1995. These data have been used extensively by previous researchers and have been found to be of relatively good quality (Lord, 2000; Miaou and Lord, 2003; Miranda-Moreno and Fu, 2007). Table 2 describes the summary statistics of the data. In this application, the model developed is classified as a general ADT model. These kinds of model are flow-only models estimated using average conditions reflected in the database. General ADT models have the same functional form as baseline models and are given by the following: β1 β2 μ = β0 FMajor FMinor
(7)
where μ is the mean number of crashes per year; FMajor is the entering vehicle per day (ADT) for the major approaches; FMinor is the entering vehicle per day (ADT) for the minor approaches; and, β0 , β1 , β2 are the regression coefficients. As reported by Miaou and Lord (2003), the functional form described above is not the most adequate for describing the relationship between crashes and exposure since the form does not appropriately fit the data near the boundary conditions (i.e., when F → 0 and for Fmax ). Nonetheless, it is still relevant for this application, as it is considered an established functional form in the highway safety literature. In addition, the most adequate functional form proposed by Miaou and Lord (2003), a model with two distinct mean functions, cannot be estimated via a generalized linear modeling (GLM) framework, as it was done herein. Tables 3 and 4 summarize the modeling output and variance–covariance matrix, (XWX )−1 , respectively. The data Table 2 Summary statistics Variable
Minimum
Maximum
Average (standard deviation)
FMajor (ADT) FMinor (ADT) Crashes per year
5,296 52 0
72,310 42,644 44
27,471 (14,299) 10,768 (1,779) 9.49 (7.82)
1016
D. Lord / Accident Analysis and Prevention 40 (2008) 1013–1017
Table 3 Model output Variable
Coefficients (std. err.)
Intercept (ln β0 ) FMajor (β1 ) FMinor (β2 ) Inverse dispersion parameter (φ) Goodness-of-fit statistics
−11.120 (0.204) 0.558 (0.020) 0.650 (0.009) 6.46 (0.22) Deviance: 5602 (F = 3,807), deviance ratio: 1.076
Table 4 Variance–covariance matrix Intercept
FMajor
FMinor
Intercept FMajor FMinor
0.041513 −0.003735 −0.000377
−0.003735 0.0000404 −0.000043
−0.000377 −0.000043 0.000089
Table 5 Estimated variance for the Poisson mean, gamma mean, and predicted response Parameter
Variance
μ m y
νμ νm νy
0.000281 0.2622 1.5639
Note: μ ˆ = 1.30; Var(ˆη) = 0.000166.
were fitted using a Poisson–gamma regression model with a fixed inverse dispersion parameter (e.g., Var(Y) = μ + (μ2 /φ)) (see Miaou and Lord, 2003; Mitra and Washington, 2007, about this assumption). The coefficients of the model were estimated using Genstat (Payne, 2000). In this application, we can assume that the signalized intersection investigated, which has not been used for developing the model, has the following characteristics: FMajor = 35,500 vehicle/day (ln 35,500 = 10.48) and FMinor = 5000 vehicle/day (ln 5000 = 8.52). The mean number of crashes is estimated to be 1.30 crashes/year (μ = 0.00001481 × 35,5000.558 × 50000.650 ). The variance Var(ˆη) can be computed by the following way: ⎛
−1 x0
Parameter
Intervals
Poisson model μ
λz √ √ λz e1.96 νzμ
e1.96 νzμ , 0, λz +
y Poisson–gamma model μ m y
λ√ z e1.96 νzμ
19νzy
√
, λz e1.96 νzμ √ √ [max{0, λ − 1.96 νzm }, λz + 1.96 νzm ]
z 0, λz + 19νzy
Note:
Variable
Var(ˆη) = x0 (XWX )
Table 6 95% confidence interval for the estimate of the product of baseline model and AMFs (adapted from Wood, 2005)
= (1
10.48
0.041513 ⎜ 8.52 ) ⎝ −0.003735 −0.0000377
Using the value of Var(ˆη) computed above, the variance for the Poisson mean μ, gamma mean m, and predicted response y can be calculated using the equations in Table 1 (using the column for Poisson–gamma models). The results are shown in Table 5. Now, let us assume that two AMFs (i.e., hypothetical values) can be used to estimate changes associated with the introduction of left-turn and right-turn lanes, both located on the major approaches (both sides), respectively: AMF1 = 0.90 (S.D. = 0.05) and AMF2 = 0.95 (S.D. = 0.10). Also assume that the model represents a baseline model, where the sites do not
• λz = baseline model output × AMF1 × · · · × AMFn or λz = μbaseline × λAMF1 × . . . × λAMFn • νzμ = variance estimated from the variance of the Poisson mean (νμ ) (first row in Table 1) and the variance of AMFs. • νzm = variance estimated from the variance of the gamma mean (νm ) (second row in Table 1) and the variance of AMFs. • νzy = variance estimated from the variance of the predicted response (νy ) (third row in Table 1) and the variance of AMFs; this is dependent upon whether a Poisson or a Poisson–gamma model was used for estimating the baseline model. • x denotes the largest integer less or equal than
x. • For computing any other % CI for y, 19νzy can be substituted
with ((1.0 − α)/α)νzy , where α = 1 − percentile in %; for instance, for boundary 90% CI, the upper value becomes
estimating the upper ((1.0 − 0.1)/0.1)νzy = 9νzy .
have right- and left-turning lanes. Using Eqs. (4) and (5), one can, for instance, estimate the predicted value and associated uncertainty by the following way (note: we changed the notation from μfinal , as shown in Eq. (1), to λz for this example): λz =μ × λAMF1 × λAMF2 =1.30 × 0.90 × 0.95 = 1.1115
(9)
νzy = (μ2 + νy )(λ2AMF1 +νAMF1 )(λ2AMF2 + νAMF2 ) − λ2z = (1.302 +1.564)(0.902 +0.0025)(0.952 + 0.01) − 1.11152 = (3.2540)(0.8125)(0.9125) − 1.2354 = 1.1771 −0.003735 0.000404 −0.000043
⎞⎛ ⎞ −0.0000377 1 ⎟⎜ ⎟ −0.000043 ⎠ ⎝ 10.48 ⎠ = 0.000166 0.000089 8.52
(10)
(8)
(Note: νzy denotes the variance was estimated using the variance associated with the predicted response of the baseline model (νy ) and the variance of AMFs.) Therefore, the estimated value after the AMFs are multiplied with the baseline model output is 1.112 crashes/year for the signalized intersection under investigation. The standard deviation of the estimate of the product is equal to 1.084. If the 95% CI is the value of central interest, it can be computed using the equations listed in Table 6 (see Wood, 2005, for additional information). Confidence intervals can be used for screening highway design alternatives based on
D. Lord / Accident Analysis and Prevention 40 (2008) 1013–1017
safety (e.g., Kononov and Allery, 2004) or identifying hazardous sites (Miranda-Moreno et al., 2005). For the value above, the 95% confidence interval is located between 0 (predicted) √ and 5 0, 1.112 + 19 × 1.177 = 5.841 . In this example, the confidence interval boundaries appear to be very wide given the estimated value, but the range is what one would expect for the predicted response (see Wood, 2005; Agrawal and Lord, 2006; Geedipally and Lord, 2008, for other examples). As noted in Wood (2005), tighter CI for y could be computed using the equations presented in Appendix A of his paper when λz ≤ 1. 4. Summary This manuscript has presented a methodology for computing the variance and 95% CI for the estimate of the product between baseline models and AMFs when the uncertainty is known both for models and the modification factors. The first section explained the technique proposed in the HSM to estimate the safety performance of segments or intersections using baseline models and AMFs. The second section described the first part of the methodology. This section covered the background material on the product of independent random variables. The third section presented the second part of the methodology, which focused on estimating the variance of baseline models. This section also explained how to compute the 95% CI when the product involved the estimate of the Poisson mean, gamma mean, predicted response. Acknowledgments The author would like to thank Drs. Simon Washington and Graham Wood for providing very useful comments for an earlier draft of this document. The paper also benefited from the input of two anonymous reviewers. This work was performed as part of the research project NCHRP 17-29 funded by the National Academy of Science. References Agrawal, R., Lord, D., 2006. Effects of sample size on the goodness-of-fit statistic and confidence intervals of crash prediction models subjected to low sample mean values. Transport. Res. Rec. 1950, 35–43.
1017
Ang, A.H.-S., Tang, W.H., 1975. Probability Concepts in Engineering Planning and Design: Basic Principles, vol. 1. John Wiley & Sons, New York, NY. Bahar, G., Parkhill, M., Hauer, E., Council, F.M., Persaud, B., Zegger, C., Elvik, R., Smiley, A., Scotty, B., 2007. Prepare Parts I and II of the Highway Safety Manual. Final Report for NCHRP Project 17-27. iTRANS Consulting, Richmond Hill, Ont. (Note: Future Parts I and II of the HSM). Browne, J.M., 2000. Probabilistic Design. Class Notes. School of Engineering and Science Swinburne University of Technology, Hawthorn, Australia, http://www.ses.swin.edu.au/homes/browne/probabilisticdesign/coursenotes/ notespdf/4ProbabilisticDesign.pdf. Cameron, A.C., Trivedi, P.K., 1998. Regression Analysis of Count Data. Cambridge University Press, Cambridge, UK. Geedipally, S.R., Lord, D., 2008. Effects of the varying dispersion parameter of Poisson–gamma models on the estimation of confidence intervals of crash prediction models. In: Paper Presented at the 87th Annual Meeting of the Transportation Research Board, Washington, DC. Hughes, W., Eccles, K., Harwood, D., Potts, I., Hauer, E., 2005. Development of a Highway Safety Manual. Appendix C: Highway Safety Manual Prototype Chapter: Two-Lane Highways. NCHRP Web Document 62 (Project 17-18(4)). Washington, DC (http://onlinepubs.trb.org/ onlinepubs/nchrp/nchrp w62.pdf accessed October 2007). Kononov, J., Allery, B.K., 2004. Explicit consideration of safety in transportation planning and project scoping. Transport. Res. Rec. 1897, 116– 125. Lord, D., 2000. The Prediction of Accidents on Digital Networks: Characteristics and Issues Related to the Application of Accident Prediction Models. Ph.D. Dissertation. Department of Civil Engineering, University of Toronto, Toronto, Ontario. Miaou, S.-P., Lord, D., 2003. Modeling traffic-flow relationships at signalized intersections: dispersion parameter. functional form and Bayes vs. empirical Bayes. Transport. Res. Rec. 1840, 31–40. Miranda-Moreno, L.F., Fu, L., 2007. Traffic Safety Study: Empirical Bayes or Full Bayes? In: Paper 07-1680 Presented at the 86th Annual Meeting of the Transportation Research Board, Washington, DC. Miranda-Moreno, L.F., Fu, L., Saccomanno, F.F., Labbe, A., 2005. Alternative risk models for ranking locations for safety improvement. Transport. Res. Rec. 1908, 1–8. Mitra, S., Washington, S., 2007. On the nature of over-dispersion in motor vehicle crash prediction models. Accident Anal. Prevention 39 (3), 459– 568. Myers, R.H., Montgomery, D.C., Vining, G.G., 2002. Generalized Linear Models: with Applications in Engineering and the Sciences. Wiley Publishing Co., New York, NY. Payne, R.W. (Ed.), 2000. The Guide to Genstat. Lawes Agricultural Trust. Rothamsted Experimental Station, Oxford, UK. TRB, 2000. Highway Capacity Manual 2000. Transportation Research Board, Washington, DC. Wood, G.R., 2005. Confidence and prediction intervals for generalized linear accident models. Accident Anal. Prevention 37 (2), 267–273.