Bayesian analysis of multivariate crash counts using copulas

Bayesian analysis of multivariate crash counts using copulas

Accident Analysis and Prevention xxx (xxxx) xxxx Contents lists available at ScienceDirect Accident Analysis and Prevention journal homepage: www.el...

1MB Sizes 0 Downloads 66 Views

Accident Analysis and Prevention xxx (xxxx) xxxx

Contents lists available at ScienceDirect

Accident Analysis and Prevention journal homepage: www.elsevier.com/locate/aap

Bayesian analysis of multivariate crash counts using copulas Eun Sug Parka, Rosy Ohb, Jae Youn Ahnc, Man-Suk Ohc,* a b c

Texas A&M Transportation Institute, Texas A&M University System, 3135 TAMU, College Station, TX, 77843-3135, United States Institute of Mathematical Sciences, Ewha Womans University, Seoul, 03760, South Korea Department of Statistics, Ewha Womans University, Seoul, 03760, South Korea

ARTICLE INFO

ABSTRACT

Keywords: Highway safety Multivariate crash counts Crash types Crash severity Unobserved heterogeneity Overdispersion

There has been growing interest in jointly modeling correlated multivariate crash counts in road safety research over the past decade. To assess the effects of roadway characteristics or environmental factors on crash counts by severity level or by collision type, various models including multivariate Poisson regression models, multivariate negative binomial regression models, and multivariate Poisson-Lognormal regression models have been suggested. We introduce more general copula-based multivariate count regression models with correlated random effects within a Bayesian framework. Our models incorporate the dependence among the multivariate crash counts by modeling multivariate random effects using copulas. Copulas provide a flexible way to construct valid multivariate distributions by decomposing any joint distribution into a copula and the marginal distributions. Overdispersion as well as general correlation structures including both positive and negative correlations in multivariate crash counts can easily be accounted for by this approach. Our copular-based models can also encompass previously suggested multivariate count regression models including multivariate Poisson-Gamma mixture models and multivariate Poisson-Lognormal regression models. The proposed method is illustrated with crash count data of five different severity levels collected from 451 three-leg unsignalized intersections in California.

1. Introduction Crash data are often collected by different crash types and severity levels. In evaluations of safety effectiveness of various countermeasures, it is important to consider the effects across different crash types or severity levels to fully understand their effects. It is possible that a countermeasure that is effective in reducing one type of crashes may increase a chance of the other types of crashes unintentionally. For instance, red light cameras (RLC) may increase rear end crashes while decreasing angle crashes (Council et al., 2005) or transverse rumble strips (TRSs) may have different effects on fatal, incapacitating and nonincapacitating injury crashes and property damage only (PDO) crashes (Srinivasan et al., 2010). In such cases, an evaluation of the countermeasure based on multivariate crash data for different crash types or severity levels will be beneficial rather than conducting separate analyses for each crash type or severity. Especially, the analysis of rare crashes such as pedestrian crashes or fatal crashes at intersections (or segments) can benefit from the joint analysis of multiple crash types or severity levels by borrowing information from other crash types or severity levels at those intersections (or segments). There has been considerable interest in jointly analyzing



multivariate crash counts by severity or by crash type over the past decade (see, e.g., Ma and Kockelman, 2006; Park and Lord, 2007; Park et al., 2010; Mannering and Bhat, 2014 and references for bivariate/ multivariate models of Table 1 therein). However, in terms of regression models for multivariate crash counts, a few specific forms of models such as multivariate Poisson regression models, multivariate Poisson-Lognormal regression models, multivariate Poisson gammamixture models, and multivariate random parameter models with spatial heterogeneity have been frequently used in safety analyses. Yasmin and Eluru (2018) provides an extensive summary on modeling crash counts by severity. More recently, copula-based methods for modeling dependence in multivariate crash data have been proposed by several authors. See Eluru et al. (2010); Yasmin et al. (2014), Wali et al. (2018) among others for copula modelling of disaggregate level discrete data, and Nashad et al. (2016) and Yasmin et al. (2018) for copula modelling of aggregate level crash count data. A copula is a flexible probabilistic tool for modeling the joint distribution of a random vector in two separate steps: selection of the marginal distribution and specification of a copula function which captures the dependence structure among the vector components. See

Corresponding author at: Department of Statistics, Ewha Womans University, Seoul, 03760, South Korea. E-mail addresses: [email protected] (E.S. Park), [email protected] (R. Oh), [email protected] (J.Y. Ahn), [email protected] (M.-S. Oh).

https://doi.org/10.1016/j.aap.2019.105431 Received 14 August 2019; Received in revised form 30 December 2019; Accepted 31 December 2019 0001-4575/ © 2020 Elsevier Ltd. All rights reserved.

Please cite this article as: Eun Sug Park, et al., Accident Analysis and Prevention, https://doi.org/10.1016/j.aap.2019.105431

Accident Analysis and Prevention xxx (xxxx) xxxx

E.S. Park, et al.

Joe (1997) and Nelson (2006) for review of copula models and their properties. Copulas have proven to be a useful way of modeling dependence in multivariate data and are now used in a diverse range of applications. While copula-based approaches to modeling multivariate continuous random variables have been successful in many applications, studies on copula-based approaches to modeling multivariate discrete random variables are relatively few. The main reason for this can be summarized in two points. First, one of the biggest barriers in using copulas on discrete data is nonidentifiability of copula (see, e.g., Yang et al., 2019). The nonidentifiability issue of copulas on discrete outcomes results in difficulties in modeling and interpretation. The second obstacle is the difficulty in calculating the likelihood function and estimation. For continuous data, when parametric copula functions are employed, the copula parameters as well as marginal parameters can often be estimated easily using maximum likelihood or other methods. When the data are discrete, however, estimation can be cumbersome and this has limited the use of copulas for multivariate discrete data (Genest and Neslehova, 2007; Smith et al., 2012). Specifically, repetitive evaluations of copula functions are required to compute the likelihood for discrete data. For J-dimensional discrete random variables, the joint probability mass function for each observation requires 2J evaluations of copulas, so that computation of the likelihood for I observations involves I×2J copula evaluations. The number of computations increases rapidly with J although it may not be much of a problem when J is 6 or less (Yasmin et al., 2018) for copulas with closed form expressions (e.g., Archimedean copulas). When copula function is not given in a closed form, however, the computation can be burdensome. Also, even when J is small it may be difficult to maximize the likelihood particularly when the copula and marginal parameters are estimated jointly (Smith et al., 2012; Smith, 2013). As a matter of fact, previous copula–based studies on crash count data mostly deal with bivariate data (with the exception of Yasmin et al., 2018) and/or use Archimedean copulas to reduce a computational burden. While Archimedean copulas have been predominantly used in previous bivariate or multivariate copula-based methods, they assume the same correlation for all pairs of dependent variables and allow only positive associations, which may be too restrictive in some cases. Also, almost all of the currently available copula-based crash analysis methods employ maximum likelihood methods for parameter estimation. It is well-known that iterative maximum likelihood methods do not guarantee a global maximum especially in a complicated high dimensional space. There is a need for a more flexible approach, while computationally feasible, to model general correlation structures (not restricted to the same correlation and/or positive correlations only) in multivariate crash counts. There has been growing interest in developing Bayesian approaches to copula modeling because Bayesian approaches using Markov chain Monte Carlo (MCMC) may overcome computational burdens in estimating copula models for high dimensional data, discrete marginal distributions, or joint estimation of marginal distributions and copula (Smith et al., 2012). Another advantage of Bayesian approach is that it can use Bayesian selection for parsimonious modelling of dependence structure (although it will not be explored in this paper). Bayesian selection idea can also be employed for pair‐copula constructions (Min and Czado, 2011; Smith and Khaled, 2012). See Smith (2013) for excellent review on Bayesian approaches to copula modelling, and LoaizaMaya and Smith (2019); Guo et al. (2019); Mohammadi and Abegaz (2019) for recent works on Bayesian approaches to copula modeling. Other research works on Bayesian estimation of copula models for discrete data include Pitt et al. (2006); Smith et al. (2012), and Smith and Khaled (2012). These studies adopted the Bayesian data augmentation idea of Tanner and Wong (1987) to get around computational difficulties in estimation of copula models for discrete data. Specifically, they augmented the likelihood with continuous latent variables and a copula model is assumed for those continuous latent variables. Despite the aforementioned advantages and other advantages of Bayesian analysis such that it can easily incorporate useful prior

information into parameter estimation as well as performing uncertainty estimation simultaneously, Bayesian approaches to copulabased modeling of multivariate crash counts have not been reported in safety literature. The objective of this paper is to introduce a general and flexible Bayesian copula-based modeling approach into multivariate crash count data analysis. Specifically, we propose multivariate regression models with continuous random effects (random intercepts) for crash counts, using copulas for modeling the joint distribution of the random effects (copula-based random effects models). In copula-based random effects models, observations are assumed to be independent conditioning on the random effects, preventing repetitive computation of the copula functions for the evaluation of the joint probability mass function. Note that multivariate Poisson-Gamma mixture models, independent multivariate Poisson-Gamma mixture models (which are equivalent to univariate Poisson-Gamma mixture models), and multivariate Poisson lognormal models can all be viewed as a special case of copula-based random effects models (see Section 2.2). While our Bayesian approach does not require any restriction on the copula in principle, in this paper we adopt a multivariate Gaussian copula among many other choices of copulas. Because the Gaussian copula is parametrized using a correlation matrix, it provides intuitive interpretation of the dependence structure as well as great flexibility in modeling the dependence structure (Joe, 2014). We perform Bayesian inference using Markov chain Monte Carlo (MCMC) methods. In MCMC, parameters and random effects are generated from their full conditional posterior distributions and the complexity of computing the likelihood function (the joint density of the multivariate count variables) can be avoided. The remainder of this paper is organized as follows. Section 2 presents our copula-based multivariate count regression models and Bayesian estimation coupled with Markov chain Montel Carlo implementation. In Section 3, we illustrate the proposed method by applying it to the California intersection crash data. Finally, concluding remarks are provided in Section 4. 2. Method We provide a brief introduction of copula in Section 2.1 and the modeling framework in Section 2.2. Sections 2.3 and 2.4 contain priors and parameter estimation implemented by a Markov chain Monte Carlo method, respectively. 2.1. Copula A copula is the joint distribution of random variables, each of which is marginally uniformly distributed as Unif(0,1). A J-dimensional copula C (u) = C (u1, …, uJ ) for u (0, 1) J is a cumulative distribution function (cdf) over the J-dimensional unit cube (0, 1) J with standard uniform marginal distributions. For given marginal cdfs F1, …, FJ and a J-dimensional copula C, if we define a function H of (X1, …, XJ ) as

H (x1, …, xJ ) = C (F1 (x1), …, FJ (xJ )), then, by Sklar’s theorem (Sklar, 1959), H can be a joint cdf of (X1, …, XJ ) with marginal cdfs F1, …, FJ . On the other hand, for a given J-dimensional continuous joint cdf H having marginals F1, …, FJ , the copula can be uniquely retrieved as

C (u) = H (F1 1 (u1), …, F J 1 (uJ )), u = (u1, …, uJ )

(0, 1) J .

See Sklar (1959) for more details. There exists a rich class of copulas: Archimedean copulas, elliptical copulas, vine copulas, and so on. Among these copulas, elliptical copulas such as Gaussian copula and t-copula have important practical advantages of providing intuitive interpretation on the dependence 2

Accident Analysis and Prevention xxx (xxxx) xxxx

E.S. Park, et al.

structure in the form of correlation matrix (Joe, 2014). The Gaussian copula can be constructed from multivariate Gaussian distribution as follows:

CRG (u) =

R(

1 (u ), 1

h ( i ) = exp

where R is a correlation matrix, R is the cdf of NJ (0 , R) , the J-dimensional multivariate normal distribution with mean 0 and covar1 iance matrix R, and and are the cdf and the inverse cdf of the standard normal distribution N(0,1), respectively. For a set of variables X = (X1 , ,XJ ) with each Xj having the marginal cdf Fj , if we assume the Gaussian copula on the joint distribution of X with Uj = Fj (Xj ), then the joint distribution of X can be expressed as follows:

H (x1, ,xJ ) = CRG (F1 (x1), ,FJ (xJ )) = R ( 1 (F1 (x1)), , 1 (FJ (xJ ))).

(1)

1 (F (x )) and z = (z , Let z j = j j 1 (1) with respect to x gives

,z J ) Differentiating both sides of J

J

h (x1,

,xJ ) = [

R (z1,

,z J )

(zj )] j=1

= exp[

1 z (R 2

f j (x j ) j =1

J 1

I)z ]

f j (xj ),

(2)

j=1

i

j

(3)

(4)

exp (Xi j).

Here, i handles (incorporates) both site- and outcome-specific heterogeneity. Let i j = exp (Xi j) . We account for the dependence among the multivariate crash counts of J different crash types (or severity levels) at site i by modeling the random effects i = ( i1, i2 , , iJ ) using copulas, i.e., j

i

C (F1,

= ( i1,

2 i ,

,

J i )

CRG (F1,

where Fj is the marginal cdf of be given as

j

i

. From Eq. (2), the joint pdf of

j h j i )] + cov[E (yi | i ), 0 + cov( i j i j , ih ih) = i j ih cov( i j , ih)

E (yih |

h i )]

(7)

yih )

NK + 1 (b0j , B0 1), j = 1,

,J,

Gamma (c0j, d0j ), (R) I {Rjk : Rjk = 1(j = k ), |Rjk | < 1(j

j

, FJ ), j

(6)

1 (F ( j )) , j i

Bayesian modeling requires prior specification for all unknown parameters in the model. For the priors, we assume that parameters ( 1 , 2, …, J , 1, 2, …, J , R) independently follow the following distributions:

where C is a copula and Fj is the marginal cdf of i j . Although any choice of copula can be made for i , we focus on the Gaussian copula in this paper because the Gaussian copula is easy to handle while being able to incorporate a general dependence structure. For random effects i , we assume i

j

2.3. Priors

(5)

, FJ ),

z iJ ) ,

From Eq. (7), it can be clearly seen that cov(yi , is positive (negative) if and only if cov( i j , ih ) is positive (negative), hence both positive and negative correlations can be accommodated through the Gaussian copula. It should be noted that our model is more general than the previously suggested multivariate Poisson-Gamma mixture models requiring the covariances to be non-negative (see, e.g., p. 213 of Winkelmann, 2008). Also, if we assume the Gaussian copula with R=I (an identity matrix) and gamma marginal distributions for i j , j = 1, , J , then we obtain independent multivariate Possion-Gamma mixture models which are equivalent to univariate Poisson-Gamma mixture models. Note that Multivariate Poisson-Lognormal (MVPLN) models (Chib and Winkelmann, 2001) can be viewed as a special case of (6) as well in which f j is given as a lognormal distribution. In a similar way to our aforementioned model assuming the Gaussian copula with a general correlation matrix and gamma marginals, the MVPLN models can also incorporate both positive and negative correlation as well as accounting for overdispersion. For MVPLN models, cov( i j , ih ) is given in a closed form and hence cov(yi j , yih ) can be computed analytically. When cov( i j , ih ) is not given in a closed form, one may estimate cov( i j , ih ) and cov(yi j , yih ) using posterior samples of i j and ih obtained from Markov chain Monte Carlo (MCMC) methods.

where

µi j =

f j ( i j ),

j

, yiJ ) denote a multivariate observation conLet yi = ( yi1 , yi2 , , I ). sisting of counts of J different types of crashes at site i (i = 1, That is, yi j is the number of crashes of type j occurred at site i. Let K be , XiK ) be a (K + 1)-dithe number of covariates and Xi = (1, Xi1 , , Kj ) denote mensional row vector of covariates. Let j = ( 0j , 1j , the (K + 1)-dimensional column vector of the regression coefficients for the crash count of the jth type. Let i = ( i1, i2 , , iJ ) denote a vector of site and outcome-specific random effects corresponding to site i for J different types of crashes (or severity levels). Suppose that, conditional on i j and βj, the crash count of type j at site i, yi j , follows a Poisson distribution with mean µi j , i.e.,

Poisson (µi j ),

i j =1

z i2,

=

2.2. Modeling framework

j

I)z

cov(yi j , yih ) = E [cov(yi j , yih | i j ,

where h (x1, ,xJ ) is the joint pdf of X, ϕR is the pdf of NJ (0 , R) , ϕ is the pdf of N(0,1), and f j is the marginal pdf of Xj . It is clearly seen in (2) that the joint pdf of X is decomposed into a term related to the dependence structure and a term related to the marginal distributions.

yi j | i j ,

J 1

zi = where z i = and f j is the marginal , pdf of i j . Remark1. Although we presented our copular-based multivariate random effects models using a Gaussian copula, an extension to t copulas or skew t copulas is easy because they can be represented as Gaussian copulas conditional upon latent variables following a gamma distribution (see Demarta and McNeil, 2005; Smith et al., 2012). The marginal distribution f j of i j can be chosen freely, separately from the dependence structure. An appropriate choice of the marginal distribution for random effects can handle overdispersion while preserving the mean of yi j . The common choices for the marginal distribution are those having mean 1 for i j , which gives E [yi j ] = i j . In this paper, we assume that the marginal distribution of i j is a gamma distribution with mean 1 and variance 1 j , i.e. j Gamma ( j , 1 j ) , so that the marginal distribution of yi j is given i as a negative binomial (NB) distribution with mean i j and variance j j j] . Note that in this case overdispersion can be taken into i [1 + i account through a parameter j while allowing a general dependence , yiJ through the correlation matrix R of structure among yi1 , yi2 , the Gaussian copula. The covariance between yi j and yih (given j ’ s) can be obtained by applying the conditional covariance formula as follows: 1 ( zi ,

1 (u )), J

,

1 z i (R 2

k )andRis positive definite}, (8)

i

where NK + 1 (b0j , B0 1) is the (K + 1)-variate normal distribution with mean vector b0j and precision matrix B0, Gamma (c0j, d 0j ) is the Gamma

can 3

Accident Analysis and Prevention xxx (xxxx) xxxx

E.S. Park, et al.

distribution with a shape parameter c0j and a scale parameter d0j , i.e., with mean c0j d 0j . The prior for R given in (8) is a non-informative prior which is uniform on the restricted space where R satisfies the condition for a correlation matrix (Barnard et al., 2000). Here, (b0j , B0 1, c0j, d 0j ), j = 1, , J , are pre-specified hyper-parameters.

min

J

[{ i=1

J

×

j =1

j

2)

ˆj

2)

) fT ( j *| , V ˆ j,

J

I

j= 1

i=1

J

f (yi j | j ,

i

j

)} cRG (F1 ( i1), F2 ( i2),

J i ))

,FJ (

j =1

,1 .

R 1) zi } exp {(

[exp {0. 5zi (I

likelihood × prior I

( j|y j ,

j

) fT ( j| ˆ , V ˆ j,

from the full conditional

( | , R)

The joint posterior distribution of the parameters is given as follows:

=

j

Sampling = ( 1, 2 , , J ). The MH algorithm can be used to sample distribution

2.4. MCMC implementation

Posterior

j|y j ,

(

f j ( i j )]

×exp (c0j

j

j

1) log

j

1) log

i

j

j

i

j

+

j log j

log ( j )}]

.

d 0j

( j|b0j , B0 1) fG ( j|c 0j, d 0j ),

(R) j =1

where cRG (u) is the density of the Gaussian copula given as R

cRG (u) =

1 (u ), 1

(

J j =1

u = (u1,

1 (u )) J

,

1 (u )) j

(

A truncated normal distribution fTN ( j | ˆ j , 22 V ˆ j ) is used as a proposal distribution where the location parameter ˆ j and the scale parameter V ˆ j are obtained by using the Newton-Raphson algorithm, and 2 is a tuning parameter. A proposal j is drawn from fTN ( j | ˆ j , 22 V ˆ j ). Then is accepted with probability = min { *, 1} , where

,

(0, 1) J ,

,uJ )

R is the joint density of J-variate normal distribution with a mean zero 1 vector and correlation matrix R , is the quantile function, is the density function of the standard normal distribution, and Fj and f j are

the cdf and pdf of the marginal distribution of i , respectively, and fG ( j |c0j, d 0j ) is the pdf of Gamma (c0j, d 0j ) distribution. Parameter estimation for ( 1 , 2 , , J , 1, 2 , , J , R, 1 , , I ) , 1 2 = , iJ ) , can be performed by using posterior samples i ( i, i, generated from the following Markov chain Monte Carlo (MCMC) algorithm. MCMC algorithm: Sampling i = ( i1, i2 , , iJ ) , i = 1, ,I . The full conditional posterior density function of 1 2 , iJ ) is given as i = ( i , i ,

( i |yi , , , R)

exp[0. 5z i (I +(

j

R 1)z i] ×

1)log µ ij

J j =1

J *j j =1 i

where

*=

i=1

where µi j =

j

V ˆ j ))

2

V ˆ j ))

(

.

i

|R, ) (R) , FJ (

J

i

)) (R)

i=1 I R (z i )

( *i | yi , , , R)

×

j J j=1 i j .Note J j =1 i

exp

(R)

that

b0j )}],

j

the Newton-Raphson algorithm. A proposal j fT ( j| ˆ , V ˆ j, 2 ) is then accepted with probability

j

drawn

I log |R| 2

I

z i R 1z i I (A),

0.5

(9)

i=1

where z i = ( 1F1 ( i1), , 1FJ ( iJ )) , I( ) is the indicator function, and A = {R: Rjk = 1(j = k ), |Rjk | < 1(j k ), R is positive definite}. Sampling R from its conditional posterior distribution is challenging especially due to the constraints on the correlation matrix (positive definiteness and fixed diagonal elements). We get around this difficulty by employing the parameter-expanded reparameterization and Metropolis-Hasting (PX-RPMH) algorithm of Liu and Daniels (2006) for sampling the correlation matrix R. In this algorithm, the difficulty of sampling R can be overcome by simulating covariance matrix from an inverse Wishart distribution, and then transforming it back to correlation matrix through the reduction function (R = D 1 D 1) where D is a diagonal matrix. The algorithm consists of two steps. Step 1: (z*, ) be a one-to-one mapping where z*i = Dzi , Let T:(z, R) I = DRD, is a J × J positive definite matrix, and i = 1 zij*2 = 1 for each j = 1, .., J . Choose 2 (R) |R| (J + 1) 2 I (A) as a proposal prior for R, and I denote i = 1 R (z i) in Eq. (9) as f (z |R) . Note that the parameters and are related to R only through z. Combining f (z |R) with the proposal prior 2 (R) gives

j yIj ) , and j = ( 1j , exp (Xi j ) , y j = ( y1j , I ). A j j ˆ multivariate t-distribution fT ( | , V ˆ j, 2 ) is used as a proposal distribution where the degrees of freedom 2 is a tuning parameter, the j location parameter ˆ and the scale parameter V ˆ j are obtained by using i

2

ˆ j)

i=1

( i | yi , , , R)

j

f(

cRG (F1 ( i1),

j µ j], i

b0j ) B0 (

ˆ j) (

j

I

exp{ µi j + yi j log(µi j )}

× exp[ 0.5{(

, R) ((

j

i=1

I

)

(

, R) ((

I

exp[ µ ij + yij log(µ ij)

J j j =1 i

j

j|

(R| , )

is the proposal density ratio of i and i*. Sampling = ( 1 , 2 , .., J ) . j j j = , Kj ) , j = 1, ,J is a (K + 1)-dimensional ( 0, 1, column vector. We use the same strategy as that in Chib and Winkelmann (2001) to sample j from the full conditional distribution

( j|y j ,

j|

(

Sampling R . The full conditional posterior density function of R is given as

1F ( j ) , where = ( 1 , 2, …, J) , = ( 1 , , J ) , z i = (z i1, ,z iJ ) , z iJ = j i j j and µi = i exp (Xi j ) . Obviously, the conditional posterior distribution of i is not given in a convenient form for sample generation. We use a random walk Metropolis Hastings (RW-MH) algorithm to generate MCMC samples of i . A proposal i* is drawn from LN ( i , 12 IJ ), Jvariate log-normal distribution with parameters i and 12 IJ where i is the current value of i , IJ is the J-dimensional identify matrix, and 1 is a tuning parameter. The proposal i* is accepted with probability

= min { *, 1} ,

J j=1 J j=1

*=

j

(z, R)

from

exp

(I + J + 1) log |R| 2

1 2

I

z i R 1z i I (A) . i=1

Now, using the one-to-one mapping T:(z, R) 4

(z*, ) ,

(10)

Accident Analysis and Prevention xxx (xxxx) xxxx

E.S. Park, et al.

(z*, )

exp exp

(I + J + 1) log |R| + (I + J + 1)log |D 1| 2 (I + J + 1) log | | 2

1 tr (z* z* 2

1)

1 2

Table 2 Descriptive Statistics for California Intersections.

I

zi*

1z* i

i=1

.

Hence, the conditional density of given z* is IW (S, I ) , the Inverse I Wishart distribution with scale matrix S = i = 1 zi*zi* and the degrees of freedom I. Step 2: Draw a proposal * IW (S , I ) and let R* = D* 1 *D* 1 where * , , JJ* ) , jj* ( j = 1, .., J ) is the jth diagonal element of D* = diag ( 11 * . The proposal correlation matrix R* is accepted with probability = min { *, 1} , where

*=

(R*| , ) × (R| , )

(z , R) J+1 = exp ( log |R*| (z, R*) 2

Levels

Frequency

Lighting)

0 (No) 1 (Yes) 0 (No) 1 (Yes) 0 (No) 1 (Yes) 0 (No) 1 (Yes) 2 4 0 (No) 1 (Yes) 0 (No) 1 (Yes) Min 7.7956 2.3026

290 159 273 176 389 60 386 63 71 378 387 62 288 161 Max 11.2683 10.0481

Painted Left Turn Curb Med Left Turn Right Turn Channel ML Lanes Mountain Terrain

log |R|) .

Rolling Terrain Log (Major AADT) Log (Minor AADT)

3. Application to California intersection crash data

Sum

Mean

Variance

Minimum

Maximum

Sev1 Sev2 Sev3 Sev4 Sev5

77 202 738 865 2857

0.17 0.45 1.64 1.92 6.33

0.27 0.92 6.33 12.65 98.99

0 0 0 0 0

5 6 20 28 88

Mean 9.4195 4.9193

64.6 % 35.4 % 60.8 % 39.2 % 86.6 % 13.4 % 86.0 % 14.0 % 15.8 % 84.2 % 86.2 % 13.8 % 64.1 % 35.9 % SD 0.7514 1.5148

painted left-turn bay), which is consistent with the observations made by Park and Lord (2007). Recall that MVPLN models can also be viewed as a special case of copula-based random effects models assuming the Gaussian copula with a general correlation matrix and lognormal marginal distributions for random effects. This suggests that some variables may have a different effect given the severity outcome of the crash. Note that some coefficients appear to be counterintuitive, which was also observed in Park and Lord (2007). For example, the presence of lighting is associated with more crashes for Sev4 and Sev5 (under both multivariate and univariate models), which might have been caused by some confounding factors such as the speed limit or other unmeasured intersection characteristic variables. While we recognize the limitation of this database, we would like to emphasize that the purpose of this paper is to present a general method that can encompass previously proposed multivariate approaches for accounting for correlations in the multivariate crash counts as well as independent multivariate approaches (i.e., univariate approaches) and illustrate the method on the real crash counts of different severity levels. Table 4 contains the posterior means of the correlation matrix of the random effects, which illustrates the correlation structure estimated by our multivariate model. Although correlations are positive in this case (which was resulted by the nature of the data not by the restrictions imposed by the model), the correlation structure is not homogeneous. This heterogeneous correlation structure would not have been modeled by Archimedean copulas. It can be seen that crashes of Severity 4 and Severity 5 are even more closely related than crashes of other severity levels. As discussed in Chib and Winkelmann (2001), the model can also be used to obtain predictive distributions and the proper modeling of the correlation structure is expected to be more important in predicting joint probabilities. Following the procedure given in Chib and Winkelmann (2001), we performed in-sample predictions of joint probabilities of several events to compare the performance of a multivariate model and a univariate model. Table 5 presents the empirical frequency distributions of the data and probabilities predicted by the joint model (from a multivariate approach accommodating a general correlation structure) and the independent model (from a univariate approach). It can be seen that overall the joint model outperforms the independence model in predicting probabilities of the events in Table 5. Specifically, for the event of no crashes of any severity level over 10 years occurring at an intersection (the first row) and the event of at least one crashes of each severity level over 10 years occurring at an intersection (the last row), the predictions of the joint model significantly superior to the predictions under independence.

Table 1 Summary Statistics of Multivariate Crash Counts for California Intersection Data. Dependent Variables

% of Total

Note: ML Lanes denotes Number of Main Lanes.

The proposed Bayesian copula modeling approach was applied to the crash count data of five different severity levels (Sev1: fatal (K), Sev2: incapacitating-injury (A), Sev3: non-incapacitating injury (B), Sev4: minor injury (C), Sev5: property damage only (O)) collected from 451 three-legged unsignalized intersections in California. These data were previously analyzed by a multivariate Poisson-Lognormal (MVPLN) modeling approach in Park and Lord (2007). There were 77 fatal injuries, 202 accidents of severity 2, 738 accidents of severity 3, 865 accidents of severity 4, and 2,857 PDO accidents at those 451 intersections for 10 years. Table 1 contains summary statistics of multivariate crash frequency by severity. In the table, the unit of crash frequency is the number of crashes per intersection for 10 years. Table 2 presents the descriptive statistics for the roadway characteristic variables used as predictor variables in the regression analysis. The major and minor roads of the intersection are defined as a function of the entering traffic flow. The legs with the highest entering flows are defined as major AADT. We fitted the proposed Bayesian multivariate Poisson-Gamma mixture models of Section 2, assuming the Gaussian copula with a general correlation matrix. Table 3 gives the estimates (posterior means and standard deviations) of the regression coefficients . For comparison, we also present the estimates obtained from applying Bayesian independent multivariate Poisson-Gamma mixture models with the Gaussian copula having a correlation matrix equal to the identity matrix (which are equivalent to univariate Bayesian Poisson-Gamma mixture models) in the table. In general, the joint multivariate PoissonGamma mixture model and the independent multivariate PoissonGamma mixture model (i.e., univariate Poisson-Gamma mixture models) give consistent results in terms significance of model coefficients and effect sizes. A number of goodness of fit measures (MSE, MAE, and MPB) for the joint and independent models are also given in the table, which indicate that the joint model performs slightly better than the independent model (in terms of RMSE and MAE). The mean prediction bias (MPB) is close to 0 for both models. It can be seen from Table 3 that the coefficients sometimes change from positive to negative values and vice versa for different crash severity levels (e.g.,

(Y1) (Y2) (Y3) (Y4) (Y5)

Variables

5

Accident Analysis and Prevention xxx (xxxx) xxxx

E.S. Park, et al.

Table 3 Estimates of Regression Coefficients Obtained by Applying Multivariate Poisson-Gamma Mixture Regression Models with Gaussian Copulas.

Severity Sev1

Goodness-of-fit measure Sev2

Goodness-of-fit measure Sev3

Goodness-of-fit measure Sev4

Goodness-of-fit measure Sev5

Goodness-of-fit measure

Model Copula

Joint Multivariate Poisson-Gamma mixture model GCG

Independent Multivariate Poisson-Gamma mixture model GCI

Variable Constant Lighting Painted Left Turn Curb Med Left Turn Right Turn Channel ML Lanes Mountain Rolling Log(Major AADT) Log(Minor AADT) RMSE MAE MPB Constant Lighting Painted Left Turn Curb Med Left Turn Right Turn Channel ML Lanes Mountain Rolling Log(Major AADT) Log(Minor AADT) RMSE MAE MPB Constant Lighting Painted Left Turn Curb Med Left Turn Right Turn Channel ML Lanes Mountain Rolling Log(Major AADT) Log(Minor AADT) RMSE MAE MPB Constant Lighting Painted Left Turn Curb Med Left Turn Right Turn Channel ML Lanes Mountain Rolling Log(Major AADT) Log(Minor AADT) RMSE MAE MPB Constant Lighting Painted Left Turn Curb Med Left Turn Right Turn Channel ML Lanes Mountain Rolling Log(Major AADT) Log(Minor AADT) RMSE MAE MPB

−13.5054 (1.5107) −0.5202 (0.3370) 0.4819 (0.2882) 0.4864 (0.3574) 0.2636 (0.3296) 0.3390 (0.2888) −0.0916 (0.3859) −0.4647 (0.2828) 0.9471 (0.1629) 0.2025 (0.0891) 0.3959 0.2052 −0.0002 −12.3576 (1.1543) 0.3055 (0.2080) 0.5277 (0.1925) 0.1618 (0.2764) 0.2153 (0.2355) 0.1813 (0.1684) 0.4281 (0.2714) 0.4402 (0.1900) 0.9294 (0.1230) 0.2051 (0.0606) 0.6415 0.3891 0.0005 −9.3322 (0.7629) 0.2599 (0.1307) 0.0846 (0.1266) 0.0460 (0.1792) 0.0998 (0.1597) 0.0562 (0.0915) 0.5138 (0.1704) 0.0534 (0.1230) 0.8702 (0.0804) 0.1715 (0.0400) 1.2369 0.8319 0.0004 −11.1559 (0.7786) 0.5932 (0.1353) −0.0103 (0.1345) −0.1817 (0.1912) 0.2317 (0.1660) 0.0228 (0.0941) 0.4579 (0.1755) 0.0380 (0.1305) 1.0498 (0.0851) 0.2149 (0.0412) 1.3461 0.8488 −0.0006 −9.2792 (0.5491) 0.4643 (0.0992) −0.2315 (0.1013) −0.2003 (0.1407) 0.0693 (0.1269) 0.1285 (0.0649) 0.6068 (0.1354) 0.1079 (0.0956) 0.9437 (0.0603) 0.2356 (0.0318) 2.5147 1.6613 −0.0025

−13.9783 (1.5189) −0.6223 (0.3344) 0.4707 (0.3014) 0.5463 (0.3595) 0.2543 (0.3282) 0.2644 (0.2823) −0.1831 (0.3924) −0.4709 (0.2856) 1.0208 (0.1624) 0.2248 (0.0927) 0.4188 0.2150 −0.0004 −13.0553 (1.2041) 0.2470 (0.2084) 0.5430 (0.2012) 0.1839 (0.2835) 0.2371 (0.2318) 0.1568 (0.1669) 0.3802 (0.2721) 0.4514 (0.1874) 1.0161 (0.1256) 0.2026 (0.0618) 0.6744 0.4049 −0.0001 −10.1926 (0.8149) 0.1985 (0.1344) 0.1215 (0.1253) 0.0826 (0.1794) 0.0441 (0.1672) 0.0531 (0.0907) 0.5682 (0.1710) 0.0879 (0.1264) 0.9639 (0.0858) 0.1676 (0.0398) 1.2892 0.8676 −0.0004 −11.2137 (0.8189) 0.5390 (0.1409) 0.0291 (0.1370) −0.1185 (0.2004) 0.3253 (0.1756) 0.0050 (0.0978) 0.4583 (0.1916) 0.0470 (0.1354) 1.0715 (0.0888) 0.1998 (0.0428) 1.4138 0.8967 −0.0006 −9.6529 (0.5973) 0.4996 (0.1068) −0.2276 (0.1061) −0.1919 (0.1450) 0.0982 (0.1325) 0.1417 (0.0699) 0.5883 (0.1345) 0.0618 (0.1003) 0.9833 (0.0656) 0.2282 (0.0325) 2.5484 1.6858 0.0017

Notes: 1. GCG: Gaussian Copula with a general correlation matrix, GCI: Gaussian Copula with a correlation matrix equal to the identity matrix; 2. Both of multivariate models and univariate models were implemented by MCMC; 3. Numbers in parentheses represent posterior standard deviations; 4. Statistically significant effects at the 95 % level (i.e., those for which the corresponding 95 % posterior intervals do not contain 0) are shown in bold; 5. RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and MPB (Mean Prediction Bias) are defined as RMSE =

ˆ

1 N

N i=1

(yi

ˆy ) i

2

, MAE =

1 N

N is the number of observations, yi is the observed crash count, and yi is the predicted crash count..

6

N i=1

|yi

ˆy |, and MPB = i

1 N

N i=1

(yi

ˆy ), respectively, where i

Accident Analysis and Prevention xxx (xxxx) xxxx

E.S. Park, et al.

levels through joint modeling. The real advantage of the joint model over the independence model (univariate models) was also demonstrated in the prediction of the joint probabilities of crash events at an intersection over the sampling period (e.g., the joint probability that an intersection experiences no crashes over the 10-year period or the joint probability that an intersection experiences at least one crash of each of five severity levels over the 10-year period, etc.). The empirical implication of the proposed modeling exercise is that unequal correlations among multivariate crash counts of different severity levels were able to be estimated (as the correlation structure of the random effects is taken to be fully general), which would not have been possible by the use of Archimedean copulas which can model only equal correlations. Another implication is that the prediction of joint probabilities of crash events is deemed simpler and more accurate in the context of our approach compared to other approaches utilizing Archimedean copulas because, conditional on the latent effects, the multivariate counts (of different severity levels) have independent Poisson distributions in our model and the joint probabilities can easily be obtained through multiplication based on the output from the MCMC algorithm. Also, as mentioned earlier, the proper modeling of the general correlation structure (not artificially restricted to homogeneous correlations) is important for more accurate predictions of joint probabilities. We are currently working on generalization of a copula-based multivariate count regression model to before-after safety evaluation as well as extension to implementing a t-copula to account for tail dependence in multivariate crash counts. Future efforts will also include employing different marginal distributions for crashes of different severities or types. For example, to cope with the high frequency of zeros for fatal crashes, one may use a zero-inflated Poisson distribution for Severity 1 while using negative binomial distributions for other severity levels.

Table 4 Estimated Correlation Matrix of the Random Effects.

Sev1 Sev2 Sev3 Sev4 Sev5

Sev1

Sev2

Sev3

Sev4

Sev5

1.0000 0.5606 0.6122 0.6175 0.6244

1.0000 0.6972 0.6618 0.6639

1.0000 0.8092 0.7805

1.0000 0.8744

1.0000

Table 5 In-Sample Predictions of Joint Probabilities. Event

Actual (Empirical)

Joint (Multivariate)

Independent (Univariate)

(0, 0, 0, 0, 0) (1+, 0, 0, 0, 0) (0, 1+, 0, 0, 0) (0, 0, 1+, 0, 0) (0, 0, 0, 1+, 0) (0, 0, 0, 0, 1+) (1+,1+,1+,1+,1+)

0.0998 0.0044 0.0067 0.0244 0.0133 0.1441 0.0554

0.1007 0.0027 0.0061 0.0269 0.0180 0.1460 0.0500

0.0623 0.0031 0.0070 0.0321 0.0267 0.1335 0.0367

Notes: 1. The values in the Event column represent the values for (Sev1, Sev2, Sev3, Sev4, Sev5); 2. ‘0’ represent ‘no crash’ and ‘1+’ represents ‘at least one crash’.

4. Summary and discussion In this paper, we have presented a Bayesian copula-based multivariate approach for jointly modeling multivariate crash count data. We used a multivariate count regression model with correlated random effects modeled by a Gaussian copula along with gamma marginals to account for overdispersion. Unlike the previously proposed copulabased models that used Archimedean copulas under some restrictive assumptions on the correlation structure (such as positive and homogeneous correlations), our approach which models multivariate random effects using Gaussian copulas allows for a general correlation structure among multivariate crash counts. Our modeling approach coupled with Bayesian estimation has a great flexibility in the choice of marginal distributions. Although in this paper we used a gamma distribution as the marginal distribution of random effects (which subsequently leads to a negative binomial marginal distribution for crash counts) as a way of accommodating overdispersion, any marginal distribution can be chosen in principle. The proposed approach can also encompass previously suggested popular multivariate count regression models including multivariate PoissonGamma mixture models and multivariate Poisson-Lognormal regression models as well as independent multivariate Poisson-Gamma mixture models. We implemented parameter estimation by MCMC within a Bayesian framework. The most challenging part of MCMC implementation is simulating a correlation matrix of a Gaussian. We dealt with this problem by employing an efficient two-stage parameter expanded re-parameterization and Metropolis-Hastings (PX-RPMH) algorithm of Liu and Daniels (2006) for sampling a correlation matrix from its conditional posterior distributions. Our multivariate model assuming the Gaussian copula with a general correlation matrix along with gamma distributions for the marginal distributions of random effects was applied to the multivariate crash counts from 451 intersections in California and was compared with univariate Poisson-Gamma mixture models (i.e.,the independent multivariate Poisson-Gamma mixture model assuming the Gaussian copula with a correlation matrix equal to an identity matrix). It was observed that an advantage of a multivariate model in parameter estimation manifests especially for fatal crashes (Severity 1), which seems to be a consequence of borrowing the strength from crashes of other severity

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgments Man-Suk Oh’s research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korean Government (NRF-2016R1A2B4008914). Jae Youn Ahn’s research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (NRF-2017R1D1A1B03032318). Rosy Oh’s research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grant No. 2019R1A6A1A11051177). The authors gratefully acknowledge many helpful comments from the reviewers. References Barnard, J., McCulloch, R., Meng, X., 2000. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Stat. Sin. 10 (October (4)), 1281–1311 (31 pages). Chib, S., Winkelmann, R., 2001. Markov chain Monte Carlo analysis of correlated count data. J. Bus. Econ. Stat. 19 (2001), 428–435. Council, F., Persaud, B., Eccles, K., Lyon, C., Griffith, M., 2005. Safety Evaluation of Redlight Cameras. Report No. FHWA HRT-05-049. Federal Highway Administration, Washington, DC. Demarta, S., McNeil, A.J., 2005. The t copula and related copulas. Int. Stat. Rev. 73, 111–129. Eluru, N., Paleti, R., Pendyala, R., Bhat, C.R., 2010. Modeling injury severity of multiple occupants of vehicles: copula-based multivariate approach. Transp. Res. Rec. 2165, 1–11. Genest, C., Neslehova, J., 2007. A primer on copulas for count data. Astin Bull. 37, 475–515.

7

Accident Analysis and Prevention xxx (xxxx) xxxx

E.S. Park, et al. Guo, C., Khan, F., Imtiaz, S., 2019. Copula-based Bayesian network model for process system risk assessment. Process. Saf. Environ. Prot. 123, 317–326. Joe, H., 1997. Multivariate Models and Dependence Concepts. Chapman & Hall, London. Joe, H., 2014. Dependence Modeling with Copulas. CRC Press. Loaiza-Maya, R., Smith, M.S., 2019. Time series copulas for heteroskedastic data. J. Appl. Econ. 33, 332–354. Liu, X., Daniels, M.J., 2006. A new algorithm for simulating a correlation matrix based on parameter expansion and reparameterization. J. Comput. Graph. Stat. 15 (4), 897–914. Ma, J., Kockelman, K.M., 2006. Bayesian multivariate Poisson regression for models of injury count, by severity. Transportation Research Record: Journal of the Transportation Research Board, No. 1950. Transportation Research Board of the National Academies, Washington, D.C, pp. 24–34. Mannering, F.L., Bhat, C.R., 2014. Analytic methods in accident research: methodological frontier and future directions. Anal. Methods Accid. Res. 1, 1–22. Min, A., Czado, C., 2011. Bayesian model selection for D‐vine pair‐copula constructions. Can. J. Stat. 39, 239–258. Mohammadi, A., Abegaz, F., 2019. Bayesian modelling of Dupuytren disease by using Gaussian copula graphical models. J. R. Stat. Soc.-Ser. C 66, 629–645. Nashad, T., Yasmin, S., Eluru, N., Lee, J., A. Abdel-Aty, M., 2016. Joint modeling of pedestrian and bicycle crashes. Transportation Research Record: Journal of the Transportation Research Board, No. 2601. Transportation Research Board, Washington, D.C, pp. 119–127. Nelson, R., 2006. An Introduction to Copulas, 2nd ed. Springer, Berlin. Park, E.S., Lord, D., 2007. Multivariate Poisson-Lognormal models for jointly modeling crash frequency by severity. Transp. Res. Rec. 2019, 1–6. Park, E.S., Park, J., Lomax, T.J., 2010. A fully bayesian multivariate approach to beforeafter safety evaluation. Accid. Anal. Prev. 42, 1118–1127.

Pitt, M., Chan, D., Kohn, R., 2006. Efficient Bayesian inference for Gaussian copula regression models. Biometrika 93, 537–554. Sklar, M., 1959. Fonctions de répartition à n dimensions et leurs marges 8. Institute of Statistics of the University of Paris, pp. 229–231. Smith, M.S., Gan, Q., Kohn, R., 2012. Modeling dependence using skew t copulas: Bayesian inference and applications. J. Appl. Econom. 27, 500–522. Smith, M.S., Khaled, M.A., 2012. Estimation of copula models with discrete margins via Bayesian data augmentation. J. Am. Stat. Assoc. 107, 290–303. Smith, M.S., 2013. Bayesian approaches to copula modelling. In: Damien, P., Dellaportas, P., Polson, N.G., Stephens, D.A. (Eds.), Bayesian Theory and Applications. Oxford University Press, pp. 336–360. Srinivasan, R., Baek, J., Council, F., 2010. Safety evaluation of transverse rumble strips on approaches to stop-controlled intersections in rural areas. J. Transp. Saf. Secur. 2 (3), 261–278. Wali, B., Khattak, A.J., Xu, J., 2018. Contributory fault and level of personal injury to drivers involved in head-on collisions: application of copula-based bivariate ordinal models. Accid. Anal. Prev. 110, 101–114. Winkelmann, R., 2008. Econometric Analysis of Count Data, 5th ed. Springer, Berlin Heidelberg. Yang, L., Frees, E.W., Zhang, Z., 2019. Nonparametric estimation of copula regression models with discrete outcomes. J. Am. Stat. Assoc. https://doi.org/10.1080/ 01621459.2018.1546586. Yasmin, S., Eluru, N., Pinjari, A.R., Tay, R., 2014. Examining driver injury severity in two vehicle crashes – a copula based approach. Accid. Anal. Prev. 66, 120–135. Yasmin, S., Eluru, N., 2018. A joint econometric framework for modeling crash counts by severity. Transp. A Transp. Sci. 14 (3), 230–255. Yasmin, S., Momtaz, S.U., Nashad, T., Eluru, N., 2018. Multivariate copula-based macrolevel crash count model. Transp. Res. Rec. 2672 (30), 64–75.

8