Approximate uniform shrinkage prior for a multivariate generalized linear mixed model

Approximate uniform shrinkage prior for a multivariate generalized linear mixed model

Journal of Multivariate Analysis 145 (2016) 148–161 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www...

406KB Sizes 0 Downloads 76 Views

Journal of Multivariate Analysis 145 (2016) 148–161

Contents lists available at ScienceDirect

Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva

Approximate uniform shrinkage prior for a multivariate generalized linear mixed model Hsiang-Chun Chen a,∗ , Thomas E. Wehrly b a

Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, USA

b

Department of Statistics, Texas A&M University, College Station, TX, USA

article

info

Article history: Received 5 August 2014 Available online 29 December 2015 Keywords: Bayesian inference Multivariate generalized linear model Noninformative prior distribution Uniform shrinkage prior

abstract Multivariate generalized linear mixed models (MGLMM) are used for jointly modeling the clustered mixed outcomes obtained when there are two or more responses repeatedly measured on each individual in scientific studies. Bayesian methods are widely used techniques for analyzing MGLMM. The need for noninformative priors arises when there is insufficient prior information on the model parameters. The main aim of the present study is to propose an approximate uniform shrinkage prior for the random effect variance components in the Bayesian analysis for the MGLMM. This prior is an extension of the approximate uniform shrinkage prior proposed by Natarajan and Kass (2000). This prior is easy to apply and is shown to possess several nice properties. The use of the approximate uniform shrinkage prior is illustrated in terms of both a simulation study and osteoarthritis data. © 2015 Elsevier Inc. All rights reserved.

1. Introduction Clustered mixed outcomes arise in scientific studies such as longitudinal trials when more than one response is repeatedly measured on each individual. Methods have been proposed for modeling the clustered mixed outcomes. The multivariate generalized linear mixed model (MGLMM) is one of the most widely used models for accommodating these measurements when they are assumed to independently follow distributions in the exponential family, depending on fixed effects and subject-specific correlated random effects [3,10,9,15,1]. Approaches such as the adaptive Gauss–Hermite quadrature, Monte Carlo EM algorithm, generalized estimating equations approach, and penalized quasi-likelihood (PQL) have been developed for maximum likelihood estimation in MGLMM. Since generalized linear mixed models can be viewed as hierarchical models containing two stages, a Bayesian approach [23,22,7] has been widely used in estimating the joint posterior distributions of the fixed-effect parameters and the variance components of the random effects. Several assumptions for the prior distribution on the fixed effect parameters and the variance components of random effects have been studied. The standard noninformative prior, or Jeffreys prior [21], is one of the most widely used assumption in Bayesian approach. The drawback of using the Jeffreys prior is that it may lead to an improper joint posterior distribution for fixed effects and variance components of the random effect [12,17]. For univariate generalized linear mixed model, Natarajan and Kass [16] introduced the approximate uniform shrinkage prior as an alternative prior for the variance component of the random effects. The main idea of the approximate uniform shrinkage prior is that the weight of the prior mean used in the approximate shrinkage estimate is assumed to be componentwise uniformly distributed. Using the transformation theorem, we can then find the distribution of the variance structure



Corresponding author. E-mail address: [email protected] (H.-C. Chen).

http://dx.doi.org/10.1016/j.jmva.2015.12.004 0047-259X/© 2015 Elsevier Inc. All rights reserved.

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

149

of random effect. This prior has several desirable properties. Kass and Natarajan [13] also proposed a default conjugate prior for variance component. This inverse Wishart prior uses 1 as its degree of freedom, and an estimate of variance component obtained from approximate shrinkage estimate and GLM model as its scale matrix. When the clustered mixed outcomes are considered, MGLMM is applied and the random effects are assumed to follow a multivariate normal distribution. Bayesian analysis can also be applied [5,6,18]. The inverse Wishart distribution is one of the most widely used prior for the covariance matrix of the random effects since the inverse Wishart distribution is the conjugate prior of the multivariate normal distribution. However, the estimation is very sensitive to the choice of scale matrix in the prior distribution. Thus, the need of noninformative priors arises when the prior information on the model parameters is insufficient. The rest of this study is organized as follows. Section 2 briefly introduces Bayesian methods for univariate and multivariate generalized linear mixed model. In Section 3, approximate uniform shrinkage prior for multivariate generalized linear mixed model is derived. Model specification examples are also provided in this section. Section 4 presents properties of the approximate uniform shrinkage prior distribution and its corresponding posterior distributions. Section 5 explains how the posterior simulation is performed. In Section 6, evaluation of the performance of the approximate uniform shrinkage prior in simulation study is described. An example is provided to illustrate the application of the approximate uniform shrinkage prior in Section 7. Lastly, Section 8 concludes with implications for future research. 2. Review of Bayesian methods for GLMM and MGLMM 2.1. Generalized linear mixed model Given data composed of N subjects and Ti repeated measurements within the ith subject. Let yit be the tth univariate measurement on the ith subject. Conditional on a subject-specific random effect bi , {yi1 , . . . , yiTi } is assumed to independently follow a distribution with density in the exponential family f (yit |bi ) ∝ exp



yit θit − a(θit )



φ

where the dispersion parameter φ is assumed to be known, and θit is the canonical parameter. Assume the conditional mean is related to the linear form of predictors by the link function: g (µit ) = xTit β + zitT bi = β1 x1,it + · · · + βp xp,it + bi1 z1,it + · · · + biq zq,it , where g (·) is a monotonic differentiable link function, xit = (x1,it , . . . , xp,it )T is a vector of covariates, β = (β1 , . . . , βp )T is a vector of fixed effect parameters, and zit = (z1,it , . . . , zq,it )T is a vector of covariates corresponding to the random effect bi = (bi1 , . . . , biq )T . The random effect bi is shared by repeated measurements within the same subject. Assume that bi has a multivariate normal distribution with mean 0 and covariance matrix D. This model is known as the GLMM [23,2]. However, the maximum likelihood estimates for β and D cannot be simplified or evaluated in closed form. Because of the complexity of the likelihood in a GLMM, several numerical integration methods, such as Gauss–Hermite quadrature and the Bayesian approach, are proposed for analyzing data in GLMM. The Bayesian approach is a very popular method used in the analysis of GLMM. A GLMM can be thought of as a twostage hierarchical model. The measurements conditional on given subject-specific random effects are assumed to follow a particular distribution from the exponential family at the first stage, whereas the random effects are assumed to follow a multivariate normal distribution at the second stage. There is a need of the specification of the prior distribution for the fixed effect parameters β and the random effect variance components D. We assume that β and D are independent of each other in this study. When there is no subjective prior information about the fixed effect coefficient, the most widely used noninformative prior assumption for β is the improper uniform distribution, which we used throughout this study. However, various noninformative prior distributions for the variance components of the random effects, D, have been suggested in the previous literature, including Jeffreys prior, a proper conjugate prior and the approximate uniform shrinkage prior. q+1

The standard noninformative prior, or a Jeffreys prior, π (D) ∝ |D|− 2 [21,23] is one of the most widely used prior assumptions in the Bayesian approach. It is obtained by applying Jeffreys rule to the second-stage random effect distribution. The posterior distribution of D corresponding to a Jeffreys prior follows an inverse Wishart distribution with a scale matrix N T S = i=1 bi bi and degrees of freedom N, IW (N , S ). The advantage of Jeffreys prior is that the posterior distribution is specified and easy to implement, however the disadvantage is that it may lead to an improper joint posterior distribution for β and D [12,17]. Another popular choice for prior distributions is a proper conjugate prior. The inverse Wishart distribution with a scale matrix Ψ and degrees of freedom λ, IW (λ, Ψ ), is a conjugate prior for D. Since a univariate specialization of the inverse Wishart distribution is the inverse gamma distribution, the prior reduces to an inverse gamma distribution when the dimension of D is one. The most popular choice is to set λ = q and Ψ = qD0 , where D0 is the prior guess of D [19]. The advantage of this conjugate prior is that it is computationally easy to implement, while the disadvantage is that the estimation results can be very sensitive to the choices of D0 [16]. Natarajan and Kass [16] introduced the approximate uniform shrinkage prior as an alternative prior for D. It is a generalization of the uniform shrinkage prior proposed by Strawderman [20]. The main idea in the approximate uniform

150

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

shrinkage prior is induced by placing a componentwise uniform distribution on the weight given to the prior mean in the approximate shrinkage estimate of the random effects. Specifically, the approximate shrinkage estimate bˆ i has the form

ˆ = Si 0q + (1 − Si )DZiT Wi (y∗i − X β), ˆ bˆi = DZiT (Wi−1 + Zi DZiT )−1 (y∗i − X β) where y∗i is a working dependent variable and Wi is the GLM weight matrix obtained by replacing the random effects with 0. The weight Si = (D−1 + ZiT Wi Zi )−1 ZiT Wi Zi is a function of D and varies with i. Thus by replacing the individual weights with the average across the cluster, they define the overall weight matrix as

 S=

D

−1

+

N 1 

N i =1

 −1  ZiT Wi Zi

N 1 

N i=1

 ZiT Wi Zi

.

Assume S is componentwise uniformly distributed, using the transformation theorem, we can find the distribution of D,

  −q−1  N   1  T   πus (D) ∝ I + Zi Wi Zi D   N i=1 which is the approximate uniform shrinkage prior. The advantage of the approximate uniform shrinkage prior is that it is proper, the corresponding posterior distribution in some situations is proper, and the calculation is quite simple. 2.2. Multivariate generalized linear mixed model The MGLMM can accommodate clustered data when measurements are repeatedly obtained using two or more methods on each subject. Consider data comprising N subjects and Ti repeated measurements within the ith subject, measured by L methods. Measurements for different subjects are assumed to be independent, and the numbers of replications differ from subject to subject. Let the measurement for the ith subject be Yi = (Yi1T , . . . , YiLT )T , where YijT = (Yij1 , . . . , YijTi ) are repeated measurements from the jth methods, j = 1, . . . , L. Assume that given the random effects bi , {Yij1 , . . . , YijTi } are conditionally independent given method j and subject i, and Yijt has a density of fj (·) in the exponential family. Let µij = (µij1 , . . . , µijTi )T be the conditional mean of Yijt = (Yij1 , . . . , YijTi )T given the random effects bi . Covariates are X = (X1 , . . . , XN )T , where Xi = (x1,i11 , . . . , xp1 ,i1Ti , . . . , x1,iL1 , . . . , xpL ,iLTi )T are the covariates for the ith subject. Assume the model has correlated random effects which follow a multivariate normal distribution. The multivariate generalized linear model is defined as follows: Yi1t |bi1 is from a particular distribution F1 in the exponential family with mean µi1t and density exp



yi1t θi1t − a1 (θi1t )



φ1

.. . YiLt |biL is from a particular distribution FL in the exponential family with mean µiLt and density exp g1 (µi1t ) =

xTi1t

β1 +



yiLt θiLt − aL (θiLt )



φL

T zi1t bi1

.. . T gL (µiLt ) = xTiLt βL + ziLt biL

Conditional on D,

bTi = (bTi1 , . . . , bTiL ) ∼ i.i.d. multivariate normal(0, D)

where for j = 1, . . . , L, φj is the dispersion parameter; θijt is the canonical parameter; gj (·) is the link function; xijt = (x1,ijt , . . . , xpj ,ijt )T is a vector of covariates; βj = (βj1 , . . . , βjpj )T is a vector of fixed effect parameters; zijt = (z1,ijt , . . . , zq,ijt )T

is a vector of covariates corresponding to the random effects bij = (b1,ij , . . . , bq,ij )T ; D = [σij ]i=1,...,L;j=1,...,L is the covariance matrix. Similar to the GLMM, the maximum likelihood estimates of MGLMM cannot be solved in closed form. Various approximate methods have been developed, such as the Bayesian approach, penalized quasi-likelihood, Monte Carlo EM algorithms, and maximum likelihood estimation, for model fitting of the MGLMM [9]. The fixed effect parameters β and the random effect variance components D are assumed to be a priori independent. The prior distribution assumptions of β and D are required in the Bayesian approach. One of the most widely used prior distributions for β is uniform. However, several choices of prior distribution can be used for D. One of the most widely used prior distribution assumptions of the variance components of random effects is inverse Wishart distribution [6]. Assume the prior distribution of D is inverse

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

151

T Wishart(m, Ψ ), then the posterior distribution of D is inverse Wishart(m + N , Ψ + S ), where S = i=1 bi bi . Thus, it is a conjugate prior for the covariance matrix of a multivariate normal distribution. However, the estimation is very sensitive to the choice of the scale matrix Ψ in the prior inverse Wishart distribution. Therefore, the need of an objective prior arises in the multivariate case. In this study, we will modify and extend the approximate uniform shrinkage prior proposed by Natarajan and Kass for the MGLMM.

N

3. Approximate uniform shrinkage prior for MGLMM The approximate uniform shrinkage prior for multivariate generalized linear mixed model is derived in this section. 3.1. Model The model in Section 2.2 can be re-expressed as: ∗



y |b ∼

  L  yijt θijt − aj (θijt ) exp φj t =1 j =1

Ti N   i=1

g ∗ (µ∗ ) = X ∗ β ∗ + Z ∗ b∗ Conditional on D∗ ,

b∗ ∼ N (0, D∗ ).

Here the multivariate measurements are expressed as y∗ = (y111 , . . . , yN1TN , . . . , y1L1 , . . . , yNLTN )T ; the conditional mean can be expressed as µ∗ = (µ111 , . . . , µN1TN , . . . , µ1L1 , . . . , µNLTN )T ; the link function g ∗ can be expressed as g ∗ (t ) = (g1 (t111 ), . . . , g1 (tN1TN ), . . . , gL (t1L1 ), . . . , gL (tNLTN ))T ; the fixed effect parameter can be expressed as β ∗ = (β11 , . . . ,

L ∗ ∗ ∗ ∗ β1p1 , . . . , βLpL )T ; the covariate matrix can be rewritten as X ∗ = j=1 Xj = diag(X1 , . . . , XL ), where Xj is the matrix  of covariates for the jth method and is the direct sum; the known random effects design matrix can be expressed as Z ∗ = L N ∗ ∗ ∗ j =1 i=1 Zij , where Zij is the design matrix; the random effects can be expressed as b = (b1,11 , . . . , bq,11 , . . . , bq,N1 , . . . , N ∗ T ∗ b1,1L , . . . , bq,1L , . . . , bq,NL ) ; the covariance matrix of the random effect is D = [Dij ]i=1,...,L;j=1,...,L , where D∗ij = k=1 σij . 3.2. Motivation We will later show that the weight matrix for the prior mean of the random effect in the approximate uniform shrinkage estimate of the random effect is equal to

 S=

D

−1

+

N 1 

N i=1

 −1  ZiT WZi

N 1 

N i=1

 ZiT WZi

where D is the covariance matrix of the random effect, W is the GLM weight matrix [14], and Zi is the design matrix for the random effects of subject i. S defined above is a function of D. To obtain the approximate uniform shrinkage prior, we assume that πS (s) is componentwise uniformly distributed. Then using the transformation theorem, we find that D has probability density function,

  −2q−1  N   1  T   Zi WZi D . πD (D) ∝ ILq +   N i =1 This is defined as the approximate uniform shrinkage prior for D. Since only a positive-semidefinite matrix can be a covariance matrix, we define the approximate uniform shrinkage prior distribution on matrices that are real-valued positivedefinite with probability one. 3.3. Derivation of weight matrix The weight matrix S is obtained by finding the approximate uniform shrinkage estimate bˆ∗ [2], which can be derived from the likelihood function for the parameters β and D:

 N

L(β, D) ∝ |D|− 2



Ti  N  

L    i=1 t =1  j =1

exp −

yijt θijt − aj (θijt ) 2φj

  −

N 1

2 i=1

  

bTi D−1 bi  db.

152

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

Since the quasi-likelihood method generates efficient estimators without making precise distribution assumptions, it is considered here to estimate the parameters in the above model. The integrated quasi-likelihood function is

  Ti N         dj (yijt ; µijt )  L    1 ∗ T −1 ∗ i =1 t =1 − b D b db exp −   2φj 2       j =1

1

L(β, D) ∝ |D∗ |− 2

where dj (y; µ) = −2 ql(β, D) ≈ − where κ(b∗ ) =

1 2

µ y

y−µ du a′′ (u) j

is the quasi-deviance function [14]. Then the log quasi-likelihood function is

log |D∗ | + log N

L

i=1



∗ e−κ(b ) db∗

Ti

j=1

d (y ;µijt ) t =1 j ijt

+ 21 b∗ T D−1 b∗ . Laplace’s method can be used for approximation of the higher

2φj

dimensional integral in the likelihood based on second-order Taylor series expansion, which gives

ql(β, D) ≈ −

1 2

T

log |INLq + Z ∗ W ∗ Z ∗ D∗ | −

Ti N   d1 (yijt ; µijt ) L  i=1 t =1

2φj

j =1

1 T − b˜∗ D−1 b˜∗ 2

where b˜∗ = b˜ (α, θ ) is chosen such that κ ′ (b˜∗ ) = 0 and W ∗ = diag φ1 a′′1 (µ111 ){g1′ (µ111 )}2



{g1′ (µNLTN )}2

−1

 , . . . , φL a′′L (µN1TN )

 −1 

is the diagonal block GLM weight matrix. Laplace’s approximations perform well when the sample size is sufficiently large. Since ql(β, D) may not result in a closed form solution and cannot be used to estimate the variance–covariance structure, the penalized quasi-likelihood (PQL) [8,2] is developed. Assuming that the GLM iterative weights vary very slowly as a function of the mean, the penalized quasi-likelihood is defined by adding a penalty function to the quasi-likelihood of the form − 12 b∗ T D−1 b∗ . Therefore, the penalized quasi-likelihood is equal to Ti N  

 i=1 t =1 dj (yijt ; µijt ) L

PQL = −

2φj

j =1

1

− b ∗ T D −1 b ∗ . 2

The maximum penalized quasi-likelihood equations are implemented by differentiating the PQL with respect to β ∗ and b∗ . Using Fisher scoring algorithm, these score equations are modified to an iterative weighted least squares problem [11,8]: T

T

X ∗ W ∗X ∗ T Z ∗ W ∗X ∗



X ∗ W ∗ Z ∗ D∗ T I + Z ∗ W ∗ Z ∗ D∗

  ∗  ∗T ∗ 0∗ β X W Y = ∗T ∗ 0∗ ν Z W Y

∗ where b∗ = D∗ ν , Y 0 is the working vector and W ∗ is the diagonal block GLM weight matrix obtained by replacing

∗ b∗ with 0. Solving this equation, we get bˆ∗ = D∗ Z ∗ T V ∗ −1 (Y 0 − X ∗ βˆ∗ ) and βˆ∗ =

V = W ∗ −1 + ∗ ∗T ∗ 0∗

∗ ∗ ∗T





X ∗ T V ∗ −1 X ∗

−1



X ∗ T V ∗ −1 Y 0 , where

Z D Z . In this model, the prior mean of b is a vector of zeros, 0, and a frequentist estimate of b∗ is D Z W (Y − X ∗ βˆ∗ ). Thus the approximate shrinkage estimate b∗ can be expressed as a weighted average of its prior mean and frequentist estimate and has the form bˆ∗ = D∗ Z ∗

T



W∗

−1



+ Z ∗ D∗ Z ∗ T

 −1



(Y 0 − X ∗ βˆ∗ )

∗ = S ∗ · 0 + (INLq − S ∗ ) · D∗ Z ∗ T W ∗ (Y 0 − X ∗ βˆ∗ )  −1 ∗ T ∗ ∗ where S ∗ = D∗ −1 + Z ∗ T W ∗ Z ∗ Z W Z is the weight given to the approximate shrinkage estimate and the kth row in

S ∗ denotes the weights on the prior mean of the kth element of b∗ . S ∗ amounts to a shrinkage factor, and the approximate shrinkage estimate shrinks the frequentist estimate toward the prior mean. Notice that the weight matrix S ∗ is an NLq × NLq block matrix comprising weights for every subject and its dimensionality depends on the number of subjects, which may result in high-dimensional problems. Therefore, we define an overall weight matrix as follows:

 S=

−1

D

+

N 1 

N i =1

−1  ZiT WZi

N 1 

N i=1

 ZiT WZi

.

3.4. Generalization: default conjugate prior Based on the overall weight matrix derived in Section 3.3, a default conjugate inverse Wishart can be defined by extending the default conjugate prior proposed by Kass and Natarajan (2006) [13]. The overall weight matrix S measures the impact of ∗ the prior mean 0 and the frequentist estimate D∗ Z ∗ T W ∗ (Y 0 − X ∗ βˆ∗ ) on the approximate shrinkage estimate bˆ∗ . The prior

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

153

mean and frequentist estimate will have equal weights on the approximate shrinkage estimate if D =

  N 1 N

i=1



ZiT WZi .

Since the GLM matrix W depends on β , the GLM estimate βˆ of β can be replaced in the above guess to make D independent on β . Thus, we can define a default conjugate prior for MGLMM as

 IW

 I, c

N 1 

N i =1

 ,

ZiT WZi

where the scale matrix is the identity matrix I, and constant c is chosen to adjust the variability. 3.5. Illustrative examples: bivariate and trivariate cases To illustrate the approximate uniform shrinkage prior, a bivariate clustered mixed model with a random intercept, a bivariate clustered mixed model with both a random intercept and random slope, and a trivariate clustered mixed model with a random intercept are considered. The extension to higher dimensional models or models with more than one random slope is quite straightforward. Example 1 (A Bivariate Clustered Mixed Model with a Random Intercept). A bivariate clustered mixed model, where the responses are assumed to be conditionally independent of a Poisson distribution and a gamma distribution, is given below. Yi1t |bi1 ∼ Poisson distribution with mean µi1t and variance µi1t Yi2t |bi2 ∼ gamma distribution with mean µi2t and variance µ2i2t /ν log(µi1t ) = β10 + β11 x1,i1t + · · · + β1p1 xp1 ,i1t + bi1 log(µi2t ) = β20 + β21 x1,i2t + · · · + β2p2 xp2 ,i2t + bi2 Conditional on D,

bi = (bi1 , bi2 )T ∼ i.i.d. multivariate normal (0, D)

where D = [σij ]i=1,2; j=1,2 , i = 1, . . . , N and t = 1, . . . , Ti . Therefore, the approximate uniform shrinkage prior is



 πD (D) ∝

1+

Ti N 1  

N i=1





µi1t σ11

1+

t =1

N 1 

N i =1





Ti ν · σ22





Ti N 1  

N i=1

 µi1t

t =1

·

N 1 

N i=1

−3 Ti ν · σ

2 12

.

Example 2 (A Bivariate Clustered Mixed Model with Both a Random Intercept and Random Slope). Consider the bivariate clustered mixed model with both random intercept and random slope as follows: Yi1t |bi1 ∼ Poisson distribution with mean µi1t and variance µi1t Yi2t |bi2 ∼ gamma distribution with mean µi2t and variance µ2i2t /ν log(µi1t ) = β10 + β11 x1,i1t + · · · + β1p1 xp1 ,i2t + bi10 + bi11 zit log(µi2t ) = β20 + β21 x1,i2t + · · · + β2p2 xp2 ,i2t + bi20 + bi21 zit Conditional on D,

bi = (bi10 , bi11 , bi20 , bi21 )T ∼ i.i.d. multivariate normal (0, D)

where D = [σij ]i=1,...,4; j=1,...,4 , i = 1, . . . , N and t = 1, . . . , Ti . In this case, the approximate uniform shrinkage prior is

 N  1   1+ S1 (i)σ11  N i=1   N  1   S2 (i)σ21   N i=1 πD (D) ∝   N  1   Ti νσ31  N i=1   N  1  2  Ti zi1t νσ41  N

N 1 

N i=1 1+

where S1 (i) =

t =1

µi1t and S2 (i) =

N 1 

N i=1

N 1 

N i =1 N 1 

N i =1

i =1

Ti

S1 (i)σ12

Ti

2 t =1 zi1t

S2 (i)σ22

Ti νσ32

2 Ti zi1t νσ42

N 1 

N i=1 N 1 

N i=1 1+

S1 (i)σ13 S2 (i)σ23

N 1 

N i=1

N 1 

N i=1

Ti νσ33

2 Ti zi1t νσ43

−5  S1 (i)σ14  N i=1   N   1  S2 (i)σ24  N i=1    N 1    Ti νσ34  N i=1   N  1  2 1+ Ti zi1t νσ44  N N 1 

i =1

µi1t .

Example 3 (A Trivariate Clustered Mixed Model with Random Intercept). Assume measurements are repeatedly taken from three different methods, and assume measurements from the methods follow a Poisson, gamma and normal distribution,

154

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

respectively. The model is shown below: Yi1t |bi1 ∼ Poisson distribution with mean µi1t and variance µi1t Yi2t |bi2 ∼ gamma distribution with mean µi2t and variance µ2i2t /ν Yi3t |bi3 ∼ normal distribution with mean µi3t and variance σN2 log(µi1t ) = β10 + β11 x1,it + · · · + β1p2 xp1 ,it + bi1 log(µi2t ) = β20 + β21 x1,it + · · · + β2p2 xp2 ,it + bi2

µi3t = β30 + β31 x1,it + · · · + β3p3 xp3 ,it + bi3 Conditional on D,

bi = (bi1 , bi2 , bi3 )T ∼ i.i.d. multivariate normal (0, D)

where D = [σij ]i=1,...,3; j=1,...,3 , i = 1, . . . , N and t = 1, . . . , Ti . Therefore, the approximate uniform shrinkage prior is

 N    1+ 1 S1 (i)σ11  N i=1    1  N  πD (D) ∝  T νσ  N i=1 i 21   N  1  1  Ti 2 σ31  N σ where S1 (i) =

t =1

N i=1 1+

N

i=1

Ti

N 1 

S1 (i)σ12

N 1 

N i =1

N 1 

N i=1

Ti

Ti νσ22

1

σN2

σ32

−3  S1 (i)σ13  N i=1    N  1  Ti νσ23   N i=1   N   1 1 1+ Ti 2 σ33  N σ N 1 

i=1

N

µi1t .

4. Properties of approximate uniform shrinkage prior Several properties of the approximate uniform shrinkage prior will be shown in this section. Natarajan and Kass [16] showed that in the univariate GLMM, the approximate uniform shrinkage prior is proper and leads to a proper posterior under some circumstances. In this section, we will show the extended approximate uniform shrinkage in the multivariate GLMM is proper and is a probability density function. We will also define sufficient conditions where the corresponding posterior is proper. Some of the following proofs are adjustments of the proofs in the appendix in the article by Natarajan and Kass. Theorem 1. The approximate uniform shrinkage prior in the MGLMM is a proper prior. Proof. As described in Section 3, the approximate uniform shrinkage prior is

   −2q−1 N   1  T   Zi WZi D πD (D) ∝ ILq +   N i=1 where each matrix D is a positive definite matrix. Consider the weight matrix S given to the prior mean in the shrinkage estimate, which is defined in the previous section. Let R = {S : all principal minors are less than one} and R1 = {S : all first-order and second-order principal minors are positive and less than one}, then R is a subset of R1 . Hence,



dS <



R

 R1

=2

1

dS = 2N (N −1) 2

1

 ···

0 2N  

0 2N −1 2

si,i

 i
I s2i,j ≤ si,i sj,j dsi,j ds1,1 · · · ds2N ,2N





dsi,i < ∞.

i=1

Thus, a uniform prior for S is integrable. Then the approximate uniform shrinkage prior πD (D) is integrable. That is, πD (D)dD < ∞. Therefore, we can conclude that the approximate uniform shrinkage prior in the MGLMM is proper.  −2q−1 N T Denote πD (D) = C ILq + AD , where C is a constant and A = N1 i=1 Zi WZi . Since A is a positive diagonal matrix, there exists a diagonal matrix A1/2 such that A = (A1/2 )2 and (A1/2 )T = A1/2 . For any column vector x of Lq real numbers and y = A1/2 x, xT (A1/2 )T DA1/2 x = yT Dy D is positive definite. Thus A1/2 DA1/2 is positive definite  > 0 1since 1/2 1/2 /2 1/2   and thus ILq + A DA is positive definite, i.e., ILq + A DA > 0. According to Sylvester’s determinant theorem,         ILq + AD = ILq + A1/2 DA1/2  > 0. Hence, ILq + AD−2q−1 > 0. Let C = ( ILq + AD−2q−1 dD)−1 , a positive constant. C is  −2q−1   finite since 0 < πD (D)dD < ∞. Therefore, πD (D)dD = 1. In addition, it can be seen that πD (D) = C ILq + AD > 0. Therefore πD (D) is a probability density function. 

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

155

Theorem 2. Assume the data follows a MGLMM. Also assume that β and D are a priori independent. We also assume that conditional on D, β and b are independent, and the prior distribution of β is uniform. If there exists p full rank vectors xTk , where k = 1, . . . , p, such that L=



   k





−∞

f (yk |r1,k , r2,k , b)dr1,k dr2,k f (b|D)dbπ (D)dD < ∞ −∞

where r1,k = xTk β1 and r2,k = xTk β2 , then the posterior distribution corresponding to the approximate uniform shrinkage prior for D is proper. Proof. First we can show that the marginal probability density function of the data m(y) is finite. Assume there exist r1,k = xTk β1 and r2,k = xTk β2 for any p full rank design vectors, then the Jacobian of the transformation is |J | = (det(X ∗ ))−1 , where X ∗ is a p × p full rank matrix with rows xTk . Then, m(y) =

  Ti N  2 

f (yijt |β, bi )f (bi |D)dbi dβπD (D)dD

i=1 j=1 t =1



f (yk |rk , b)dr1,k dr2,k f (bi |D)dbi πD (D)dD ∝ L < ∞.

∝ |J |

Notice that the second equation holds since individual components in the likelihood are bounded and thus can be ignored. Hence, m(y) is bounded above. Since the joint posterior distribution π (β, D) is proper if and only if m(y) is finite, the posterior distribution corresponding to the approximate uniform shrinkage prior for D is proper when the prior distribution for β is uniform.  Corollary 3. Suppose that the conditions of Theorem 2 hold. The posterior distribution corresponding to the uniform prior for β and the approximate uniform shrinkage prior for D is proper if measurements of each method are assumed to follow, but not limited to, any of the following distributional families: Poisson distribution with a canonical link when yk corresponding to the full rank xTk are nonzero, gamma distribution with a canonical link or log link, and Gaussian distribution and inverse Gaussian distribution with a canonical link. Proof. First, assume the measurements are from a joint model of Poisson distribution with log link and gamma distribution with log link, then log(µ1,k ) = xTk β1 + z1T,k b1,k and log(µ2,k ) = xTk β2 + z2T,k b2,k . Let r1,k = xTk β1 and r2,k = xTk β2 , then

f (yk |r1,k , r2,k , bk ) ∝ exp[− exp(r1,k + z1T,k b1,k ) + y1,k (r1,k + z1T,k b1,k ) − ν y2,k exp{−(r2,k + z2T,k b2,k )} − ν(r2,k + z2T,k b2,k )], where ν is the shape parameter in the gamma distribution. Thus,    ∞  ∞ f (yk |rk , b)dr1,k dr2,k f (b|D)dbπ (D)dD I = k



−∞ ∞

   k

−∞

−∞ ∞



−∞

exp[− exp(r1,k + z1T,k b1,k ) + y1,k (r1,k + z1T,k b1,k )

− ν y2,k exp{−(r2,k + z2T,k b2,k )} − ν(r2,k + z2T,k b2,k )]dr1,k dr2,k f (b|D)dbπ (D)dD    ∞  ∞ ∝ exp{−s1,k + y1,k log(s1,k )} exp{−ν y2,k s2,k + ν log(s2,k )}s1,k s2,k ds1,k ds2,k f (b|D)dbπ (D)dD k

=

0

k

0



   0

y1,k +1

e−s1,k s1,k



 ds1,k 0

1 e−ν y2,k s2,k sν+ 2,k ds2,k f (b|D)dbπ (D)dD

where the transformations s1,k = exp(r1,k + z1T,k b1,k ) and s2,k = exp{−(r2,k + z2T,k b2,k )} are made in the last two equations. Then I is finite when yi2t are all nonzero. Thus, the corresponding posterior distribution is proper. The posterior distributions for measurements from a joint model of distributions described above can be shown to be proper analogously.  5. Posterior distributions simulation In this section the Markov chain Monte Carlo (MCMC) algorithm is outlined for estimating the joint posterior distribution of fixed effect parameters and the variance components of random effects. MCMC for univariate GLMM has been discussed in several studies [23,4,24]. Zeger [23] provided a detailed illustration of this approach. Suppose that all observations in a dataset are independent. Assume that β and D are a priori independent. Since MGLMM can be viewed as a hierarchical Bayesian model, the Bayesian inference is obtained by sampling from the full distribution

156

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

of each variable conditioned on all other variables. That is, we sample from f (β|y, b), f (D|b) and f (bi |β, D, y), respectively. The posterior distributions of β , D, and bi are given as follows: f (β|y, b) ∝

 Ti N  L  

 f (yijt |β, bi ) π (β)

i=1 j=1 t =1

f (D|b) ∝

 N  L 

 f (bij |D) π (D)

i=1 j=1

f (bi |y, β, D) ∝

 Ti L  





1 f (yijt |β, bi ) exp − bTi D−1 bi 2 j =1 t =1



.

Gibbs sampling can be used to estimate these desired posterior distributions. Let β 0 , b0 and D0 be arbitrary starting points. Given the full conditional distributions defined above, β 1 can be drawn from f (β|y, b0 ), D1 can be drawn from f (D|b0 ) and then b1i can be drawn from f (bi |y, β 1 , D1 ). Samples are iteratively generated and collected after convergence to gain the empirical distribution and compute the posterior summaries of interest. Rejection sampling with normal proposal distribution based on maximum likelihood estimation can be used to sample from the posterior of β s and bi . However, the main computational difficulty is that the posterior of D is no longer inverse Wishart distribution when the approximate uniform shrinkage prior is used. Methods such as the Metropolis–Hastings algorithm can be adopted here to generate realizations from f (D|b) and the inverse Wishart distribution can be chosen as the proposal distribution. 6. Simulation study Simulation studies are conducted in this section to evaluate the performance of the approximate uniform shrinkage prior and compare results with other commonly used priors. Independent bivariate mixed outcomes are considered. A joint model of clustered count and continuous responses is fitted. These outcomes are assumed to independently follow distributions in the exponential family, depending on fixed effects and subject-specific correlated random effects. First for each subject, the fixed covariates xi1 and xi2 are generated independently and identically from a standard normal distribution, and the random effects bi = (bi1 , bi2 )T are generated independently from a bivariate normal distribution with mean 0 and covariance matrix D. Then given the random effect bi , the measurements Yi1t and Yi2t are generated independently from a Poisson distribution with mean µi1t and a gamma distribution with mean µi2t and variance µ2i2t /ν , respectively. For the sake of simplicity, the shape parameter in the fitted gamma distribution is set to be ν = 1, which produces an exponential distribution. The multivariate generalized linear model with the natural logarithm as the link functions is considered. That is, we consider the following model: Yi1t |bi1 ∼ Poisson distribution with mean µi1t Yi2t |bi2 ∼ exponential distribution with mean µi2t log(µi1t ) = β10 + β11 x1,i1t + β12 x2,i1t + bi1 log(µi2t ) = β20 + β21 x1,i2t + β22 x2,i2t + bi2 Conditional on D,

bi = (bi1 , bi2 )T ∼ i.i.d. multivariate normal (0, D)

where D = [σij ]i=1,2; j=1,2 , i = 1, . . . , N and t = 1, . . . , Ti . The situation may arise when two measurements are observed from each subject repeatedly in a longitudinal study. In the simulation study, we consider that each dataset consists of N = 50 or N = 100 clusters of size Ti = 1 or Ti = 7, respectively. The true values of the fixed effect parameters are set to be β = (β10 , β11 , β12 , β20 , β21 , β22 ) = (0.5, 0.3, 0.7, 0.5, 0.3, 0.7). The covariance matrix of the random effects is set to be

 D=

1 0.9

0.9 , 1



which implies the strongly positive correlated random effects. Assume an improper uniform prior distribution is placed on the fixed effect parameters β . The variance components of the random effects, D, is assumed to have an approximate uniform shrinkage prior. In addition, the conjugate priors, inverse Wishart (2, 2I ) and inverse Wishart (2, 2D) priors, are also considered. Please note the second prior is chosen for comparison only since the true value of D is not available in practice. The simulation studies are implemented using the MCMC procedure with the SAS software program (SAS Institute, Cary, NC). For each situation, 500 MCMC runs are performed, with each run consisting of 10 000 iterations after a burn-in of 3000 iterations, and only every 20th sample is collected. Inferences are based on the 500 samples generated from the full posterior distribution. Trace plots of parameter estimates versus the simulation index demonstrate that the chain is mixing well and converges to its stationary. For each situation, the posterior summaries for each fixed effect parameters β and variance components D areshown in Tables 1–4, respectively. In each table, the fourth column is the average of 500 posterior means; the fifth column is the

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

157

Table 1 Simulation results for fixed effect parameters and the variance components of the random effects when each dataset consists of 100 clusters of size 7. Parameter

Method

True value

Post Mean

Post Sd

SE of Post Mean

Relative bias

HPD interval width

Coverage

β10

AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D)

0.5 0.5 0.5 0.3 0.3 0.3 0.7 0.7 0.7 0.5 0.5 0.5 0.3 0.3 0.3 0.7 0.7 0.7 1.0 1.0 1.0 0.9 0.9 0.9 1.0 1.0 1.0

0.4726 0.4918 0.4966 0.3016 0.3019 0.3019 0.7002 0.7006 0.7005 0.4805 0.4958 0.5006 0.3006 0.3005 0.3004 0.6974 0.6980 0.6977 1.0081 1.0515 1.0366 0.8981 0.8866 0.9264 0.9905 1.0437 1.0211

0.1014 0.1043 0.1032 0.0227 0.0228 0.0227 0.0233 0.0234 0.0233 0.1024 0.1060 0.1046 0.0414 0.0419 0.0414 0.0413 0.0417 0.0412 0.1626 0.1710 0.1676 0.1465 0.1508 0.1515 0.1637 0.1721 0.1679

0.1035 0.1027 0.1043 0.0227 0.0228 0.0229 0.0240 0.0240 0.0241 0.1066 0.1049 0.1070 0.0414 0.0418 0.0412 0.0388 0.0391 0.0387 0.1623 0.1627 0.1616 0.1427 0.1425 0.1429 0.1609 0.1598 0.1602

−5.50% −1.60% −0.70%

0.3868 0.3983 0.3932 0.0873 0.0876 0.0871 0.0896 0.0899 0.0898 0.3928 0.4069 0.4005 0.1592 0.1606 0.1594 0.1589 0.1605 0.1586 0.6122 0.6445 0.6306 0.5534 0.5699 0.5727 0.6175 0.6496 0.6349

91.20% 93.40% 93.60% 95.00% 95.00% 94.20% 93.80% 94.60% 94.00% 91.80% 94.00% 93.20% 94.20% 94.40% 93.20% 95.40% 95.80% 96.20% 94.40% 95.60% 95.80% 93.80% 94.40% 95.60% 93.60% 96.20% 95.80%

β11 β12 β20 β21 β22 D11

D12

D22

0.50% 0.60% 0.60% 0.00% 0.10% 0.10% −3.90% −0.80% 0.10% 0.20% 0.20% 0.10% −0.40% −0.30% −0.30% 0.80% 5.20% 3.70% −0.20% −1.50% 2.90% −0.90% 4.40% 2.10%

Table 2 Simulation results for fixed effect parameters and the variance components of the random effects when each dataset consists of 100 clusters of size 1. Parameter

Method

True value

Post Mean

Post Sd

SE of Post Mean

Relative bias

HPD interval width

Coverage

β10

AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D)

0.5 0.5 0.5 0.3 0.3 0.3 0.7 0.7 0.7 0.5 0.5 0.5 0.3 0.3 0.3 0.7 0.7 0.7 1.0 1.0 1.0 0.9 0.9 0.9 1.0 1.0 1.0

0.4495 0.4573 0.4744 0.2964 0.3047 0.3007 0.6905 0.7108 0.7062 0.4689 0.4573 0.4896 0.2885 0.2910 0.2927 0.6831 0.6906 0.6943 1.0848 1.1381 1.0923 0.8855 0.8051 0.9459 1.0781 1.1777 1.0815

0.1476 0.1480 0.1453 0.1295 0.1349 0.1319 0.1328 0.1384 0.1360 0.1570 0.1595 0.1558 0.1539 0.1583 0.1542 0.1537 0.1590 0.1547 0.2535 0.2704 0.2540 0.2214 0.2273 0.2272 0.3091 0.3289 0.2969

0.1506 0.1459 0.1448 0.1366 0.1423 0.1383 0.1255 0.1315 0.1293 0.1559 0.1554 0.1539 0.1610 0.1631 0.1632 0.1477 0.1489 0.1510 0.2720 0.2561 0.2458 0.2273 0.2183 0.2143 0.3153 0.2917 0.2641

−10.10% −8.60% −5.10% −1.20%

0.5662 0.5677 0.5572 0.4982 0.5187 0.5072 0.5105 0.5327 0.5223 0.6043 0.6148 0.5997 0.5935 0.6104 0.5941 0.5923 0.6131 0.5967 0.9358 1.0028 0.9443 0.8341 0.8619 0.8592 1.1325 1.2264 1.1047

93.60% 95.00% 95.40% 94.00% 93.20% 92.80% 96.00% 95.00% 95.80% 93.80% 94.40% 94.60% 93.80% 92.80% 92.20% 93.80% 94.60% 94.60% 90.00% 94.00% 93.20% 90.20% 89.60% 95.60% 93.40% 96.00% 96.80%

β11 β12 β20 β21 β22 D11

D12

D22

1.60% 0.20% −1.40% 1.50% 0.90% −6.20% −8.50% −2.10% −3.80% −3.00% −2.40% −2.40% −1.30% −0.80% 8.50% 13.80% 9.20% −1.60% −10.60% 5.10% 7.80% 17.80% 8.20%

average of 500 posterior standard deviations; the sixth column is the standard error of 500 posterior means; the seventh column is the ratio of the difference between the posterior mean and the true value to the true value; the eighth column is the average of 95% highest posterior density (HPD) interval widths in 500 simulations; the ninth column is the percentage of times that the 95% HPD interval includes the true value. As seen in these tables, most of the Bayesian estimates tend to estimate the true value very well, especially when there are more replicates within each subject. The posterior means are more sensitive to the number of replicates than to the sample size. In many situations, posterior estimates of approximate

158

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

Table 3 Simulation results for fixed effect parameters and the variance components of the random effects when each dataset consists of 50 clusters of size 7. Parameter

Method

True value

Post Mean

Post Sd

SE of Post Mean

Relative bias

HPD interval width

Coverage

β10

AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D)

0.5 0.5 0.5 0.3 0.3 0.3 0.7 0.7 0.7 0.5 0.5 0.5 0.3 0.3 0.3 0.7 0.7 0.7 1.0 1.0 1.0 0.9 0.9 0.9 1.0 1.0 1.0

0.4478 0.4874 0.4927 0.2993 0.2996 0.2995 0.7000 0.7005 0.7004 0.4629 0.4970 0.4997 0.3005 0.3002 0.3005 0.6971 0.6975 0.6974 1.0106 1.0916 1.0681 0.9000 0.8844 0.9529 0.9987 1.0930 1.0591

0.1445 0.1498 0.1489 0.0321 0.0324 0.0323 0.0333 0.0334 0.0333 0.1462 0.1527 0.1515 0.0584 0.0597 0.0587 0.0587 0.0599 0.0590 0.2334 0.2551 0.2482 0.2102 0.2211 0.2252 0.2365 0.2571 0.2507

0.1565 0.1583 0.1535 0.0328 0.0329 0.0329 0.0323 0.0323 0.0324 0.1589 0.1601 0.1567 0.0600 0.0600 0.0600 0.0572 0.0579 0.0570 0.2285 0.2303 0.2254 0.2118 0.2130 0.2127 0.2402 0.2399 0.2375

−10.40% −2.50% −1.50% −0.20% −0.10% −0.20%

0.5513 0.5736 0.5701 0.1235 0.1249 0.1246 0.1281 0.1288 0.1283 0.5613 0.5869 0.5831 0.2251 0.2299 0.2259 0.2261 0.2305 0.2263 0.8627 0.9394 0.9147 0.7793 0.8196 0.8316 0.8742 0.9489 0.9242

89.60% 92.40% 92.60% 93.60% 94.20% 95.00% 95.00% 94.60% 94.80% 89.00% 93.00% 93.80% 92.60% 94.80% 94.00% 93.60% 95.40% 93.40% 93.60% 96.20% 95.60% 91.60% 93.80% 94.80% 91.20% 95.40% 94.60%

β11 β12 β20 β21 β22 D11

D12

D22

0.00% 0.10% 0.10% −7.40% −0.60% −0.10% 0.20% 0.10% 0.20% −0.40% −0.40% −0.40% 1.10% 9.20% 6.80% 0.00% −1.70% 5.90% −0.10% 9.30% 5.90%

Table 4 Simulation results for fixed effect parameters and the variance components of the random effects when each dataset consists of 50 clusters of size 1. Parameter

Method

True value

Post Mean

Post Sd

SE of Post Mean

Relative bias

HPD interval width

Coverage

β10

AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D) AUS IW(2, 2I ) IW(2, 2D)

0.5 0.5 0.5 0.3 0.3 0.3 0.7 0.7 0.7 0.5 0.5 0.5 0.3 0.3 0.3 0.7 0.7 0.7 1.0 1.0 1.0 0.9 0.9 0.9 1.0 1.0 1.0

0.3995 0.4333 0.4554 0.2934 0.3079 0.3064 0.6905 0.7300 0.7261 0.4672 0.4755 0.5050 0.2859 0.2943 0.2960 0.6756 0.6901 0.6984 1.1432 1.2143 1.1490 0.8647 0.7644 0.9735 1.1339 1.2570 1.1372

0.2204 0.2177 0.2107 0.1889 0.1995 0.1954 0.1902 0.2050 0.1980 0.2257 0.2316 0.2242 0.2240 0.2328 0.2253 0.2235 0.2312 0.2236 0.4039 0.4234 0.3872 0.3262 0.3354 0.3370 0.4570 0.4890 0.4301

0.2444 0.2311 0.2207 0.1990 0.2135 0.2040 0.1911 0.2067 0.2041 0.2375 0.2344 0.2316 0.2235 0.2310 0.2264 0.2205 0.2249 0.2234 0.3937 0.3542 0.3214 0.3159 0.2973 0.2870 0.4174 0.3876 0.3390

−20.10% −13.30% −8.90% −2.20%

0.8460 0.8348 0.8078 0.7301 0.7695 0.7529 0.7316 0.7906 0.7650 0.8678 0.8948 0.8635 0.8632 0.8980 0.8691 0.8644 0.8918 0.8637 1.4477 1.5150 1.3922 1.2198 1.2570 1.2486 1.6389 1.7599 1.5455

90.20% 91.80% 92.00% 91.20% 92.00% 91.40% 93.40% 94.00% 94.20% 91.00% 92.20% 93.00% 93.80% 93.80% 92.80% 94.40% 93.60% 93.00% 92.60% 97.40% 97.60% 91.80% 90.40% 96.40% 93.60% 98.00% 97.60%

β11 β12 β20 β21 β22 D11

D12

D22

2.60% 2.10% −1.40% 4.30% 3.70% −6.60% −4.90% 1.00% −4.70% −1.90% −1.30% −3.50% −1.40% −0.20% 14.30% 21.40% 14.90% −3.90% −15.10% 8.20% 13.40% 25.70% 13.70%

uniform shrinkage prior have the smallest relative biases among the three priors and it usually gives the least biased estimators. Especially when estimating the variance components of random effects, its absolute value of relative bias is significantly smaller than the other two inverse Wishart priors. This difference becomes more significant when the number of replicates increases. On the other hand, there is no significant difference between the standard error of posterior means and average posterior standard deviations under all situations. Both the standard error of posterior means and average posterior standard deviations decrease not only as the sample size increases but also as the number of replicates increases.

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

159

The approximate uniform shrinkage prior usually has the smallest HPD interval width, resulting in the lower 95% HPD interval coverage probabilities. This may be due to that its posterior standard deviations are smaller than the other priors. It indicates that the approximate uniform shrinkage prior tends to be more conservative than the other two priors. Since almost all of the coverage probabilities are higher than 90%, the approximate uniform shrinkage prior still seems competitive. Posterior inferences can also be evaluated in terms of the squared and the  error risk. Hence,in order  to assess the  accuracy  

ˆ 11 − D11 )2 , E (Dˆ 12 − D12 )2 , precision of the estimators of β , D and bi , Table 5 reports the risks E (βˆ − β)T (βˆ − β) , E (D 

  N

ˆ 22 − D22 )2 , E (D

i=1 E



bˆ i1 − bi1

2 

and

N

i=1 E



bˆ i2 − bi2

2 

. The approximate uniform shrinkage prior has similar

risks to the inverse Wishart (2, 2I ) prior and inverse Wishart (2, 2D) when estimating β , D and bi . The risks of β and Ds decrease as the sample size increases or the number of replicates increases. But the risks of bi s decreases as the sample size decreases or the number of replicates increases. In conclusion, the approximate uniform shrinkage prior has good overall performance in simulation studies. It yields less biased estimators, especially for variance components of random effects. 7. Example: OAI data To illustrate the methodology, we considered an osteoarthritis data from the Osteoarthritis Initiative (OAI) database, which is available for public access at http://www.oai.ucsf.edu/ and is described in detail by McCulloch [15]. The Osteoarthritis Initiative is a cohort study of the determinants of knee osteoarthritis for people aged 45 years and above. The data were collected from persons at high risk for knee osteoarthritis at baseline, 12 months, 24 months, 36 months and 48 months, resulting in five measurements per individual. We restrict our study to the complete data, which reduces our data to 1499 individuals. The outcomes of interest are the Western Ontario and McMaster Universities Arthritis Index (WOMAC) disability scores and numbers of workdays missed in past 3 months. The WOMAC is a numeric score used to assess pain, stiffness, and physical function in patients with hip and/or knee osteoarthritis, while the number of days of missed work due to knee pain, aching or stiffness in past 3 months is a count variable. In this study, we use the average of WOMAC for the left and right knees as the WOMAC disability score. The predictor variables of primary interest in this study are age, sex and body mass index (BMI), with age and BMI as continuous variables, and sex as a categorical variable. To accommodate such a clustered mixed outcome data, we consider a multivariate generalized linear mixed model with subject-specific random effects. Assume that conditional on the random effects, the WOMAC disability scores follow a normal distribution and the numbers of workdays missed in the past 3 months follow a negative binomial distribution since negative binomial distribution can be used to accommodate overdispersion in count data. The dispersion parameter in the negative binomial distribution is the inverse of its shape parameter, say δN . The negative binomial distribution approaches a Poisson distribution when the overdispersion parameter approaches infinity. The normal means and Poisson means are related to the covariates via the identity link and logarithm link, respectively. More specifically, the data are accommodated by the following model: WOMAC i1t |bi1 ∼ normal µi1t , σN2





MISSW i2t |bi2 ∼ negative binomial (µi2t , δN )

µi1t = β10 + β11 AGE it + β12 SEX i + β13 BMI it + bi1 log(µi2t ) = β20 + β21 AGE it + β22 SEX i + β23 BMI it + bi2 Conditional on D,

bi = (bi1 , bi2 )T ∼ i.i.d. multivariate normal (0, D)

where D = [σij ]i=1,2; j=1,2 , i = 1, . . . , N and t = 1, . . . , T . In this case, N = 1499 and T = 5. Assume that the prior for the fixed effect coefficients β s is uniform, the prior for variance σN2 of the normal distribution is inverse gamma(αN , βN ), and the prior for overdispersion parameter δN of the negative binomial distribution is inverse gamma(αD , βD). We set αN = 10, βN = 1, αD = 100, and βD = 1 in this study. The posterior distribution of σN2 is inverse gamma

NT 2

+ αN ,

N

i=1

Ti

t =1

(yi1t −µi1t )2

2

+ βN . However, unlike σN2 , it is difficult to directly sample from the posterior

distribution of δN so that sampling techniques are needed. We set the prior for the variance components of the random effects to be either the approximate uniform shrinkage prior or the inverse Wishart(2, 2I ). Under these circumstances, the approximate uniform shrinkage prior for D is

 πD (D) ∝

N σ11  ni 1+ N i=1 σN2

 T    n −3 N N N  σ 2  i i  σ22  δN µbi2t Ti   δN µbi2t 12 1+ − . N i=1 t =1 δN + µbi2t N σN2 i=1 t =1 δN + µbi2t i =1



Analysis is performed using 500 samples obtained from a single chain retaining every 20th simulation iteration in 10 000 iterations after a burn-in of 2000 iterations. The posterior simulation results, including the posterior means, posterior standard deviations and 95% highest probability density intervals, are presented in Table 6. For comparative purposes, data are fitted by the maximum likelihood using adaptive Gaussian quadrature. The estimates of the parameters, their standard deviations and 95% confidence interval are also reported.

160

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161 Table 5 Risk for fixed effect coefficients β , components   of covariance matrix   D11 , D12 , D22, andrandom effects 

ˆ 11 − D11 )2 , E (Dˆ 12 − D12 )2 , E (Dˆ 22 − D22 )2 , b·1 , and b·2 . That is, E (βˆ − β)T (βˆ − β) , E (D N

i=1

E



bˆ i1 − bi1

Method

2 

and

β

N

i=1

E



bˆ i2 − bi2

D11

2 

.

D12

D22

b ·1

b·2

0.0202 0.0203 0.0211

0.0259 0.0269 0.0259

7.7441 7.7257 7.6974

10.8635 10.8783 10.8110

0.0518 0.0566 0.0479

0.1053 0.1165 0.0763

34.0333 34.4263 33.7481

39.5811 41.3977 38.6195

0.0448 0.0455 0.0479

0.0576 0.0661 0.0598

4.6146 4.5986 4.4571

6.1263 6.1725 5.9452

0.1008 0.1066 0.0876

0.1918 0.2160 0.1335

18.7341 19.1468 18.3096

21.8448 23.2090 20.8748

N = 100 m = 7 AUS IW(2, 2I ) IW(2, 2D)

0.0287 0.0270 0.0275

0.0259 0.0285 0.0272

N = 100 m = 1 AUS IW(2, 2I ) IW(2, 2D)

0.1329 0.1354 0.1306

0.0810 0.0845 0.0688

N = 50 m = 7 AUS IW(2, 2I ) IW(2, 2D)

0.0627 0.0598 0.0571

0.0522 0.0613 0.0554

N = 50 m = 1 AUS IW(2, 2I ) IW(2, 2D)

0.3023 0.3062 0.2890

0.1752 0.1711 0.1253

Table 6 Parameter estimates for osteoarthritis study based on normal-negative binomial model. Para Approximate uniform shrinkage prior

β10 β11 β12 β13 β20 β21 β22 β23 σ11 σ12 σ22 σN2 δN2

Inverse-Wishart(2,2I)

Adaptive Gaussian quadrature

Post. Mean

Post. Std

95% HPD

Post. Mean

Post. Std

95% HPD

Estimate

−0.4387 −0.00269

0.0745 0.00184 0.0411 0.00298 0.2757 0.0172 0.2709 0.0313 0.0211 0.1678 1.4102 0.0138 0.00169

(−0.5806, −0.3134) (−0.00636, 0.00087) (0.1394, 0.2926) (0.0454, 0.0567) (−11.2364, −10.1901) (−0.0351, 0.0305) (−0.2877, 0.6628) (0.1738, 0.2946) (0.367, 0.4446) (0.7777, 1.3909) (4.1907, 9.757) (0.7792, 0.8336) (0.0137, 0.0201)

−0.4207 −0.00289

0.176 0.00244 0.0401 0.00378 1.6514 0.0233 0.3406 0.0321 0.0221 0.1326 1.3006 0.0149 0.00189

(−0.7608, −0.0807) (−0.00767, 0.00146) (0.134, 0.2912) (0.0448, 0.0594) (−12.1228, −5.9527) (−0.0571, 0.0370) (−0.5779, 0.7207) (0.1493, 0.2689) (0.3737, 0.4580) (0.6411, 1.1484) (2.5209, 7.4504) (0.7759, 0.8343) (0.0134, 0.0204)

−0.4550 −0.0023

0.2148 0.0514 −10.7088 −0.0051 0.1594 0.2375 0.4083 1.0684 6.7426 0.8064 0.0173

0.2126 0.0511 −8.9397 −0.0124 0.1359 0.2062 0.4128 0.8956 4.6397 0.8037 0.0165

Std. error

0.1846 0.0026 0.2077 0.0390 0.0513 0.0038 −10.0150 1.8614 −0.0408 0.0258 0.0577 0.3638 0.2213 0.0369 0.4062 0.0210 1.1670 0.1597 11.9830 2.2314 0.8081 0.0148 0.0586 0.0090

95% CI (−0.8172, −0.0928) (−0.0073, 0.0027) (0.1312, 0.2843) (0.0438, 0.0587) (−13.6662, −6.3637) (−0.0913, 0.0097) (−0.6560, 0.7713) (0.1490, 0.2936) (0.3649, 0.4475) (0.8538, 1.4802) (7.6059, 16.3600) (0.7791, 0.8371) (0.0410, 0.0762)

Most Bayesian estimates using approximate uniform shrinkage prior have similar values with the Bayesian estimates using inverse Wishart prior and maximum likelihood estimates using the adaptive Gaussian quadrature, but those estimates are very different for σ22 . As shown in the simulation study, estimates using the approximate uniform shrinkage prior are less biased, especially for the variance components of the random effect, when compared with inverse Wishart prior. Estimates using the approximate uniform shrinkage prior and those using the adaptive Gaussian quadrature have more similar values. The age effect is not a significant in both WOMAC disability scores or numbers of workdays missed in past 3 months, the sex effect is significant only for WOMAC disability scores, and the BMI effect is significant for both WOMAC disability scores and the numbers of workdays missed in past 3 months. The subject-specific random effects are significant and the two measurements are moderately correlated. 8. Concluding remarks In the Bayesian approach to GLMM, the choices of prior distribution may greatly influence inferences, especially when the number of subjects is small. Noninformative priors are needed when the prior information on the model parameters is insufficient. In this study, we introduced the approximate uniform shrinkage prior in the multivariate generalized linear mixed model. This prior is obtained by placing a uniform distribution on the weight given to the prior mean in the approximate shrinkage estimate of the random effects, and then transforming it to find the distribution of the variance components of the random effects. It is noteworthy to mention that when two methods are assumed to be independent and identically distributed or when there is only one method, the MGLMM reduces to an ordinary GLMM. Thus, the

H.-C. Chen, T.E. Wehrly / Journal of Multivariate Analysis 145 (2016) 148–161

161

proposed approximate uniform shrinkage prior becomes the approximate uniform shrinkage prior proposed by Natarajan and Kass [16]. In this study we have shown that the approximate uniform shrinkage prior not only is easy to implement, but also has several attractive properties. It is proper and leads to a proper posterior distribution for numerous common distributions under a MGLMM. In addition to bivariate generalized linear mixed model with random intercept, we have shown that the approximate uniform shrinkage prior can be applied to more complicated models, such as bivariate generalized linear mixed model with both random intercepts and random slopes and trivariate generalized linear mixed model with random intercepts. The extension to higher dimensional models is quite straightforward. This prior is very flexible in diverse models. Simulation studies indicate that the use of the approximate uniform shrinkage prior results in less biased estimates, especially for the variance components of the random effect. It performs as well as the commonly used inverse Wishart prior and even better under some circumstances. It also tends to have smaller posterior standard deviations and thus the smaller HPD interval width. These posterior estimates are close to the frequentist estimates when the sample sizes are large. In summary, the approximate uniform shrinkage prior is very competitive and flexible. References [1] L. An, S. Nkurunziza, K.Y. Fung, D. Krewski, I. Luginaah, Shrinkage estimation in general linear models, Comput. Statist. Data Anal. 53 (2009) 2537–2549. [2] N.E. Breslow, D.G. Clayton, Approximate inference in generalized linear mixed models, J. Amer. Statist. Assoc. 88 (1993) 9–25. [3] B.A. Coull, A. Agresti, Random effects modeling of multiple binomial responses using the multivariate binomial logit-normal distribution, Biometrics 56 (1) (2000) 73–80. [4] P. Damlen, J. Wakefield, S. Walker, Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables, J. R. Stat. Soc. Ser. B Stat. Methodol. 61 (2) (1999) 331–344. [5] S.C. Dass, C.Y. Lim, T. Maiti, Default Bayesian analysis for multivariate generalized car models, Statist. Sinica 22 (2012) 231–248. [6] D.B. Dunson, Bayesian latent variable models for clustered mixed outcomes, J. R. Stat. Soc. Ser. B Stat. Methodol. 62 (2000) 355–366. [7] A. Gelman, Prior distributions for variance parameters in hierarchical models, Bayesian Anal. 1 (3) (2006) 515–533. [8] P.J. Green, Penalized likelihood for general semiparametric regression-models, Internat. Statist. Rev. 55 (3) (1987) 245–259. [9] R. Gueorguieva, A multivariate generalized linear mixed model for joint modelling of clustered outcomes in the exponential family, Stat. Model. 1 (3) (2001) 177–193. [10] R.V. Gueorguieva, A. Agresti, A correlated probit model for joint modeling of clustered binary and continuous responses, J. Amer. Statist. Assoc. 96 (455) (2001) 1102–1112. [11] D.A. Harville, Maximum likelihood approaches to variance component estimation and to related problems, J. Amer. Statist. Assoc. 72 (358) (1977) 320–338. [12] J.G. Ibrahim, P.W. Laud, On Bayesian analysis of generalized linear models using Jeffreys’s prior, J. Amer. Statist. Assoc. 86 (416) (1991) 981–986. [13] R.E. Kass, R. Natarajan, A default conjugate prior for variance components in generalized linear mixed models (comment on article by browne and draper), Bayesian Anal. 1 (3) (2006) 535–542. [14] P. McCullagh, J.A. Nelder, Generalized Linear Models, Chapman and Hall, London, 1989. [15] C. McCulloch, Joint modelling of mixed outcome types using latent variables, Stat. Methods Med. Res. 17 (1) (2008) 53–73. [16] R. Natarajan, R.E. Kass, Reference Bayesian methods for generalized linear mixed models, J. Amer. Statist. Assoc. 95 (449) (2000) 227–237. [17] R. Natarajan, C.E. McCulloch, A note on the existence of the posterior distribution for a class of mixed models for binomial responses, Biometrika 82 (3) (1995) 639–643. [18] S.M. O’Brien, D.B. Dunson, Bayesian multivariate logistic regression, Biometrics 60 (3) (2004) 739–746. [19] D.J. Spiegelhalter, A. Thomas, N.G. Best, W.R. Gilks, Bugs 0.5. Bayesian inference using gibbs sampling, Cambridge: Medical Research Council Biostatistics Unit, 1996. [20] W.E. Strawderman, Proper Bayes minimax estimators of multivariate normal mean, Ann. Math. Statist. 42 (1) (1971) 385. [21] G.C. Tiao, W.Y. Tan, Bayesian analysis of random-effect models in analysis of variance i posterior distribution of variance components, Biometrika 52 (1–2) (1965) 37. [22] L. Tierney, Markov chains for exploring posterior distributions, Ann. Statist. 22 (4) (1994) 1701–1728. [23] S.L. Zeger, M.R. Karim, Generalized linear-models with random effects—A gibbs sampling approach, J. Amer. Statist. Assoc. 86 (413) (1991) 79–86. [24] Y. Zhao, J. Staudenmayer, B.A. Coull, M.P. Wand, General design Bayesian generalized linear mixed models, Statist. Sci. 32 (1) (2006) 35–51.