Computational Statistics and Data Analysis 117 (2018) 162–181
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
Dual-semiparametric regression using weighted Dirichlet process mixture Peng Sun a , Inyoung Kim a, *, Ki-Ahm Lee b,c a b c
Department of Statistics, Virginia Polytechnic Institute and State University, USA Department of Mathematical Sciences, Seoul National University, Republic of Korea Korea Institute for Advanced Study, Republic of Korea
article
info
Article history: Received 16 April 2016 Received in revised form 4 July 2017 Accepted 10 August 2017 Available online 31 August 2017 Keywords: Additive model Bayes factor Cubic splines Dual-semiparametric regression Generalized pólya urn Gibbs sampling Metropolis–Hastings Nonparametric Bayesian model Ordinal data Semiparametric regression Weighted Dirichlet process
a b s t r a c t An efficient and flexible Bayesian approach is proposed for a dual-semiparametric regression model that models mean function semiparametrically and estimates the distribution of the error term nonparametrically. Using a weighted Dirichlet process mixture (WDPM), a Bayesian approach has been developed on the assumption that the distributions of the response variables are unknown. The WDPM approach is especially useful for real applications that have heterogeneous error distributions or come from a mixture of distributions. In the mean function, the unknown functions are estimated using natural cubic smoothing splines. For the error terms, several different WDPMs are proposed using different weights that depend on the distances between the covariates. Their marginal likelihoods are derived, and the computation of marginal likelihood for WDPM is provided. Efficient Markov chain Monte Carlo (MCMC) algorithms are also provided. The Bayesian approaches based on different WDPMs are compared with the parametric error model and the Dirichlet process mixture (DPM) error model in terms of the Bayes factor using a simulation study, suggesting better performance of the Bayesian approach based on WDPM. The advantage of the proposed Bayesian approach is also demonstrated using the credit rating data. © 2017 Elsevier B.V. All rights reserved.
1. Introduction Semiparametric regression (Ruppert et al., 2003) has been widely used in many fields, including economics (Chib and Greenberg, 2010), finance (Hannah et al., 2011; Jensen and Maheu, 2013), environmetrics (Mahmoud et al., 2016) and biostatistics (Kim et al., 2003; Kim and Cohen, 2004; Kim et al., 2011; Pang and Zhao, 2012; Ortega Villa et al., 2017), in which some covariates have a known relationship with a response while others do not. It mixes the parametric and nonparametric parts together in the regression. The nonparametric part can be estimated using cubic smoothing splines (Durrleman and Simon, 1989; Green and Silverman, 1994; Pagan and Ullah, 1999; Marsh and Cormier, 2002; Li and Racine, 2006; Chib and Greenberg, 2010). Although semiparametric regression has the flexibility of modeling both parametric and nonparametric parts, parametric distributions often impose strong assumptions about the distribution of the unobserved error or the distribution of the underlying latent variable. The error distribution is often assumed to be parametric, as with normal distribution and t-distribution, when the outcomes are continuous. In ordinal models for categorical outcomes, the model is almost always specified with logit or probit links. Chu and Ghahramani (2005) proposed Gaussian processes (GP) for ordinal regression and introduced a generalization of the probit function.
*
Correspondence to: Department of Statistics, Virginia Tech., Blacksburg, VA 24061, USA E-mail address:
[email protected] (I. Kim).
http://dx.doi.org/10.1016/j.csda.2017.08.005 0167-9473/© 2017 Elsevier B.V. All rights reserved.
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
163
Parametric assumptions of error distributions are often not satisfied in real applications that have heterogeneous error distributions or come from a mixture of a heterogeneous distribution. As a result, the model may easily turn out to be misspecified, which thus influences the statistical inference. Therefore, we adopt a nonparametric Bayesian approach to allow the semiparametric regression to be more flexible. Our nonparametric Bayesian approach can apply to both continuous and ordinal responses. Because we are interested in the functional relationship between the covariate and response variable, in this paper, we model each covariate using the natural cubic smoothing splines. First, let us consider that we have p + ns covariates and a continuous response variable y. The first p covariates can be parametrically modeled with unknown parameters βp , and the other ns covariates w1 , . . . , wns are nonparametrically modeled with gζ (·), ζ = 1, . . . , ns. Our semiparametric model can be written as: yi = x′i βp + g1 (wi1 ) + g2 (wi2 ) + · · · + gnp (wi,np ) + εi
i = 1, . . . , n,
(1.1)
where xi βp is the parametric function, gζ (·) (ζ ≤ ns) are unknown functions that are estimated using the cubic smoothing splines by adopting the basis illustrated in Lancaster and Šalkauskas (1986) and Wood (2006), and ε is the error term whose distribution shape needs to be captured. In the parametric Bayesian approach, it is typical to assume that the distribution of εi has some known form fe (εi |θi ), where θi denotes the parameters of the distribution of the error term. The prior distribution of θi is also in some known form G0 . However, it is relatively challenging to derive convincing evidence about the distribution of the error. On the other hand, in the nonparametric Bayesian approach, the Dirichlet process, which was formally introduced by Ferguson (1973), enables us to construct the prior of θ in a nonparametric way. If we assume that the prior distribution of θ , say G, is sampled from a Dirichlet process, this Dirichlet process mixture (DPM) model allows more flexibility than does assuming that G belongs to some known family of distributions. Therefore, many existing studies have adopted DPM (Chib and Greenberg, 2010; Zhang et al., 2014; Dunson et al., 2007; Dunson and Stanford, 2005; Ghosal, 2009; Pati and Dunson, 2014; Chae et al., 2016) in various problems such as density estimation, the asymptotic distribution, the outlier problem, and high-dimensional analysis. Ghosal (2009) reviewed the role of the Dirichlet process (DP) and related prior distributions in nonparametric Bayesian inference. Ghosal (2009) also discussed various properties of the Dirichlet process and provided the asymptotic properties of posterior distributions. In addition, Pati and Dunson (2014) proposed robust Bayesian nonparametric regression which automatically downweights outliers and influential observations using the varying residual density based on probit stickbreaking (PSB) scale mixtures and symmetrized PSB (sPSB) location-scale mixtures. Recently, Chae et al. (2016) proposed Bayesian sparse linear regression with unknown symmetric error. They studied full Bayesian procedures for sparse linear regression when errors have a symmetric but otherwise unknown distribution. The unknown error distribution is endowed with a symmetrized Dirichlet process mixture of Gaussian. Although DPM has flexibility, it does not take into account the covariates’ information into the prior of θ . DPM assumes that all of the θi s follow the same prior distribution G. Therefore, DPM is not flexible enough when the error distributions are a mixture of heterogeneous distributions. To overcome this limitation, we can use the weighted Dirichlet process mixture (WDPM). WDPM can be viewed as the dependent Dirichlet process (DDP), which extended DP by allowing cluster parameters to vary with some covariates (MacEachern, unpublished). The idea of WDPM was proposed by Zellner (1986) and applied by Dunson et al. (2007). The concept of WDPM comes from the idea that one can add the information provided by covariates into the construction of prior distributions. Such a concept relaxes the constraint that all of the observations share the same prior for the error-term parameter. Instead of a single prior, there are multiple candidate priors. The observations with similar predictor values (covariates) are more likely to share the same prior, which is one of the available candidate priors. Therefore, we can see that WDPM allows a higher degree of heterogeneity for the observations compared to DPM or the parametric Bayesian model. In summary, we have shown that, in the parametric Bayesian model, we have only one prior for all of the θi′ s, and such a prior must belong to some known distribution family. In DPM, there is only one prior as well, although it is not necessarily some known distribution. However, when the observations do not share the same prior distribution for θ , DPM is not appropriate. WDPM can be an efficient alternative approach when a model based on a single prior-distribution assumption fails to produce an adequate degree of accuracy. In this paper, we incorporate WDPM in semiparametric regressions and propose the dual-semiparametric model because we adopt semiparametric regression for the mean part and nonparametric Bayesian approaches for the error terms. We also propose several WDPM models using different weight functions. We derive their marginal likelihood for statistical inference. Efficient Markov chain Monte Carlo (MCMC) algorithms are provided. To the best of our knowledge, our approach is the first to incorporate WDPM in semiparametric regression, propose efficient weights, and provide the marginal likelihood derivation. Our weight functions in WDPM are compared with the weight function proposed by Dunson et al. (2007). We further compare our WDPM models with parametric and DPM models in terms of the Bayes factor using both a simulation study and real data and suggest some outperformances of our approach. This article is organized as follows. In Section 2, we briefly review the basic idea of WDPM and propose several weight functions for WDPM. In Section 3, we introduce the potential reason why WDPM can produce a better marginal likelihood based on the Pólya urn for WDPM. In Section 4, we explain how to incorporate WDPM into the semiparametric regression. In Section 5.1, the posterior computation is illustrated. In Section 5.2, the marginal likelihood computation is explained so that it can be used for statistical inference. In Section 6, we conduct a simulation study to understand the performance of our approach. In Section 7, we apply our method to the credit-rating data illustrated in Verbeek (2008). Finally, Section 8 provides our concluding remarks. ′
164
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
2. Weighted Dirichlet process mixture WDPM can be treated as a generalized DPM method that includes the covariates’ information into the prior of parameter θ . Instead of having only one unique G for all of the θi ’s, we provide multiple candidate priors (Gc1 , . . . , GcQ ) for each observation from which to select. We need to specify a location in the covariate space XS , say xcq , for each candidate prior Gcq , and all of the Gcq ’s are independently sampled from the same DP(α G0 ), where α is the constant precision parameter for the base measure G0 . Therefore, for a given observation with covariate vector x, the probability of this observation selecting Gcq as its true prior can be modeled by some function b(·, ·) of x and xcq : Pr(prior = Gcq ) = b(x, xcq ).
(2.1)
Such a function is usually built on the idea that if the covariates’ values from two observations are exactly the same or very close to each other, the probability that the two error terms parameters follow the same prior should be higher than that of the case in which the covariates of the two observations are very far away from each other. We know that having the same prior of the parameter means having the same marginal distribution on the error terms. We call this function as the weight function. Therefore, the weight function is constructed in a way that favors the assumption that error terms sharing similar covariate informations tend to share similar distributions. We can consider the following to be the weight function: c
b(x,
xcq )
2
γq e−ψ ∥x−xq ∥ = ∑ c 2 Q γl e−ψ ∥x−xl ∥
q = 1, . . . , Q ,
∀x ∈ X S ,
(2.2)
l=1
where both γl and ψ are positive parameters of the weight function. We can see that the weight is a decreasing function of the distance in the covariate vector between the given observation and the location of the qth candidate prior, which follows the basic idea in the construction of the weight function. { } Since both the number of candidate priors and the locations of these candidates affect the weights biq ≡ b(xi , xcq ) n×Q , which are finally calculated based on the n observations in the given data, how to choose the proper number and locations of the candidate priors is worthy of consideration. If the number and locations are poorly chosen, they can produce misleading weights such as in (2.2). For example, consider that most of the observations’ covariate vectors are very close to one another and form a cluster chosen while several outliers are far away from this cluster but very close among themselves, forming another cluster. In this case, the weight function will lose the power to reflect the covariate information of these data if all of the candidate locations are around the middle point of the two clusters. Dunson et al. (2007) proposed choosing Q = n and using the n observations’ covariate vectors as the locations of the candidate priors. We apply this approach in our empirical study and find that the model performance does not change significantly if we randomly sample the candidate locations rather than selecting the locations given by the observations. We also explore how the number of candidate priors affects the model performance, and we find that a small Q leads to a poor result, while the model performance becomes insensitive as Q approaches n. In other words, neither a large Q nor the randomization of the candidate locations leads to significant model improvement. Therefore, we turn to different specifications of weights and priors of hyperparameters in the weight function. The following WDPM models have been studied:
• Dunson’s WDPM: 2
c
γq e−ψ ∥x−xq ∥ c 2 γl e−ψ ∥x−xl ∥
b(x, xcq ) = ∑ n
q = 1, . . . , n,
∀x ∈ XS ,
(2.3)
l=1
which will be denoted as WDPMD;
• Efficient WDPM: In (2.3), γl and ψ are not identifiable under one of the following situations: (i) γl → 0; (ii) 1/ψ → 0 and γl ∼ O(ψ c ) for any positive c; or (iii) ψ → 0. To avoid this identifiable situation, we propose the following weight function c
2
e−ψ ∥x−xq ∥ b(x, xcq ) = ∑ , 2 n −ψ ∥x−xcl ∥ e l=1
(2.4)
which will be called EWDPM (short for efficient WDPM) in the rest of our paper;
• WDPMG: We can also consider the weight that depends on variable ψj rather than a fixed ψ for all of the observations: c
b(x,
xcq )
e−ψq ∥x−xq ∥
= ∑ n
l=1
2
c
e−ψl ∥x−xl ∥
2
.
(2.5)
The hyperparameters enter the weight function linearly in (2.3); in contrast, they enter the function exponentially in (2.5). The types of hyper priors for ψl in (2.5) have a gamma distribution. We refer to this WDPM as WDPMG; • WDPME: This is the same as (2.5) except that the hyper prior follows the exponential distribution. We refer to this WDPM as WDPME;
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
165
• WDPMH: This is the same as (2.5), except the hyper prior follows the half-Cauchy distribution (Carvalho and Polson, 2010), which is also called the horseshoes prior distribution. The horseshoes prior is a one of shrinkage priors, like with double exponential distributions, but is a robust prior on large signals from heavy-tailed distributions. We refer to this WDPM as WDPMH. The weighted DPM is a generalization of DPM in the sense that there is only one candidate prior in DPM, while there are {multiple candidates in the weighted method. We have to allocate the n observations to the Q candidate priors based } on biq n×Q in weighted models. But in DPM, we have bi1 = 1 and biq = 0 for i = 1, . . . , n and q > 1 and ignore the information provided by the predictive covariates. We will compare the performances of these WDPM models with that of the DPM model based on both simulated and empirical data. 3. Pólya Urn for WDPM For MCMC sampling, we need to generalize the Pólya urn result (Blackwell and MacQueen, 1973; MacEachern and Möller, 1998) to the WDPM model. In DPM, the conditional prior for parameter θi given the base distribution of G0 and all of the other values in θ is:
π (θi |θ −i , α ) =
α α+n−1
G0 +
1
∑
α+n−1
j ̸ =i
δθj ,
(3.1)
where θ −i is the vector of parameters excluding the ith element, and δθj is a zero-point mass at the j parameter. Let Z = (Z1 , Z2 , . . . , Zn ) be the latent vector, which indicates the allocation of the n observations to the Q candidate priors. In the DPM model, there is only one candidate prior, and Z = 1n . In the WDPM model, the number of possible outcomes of Z is Q n . The conditional prior in the WDPM model is:
π (θi |Z , θ −i , α ) =
α α+
∑
j ̸ =i
1(Zj = Zi )
G0 +
1
α+
∑
j ̸ =i
1(Zj = Zi )
∑
1(Zj = Zi )δθj .
(3.2)
j ̸ =i
Hence, the advantage is that the WDPM model can produce more flexibility in comparing (3.1) and (3.2). Suppose that θi represents the precision of the normal distribution that εi follows, i.e., εi |θi ∼ N(0, θi−1 )). We know that in DPM, θi |θ −i either comes from G0 or repeats one of the values in θ −i . However, in WDPM, θi |θ −i will only repeat one of the values that are assigned to the same candidate prior as itself. Based on the weight function, we know that if two observations have close covariate values, the chance that their error term parameters follow the same candidate prior will be higher compared to the case in which they are far away from each other. Therefore, it is more likely that θi |θ −i focuses only on G0 and the values in θ −i , whose observations are close to the ith one in terms of covariate values. This may be a more reasonable mechanism than simply taking all the values in θ −i into consideration, because we know that it is a common assumption that the two observations with the same covariate values will follow the normal distribution with the same variance. Hence, by incorporating the prior (3.2) of θi |θ −i into the derivation of the posterior of θi |θ −i for the MCMC sampling, the idea of including covariate information and favoring close candidates may lead to better model accuracy. 4. Dual-Semiparametric regression model using WDPM In this section, we explain how a dual-semiparametric regression model with continuous and ordinal response variables can be developed using WDPM. 4.1. Dual-semiparametric regression Our semiparametric regression model with a continuous response variable can then be written as (1.1), where ϵi is the error term whose distribution parameters are assumed to follow nonparametric WDPM prior in our study. When the response is ordinal, yi takes one of the ordered values {0, . . . , J − 1}. We write the model using a latent variable y∗ and ordered category cut points, c0 < c1 < · · · < cJ −2 < cJ −1 , where yi = j if cj−1 < y∗i ≤ cj , for which c−1 is normalized to minus infinity and c0 is zero. The cut points divide the support of the latent variable into a sequence of intervals. The response variable labels these intervals from 0 to J − 1, where J is the number of intervals. Using these settings, the model for the ordinal case is the same as that for the continuous case except that we replace yi with y∗i in (1.1). The ordinal outcome model can be viewed as an extension of the binary outcome case introduced in Basu and Chib (2003). The cut points are often of great interest in the ordinal model because finding the accurate cut points is crucial in producing a better marginal likelihood. Let c = (c1 , . . . , cJ −2 ) be the vector of the free cut points. For the J-outcome case, we need J − 1 cut points. We can always set c0 to 0 and adjust the intercept of the regression so there are J − 2 free cut points. Denoting a1 = log(c1 ) and aj = log(cj − cj−1 ) for 2 ≤ j ≤ J − 2, we assume a ∼ N(a00 , A00 ), for which the mean vector and variance matrix follow the setting specified by Chib and Greenberg (2010).
166
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
4.2. Model the unknown error term DPM can be applied to model the error terms in both continuous and ordinal outcome cases. For example, it can be 1 assumed in the ordinal model that εi |λi ∼ N(0, λ− i ), λi |G ∼ G and G ∼ DP(α G0 ). This is a DPM for the ordinal outcome. If G0 is directly used as the prior of λi instead of as G ∼ DP(α G0 ), we can derive that the marginal distribution of the latent variable y∗ is a Student’s t distribution if G0 is a gamma distribution. Therefore, the Student t model is a parametric Bayesian model, and the models developed based on DPM include the nonparametric Bayesian model. Extending DPM to WDPM in our study, we have: 1 εi |λi ∼ N(0, λ− i ); c λi |[(Zi = q)|θ H ] ∼ Gq ; Pr(Zi = q|θ H ) = biq ; Gcq ∼ DP(α G0 );
v v
G0 = Gamma(λi | , ), 2 2 where θ H is (ψ, γ1 , . . . , γn ) in WDPMD, ψ in EWDPMD, and (ψ1 , . . . , ψn ) in WDPMG, WDPME, and WDPMH, and v is the hyperparameter that specifies G0 . biq is b(xi , xcq ), which is specified by one of (2.3)–(2.5). 4.3. Model the unknown mean function The unknown functions can be modeled using natural cubic splines, Φmζ (w ) and Ψmζ (w ), ζ = 1, . . . , ns, where
Φmζ (w ) =
Ψmζ (w ) =
0, if w < τm−1,ζ ; −(2/h3mζ )(w − τm−1,ζ )2 (w − τmζ − 0.5hmζ ) if τm−1,ζ ≤ w < τmζ ; (2/h3mζ )(w − τm+1,ζ )2 (w − τmζ + 0.5hm+1,ζ ) if τm,ζ ≤ w < τm+1,ζ ; 0, if w ≥ τm+1,ζ , and
⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ ⎧ ⎪ ⎪ ⎨
0, if w < τm−1,ζ ; (1/h2mζ )(w − τm−1,ζ )2 (w − τmζ ) if τm−1,ζ ≤ w < τmζ ; (1/h2m+1,ζ )(w − τm+1,ζ )2 (w − τmζ ) if τm,ζ ≤ w < τm+1,ζ ; 0, if w ≥ τm+1,ζ .
⎪ ⎪ ⎩
{ }Mζ { }Mζ τζ = (τ1ζ , . . . , τMζ ζ ) are the Mζ knot points for the ζ th unknown function gζ (w), Φmζ (w) m= , Ψmζ (w ) m=1 are the two 1 types of collections of cubic splines, and hmζ = τmζ − τm−1,ζ is the width between the (m − 1)th and mth knots. We used a natural cubic spline (Green and Silverman, 1994; Wasserman, 2006). Cubic splines are piecewise cubic polynomials which have continuous second derivatives at and between knots points. Natural splines refers to the assumptions that the second derivatives of the approximating splines are set to zero at the smallest and largest knots. This assumption is necessary to determine a unique approximating function. Other splines such as B-splines did not have these boundary points assumptions. Then, the unknown function gζ (·) for the lth observation is represented as: gζ (wlζ ) =
Mζ ∑ {Φmζ (wlζ )fmζ + Ψmζ (wlζ )smζ },
(4.1)
m=1
where fζ = (f1ζ , . . . , fMζ ζ )′ and sζ = (s1ζ , . . . , sMζ ζ )′ are the coefficients of this cubic spline. These coefficients have the convenient interpretation of being, respectively, the ordinate and slope of gζ (wlζ ) at the mth knot, τmζ . Specifically, gζ (τmζ ) = fmζ and gζ′ (τmζ ) = smζ , ζ = 1, . . . , ns. Since gζ (w ) is a natural cubic spline, we need its second derivative to be 0 at the lower and upper bounds. We also require gζ (w ) to be continuous at the knot points. These all place restrictions on smζ . Using these restrictions, Lancaster
}Mζ
and Šalkauskas (1986) show that the coefficients of one type of basis function ( Ψmζ (w ) m=1 ) can be represented by the coefficients of the other type of basis function. Denoting (Φ1ζ (wlζ ), . . . , ΦMζ ζ (wlζ )) by (Φζ (wlζ )T ), we can rewrite gζ (wlζ ) as:
{
1 gζ (wlζ ) = (Φζ (wlζ )T + Ψζ (wlζ )A− ζ Cζ )fζ ,
(4.2)
where
⎡
2
⎢η2ζ ⎢0 ⎢ Aζ = ⎢ ⎢ .. ⎢ . ⎣0 0
1 2
η3ζ
0
µ2ζ 2
···
..
0 0
0 0
.
0 0
0 0 0
µ3ζ .. .
..
0 0
0 0
.
··· ··· ··· ··· ··· ···
0 0 0
0 0 0
0 0 0
ηMζ −1 ,ζ
2 1
µMζ −1 ,ζ
.. .
0
.. .
.. .
2
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
167
,
⎡
⎤
1
1
⎢− h2ζ ⎢ η ⎢− 2ζ ⎢ ⎢ h2ζ ⎢ ⎢ 0 ⎢ Cζ = 3 ⎢ ⎢ .. ⎢ . ⎢ ⎢ ⎢ 0 ⎢ ⎢ ⎣
η2ζ
h2 ζ
0
− h3 ζ η3ζ − h3ζ
.. .
0
0
µ2ζ
µ2ζ
h2ζ
0
η3 ζ h3ζ
h3ζ
− .. .
0
0
···
0
0
µ3ζ
µ3ζ
h4ζ
h4ζ
···
0
0
0
0
0
···
0
.. .
···
0
···
0
···
0
−
ηMζ −1 ,ζ
ηMζ −1 ,ζ
hMζ −1 ,ζ
hMζ −1 ,ζ
0
−
0
− 1
0
µMζ −1 ,ζ hMζ ,ζ
hMζ ,ζ
⎥ ⎥ ⎥ 0 ⎥ ⎥ ⎥ ⎥ 0 ⎥ ⎥ ⎥ ⎥ 0 µMζ −1 ,ζ ⎥ ⎥ ⎥ hMζ ,ζ ⎥ ⎥ 1 ⎦ hMζ ,ζ
,
ηmζ = hmζ /(hmζ + hm+1,ζ ), and µmζ = 1 − ηmζ . 1 Defining zlζ T = Φζ (wlζ )T + Ψζ (wlζ )A− ζ Cζ , we can finally transfer (1.1) into: y = X0 β0 + Znp f + ε, where Znp =
{
zlζ
} T
∑ns
n×
ζ = 1 Mζ
(4.3) . Although it looks the same as a linear model once the (w1 , . . . , wq ) that enter the model
non-parametrically are transformed into the Znp data matrix, there is still a difference between the coefficients of X0 and Znp , in terms of how the priors are given. For the parametric part, we assume that the prior of the coefficients is a multivariate normal whose mean vector and variance matrix are specified. For the nonparametric part, on the other hand, we assume that the prior of the coefficients is a multivariate normal, but this distribution is formulated in the following two ways: (1) f −f the differences in the slopes of the coefficients (i.e., ∆( m h m−1 )) follow a normal distribution with a zero mean and a certain m
variance (σd2 ) for the interior knots; (2) the two slopes themselves on the boundary (i.e.,
f −f f2 −f1 and M h M −1 ) follow another h2 M 2 d control the smoothness of the
normal distribution with a zero mean and a different variance (σe2 ). Note that σe2 and σ coefficients in the nonparametric part since a large variance means a possibly large jump between two adjacent coefficients, while a small variance increases the smoothness among the coefficients. The priors of σe2 and σd2 are typically inverse gamma distributions, which provide an efficient way to construct priors, especially for the nonparametric part. For example, if we consider three independent variables in the parametric part, we may specify the normal prior for these four (including the intercept) coefficients. However, when we model them into the nonparametric part, the number of coefficients will be much larger, even if there are only three unknown functions. If the number of knots for these three functions is 8, 5, and 5, for instance, the number of coefficients will be 8 + 5 + 5 − 3 = 15. It will seem dogmatic to directly specify the 15-dimension multivariate normal prior for these coefficients. If we introduce the smoothness parameters instead, then we only have to specify two inverse gamma distributions for each unknown function. Ruppert (2002) finds that within a certain range, the change of the numbers and locations of the knots does not have a significance effect on the unknown function approximation. Based on this, we assume equal bandwidth to simplify our model: hmζ =
max(wζ ) − min(wζ ) Mζ − 1
∀m = 1, . . . , Mζ ,
(4.4)
where ζ = 1, . . . , ns. 5. Bayesian inference and marginal likelihood 5.1. Prior specification and posterior computation
∑M1 ∑M2 the Znp matrix does not have full rank, we need to apply the constraint that m=1 fm1 = m=1 fm2 = · · · = ∑MSince ns f = 0, which implies that f = − (f + · · · + f ) and ζ = 1 , . . . , ns. This means that each row 1ζ 2ζ Mζ ζ m=1 m,ns vector zTlζ ’s first element is dropped and all of the other elements are subtracted by that dropped element. We denote △
that the Znp matrix is transformed in this way into Xnp . Combining X0 and Xnp , we have the coefficient matrix X = (X0 , Xnp ). This coefficient matrix enables us to write our semiparametric model in a neat form, y = X β + ε, where β is (β0T , (f21 , . . . , fM1 1 )T , . . . , (f2q , . . . , fMns ns )T )T . The parameters of our interests are the cut point vector a, the coefficient vector β, the smoothness parameters σe2 and 2 σd , parameters λi s of the error term, the allocation indicator vector Z , the hyperparameters θ H in the weight functions, and the precision parameter α for WDPM. Let (aζ e0 , bζ e0 ), (aζ d0 , bζ d0 ), (b0 , B0 ), (a0 , d0 ), (µψ , σψ2 ), (aγ , bγ ), (aψ ), and (cψ,0 , cψ,00 ) be values that specify the priors of σe2ζ , σd2ζ , β, α and θ H for WDPMD, WDPMG, WDPME, and WDPMH, respectively. Using
168
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
these notations, the prior distributions and weight functions are specified as follows:
σe2ζ ∼ IG(aζ e0 , bζ e0 ), σd2ζ ∼ IG(aζ d0 , bζ d0 ), β ∼ N(b0 , B0 ), Pr(Zi = q|θ H ) = biq , ⎧ c 2 ⎪ γq e−ψ ∥xi −xq ∥ ⎪ ⎪ ⎪ if WDPMD; ⎪ 2 ∑n ⎪ −ψ ∥xi −xcl ∥ ⎪ γ e ⎪ l l = 1 ⎪ ⎪ c 2 ⎨ e−ψ ∥xi −xq ∥ = ∑n −ψ ∥xi −xc ∥2 if EWDPM; ⎪ l ⎪ ⎪ l=1 e ⎪ 2 ⎪ −ψ xi −xcq ∥ ⎪ ∥ q ⎪ e ⎪ ⎪ ⎪ ⎩ ∑n −ψl ∥xi −xc ∥2 if WDPMG or WDPME or WDPMH, l l=1 e ⎧ n ∏ ⎪ ⎪ 2 ⎪ log-N( ψ|µ , σ ) × Gamma(γq |aγ , bγ ) if WDPMD; ⎪ ψ ψ ⎪ ⎪ ⎪ q=1 ⎪ ⎪ ⎪ ⎪ log-N(ψ|µψ , σψ2 ) if EWDPM; ⎪ ⎪ ⎪ n ⎪ ∏ ⎪ ⎪ ⎨ Gamma(ψq |aγ , bγ ) if WDPMG; θH ∼ q=1 ⎪ n ⎪ ⎪ ∏ ⎪ ⎪ Exp(ψq |aψ ) if WDPME; ⎪ ⎪ ⎪ ⎪ ⎪ q=1 ⎪ ⎪ n ⎪ ∏ ⎪ ⎪ ⎪ C+ (ψq |cψ,0 , cψ,00 ) if WDPMH, ⎪ ⎩ q=1
λi |(Zi = q) ∼ Gcq , Gcq ∼ DP(α G0 ), q = 1, . . . , Q , v v G0 = Gamma(λi | , ), 2 2
α ∼ Gamma(a0 , d0 ), a ∼ N(a00 , A00 ).
Suppose there are d distinct values in Z and (n1 , . . . , nd ) denotes the numbers of observations that come from these d candidate priors. Suppose the unique values among the λi s that share the same candidate prior Gcr (r = 1, . . . , d) are kr . We ∑d also denote the total number of distinct values in (λ1 , . . . , λn ) as k (i.e. k = r =1 kr ) and the distinct values in (λ1 , . . . , λn ) ∗ ∗ as (λ1 , . . . , λk ). The indicator vector that allocates (λ1 , . . . , λn ) to the k distinct values is denoted as S. Let π (θ H ) be the prior for θ H and w (Z|θ H ) be the corresponding weight function. Based on these notations, the complete joint density will be proportional to:
⎡ ⎤⎡ ⎤ ⎤⎡ ( ) aζ2e0 −1 bζ e0 ( ) aζ2d0 −1 bζ d0 )( ) Mζ2−3 ( ns ns ns [ ] − − ∏ ∏ ∏ 1 1 1 ⎢ 1 ⎥ − 12 (β−b0 )T B−0 1 (β−b0 ) 2σ 2 ⎥ ⎢ 2σ 2 ⎥ ⎢ dζ ej e e e ⎦⎣ ⎦ ⎦ ⎣ ⎣ 2 σeζ σd2ζ σe2ζ σd2ζ ζ =1 ζ =1 ζ =1 × [w (Z |θ H )][π (θ H )] × [α a0 −1 e−d0 α ]Pr(k1 , . . . , kd |α, S , Z )Pr(S |k)
k ∏
dG0 (λ∗l )
l=1
×
[ n ∏
1 2
λi e
] − 21 λi (y∗i −xTi β)2
i=1
⎡ × [e
1 − 12 (a−a00 )T A− (a−a00 ) 00
]⎣
J −1 ∏ j′ =0
⎤ 1{
cj′ −1
⎦ {yi =j′ } .
}1
It should be noted that the second line of this joint density is necessary only for WDPM models. For the Student’s t or DPM model, it should be dropped to or fixed at 1. The third line of this density can be simplified as [α a0 −1 e−d0 α ]Pr(k|α, S)Pr(S |k) ∏k ∗ dG in the DPM model. As for the Student’s t model, we know that α will go to ∞, and this line can be further 0 (λl ) ∏ l=1 n simplified as l=1 dG0 (λ∗l ). The fourth line is the likelihood of the response (or latent response), given all of the parameters, and this part appears in all the models’ full likelihoods. The last line is necessary only for the ordinal outcome case and, in the continuous outcome model, it should be dropped to or fixed at 1. In addition, for the continuous response case,
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
169
we do not have to assume zero means for the error terms. By defining φi = (µi , σi2 ), we can have εi |φi ∼ N(µi , σi2 ) and G0 = N(µi |0, σi2 ) × inv gamma(σi2 | 2a , 2b ). We also define φi to be λi in the ordinal response case. Therefore, φi will be a parameter(s) whose prior can be constructed by DPM or WDPM in both continuous and ordinal cases. Our MCMC can be viewed as the extension of the framework of the DPM MCMC sampling. We use the ordinal case for illustration purposes because it can be straightforwardly generalized to the continuous case. Once we have sampled a using Metropolis–Hastings (M–H) and β, σe2 , and σd2 using Gibbs samplings, steps 1–3 are applied and briefly described (see the detailed MCMC sampling procedures in Appendix A); Step 1: Sample the error term parameters λi s and the allocation indicator vector Z . It seems convenient to directly sample Z based on the weights (i.e., the prior of Z ). Unfortunately, this is not applicable if Pr(k1 , . . . , kd |α, S , Z ) is not a constant of Z ; hence, we conduct the following steps (a)–(c). (a) Sample S based on the generalized Pólya urn result introduced in Section 3; see the detailed MCMC sampling procedures for this step in Appendix A. (b) Update the k distinct values in the λ vector based on its posterior; these are simply several Gamma distributions. (c) Update Z . We denote C = (C1 , . . . , Ck ) as the allocation of the k distinct values sampled in (b) to the n candidate priors. For example, Ch = q means all of the observations that are assigned to the hth unique value of the number, with the qth candidate prior. We can see that updating C is equivalent to updating Z because the observations that share the same unique value of λ must share the same candidate prior (this is the very important fact that was introduced in Section 3). The following posterior of C is true for any WDPM (not only the ones we propose in our study) and can be viewed as the global rule for sampling Z in WDPM: n(h)
Pr(Ch = q|C−h , S , y, α, θH ) ∝
∏ Γ [α + nq(−h) ] biq . Γ [α + nq(−h) + n(h) ]
(5.1)
i=1
In (5.1), nq(−h) is the number of observations that select the qth candidate and are not assigned to the hth unique value in λ. This number is fixed once C−h is given. n(h) is the number of observations that are assigned to the hth unique value. Step 2: Sample the hyperparameters that are used to construct the weight functions. For (2.4) and (2.5), we can apply M–H in sampling ψ . For (2.3), several Gibbs sampling steps will be added to sample γj s according to the method proposed by Dunson and Stanford (2005) and Holmes and Held (2006) and introduced in Dunson et al. (2007). Step 3: Sample precision parameter α , which can be viewed as the extension of the result in Escobar and West (1995).
5.2. Marginal likelihood computation for WDPM The marginal likelihood computation enables us to make a model comparison through which we select the best model among the feasible choices. In this section, we explain how to derive marginal likelihood for WDPM, which is the generalized form of DPM and the extension of the work by Basu and Chib (2003). Let f (y |X ) be the marginal likelihood. Define Θ = (σe2 , σd2 , a, α, β). Then log of the marginal likelihood can be written as: log{f (y |X )} = log{f (y |X , Θ )} + log{π (Θ )} − log{f (Θ |y , X )},
(5.2)
where π (Θ ) is calculated using the prior distribution of Θ , which is simply plugging the posterior mean of Θ into the priors. The calculation of the likelihood f (Θ |y , X ) is straightforward as well. However, the calculation of f (y |Θ , X ) needs extra MCMC sampling by applying the method introduced in Chib (1995). This computation is the most challenging part of the marginal likelihood computation and is essentially realized by sequential importance sampling. We provide the algorithm for computing the marginal likelihood for the ordinal outcome case. Let (c1∗ , . . . , cJ∗−2 ; β∗ ; ∗ b1 , . . . , b∗n ) be the estimation of the cut points, the coefficients in the semiparametric model, and the weight vectors for the n observations (i.e., b∗i = (bi1 , . . . , biQ )T ), and G0 =gamma( v2 , v2 ). The marginal likelihood can be calculated using the following steps M1-M5: Step M1: Set u1 = Tv (cy∗1 − xT1 β∗ , 1) − Tv (cy∗ −1 − xT1 β∗ , 1). Draw y∗1 ∼ tv (xT1 , 1)1(cy∗ −1 , cy∗1 ), Z1 ∼multinom(b∗1 ), and S1 = 1. 1
1
Tv (·) is the cumulative distribution function of the Student’s t distribution with the degree of freedom v . Step M2: For each i observation, Zi ∼multinom(b∗i ); we denote the observations {1, . . . i − 1} that share the same value Z with observation i, as in (i − 1) ∼ Zi , and check the S values for these observations. Suppose that there are k(i−1)∼Zi distinct values of S among these observations. For the rth value of these k(i−1)∼Zi distinct values, we denote it as S(i−1)∼Zi ,r and stand for n(i−1)∼Zi ,r as the number of observations that fall into this cluster. Set
170
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
a(i−1)∼Zi ,r = v + n(i−1)∼Zi ,r and b(i−1)∼Zi ,r = v + ui =
α∗ α∗ ×
+
∑
j
α∗ +
j
l∈(i−1)∼Zi Sl =S(i−1)∼Z ,r i
(y∗l − xTl β∗ )2 . Then, we have:
[Tv (cy∗i − xTi β∗ , 1) − Tv (cy∗i −1 − xTi β∗ , 1)] k(i−1)∼Z
1
∑
∑
i
∑ 1(Zj = Zi )
1 n(i−1)∼Zi ,r [Ta(i−1)∼Z ,r (cy∗i − xTi β∗ , a− (i−1)∼Z ,r b(i−1)∼Zi ,r ) i
i
r =1 ∗ −1 a(i−1)∼Z ,r b(i−1)∼Zi ,r ) i
− Ta(i−1)∼Zi ,r (cy∗i −1 − xTi β ,
].
Step M3: Sample the S value for the ith observation. Draw y∗i ∼
α∗ α∗ + +
∑
j
1(Zj = Zi )
tv (xTi β∗ , 1) k(i−1)∼Z
i
1
α∗ +
∑
j
∑ 1(Zj = Zi )
∑
1
Si =
⎪ ⎩
r
with p ∝
j
= Zi ) ̸= 0, obtain
n(i−1)∼Zi ,r
α∗
+
i
i
r =1
which is truncated to (cy∗ −1 , cy∗1 ), and, if
⎧ ⎪ ⎨
1 n(i−1)∼Zi ,r ta(i−1)∼Z ,r (xTi β∗ , a− (i−1)∼Z ,r b(i−1)∼Zi ,r ),
∑
ki−1 + 1 with p ∝
j
1(Zj = Zi )
α∗ +
1 ta(i−1)∼Z ,r (y∗i |xTi β∗ , a− (i−1)∼Z ,r b(i−1)∼Zi ,r )
∑
j
i
i
α∗
1(Zj = Zi )
∗
xTi
tv (y |
β , 1), ∗
∑
if j
i = 1, . . . , n,
(6.1)
where n is the sample size and the underlying forms of the unknown functions are given by: g1 (w1 ) = 8w1 + sin(4πw1 ), g2 (w2 ) = −1.5 − w2 + exp[−30(w2 − 0.5)2 ], and g3 (w3 ) = 6w33 (1 − w3 ). To generate error terms, we use G0 = N(µi |0, σi2 ) × inv-Gamma(σi2 |2.25, 0.75) as G0 in the continuous response case. We use Gamma(2.5,2.5) as G0 to generate the λi s in the ordinal response case. For the Student’s t model, G0 is directly used as the prior of error term parameters. For other models, which include sampling one or multiple distributions from this base distribution, the sampling is conducted by using the ‘‘stick-breaking’’ algorithm (Sethuraman, 1994). In the DPM model, only one distribution is sampled, and φi s are independently generated from this sampled G(·). In the weighted models, multiple candidate prior distributions are sampled independently from G0 for each model. The entire process of generating the error terms for WDPM can be summarized as: Step S1: Sample the candidate priors of φi independently from G0 ; Step S2: Sample the hyperparameters in the weight functions based on the assumed hyper priors, and calculate the weights using the specified weight function; Step S3: For each observation, randomly choose the prior of φi from the candidates based on the weights; Step S4: Sample φi based on its prior; Step S5: Sample εi based on εi |φi ∼ N(µi , σi2 ).
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
171
The hyperparameters for the weight function are sampled via:
θH ∼
⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨
log-N(ψ|0.262, 1) ×
n ∏
Gamma(γq |0.1, 100) if WDPMD;
q=1
log-N(ψ|0.262, 1) if EWDPM; n ∏
Gamma(ψq |2, 1) if WDPMG;
q=1
⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
n ∏
Exp(ψq |2.42) if WDPME;
q=1 n ∏
C+ (ψq |0, 0.3) if WDPMH.
q=1
The precision parameter α is chosen to be 5 in the ‘‘stick-breaking’’ algorithm, and the numbers of knots for the three unknown functions are (8, 5, 5). We let the knots be equally spaced and set the smallest and largest knots, respectively, to the minimum and maximum of each covariate. Ruppert (2002) finds that function approximations are not very sensitive to the number of knots beyond some minimum and that an excessive number of knots can worsen the mean squared error. Our work with simulated and real data supports this conclusion. A simple strategy that we find useful is to start with five knots for each function and then to incrementally adjust the number of knots on the basis of the model marginal likelihoods. This exploration can be done quickly with a Gaussian error model because, in our studies, we find that the choice of the number of knots is not very sensitive to the distribution of the error term. Our prior distributions for WDPMD and EWDPM are log-N(ψ|0.262, 1) because we want to have the mean of ψ to be around 2 and a relatively large prior variance (greater than 7). Choosing a large scale parameter for Gamma, we want γq to have a mean approximately 1 while maintaining a large variance as well. For WDPMG, since there will be multiple exponential parameter ψq s, we choose a prior with a mean of 2 and moderate variability for ψq . It is found in practice that using a moderate variance, rather than a large variance, can produce more stable results in terms of parameter estimates and marginal likelihood. As similar idea (moderate variability of ψq ) is applied to WDPME as well. For WDPMH, the scale parameter is small because Cauchy distribution is already a heavy-tail distribution. We can still control the variability by using a small scale. We choose these hyperparameters values from our application. We apply MCMC sampling and the marginal likelihood calculation introduced in Sections 5.1 and 5.2 to complete a model comparison. The first 2500 draws of the MCMC are discarded, and the following 20,000 draws are used to derive a posterior mean. This setting of burn-in and the maximum number of iterations are applied to the MCMC of each model fitting in both the simulation and empirical study. We compile C++ functions in R and utilize parallel computation on multiple central processing unit (CPU) cores to improve computation efficiency. It takes 49 min to complete the MCMC sampling and marginal likelihood calculation of a simulated continuous-response data set with n = 1000 fitted by EWDPM on a single core of an Intel Xeon E5-2687 3.10 GHz CPU. Our code is available upon request from the authors and in online supplementary materials of the journal site. In both the continuous and ordinal case, we generate 200 data sets from each candidate model and fit each data set by the seven candidate models. In the continuous response case, we set n = 1000; in the ordinal response case, we set n = 500 because the sampling cut points requires likelihood maximization in each MCMC iteration. For each combination of a true and fit model, we record the log of marginal likelihood (log(ML)) of each fitting. For each true model, we can compare the fit models through log(ML) based on the 200 repetitions. We find that the paired-t test is not proper because the differences in log(ML) violate the normality assumption. Therefore, we adopt the Wilcoxon signed rank test. Table 1 contains the results of the continuous case. The purpose of this table is to determine whether the data sets generated from a true model can be best explained by the same model. Here, ‘‘best explained’’ means producing the maximum marginal likelihood. In each row of the table, the fit model that produces the highest average log(ML) is labeled as the ‘‘best model’’. In each cell of this table, we display the average log(ML), along with the p-value of the signed rank test, which compares the corresponding fit model to the best model of the same row. In Table 1, we can see that all of the true models are best explained by themselves, except for WDPMD, where WDPMG is the best fit model. We can also observe that the best models in most rows are significantly better than other models of the same row. The exception is in the WDPMD row, where DPM does not significantly deviate from WDPMG, and in the WDPME row, where WDPME is not significantly better than EWDPM. In the ordinal case, we set cut points (6.31, 9.04, 11.95) for the seven types of true models. Similar to Table 1, the results are displayed in Table 2. We can see that four out of the seven true models can be best explained by themselves. For the other three cases, the best explanatory model is WDPMH. We can also observe that the overall performance of WDPMH is better than other models because, even in the four cases where WDPMH is not the best, this model is either closest to the best models (the first and third rows) or produces no significant deviation from the best models (the second and fourth rows). Regarding the conclusion of the simulation study shows that when the error terms are simulated by WDPM, we do need weighted models to guarantee an accurate fitting.
172
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
Table 1 Simulation results for continuous case. We generate 200 data sets from each candidate model and fit each data set by the seven candidate models. n = 1000. In each row of the table, the fit model which produces highest average log(ML) is labeled as ‘‘best model’’ and two other best models are marked as bold letter. In each cell of this table, we display the average log(ML) along with the p-value of signed rank test which compares the corresponding fit model to the best model of the same row. t = Student t model, DPM = Dirichlet Process Mixture, WDPMD = Dunson’s WDPM, EWDPM = our Efficient WDPM, WDPMG = our WDPM with gamma distribution as hyper prior for ψ , WDPME = our WDPM with exponential distribution as hyper prior for ψ , WDPHM = our WDPM with half Cauchy distribution as hyper prior for ψ . True model
Fit model
Student-t
Student-t
DPM
WDPMD
EWDPM
WDPMG
WDPME
WDPMH
−685.14
−690.86 (2.651e−06) −693.67
−695.53 (7.693e−11) −697.11 (5.233e−05) −700.75 (1.055e−04) −697.51 (7.166e−05) −694.24 (6.740e−10) −694.29 (3.824e−06) −693.62 (5.091e−10)
−689.70 (1.992e−05) −697.43 (4.415e−05) −702.34 (7.188e−05) −694.08
−691.44 (3.759e−06) −709.22 (2.989e−16) −700.01
−690.69 (6.017e−06) −698.30 (1.790e−09) −701.57 (1.522e−05) −695.25
−689.81
−690.44 (4.581e−06) −697.21 (1.656e−07) −708.82 (3.954e−15) −696.17 (1.795e−06) −692.79 (9.953e−08) −693.75 (2.694e−04) −687.24
(0.0018)
best model
best model
−716.06 (3.039e−15) −722.93
DPM WDPMD EWDPM
(0)
(0.3556)
−718.12
−696.72 (1.937e−06) −691.36 (3.587e−07) −694.60 (8.005e−06) −691.21 (3.249e−08)
(0)
−714.30
WDPMG
(0)
−714.03
WDPME
(0)
−717.09
WDPMH
best model
−700.30
(0)
best model
−695.73
best model
(0.1914)
(0.2651)
−693.10 (2.601e−08) −692.96
−689.68
−691.62 (9.267e−07) −692.27
(0.4117)
−689.38 (7.100e−06)
best model
−694.63 (7.580e−06) −689.09 (4.293e−05)
best model
Table 2 Simulation results for ordinal response case. We generate 200 data sets from each candidate model and fit each data set by the seven candidate models. n = 500. In each row of the table, the fit model which produces highest average log(ML) is labeled as ‘‘best model’’ and two other best models are marked as bold letter. In each cell of this table, we display the average log(ML) along with the p-value of signed rank test which compares the corresponding fit model to the best model of the same row. t = Student t model, DPM = Dirichlet Process Mixture, WDPMD = Dunson’s WDPM, EWDPM = our Efficient WDPM, WDPMG = our WDPM with gamma distribution as hyper prior for ψ , WDPME = our WDPM with exponential distribution as hyper prior for ψ , WDPHM = our WDPM with half Cauchy distribution as hyper prior for ψ . True model
Fit model
Student-t
Student-t
DPM
WDPMD
EWDPM
WDPMG
WDPME
WDPMH
−263.57
−267.82 (4.518e−09) −257.81
−269.10 (5.097e−13) −271.32 (6.988e−07) −267.59
−265.54 (3.383e−06) −270.53 (1.136e−06) −269.41 (9.613e−05) −265.53
−266.21 (7.732e−07) −269.94 (7.094e−06) −269.27 (2.618e−04) −267.94 (9.738e−06) −263.47
−265.05 (1.937e−05) −268.09
−264.31 (6.895e−04) −267.13 (1.790e−05) −265.91
−267.49 (2.108e−08) −269.73 (5.628e−06) −270.01 (2.035e−05) −268.63 (3.944e−06) −265.12 (7.252e−05) −266.36 (8.611e−05) −266.08
(0.1625)
(0.0537)
best model
−274.02 (3.701e−11) −280.91
DPM WDPMD
(0)
−274.68 (2.793e−14) −272.22 (2.781e−12) −274.60
EWDPM WDPMG WDPME
(0)
−278.19
WDPMH
(0)
best model
−272.67 (8.379e−07) −266.04 (0.3551)
−264.53 (4.750e−05) −269.81 (9.487e−07) −469.89 (5.842e−06)
best model
−268.86 (2.627e−06) −264.74 (2.338e−04) −267.61 (4.324e−05) −266.73 (1.168e−04)
best model
(0.2323)
−268.75 (4.023e−04) −266.15 (0.2954)
−262.89
(0.3817)
best model
−265.93
−265.22
(0.4174)
best model
−266.20 (6.734e−04)
−265.74 best model
It is also worth mentioning that the concentration parameter α is severely underestimated in the ordinal case. For example, the posterior mean of α in the continuous case is 5.13 by the DPM model, but it is 1.15 when it comes to the ordinal case by the same model. The true value of α is 5, so the model comparison in the ordinal case is less convincing than that in the continuous case since the parameter estimation is less accurate. We calculated the marginal likelihood described in Section 5.2 and then maximized it. In ordinal case, the vector a needs to be sampled using Metropolis–Hastings. We propose anew ∼ t15 (a|m, V ), where m = argmax log Pr(y|λ, β, a) a
{ V =
−∂ log Pr(y|λ, β, a) ∂ a∂ aT
}−1
.
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
173
Fig. 1. Plots of posterior density curves of λt (t = 1, . . . , n). In each of the panels, there are n curves, where n is the sample size. Each curve represents the posterior density of an observation’s λ value based on 20,000 MCMC draws. The y-axis is kernel density estimated value, and x-axis is the range of λt ’s. In this graph: Panel A is the result of data generated by Student-t fitted by Student-t (continuous response and n = 1000); Panel B is the result of data generated by Student-t fitted by DPM (ordinal response and n = 1000); Panel C is the result of data generated by WDPMG fitted by WDPMG (continuous response and n = 1000); Panel D is the result of data generated by DPM fitted by DPM (ordinal response and n = 500).
We then update a by anew with probability αMH , where
αMH
} π (anew )Pr(y|λ, β, anew )t15 (a|m, V ) = min ,1 , π (a)Pr(y|λ, β, a)t15 (anew |m, V ) {
and derive the vector of free cut-points c based on what has been specified in the paper: Denote a1 = log(c1 ) and aj = log(cj − cj−1 ) for 2 ≤ j ≤ J − 2. The optimization approach for maximizing log Pr(y|λ, β, a) is the Newton method. Specifically, we used the ‘‘nlm’’ function in R to implement. Note that this step of sampling is an extra computation burden caused by the ordinal case because in the continuous response y case we do not need cut-points. Since DPM and WDPM are essentially clustering observations by unique values (in λ for this study), we are interested in determining whether the clusters do exist in the posterior densities of (λ1 , . . . , λn ). Fig. 1 displays such results in four cases. Panel A is the result of data generated by the Student’s t model fitted by the Student’s t model (continuous response and n = 1000); Panel B is the result of data generated by the Student’s t model fitted by DPM (ordinal response and n = 1000); Panel C is the result of data generated by WDPMG fitted by WDPMG (continuous response and n = 1000); and Panel D is the result of data generated by DPM fitted by DPM (ordinal response and n = 500). We can see that data generated from DPM or WDPMP (Panel C or Panel D) display a clearer pattern of multiple clusters compared to data generated from the Student’s t model (Panel A or Panel B). There are at least two major modes in Panel C or Panel D and a clear trough between the two modes. Even if we use DPM to fit the Student’s t data, we cannot see such significant visual evidence of multiple clusters (in Panel B). We can conclude that the existence of multiple clusters depends more on the true model than on the fitting approach. 6.2. Correlated covariates The simulation study for this case is similar to the independence covariates case except for the following setting. The true model is: yi = 0.4 + 0.2xi + g1 (wi1 ) + g2 (wi2 ) + g3 (wi3 ) + εi
i = 1, . . . , 1000.
Instead of assuming that w1 w2 and w3 are independent and simply sampled from Uniform(0,1), we set as w1 = 0.1 cos(u1 ) + 0.2z, w2 = 0.2 sin(z) exp(−u2 ) + 0.4u3 , w3 = −0.4z sin(u3 ) + 0.3u4 exp(−u5 ), where u1 , u2 , u3 , u4 , u5 are independently sampled from Uniform(0,1) and z ∼ N(0, 1). Then the three unknown functions are specified as: g1 (w1 ) = 8w1 + sin(4πw1 ), g2 (w2 ) = 2 cos(w2 ) − 3w22 + 4w2 sin(w2 ), g3 (w3 ) = 1.8 exp(−[w3 − 0.5]2 ) cos(w 3) + 4w33 sin(w3 − 1).
174
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181 Table 3 Simulation results for continuous case where independent covariates are correlated. We set n = 500 and generate 200 data sets from each of the 7 true models and fit each data set by these 7 candidate models. For each data set fitted by each model, we let the fitting model provide the 95% credible interval for E(y∗ |y) = E(x′ βp |y) + E(g1 |y) + E(g2 |y) + E(g3 |y) for each observation. Based on the 500 observations’ credible intervals, we calculate the average length of credible interval and cover rate (the percentage of the 500 intervals that cover the corresponding true values). Therefore we can return 200 average lengths and cover rates for each of the 49 cases. Each cell’s top value is the average cover rate based on the 200 repetitions, and the bottom value is the mean of the 200 average credible interval lengths. Each true model’s (each row’s) shortest credible interval length is labeled by blue; each true model’s highest cover rate is labeled by red.
Hence, w1 , w2 , w3 are correlated and g1 , g2 , g3 take more complex forms compared to the previous section. Based on 50,000 sampled observations, the correlation matrix of w1 , w2 , w3 is: 1.000 0.526 −0.843
[
0.526 1.000 −0.444
] −0.843 −0.444 . 1.000
The correlation matrix of g1 , g2 , g3 is: 1.000 −0.508 −0.306
[
−0.508 1.000 0.474
] −0.306 0.474 . 1.000
The simulation results are summarized in Tables 3 and 4 for continuous and ordinal responses, respectively. By following a referee’s suggestion, we provide the credible intervals and coverage probability instead of the marginal likelihood in this section. Since we found that the marginal comparison results are similar to those of the previous section, it would better to provide other important information instead of reporting similar results. For a continuous case, we set n = 500 and generate 200 data sets from each of the 7 true models and fit each data set by these 7 candidate models. For each data set fitted by each model, we let the fitting model provide the 95% credible interval for E(y∗ |y) = E(x′ βp |y) + E(g1 |y) + E(g2 |y) + E(g3 |y) for each observation. Based on the 500 observations’ credible intervals, we calculate the average length of the credible interval and coverage rate (the percentage of the 500 intervals that cover the corresponding true values). Such average length and coverage rates can be viewed as important metrics on how the model can capture the complexity of actual error terms. Therefore, we can return 200 average lengths and coverage rates for each of the cases where one type of true model is fitted by one type of candidate model. In Table 3, we can see that the Student-t always provides the shortest credible interval for when except the true model is the DPM type. This means that the Student’s t model captures the error terms with the least variability when the covariates are correlated in the continuous response case. However, the Student’s t model (as the fitted model) fails to provide more accurate coverage of the true values of E(y∗ |y) when compared to other types of fit models. When it comes to accuracy of the credible interval coverage, in three among the 7 true model cases, the WDPMH fit model produces the highest coverage rate. Therefore, in general, the credible intervals generated by the WDPMH fit model can cover the true values more frequently, while the Student’s t model’s credible intervals are usually shorter than the DPM and WDPM type models. For the ordinal response case, we set n = 500 again and generate 200 data sets from each of the 7 true models and fit each data set by these 7 candidate models. For each data set fitted by each model, we calculate the marginal likelihood. The comparison procedures are the same as in Section 6.1. In Table 4 we can see that the Student’s t, DPM, and WDPMD models fail, regardless of what the true model is. They are always significantly dominated by other types of WDPM fit models (one can see that the p-values of these three types of fit models are always equal to 0). When the true models are the Student’s t, DPM, WDPME and WDPMH, the best fit model, in terms of marginal likelihood, is WDPME; when the true models are from WDPMD, EWDPM, and WDPMG, the best fit model is WDPMG. However, we can also observe that EWDPM and WDPMH, as
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
175
Table 4 Simulation results for ordinal response case where independent covariates are correlated. We generate 200 data sets from each candidate model and fit each data set by the seven candidate models. n = 500. In each row of the table, the fit model which produces highest average log(ML) is labeled as ‘‘best model’’. In each cell of this table, we display the average log(ML) along with the p-value of signed rank test which compares the corresponding fit model to the best model of the same row. t = Student t model, DPM = Dirichlet Process Mixture, WDPMD = Dunson’s WDPM, EWDPM = our Efficient WDPM, WDPMG = our WDPM with gamma distribution as hyper prior for ψ , WDPME = our WDPM with exponential distribution as hyper prior for ψ , WDPHM = our WDPM with half Cauchy distribution as hyper prior for ψ . True model
Student-t DPM WDPMD EWDPM WDPMG WDPME WDPMH
Fit model Student-t
DPM
WDPMD
EWDPM
WDPMG
WDPME
WDPMH
−257.08
−258.80
−256.95
−254.69
−254.76
−254.61
−254.83
(0)
(0)
(0)
(0.6266)
(0.4381)
best model
(0.4820)
−258.15
−259.65
−257.86
−255.85
−255.91
−255.74
−255.91
(0)
(0)
(0)
(0.8433)
(0.1827)
best model
(0.4201)
−257.73
−259.37
−257.51
−255.51
−255.38
−255.50
−255.45
(0)
(0)
(0)
(0.1602)
best model
(0.5531)
(0.6536)
−257.72
−259.38
−257.53
−255.48
−255.17
−255.37
−255.47
(0)
(0)
(0)
(0.0427)
best model
(0.8947)
(0.0413)
−257.11
−258.62
−256.81
−254.78
−254.51
−254.87
−254.78
(0)
(0)
(0)
(0.9465)
best model
(0.0274)
(0.1361)
−257.95
−259.58
−257.70
−255.64
−255.69
−255.60
−255.69
(0)
(0)
(0)
(0.9001)
(0.3577)
best model
(0.5508)
−255.84
−257.46
−255.64
−253.59
−253.55
−253.47
−253.51
(0)
(0)
(0)
(0.9192)
(0.9739)
best model
(0.3064)
fit models, generally produce no significant difference other than best models based on p-values. Therefore, the conclusion for the ordinal case in which the predictive covariates are correlated, EWDPM, WDPMG, WDPME, and WDPMH are the four competitive fit models, while among these four models WDPMG and WDMPE are the best based on marginal likelihood in this simulation case. 7. Application: credit rating data We apply the WDPM methods to the credit rating data illustrated in Verbeek (2008). We have the observed values of the 921 farms’’ Standard & Poor’s credit rating (the response), book value of debt divided by assets (w1 ), earnings before interest, taxes divided by total assets (w2 ), log of sales (w3 ), retained earnings divided by total assets (w4 ) and working capital divided by total assets (w5 ). The response values are ordinal (from 0 to 4). The five (ζ = 1, . . . , 5) covariates (w1 , . . . , w5 ) will enter the model nonparametrically and there is only the intercept for the parametric part. For EWDPM and WDPMD, the prior for ψ is specified as log N(0.262, 1). In WDPMD, the prior for γq ’s is given as gamma(1, n), where n is the number of observations of this empirical data set. The three types of priors for ψq s in WDPMG, WDPME, and WDPMH are, correspondingly, Gamma(2,1), Exp(2.42), and C + (0, 0.03). For the prior of cut-point vector a, which is N(a00 , A00 ), we set a00 = (0.916, 0.693,{1.386) }to make the cut-point prior mean after the exponential transformation as (2.5, 4.5, 8.5). We choose A00 = diag 21 , 14 , 61 because controlling the scale of variance at this level and allowing the variances to be arranged in a decreasing pattern can help to increase the efficiency of the sampling for cut-points. Further, (aζ e0 , bζ e0 , aζ d0 , bζ d0 ) is specified as (2.04, 1.04, 2.04, 1.04) because the mean of the inverse-Gamma distribution for σe2ζ and σd2ζ is controlled around 1 while the variance can be large. We use a flat prior for β0 because we do not have any prior information regarding to the value of the intercept. The prior for the concentration parameter α is Gamma(1,0.2). Although the prior expectation is 5, which means that we expect the error terms to display a certain degree of a cluster property, the prior will have little impact on the posterior mean because the latter one will be basically determined by how many clusters the observations actually display. We apply our MCMC algorithm and the marginal likelihood computation described in Section 5.2 to perform a model comparison. We consider using different numbers and locations of candidate priors to seek model performance. Randomized location means that we do not use the observations’ locations as the candidate locations. Instead, we derive the range of each covariate based on the data and uniformly sample the five components of the locations from their corresponding range. Each candidate’s location is independently sampled in this way. We consider such a randomized location prior for Q = 10, 200, 1000, and 1200. We also consider using subsets of the observations rather than all to specify location. In Table 5, WDPMD200ob means that we use WDPMD and randomly select 200 observations’ locations as the candidate locations. WDPMD1000r means that we randomly sample 1000 locations as the candidate locations. Using the number of knots equal to 4, 5, and 6 for each unknown function and applying WDPMD and WDPMG, we consider different combinations of Q , location, number of knots, and the WDPM prior. In Table 5, we display the comparison results of these combinations by using the Bayes factor. We set the Student’s t model by using 4 knots for each unknown function as the benchmark model (i.e., its Bayes factor is 1) and calculate other models’ Bayes factors with respect to it. The Bayes factor is essentially the ratio of marginal likelihood.
176
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181 Table 5 Bayes factors of WDPMD and WDPMG with different Q and candidate locations with respect to Student-t model using 4 knots for each unknown function. ‘‘r’’ means randomized locations, and ‘‘ob’’ means observations’ locations. -Q means Q candidate priors. Candidate models
Number of knots 4
5
6
t WDPMD-10r WDPMD-200r WDPMD-1000r WDPMD-1200r WDPMD-10ob WDPMD-200ob WDPMD-921ob
1.000 0.009 0.017 0.047 0.051 0.011 0.017 0.069
120.226 4.365 8.318 12.882 7.413 2.818 3.311 30.903
114.815 0.224 0.468 0.603 0.603 0.550 0.759 1.660
WDPMG-10r WDPMG-200r WDPMG-1000r WDPMG-1200r WDPMG-10ob WDPMG-200ob WDPMG-921ob
0.013 0.018 0.102 0.040 0.004 0.019 0.229
5.888 13.183 14.791 15.488 7.413 20.417 34.674
0.513 1.096 1.698 1.514 0.562 0.813 1.175
Table 6 Bayes factors of candidate models with respect to Student-t model using 4 knots for each unknown function. t = Student t model, DPM = Dirichlet Process Mixture, WDPMD = Dunson’s WDPM, EWDPM = our Efficient WDPM, WDPMG = our WDPM with gamma distribution as hyper prior for ψ , WDPME = our WDPM with exponential distribution as hyper prior for ψ , WDPHM = our WDPM with half Cauchy distribution as hyper prior for ψ . Candidate models
t DPM WDPMD EWDPM WDPMG WDPME WDPMH
Number of knots 4
5
6
1.000 0.010 0.069 0.372 0.229 0.246 0.676
120.226 5.623 30.903 47.863 34.674 13.490 56.234
114.815 0.457 1.670 20.893 1.175 0.575 5.129
In Table 5, we can see that Q = 10 leads to the worst model performance in each column. Q = 200 results in model improvement, but the marginal likelihoods are still significantly worse than that of the Student’s t model. If we use Q = 1000 for randomized locations or Q = 921 for observations’ locations, the model performances will be better than those with Q = 200, but such improvement is not adequate because the Bayes factors of the Student’ t model with respect to the weighted models using the same number of knots that are all higher than 3. We also observe that the ability to improve model performance reaches a limit as Q goes beyond 1000. For Q = 1200 with randomized locations, we actually do not see any model improvement compared to Q = 1000. The marginal likelihoods of the models using observations to specify the candidate locations are generally better than those using randomized locations, and such a result is not sensitive to the number of knots. Based on these results, we use Q = n and observations’ locations as proposed by Dunson et al. (2007) and explore different weight functions to seek further model performance. In Table 6, seven types of models in three knot-number cases are compared. We still set the Student’s t model by using 4 knots for each unknown function as the benchmark model and the Bayes factors are displayed. For the names of models using the WDPM, we drop the extensions because they are all followed by ‘‘-921ob’’. Similar to Table 5, the models using 5 knots for each unknown function are better than those using 4 or 6. In all of the three cases, the best three models are always the Student’s t, EWDPM, and WDPMH. For the 4-knot case, WDPMH is favored over EWDPM, and for the 6-knot case, EWDPM performs better than WDPMH. All of the weighted DPM models produce a better marginal likelihood than DPM in all of the three cases. In the 4 and 5-knot cases, EWDPM and WDPMH are the two weighted approaches with respect to which the Bayes factor of the Student’s t is less than 3. The result that the Student’s t models perform better than DPM or WDPM models in this empirical data can be explained by the visual evidence in Fig. 2, where Panel A is the result of empirical data (Verbeek, 2008) fitted by the Student-t, Panel B is the result of the same data fitted by DPM, and Panel C is the result of WDPME. As we did in the simulation study, we check whether multiple clusters exist in the posterior densities of (λ1 , . . . , λn ). We can see that Panel A and Panel C are similar, while Panel B is somewhat different from them. This reflects the results that the marginal likelihood of WDPM is closer than DPM to that of the Student’s t model. What is more important is that one of the panels gives us evidence that
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
177
Fig. 2. Plots of posterior density curves of λt (t = 1, . . . , n). In each of the panels, there are n curves, where n is the sample size. Each curve represents the posterior density of an observation’s λ value based on 20,000 MCMC draws. The y-axis is kernel density estimated value, and x-axis is the range of λt ’s. In this graph: Panel A (Left) is the result of Empirical data (Verbeek, 2008) fitted by Student-t (ordinal response and n = 921); Panel B (Middle) is the result of same data fitted by DPM. Panel C (Right) is the result of same data fitted by WDPME.
Table 7 The proportion of the observations that are correctly rated based on the observed credit rating; we can calculate the probability distribution of a certain 1/2 1 /2 observation’s credit rating based on its covariate information and the model estimation: Pr(yi = j|λ, β, a) = Φ ([cj − xTi β]λi ) − Φ ([cj−1 − xTi β]λi ) j = 0, 1, 2, 3, 4, where cj is the cut point that are calculated based on a as mentioned in Section 4; Based on the fitted probability distribution of credit rating, we can assign a certain observation to the category of rating that it is most likely to fall into and check the proportion of the observations that are correctly rated based on the observed credit rating. This proportion is obtained from the model with 5 knots.
Fitting accuracy
t
DPM
WDPMD
EWDPM
WDPMG
WDPME
WDPMH
54.72
54.07
53.31
54.40
54.29
53.20
55.05
Table 8 Estimated three free cut points using the best two models for credit rating data, student t-model and WDPMH. Posterior mean, standard deviation (SD), median, lower and upper quantiles of MCMC samples are displayed as well. Posterior mean
SD
Median
2.5% quantile
97.5% quantile
1.751 3.225 5.206
1.578 2.988 4.830
1.944 3.485 5.599
1.732 3.196 5.128
1.564 2.959 4.771
1.906 3.449 5.546
Student-t with 5 knots for each unknown function c1 c2 c2
1.753 3.224 5.197
0.094 0.127 0.195
WDPMH with 5 knots for each unknown function c1 c2 c3
1.731 3.190 5.133
0.089 0.121 0.199
multiple clusters exist among (λ1 , . . . , λn ). It suggests that the real data may favor the assumption that posterior densities of λt s are based on a parametric prior. Furthermore, we conduct a model comparison based on fitting accuracy. Once we have the estimated values of β, the cut-point vector a, and error term parameter λi s, we can calculate the probability distribution of a certain observation’s credit rating based on its covariate information and the model estimation: 1/2
Pr(yi = j|λ, β, a) = Φ ([cj − xTi β]λi
1/2
) − Φ ([cj−1 − xTi β]λi
),
j = 0, 1, 2, 3, 4,
(7.1)
where cj is the cut-point calculated based on a, as mentioned in Section 4. Based on the fitted probability distribution of the credit rating, we can assign a certain observation to the category of rating of which it is most likely to fall into and check the proportion of the observations that are correctly rated based on the observed credit rating. Based on Tables 5 and 6, we can see that the model performances in the 5-knot case are generally better than those in the other cases, so we display the result of the fitting accuracy comparison for only the 5-knot case in Table 7. Since the fitted values are derived based on all of the 921 observations, this accuracy is related to which of the seven candidate models is most powerful for these particular credit rating data. It turns out that WDPMH produces the best fitting result (55.05%) based on Table 7, the second best is the Student’s t model (54.72%), and the third best is EWDPM. The best two models with the 5 knots for each unknown function also produce the best marginal likelihood. Hence, we view these two models as the most competitive models for these credit data. Table 8 displays these two models’ parameter inferences based on the MCMC for the three free cut-points (c1 , c2 , c3 ). The posterior mean, standard deviation, median, lower, and upper quantiles of MCMC samples are described. The credit rating for observation i will be 0 if y∗i < 0, 1 if 0 ≤ y∗i < c1 , 2 if c1 ≤ y∗i < c2 , 3 if c2 ≤ y∗i < c3 , and 4 if y∗i ≥ c3 . We can see that the results of the two models are similar to each other. Figs. 3 and 4 display the posterior mean E(gζ |y) as a function of wζ . A positive curve slope of the curve means that the corresponding predictor contributes positively to gj , and, therefore, positively to the credit rating based on posterior
178
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
Fig. 3. The estimated expected value of gj |y using Student-t model of Credit Rating Data; It illustrates how the expected value changes as the value of the jth covariate changes. These figures help analyze how the five covariates contribute to the latent response y∗ .
Fig. 4. The estimated expected value of gj |y for WDPMH of Credit Rating Data; It illustrates how the expected value changes as the value of the jth covariate changes. These figures help analyze how the five covariates contribute to the latent response y∗ .
information. That being said, we can see that w1 (book value of debt) and w5 (working capital divided by total assets) always have a negative impact on a firm’s credit rating. Working capital is the proportion of total assets that is at risk in the market. Therefore, it is aligned with common knowledge that debt and risk do harm credit rating. w4 (retained earnings divided by total assets) is the retained earning divided by total assets. It is related to a firm’s profitability, which definitely has a positive impact on credit rating. However, the impact of predictors are not always that simple. w2 (earning before interest, taxes divided by total assets) and w3 (log of sales) clearly display a change point where the pattern switches from decreasing to increasing. Taking the log of sale w3 as an example, we can interpret that the sale size has a negative impact on rating when it is relatively small, but its impact becomes a positive when the sale size is large enough. It is aligned with observations in the real business development. When a business has just started, the risk becomes bigger as the firm grows. However, after certain stage, when the firm is large enough and has built a reputation in a market, the increase in sale size will be the sign of steady prosperity. These results indicate that although the estimated functions using splines with the Student’s t is similar to those with WDPMH, as displayed in Figs. 3 and 4, WDPHM produces the better fitting results than the Student’s t model, as shown in Table 5. 8. Conclusion In this article, we propose an efficient and flexible Bayesian approach for dual-semiparametric regression models that nonparametrically demonstrate both mean function and error term. WDPM is applied in the construction of the error term and cubic splines are used in the modeling mean function. The important property of the WDPM method is that it uses the predictive information provided by the covariates to construct the prior distribution. As we have seen in the empirical study, WDPM priors contain a potential advantage compared to the idea of proposing a prior ignoring the covariates’ information. We can also observe that, in the cubic spline regression, both the posterior distribution and the marginal likelihood of the weighted DPM model can be conveniently derived by the straightforward extension of the algorithms designed for the DPM model. When the hyperparameters in the weight functions are invariant to different observations, the prior distribution of the hyperparameters do not have a significant effect on the marginal likelihood compared to when the hyperparameters are variable among the observations. Therefore, the function form of the weights and some possible priors of the hyperparameters in it should be taken into consideration for model comparison when applying our approach to other empirical data sets. Acknowledgments Ki-Ahm Lee was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2017R1A2A2A05001376). Ki-Ahm Lee also hold a joint appointment with the Research Institute of Mathematics of Seoul National University.
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
179
Appendix A. MCMC sampling procedures for the semiparametric model Step 1 Sample cut-point vector a using Metropolis–Hastings: to do this, derive m = argmax log Pr(y|λ, β, a) a
{ V =
−∂ log Pr(y|λ, β, a) ∂ a∂ aT
}−1
and propose anew ∼ t15 (a|m, V ). We then update a by anew with probability αMH , where
αMH = min
{
} π (anew )Pr(y|λ, β, anew )t15 (a|m, V ) ,1 , π (a)Pr(y|λ, β, a)t15 (anew |m, V )
and derive the vector of free cut-points {c. } 1 Step 2 Sample the continuous latent response y∗i from N(xTi β, λ− i ) truncated to the interval (cj−1 , cj ) for yi = j. Step 3 Sample β from the following normal distribution, 1 Nk [β|B(B− 0 b0 +
∑
λi xi y∗i ), B],
i
{ } 1 −1 T T −1 −1 −1 T where B = (B0 + , B0 = diag B00 , ∆− , b0 = (bT00 , 0k∗ ×1 )T , k∗ = 1 T1 (∆1 ) , . . . , ∆np Tnp (∆np ) i λi xi xi ) ∑ np ζ =1 Mζ − np,{b00 is the mean vector } of the normal prior for β0 , B00 is the covariance matrix of the normal prior for β0 , Tζ = diag σe2ζ , σd2ζ IMζ −3 , σe2ζ , ζ = 1, . . . , ns. Step 4 Update σe2 and σd2 using the following inverse gamma distributions, −1
∑
σe2ζ ∼ inv gamma( σd2ζ
αζ e0 + 2 δζ e0 + βζT ∆Tζ D0ζ ∆ζ βζ , ), 2
2
αζ d0 + Mj − 3 δζ d0 + βζT ∆Tζ D1ζ ∆ζ βζ ∼ inv gamma( , ), 2
2
where αζ e0 , αζ d0 , bζ e0 , bζ d0 are the hyperparameters of the gamma priors for σe2ζ and σd2ζ , ζ = 1, . . . , ns, and
⎡
D0ζ
1 = ⎣0 0
⎤
0
⎡
0 0 0⎦ , D0ζ = ⎣0 1 0
0Mζ −3 0
⎤
0
0 0⎦ . 0
IMζ − 3 0
Step 5 Update S:
Pr(St = s|Zt = q) ∝
⎧ ⎪ ⎨
1 N(y∗i |xTi β, λ− h )·
⎪ ⎩
tv (y∗i |xTi β, 1) ·
ns(−t)
s ∈ q1 , . . . , qkl
{
nq + α − 1
α nq + α − 1
}
s = max(S −t ) + 1,
where nq is the number of observations that select the qth candidate prior, SZ =q denotes the subset of S including these observations, and ns(− of observations except the tth one that belong to the sth unique value } {t) is the number in λ. S −t means S /{St } and q1 , . . . , qkl is the set of unique values in the intersection of S −t and SZ =q . Step 6 Sample the k distinct values (λ∗1 , . . . , λ∗k ) in λ,
λh ∼ gamma( ∗
v+
∑
1(Si = h) v +
,
∑
Si =h (yi
2 Step 7 Sample the C vector directly from the weights
− xTi β )2
2
),
h = 1, . . . , k.
n(h)
∏ Γ [α + nq(−h) ] Pr(Ch = q|C−h , S , y, α, θH ) ∝ biq . Γ [α + nq(−h) + n(h) ] i=1
Step 8 Sample the hyperparameters ψ for weight function described in (a) and update γj explained in (b): (a) For the unique ψ case, use
⎛ argmax log{w (Z |ψ )} = argmax log ⎝ ψ
ψ
n ∏ i=1
e
2 −ψ xi −xcZ
⎞
i
c
−ψ ∥xi −xq ∥ q=1 e
∑Q
2
⎠
and the negative inverse of the second order derivative as the location (m) and scale (σ 2 ) parameters of the Student’s t with a degree of freedom 15, which is the proposal distribution of the Metropolis–Hastings
180
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
algorithm for sampling ψ . The accepting probability is
{ min
} π (ψnew )w(Z |ψnew ) t15 (ψ|m, σ 2 ) · , 1 . π (ψ )w(Z |ψ ) t15 (ψnew |m, σ 2 )
And for the ψ = (ψ1 , . . . , ψQ ) case, we apply a sub-optimal proposal. The target density is proportional to −1 −(logψnew,q −logψq )2 /2 π (ψq )w (Z |ψq ) and our proposal q(ψnew,q |ψq ) ∝ ψne 1[0,10] . w,q e (b) This step is only necessary for the WDPMD model, where we have to update γj s as Dunson et al. (2007) 2 γj exp(−ψ xi −xcj )
propose. For a given ψ , let Kij = ∑ ∗
q̸ =i γq exp(−ψ
∥xi −xcq ∥2 )
. Then γj can be sampled by the following Gibbs
sampling: i. Sij∗ ∼ Poisson (γj ξij Kij∗ ) if Zi = j, otherwise Sij = 0, ii. ξij ∼ gamma (1 + Sij∗ , 1 + γj Kij∗ ), ∑n ∑n iii. γj ∼ gamma (aγ + i=1 Sij∗ , bγ + i=1 ξij Kij∗ ), where Sij∗ and ξij are auxiliary variables that are introduced to realize the Gibbs sampling and (aγ , bγ ) specify the gamma hyper priors that are used for γj s. Step 9 Sample α : let d be the distinct values in Z and (n1 , . . . , nd ) denote the numbers of observations that come from these d candidate priors. For among λi s that come from the rth one of these d candidate priors (r = 1, . . . , d), suppose that there are kr distinct values of λ. Then we have: Pr(k1 , . . . , kd |α ) ∝ α
∑d
r =1 k r
d ∏ r =1
Γ (α ) . Γ ( α + nr )
We use argmax log{Pr(k1 , . . ., kd |α )}. For sampling α using MH, we use the proposal distribution as the multivariate α
Student’s t with degree of freedom 15 with the location m and scale V , which are setted as the negative inverse of the corresponding Hessian matrix of argmax log{Pr(k1 , . . ., kd |α )}. Hence, the accepting probability is α
{ min
} t15 (α|m, V ) π (αnew )Pr(k1 , . . . , kd |αnew ) · ,1 . π (α )Pr(k1 , . . . , kd |α ) t15 (αnew |m, V )
Note that since our procedures pertain to a general case, they can be applied to other models. For example, as in (i) for a continuous case, Step 1 will be dropped and we directly use y instead of the latent response y∗ ; (ii) for DPM, Steps 7 and 8 should be dropped and Steps 5–6 will be simplified into one step:
( ) α tv (y∗i |xTi β, 1) v + 1 v + (yi − xTi β)2 gamma , [λi |λ−i , y , β, G0 ] ∼ ∑ 1 2 2 α tv (y∗i |xTi β, 1) + j̸=i N(y∗i |xTi β, λ− j ) ∗
+
1 N(y∗i |xTi β, λ− j )
∑ j̸ =i
α tv (y∗i |xTi β, 1) +
∑
j ̸ =i
1 N(y∗i |xTi β, λ− j )
δλ j .
As a final example, as in (iii) for t model, Steps 7–9 will be dropped and Steps 5–6 can be further simplified:
[λi |λ−i , y∗ , β, G0 ] ∼ gamma
(
v + 1 v + (yi − xTi β)2 , 2
2
)
.
Appendix B. Supplementary data Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.csda.2017.08.005.
References Basu, S., Chib, S., 2003. Marginal likelihood and bayes factors for dirichlet process mixture models. J. Amer. Statist. Assoc. 98, 224–235. Blackwell, D., MacQueen, J.B., 1973. Ferguson distributions via pólya urn schemes. Ann. Statist. 1, 353–355.
P. Sun et al. / Computational Statistics and Data Analysis 117 (2018) 162–181
181
Carvalho, C., Polson, N.G., 2010. The horseshoe estimator for sparse signals. Biometrika 97, 465–480. Chae, M., Lin, L., Dunson, D.B., 2016. Bayesian sparse linear Regression with Unknown Symmetric Error, arXiv:1608.02143. Chib, S., 1995. Marginal likelihood from the Gibbs output. J. Amer. Statist. Assoc. 90, 1313–1321. Chib, S., Greenberg, E., 2010. Additive cubic spline regression with Dirichlet process mixture errors. J. Econometrics 156, 322–336. Chu, W., Ghahramani, Z., 2005. Gaussian processes for ordinal regression. J. Mach. Learn. Res. 6, 1019–1041. Dunson, D.B., Pillai, N., Park, J., 2007. Bayesian density regression. J. R. Stat. Soc. Ser. B 69, 163–183. Dunson, D.B., Stanford, J.B., 2005. Bayesian inferences on predictors of conception probabilities. Biometrics 61, 126–133. Durrleman, S., Simon, R., 1989. Flexible regression models with cubic splines. Stat. Med. 8, 551–561. Escobar, M.D., West, M., 1995. Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90, 577–588. Ferguson, T.S., 1973. A bayesian analysis of some nonparametric problems. Ann. Statist. 1, 209–230. Ghosal, S., 2009. In: Hjort, N.L., Holmes, C., Muller, P., Walker, S.G. (Eds.), The Dirichlet Process, Related Priors, and Posterior Asymptotic. In: Bayesian Nonparametrics, Cambridge University Press, New York, (Chapter 2). Green, P.J., Silverman, B.W., 1994. Nonparametric Regression and Generalized Linear Models: a Roughness Penalty Approach. In: Chapman & Hall/CRC Monographs on Statistics & Applied Probability, vol. 58. Hannah, L.A., Blei, D.M., Powell, W.B., 2011. Dirichlet process mixtures of generalized linear models. J. Mach. Learn. Res. 12, 1923–1953. Holmes, C.C., Held, L., 2006. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Anal. 1, 145–168. Irwin, M., Cox, N., Kong, A., 1994. Sequential imputation for multilocus linkage analysis. Proc. Natl. Acad. Sci. USA 91, 11684–11688. Jensen, M.J., Maheu, J.M., 2013. Bayesian semiparametric multivariate GARCH modeling. J. Econometrics 176, 3–17. Kim, I., Cheong, H., Kim, H., 2011. Semiparametric regression models for detecting effect modification in matched case-control studies. Stat. Med. 30, 1837–1851. Kim, I., Cohen, N., 2004. Semiparametric and nonparametric modelling for effect modification in matched case-control studies. Comput. Statist. Data Anal. 46, 631–643. Kim, I., Cohen, N., Carroll, R., 2003. Semiparametric regression splines in matched case-control studies. Biometrics 59, 1158–1169. Lancaster, P.,Šalkauskas, K., 1986. Curve and Surface Fitting: An Introduction. Academic Press, San Diego. Li, Q., Racine, J.S., 2006. Nonparametric Econometrics: Theory and Practice. Princeton University Press, Princeton. MacEachern, S., 2000. Dependent Dirichlet Processes. Department of Statistics, The Ohio State University, (unpublished manuscript). MacEachern, S.N., Möller, P., 1998. Estimating mixtures of Dirichlet process models. J. Comput. Graph. Stat. 7, 223–238. Mahmoud, H.F.F., Kim, I., Kim, H., 2016. Semiparametric single index multi change points model with an application of environmental health study on mortality and temperature. Environmetrics 27, 494–506. Marsh, L.C., Cormier, D.R., 2002. Spline Regression Models. In: Series: Quantitative Applications in the Social Sciences, vol. 137. Pagan, A., Ullah, A., 1999. Nonparametric Econometrics: Themes in Modern Econometrics. Cambridge University Press, Cambridge. Pang, H. Kim.I., Zhao, H., 2012. Bayesian semiparametric regression models for evaluating pathway effects on continuouse and binary clinical outcomes. Stat. Med. 31, 1633–1651. Pati, D., Dunson, D.B., 2014. Bayesian nonparametric regression with varying residual density. Ann. Inst. Statist. Math. 66, 1–31. Ruppert, D., 2002. Selecting the number of knots for penalized splines. J. Comput. Graph. Stat. 11, 735–757. Ruppert, D., Wand, M.P., Carroll, J., 2003. Semiparametric Regression. Cambridge University Press, New York, NY, USA. Sethuraman, J., 1994. A constructive definition of Dirichlet priors. Statist. Sinica 4, 639–650. Verbeek, M., 2008. A Guide To Modern Econometrics, third ed.. John Wiley & Sons, Ltd, Chichester. Ortega Villa, A., I, Kim., Kim, H., 2017. Semiparametric time varying coefficient model for matched case-crossover studies. Stat. Med. 36, 998–1013. Wasserman, L., 2006. All of Nonparametric Statistics. In: Springer Texts in Statistics, Springer, New York. Wood, S.N., 2006. Generalized Additive Models: An Introduction with R. In: Texts in Statistical Science, Chapman & Hall/CRC, Boca Raton, FL. Zellner, A., 1986. On Assessing Prior Distributions and Bayesian Regression Analysis with G-Prior Distributions. In: Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, North-Holland, Amsterdam, pp. 233–243. Zhang, H., Kim, I., Park, I., 2014. Semiparametric Bayesian hierarchical models for heterogeneous population in nonlinear mixed effect model: application to gastric emptying studies. J. Appl. Stat. 41, 2743–2760.
Further reading Ferguson, T.S., 1983. Bayesian density estimation by mixtures of normal distributions. In: Recent Advances in Statistics: Papers in Honor of Herman Chernoff on His Sixtieth Birthday. Academic Press, pp. 287–302.