Journal of the Korean Statistical Society (
)
–
Contents lists available at ScienceDirect
Journal of the Korean Statistical Society journal homepage: www.elsevier.com/locate/jkss
Bayesian variable selection with strong heredity constraints Joungyoun Kim a , Johan Lim b , Yongdai Kim b , Woncheol Jang b, * a b
Department of Information & Statistics, Chungbuk National University, Cheongju, Republic of Korea Department of Statistics, Seoul National University, Seoul, Republic of Korea
article
a b s t r a c t
info
Article history: Received 2 April 2017 Accepted 25 March 2018 Available online xxxx
In this paper, we propose a Bayesian variable selection method for linear regression models with high-order interactions. Our method automatically enforces the heredity constraint, that is, a higher order interaction term can exist in the model only if both of its parent terms are in the model. Based on the stochastic search variable selection George and McCulloch (1993), we propose a novel hierarchical prior that fully considers the heredity constraint and controls the degree of sparsity simultaneously. We develop a Markov chain Monte Carlo (MCMC) algorithm to explore the model space efficiently while accounting for the heredity constraint by modifying the shotgun stochastic search algorithm Hans et al. (2007). The performance of the new model is demonstrated through comparisons with other methods. Numerical studies on both real data analysis and simulations show that our new method tends to find relevant variable more effectively when higher order interaction terms are considered. © 2018 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved.
AMS 2000 subject classifications: 62J05 62C10 Keywords: Heredity principle Interaction Shotgun stochastic search Strong heredity Variable selection
1. Introduction Suppose that we observe (x1 , y1 ), . . . , (xn , yn ) from n subjects, where xi = (xi1 , . . . , xip ) are the predictors and yi is the response. To capture the relationship between the response and the predictors, one may consider the following linear model: yi = β0 +
p ∑
βj xij + ϵi ,
i = 1 , . . . , n,
(1)
j=1
where ϵi ∼ N(0, σ 2 ). One of the main interests of regression analysis is to select a subset of predictors that are relevant to the response. When the main terms x1 , . . . , xn are not sufficient to capture the relationship between the response and the predictors, it can be helpful to add higher order interactions to the model. An interaction, as a product of a pair of predictors, can be considered when one predictor has different effects on the response depending on values of the other predictor. For example, interactions between genetic markers and environmental factors, denoted as G×E, are emphasized as one potential source of missing genetic variations for human disease risk (Khoury & Wacholder, 2009). There are numerous examples of the effects of G×Es on disease risk such as for bladder cancer and skin cancer (Green & Trichopoulos, 2002; Hung et al., 2004). In this paper, we consider a regression model with main terms and all possible second-order interaction terms: yi = β0 +
p ∑ j=1
*
βj xij +
∑
αjk xij xik + ϵi .
(2)
j
Corresponding author. E-mail addresses:
[email protected] (J. Kim),
[email protected] (J. Lim),
[email protected] (Y. Kim),
[email protected] (W. Jang).
https://doi.org/10.1016/j.jkss.2018.03.003 1226-3192/© 2018 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved.
Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
2
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
The main goal is to identify which terms, especially which interaction terms, have an important effect on the response. By adding interaction terms to the model, we may face two challenges: (1) the increased number of terms and (2) the complexity in the relations among predictors. First, having 2nd-order interaction terms increases the number of terms significantly from p to p + p(p − 1)/2, which can be easily larger than the sample size n. Hence, we would like to select a small subset of predictors if possible. Second, when interaction terms exist, there is a hierarchical relation among the predictors, that is, main terms are more important than interaction terms. In other words, we may include an interaction term only if one (called weak heredity) or both (called strong heredity) of the parental main terms are also included in the model (Bien, Taylor, & Tibshirani, 2013; Wu & Hamada, 2000). In this paper, we focus on variable selection in applications where such a hierarchical relation is desired. Traditional variable selection methods include stepwise selection and criterion-based approaches such as Mallows’ Cp (Mallows, 1973). However, methods based on criteria are unstable when the number of predictors is large (Miller, 2002). Alternatively, penalized approaches, such as LASSO, have been studied extensively (Breiman, 1995; Fan & Li, 2001; Tibshirani, 1996). However, penalized methods do not necessarily obey the heredity constraint. There have been efforts to extend LASSO for hierarchical variable selection (Bien et al., 2013; Choi, Li, & Zhu, 2010); however, these methods are not easy to implement, and they may not be consistent of the model choice in certain situations (Meinshausen & Buhlmann, 2006; Zou, 2006). In this paper, we propose a Bayesian variable selection method enforcing heredity constraints, herein focusing on strong heredity, based on the stochastic search variable selection (SSVS), proposed by George and McCulloch (1993). We propose a novel hierarchical prior for SSVS, which fully considers the strong heredity and controls the degree of sparsity simultaneously. In addition, we develop a computational algorithm to explore the model space efficiently under the strong heredity by modifying the shotgun stochastic search (SSS) algorithm (Hans, Dobra, & West, 2007). The SSS algorithm has an advantage over other SSVS algorithms (Madigan & York, 1995; Raftery, Madigan, & Hoeting, 1997) in the sense that the SSS algorithm can be implemented through parallel processing. The proposed algorithm also has this advantage. The remainder of this paper is as follows: In Section 2, we review SSVS and present our method and prior distributions. Section 3 extends the SSS algorithm to apply it to variable selection under the strong heredity principle. In Sections 4 and 5, we present numerical studies, including a simulation and a real data example, to show that our method outperforms existing methods. Concluding remarks are given in Section 6. 2. Bayesian models 2.1. Stochastic search variable selection for main terms In this section, we present a short review on stochastic search variable selection (SSVS). Here, we consider the regression model with only main terms: As in (1), we have p predictors, X1 , X2 , . . . , Xp , and a response y from n subjects, where Xj = (x1,j , . . . , xn,j ) and y = (y1 , . . . , yn ). In SSVS, George and McCulloch (1993) introduced a binary latent variable γj to identify whether the corresponding jth predictor Xj should be included in the model and used the prior γ
π (γj | wj ) = wj j (1 − wj )1−γj , where wj is the inclusion probability. Often wj = 1/2 is assumed. In Bayesian inference, the posterior mean of γj , denoted by γ¯j , represents the posterior inclusion probability of predictor Xj in the regression models. A higher γ¯j indicates that the covariate Xj is more important in predicting the response. Conditioning on γj , the prior distribution of βj is given as a normal mixture:
βj |γj ∼ γj N(0, σ 2 τ 2 ) + (1 − γj )N(0, σ 2 ν 2 ), for j = 1, . . . , p, where 0 < ν 2 ≪ τ 2 . Note that σ 2 is the noise variance in (1). If γj = 1, βj has a normal prior with a large variance that is sufficient to stay away from zero. In contrast, if γj = 0, βj has a normal prior with a narrow peak at zero, which results in βj being close to zero. For the remaining parameters, we use the following improper priors
π (σ 2 ) =
1
σ
2
,
π (ν 2 , τ 2 | σ 2 ) =
1
σ
2
(1 +
ν 2 −2 1 τ2 ) × 2 (1 + 2 )−2 × I{ν 2 <τ 2 } 2 σ σ σ
that are used by Scott and Berger (2006). 2.2. Stochastic search variable selection for interaction terms Now, we consider the extended model (2), which includes main terms and interaction terms. By including interaction terms, we require additional binary variables γj,k (j = 1, . . . , p − 1 and k = 1, . . . , p), which indicates the inclusion state of an interaction term Xj Xk in a regression model. The role of γj,k is the same as γj , indicating the importance of the covariate Xj Xk . Each model can be identified by the binary latent vector γ = (γ1 , . . . , γp , γ1,2 , . . . , γp−1,p ). For a regression coefficient αj,k , we assume the following normal mixture prior distribution:
αj,k |γj,k ∼ γj,k N(αj,k ; 0, σ 2 τ 2 ) + (1 − γj,k )N(αj,k ; 0, σ 2 ν 2 ) with the same hyper-parameters σ 2 , ν 2 and τ 2 in the prior distribution of βj . Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
3
Chipman (1996) used the following prior for (γj , γk , γj,k ) to capture the dependence between higher orders and parent terms:
π (γj , γk , γj,k ) = π (γj )π (γk )π (γj,k |γj , γk ), where
π (γj,k
⎧ p ⎪ ⎨ 00
: : : :
p01 = 1|γj , γk ) = ⎪p10 ⎩ p11
if (γj , γk ) if (γj , γk ) if (γj , γk ) if (γj , γk )
= (0, 0), = (0, 1), = (1, 0), = (1, 1).
The strong heredity is satisfied when p00 = p01 = p10 = 0 and 0 < p11 ≤ 1. Note that the inclusion prior probability for an interaction effect is common for all interaction effects. Farcomeni (2010) simplified and reparameterized the prior distribution for the hierarchical constraints as well as for grouping and anti-hierarchical constraints. In this paper, we propose the use of (wj wk )1/2 as wj,k , the inclusion prior probability for an interaction Xj Xk . That is,
π (γj,k = 1 | γj , γk , wj , wk ) = γj γk (wj wk )1/2 ,
(3)
for all 1 ≤ j < k ≤ p. Recall that π (γj = 1) = wj and that π (γk = 1) = wk . This hierarchical prior enforces the strong heredity automatically. The model (3) reflects the prior belief that the selection probability of an interaction term is high when the selection probabilities of the corresponding main terms are high. We employ this model to reduce the number of hyperparameters from p(p − 1)/2 for {wj,k } to p for {wj } in (3). In addition, for the hyper-parameter wj , we let π (wj ) ∼ Beta(2, 2) for j = 1, . . . , p. With a hierarchical prior on wj , we can make the posterior inference of γj be less sensitive to the initial choice of wj and hence control the sparsity of the model flexibly. 2.3. Bayesian inference The state space of our model can be represented as (w, γ, β, σ 2 , τ 2 , ν 2 ) where w = (w1 , . . . , wp ) is the vector of the prior inclusion probabilities, γ = (γ1 , . . . , γp , γ1,2 , . . . , γ(p−1),p ) is the vector of variable inclusion states, β = (β1 , . . . , βp , α1,2 , . . . , α(p−1),p ) is the vector of the regression coefficients, and σ 2 , τ 2 and ν 2 are hyper-parameters. Given data, Bayesian inference over the state space (w, γ, β, σ 2 , τ 2 , ν 2 ) is based on the posterior distribution π (w, γ, β, σ 2 , τ 2 , ν 2 | y) given as
π (w, γ, β, σ 2 , τ 2 , ν 2 | y) = ∫
Q 0 (w, γ, β, σ 2 , τ 2 , ν 2 ) β,w,γ,σ 2 ,τ 2 ,ν 2
Q 0 (w, γ, β, σ 2 , τ 2 , ν 2 )
,
(4)
where Q 0 (w, γ, β, σ 2 , τ 2 , ν 2 ) = π (w)π (γ | w)π (β | γ, σ 2 , τ 2 , ν 2 )π (σ 2 )π (τ 2 , ν 2 | σ 2 ) × likelihood. Note that the likelihood is the product of the density probabilities of the normal distributions given the parameters. To focus on estimating γ and compensating the time spent on updating w, we drop β by integrating (4) over β. The new target distribution π (w, γ, σ 2 , τ 2 , ν 2 | y) is obtained as follows:
π (w, γ, σ 2 , τ 2 , ν 2 | y) =
∫
= ∫
π (β, w, γ, σ 2 , τ 2 , ν 2 | y)dβ ∫ 0 Q (w, γ, β, σ 2 , τ 2 , ν 2 ) β
β,w,γ,σ 2 ,τ 2 ,ν 2
Q 0 (w, γ, β, σ 2 , τ 2 , ν 2 )
∝ Q (w, γ, σ 2 , τ 2 , ν 2 ).
(5)
Here Q (w, γ, σ 2 , τ 2 , ν 2 ) is defined in the following theorem. Theorem 1. The posterior distribution Q (w, γ, σ 2 , τ 2 , ν 2 ) is given below. −1/2
Q (w, γ, σ 2 , τ 2 , ν 2 ) = σ −n |X t X D¯γ + I |
{ exp −
1
yt y + 2
2σ × π (γ|w)π (w)π (σ 2 )π (τ 2 , ν 2 |σ 2 ).
1 2σ
µt (X t X + D¯γ 2 γ
−1
) µγ
} (6)
−1
Note that D¯γ = diag(γj τ 2 + (1 − γj )ν 2 ) and µγ = (X t X + D¯γ )−1 X t y. Here, X denotes an n × q design matrix from n subjects, and q = p + p(p − 1)/2. Furthermore, the above posterior distribution is proper. Proof. See Appendix. Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
4
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
Because the exact calculation of the integration in (5) is infeasible, we use a Markov chain Monte Carlo (MCMC) algorithm (Hastings, 1970), which only requires evaluating Q in (6). For Bayesian inference, an MCMC algorithm constructs a Markov ∗ ∗ ∗ chain whose stationary distribution is π (w, γ, σ 2 , τ 2 , ν 2 | y) by proposing a new state (w∗ , γ ∗ , σ 2 , τ 2 , ν 2 ) with a proposal 2 2 2 density q(· | w, γ, σ , τ , ν ). The proposal density that we use is a mixture of many components, each of which is obtained by changing a small number of covariates from the current subset γ. In the following section, we describe in detail the complete set of MCMC proposals, which is a modified version of the SSS (Hans et al., 2007). 3. MCMC 3.1. Shotgun stochastic search algorithm In this section, we briefly review the Shotgun stochastic search algorithm (Hans et al., 2007). SSS is an algorithm used to identify probable γ far more rapidly and to move around swiftly in the space of models as the dimension escalates. Let the neighborhood N(γ ) in SSVS be a set of γ ∗ s that are identical to the current γ except for a few elements. Suppose that N(γ ) is composed of three disjoint subsets, N(γ ) = γ + ∪ γ − ∪ γ 0 , in which (i) γ + is for ‘‘add’’, where a new variable is added to the current model; (ii) γ − is for ‘‘deletion’’, where an existing variable is dropped from the current model; and (iii) γ 0 is for ‘‘replacement’’, where two variables exchange their inclusion/exclusion states. Once N(γ ) is identified, the algorithm randomly selects an element γ ∗ from N(γ ) with probability T (γ → γ ∗ ), T (γ → γ ∗ ) =
=
π (w, γ ∗ , σ 2 , τ 2 , ν 2 | y)I(γ ∗ ∈ N(γ )) ∑ 2 2 2 s∈N(γ ) π (w, s, σ , τ , ν | y) Q (w, γ ∗ , σ 2 , τ 2 , ν 2 )I(γ ∗ ∈ N(γ ))
∑
s∈N(γ )
Q (w, s, σ 2 , τ 2 , ν 2 )
(7)
where Q (w, γ, σ 2 , τ 2 , ν 2 ) is defined in (6) and I denotes an indicator function. Once γ ∗ is selected, we can identify N(γ ∗ ) as well. Then, the acceptance probability α is given by
{
s∈N(γ )
Q (w, s, σ 2 , τ 2 , ν 2 )
s∈N(γ ∗ )
Q (w, s, σ 2 , τ 2 , ν 2 )
∑
α = min 1, ∑
} .
This is the ratio of posterior probabilities between two neighborhoods, and it is high enough for rapid mix. Through the use of the acceptance probability, the MCMC chain can converge rapidly. In the following section, we propose three methods to find N(γ ), which modifies the original shotgun algorithm to account for the strong heredity. 3.2. Our proposal: MCMC with SSS To sample from our model posterior π (w, γ, σ 2 , τ 2 , ν 2 | y) with MCMC, we need to update the states that include the inclusion probability w, the inclusion indicator γ and the three hyper-parameters of σ 2 , τ 2 and ν 2 . A new value for w, σ 2 , τ 2 and ν 2 is proposed as the product of the current value and a random number from Gamma(2,2). To update γ , our MCMC algorithm follows the SSS algorithm, which accelerates the mixing in MCMC chains even with a large dimension of γ . Given γ , w, σ 2 , τ 2 and ν 2 , we propose new inclusion states γ ∗ from the neighborhood of γ , N(γ ). For the efficient updating of γ under the strong heredity, we propose three procedures to find N(γ ), which we call P1, P2 and P3. Among the three procedures, one is randomly selected with probability 0.1, 0.4 and 0.5, respectively. These probabilities are optimized to guarantee a good mixing. The three procedures are summarized as follows. P1. finding a neighborhood focusing on the main terms. We consider the main terms, which do not involve any interaction terms existing in the current model. Specifically, for the main terms included in the current model, we consider only the main terms without any interaction terms existing in the model. For the main terms that are not included in the current model, we consider all of them as possible new main terms. The idea is to add a new main term, delete an existing main term, or swap a new main term with an existing main term. P2. finding a neighborhood focusing on interaction terms. We consider the interaction terms by all possible pairs of existing main terms in the current model. Then, these interaction terms are divided into two groups: the existing interaction terms and the new interaction terms, depending on the current inclusion states. Then, the neighborhood is obtained by adding a new interaction term, deleting an existing interaction term, or swapping a new interaction term with an existing interaction term. P3. finding a neighborhood simultaneously for main terms and interaction terms. First, we find main terms that exist in the current model and that are also involved with at least one existing interaction term in the model. These main terms and interaction terms are considered as existing main terms and interaction terms. Next, we find main terms that are not included in the current model as new main terms. Given a new main term, we select an existing main term along with the corresponding interaction terms. Then, we swap the new main term with the existing main term for the main term itself as well as the interaction terms. For example, suppose that the main term X1 is included in the model and that X4 is not Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
5
included in the model. Regarding X1 , only two interactions, X1 X2 and X1 X3 , are included in the current model, that is, γ1 = 1, γ4 = 0, γ1,2 = 1, γ1,3 = 1 and γ1,k = 0 for all k ≥ 3. As an element of the neighborhood, consider swapping X4 with X1 , resulting in swapping interactions X2 X4 and X3 X4 with interactions X1 X2 and X1 X3 . For the swapped result, we have γ1 = 0, γ4 = 1, γ2,4 = 1, γ3,4 = 1 and γ1,k = 0 for all k and γ4,k = 0 for k ̸∈ {2, 3}. In summary, a MCMC chain is constructed as follows: (0)
(0)
(0)
1. Initialization. Start with initial states γ (0) , w(0) , σ 2 , τ 2 and ν 2 . Set i = 0. (i) (i−1) (i) (i−1) (i) (i−1) 2. Set i = i + 1. γ (i) = γ (i−1) , w(i) = w(i−1) , σ 2 = σ 2 , τ2 = τ2 and ν 2 = ν 2 . (i) 3. Update γ . (a) Find N(γ (i) ), a neighborhood of γ (i) . (i) (i) (i) (b) Evaluate Q (w(i) , s, σ 2 , τ 2 , ν 2 ) for all s ∈ N(γ (i) ). ∗ (i) (c) Randomly choose γ from N(γ ) with the probability (i)
(i)
(i)
Q (w(i) , γ ∗ , σ 2 , τ 2 , ν 2 )
∑
s∈N(γ (i) )
(i)
(i)
(i)
Q (w(i) , s, σ 2 , τ 2 , ν 2 )
.
(d) Find N(γ ∗ ), a neighborhood of γ ∗ . ∑ (i) (i) (i) (e) Evaluate s∈N(γ ∗ ) Q (w(i) , s, σ 2 , τ 2 , ν 2 ). (f) Calculate the acceptance probability α , which is
{
(i)
(i)
(i)
(i)
(i)
(i)
s∈N(γ (i) )
Q (w(i) , s, σ 2 , τ 2 , ν 2 )
s∈N(γ ∗ )
Q (w(i) , s, σ 2 , τ 2 , ν 2 )
∑
α = min 1, ∑
} .
(g) Update γ (i) as γ (i) = γ ∗ with the probability α . (i)
4. Update w(i) , σ 2 , τ 2
(i)
and ν 2 .
(i)
(i)
(i)
(i)
(a) Among w(i) , σ 2 , τ 2 and ν 2 , randomly select a parameter for updating with probabilities 0.7, 0.1, 0.1 and 0.1. If w(i) is selected, then we randomly and uniformly determine which component of w(i) is to be updated. (b) A new state is proposed as the product of the current value and a random multiplier from a Gamma(2, 2) distribution. (c) The new state is accepted with an acceptance probability α . For example, suppose that we propose a new value (i) ∗ (i) for σ 2 as σ 2 = σ 2 × u, where u is a random number from Gamma(2,2). Then, the acceptance probability α is obtained as
{ α = min 1,
∗
(i)
(i)
(i)
(i)
(i)
Q (w(i) , γ (i) , σ 2 , τ 2 , ν 2 ) Q (w(i) , γ (i) , σ 2 , τ 2 , ν 2 )
×
f (1/u | 2, 2) f (u | 2, 2)
} ×J , (i)
∗
where f (· | 2, 2) is the pdf of Gamma(2,2) and J is the Jacobian J = σ 2 /σ 2 . 5. Repeat 2–4 until i
10 ∑ j=1
β j xj +
9 10 ∑ ∑
αj,k xj xk + ϵ.
(8)
j=1 k=j+1
Here, ϵ ∼ N(0, σ 2 ). In the study, the covariates xj are generated from the standard normal distribution. We consider four sets of coefficient vectors, which are displayed in Table 1: Case 1 considers the model with no interaction terms, Case 2 is a model with interaction terms of moderate size, Case 3 is a model with interaction terms of large size, and Case 4 is a model in which the size of the interaction terms is larger than that of the main terms. In each case, the noise variance σ 2 is set to make the signal-to-noise ratio (SNR) be 4.0, and the coefficients not listed in Table 1 are equal to 0. For each case, we generate 100 data set with the sample size n = 100 and apply three methods, including our proposal; the two other methods are the forward selection and Gibbs sampling in Chipman (1996); each method is denoted as HB-ShotG, Forward and C-Gibbs, respectively. For HB-ShotG, for each data set, we generate 100K MCMC samples. The first 40K samples Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
6
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
Table 1 True values of βj and αj,k in the numerical study. Case 1 Case 2 Case 3 Case 4
β1
β2
β3
β4
α1,2
α1,3
α1 ,4
α2,3
α2 ,4
α3,4
σ
7.00 7.00 7.00 7.00
2.00 2.00 2.00 2.00
1.00 1.00 1.00 1.00
1.00 1.00 1.00 1.00
0.00 1.00 7.00 14.00
0.00 0.00 7.00 14.00
0.00 0.00 7.00 14.00
0.00 0.50 2.00 4.00
0.00 0.40 2.00 4.00
0.00 0.10 1.00 2.00
3.71 3.85 11.43 13.03
Table 2 The averages of γ¯j and γ¯j,k This table provides the averages of the posterior mean γ¯j and γ¯j,k corresponding to the 10 variables in Table 1. Bold indicates values exceeding 0.5. True
HB-ShotG
C-Gibbs
Forward 0.50
0.80
1.00 0.91 0.41 0.36 0.14 0.08 0.06 0.07 0.06 0.03
1.00 0.91 0.34 0.33 0.08 0.03 0.02 0.02 0.02 0.01
1.00 0.97 0.60 0.58 0.25 0.15 0.13 0.14 0.13 0.08
1.00 0.99 0.82 0.80 0.57 0.47 0.44 0.45 0.43 0.35
1.00 1.00 0.88 0.87 0.19 0.15 0.14 0.17 0.11 0.04
7.00 2.00 1.00 1.00 1.00 0.00 0.00 0.50 0.40 0.10
1.00 0.97 0.95 0.94 0.96 0.94 0.94 0.5 0.55 0.41
0.84 0.28 0.22 0.23 0.10 0.06 0.08 0.01 0.01 0.01
0.95 0.56 0.48 0.49 0.37 0.28 0.29 0.10 0.10 0.07
0.99 0.82 0.77 0.77 0.74 0.67 0.68 0.44 0.44 0.39
1.00 0.5 0.29 0.34 0.47 0.27 0.32 0.02 0.06 0.02
7.00 2.00 1.00 1.00 14.00 14.00 14.00 4.00 4.00 2.00
Case 1
β1 β2 β3 β4 α1 ,2 α1 ,3 α1 ,4 α2 ,3 α2 ,4 α3 ,4
7.00 2.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00
7.00 2.00 1.00 1.00 7.00 7.00 7.00 2.00 2.00 1.00
HB-ShotG
C-Gibbs
Forward
0.20
0.50
0.80
1.00 0.89 0.38 0.36 0.53 0.09 0.08 0.11 0.12 0.03
1.00 0.89 0.33 0.32 0.31 0.03 0.02 0.03 0.03 0.01
1.00 0.96 0.59 0.56 0.55 0.15 0.14 0.17 0.15 0.07
1.00 0.98 0.81 0.79 0.79 0.47 0.44 0.50 0.45 0.35
1.00 0.99 0.85 0.87 0.86 0.12 0.17 0.31 0.24 0.05
1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.72 0.81 0.48
0.46 0.23 0.22 0.23 0.03 0.03 0.04 0.01 0.01 0.01
0.73 0.53 0.51 0.52 0.23 0.21 0.23 0.10 0.11 0.09
0.93 0.81 0.79 0.8 0.67 0.65 0.66 0.47 0.48 0.44
0.85 0.37 0.31 0.31 0.25 0.23 0.21 0.05 0.07 0.05
Case 2
Case 3
β1 β2 β3 β4 α1 ,2 α1 ,3 α1 ,4 α2 ,3 α2 ,4 α3 ,4
True
0.20
Case 4
are discarded as a burn-in, and every 100th sample after the burn-in is saved to estimate the posterior probabilities. Thus, we estimate the model using 600 MCMC samples from the posterior distribution. For C-Gibbs, we generate an MCMC chain with a length of 20K, where the first 10K samples are discarded as a burn-in and the other 10K samples are recorded. Here, we should remark that a single MCMC iteration of both methods corresponds to approximately p2 /2 iterations in HB-ShotG. HBShotG, in each MCMC iteration, randomly selects and updates a single component of γ , w, σ 2 , τ 2 and ν 2 . On the other hand, the iteration in C-Gibbs updates all variables simultaneously element-wise. Thus, in our numerical study, the actual number of updates of C-Gibbs is much larger than that of HB-ShotG. In applying C-Gibbs, we consider three values for the inclusion probability wj , 0.2, 0.5, and 0.8, to investigate its effect on the selection of interaction terms and assume that wj,k = wj for all j and k. In each data set, we record the (empirical) posterior mean of γ , denoted by γ . Table 2 provides the averages of γ for the coefficients in Table 1. We find that C-Gibbs is quite sensitive to the choice of wj and wj,k . If we select wj = wj,k = 0.2, it often mistakenly sets non-zero coefficients of interactions as zero. On the other hand, the selection of wj = wj,k = 0.8 tends to include the interactions into the model, even if their true coefficients are zero. The forward selection, as can be found in the literature, tends to find a larger model than the true model, and it often includes interactions in the model whose true coefficients are zero. Finally, we find that our hierarchical Bayes model performs well in terms of finding both true non-zero coefficients and zero coefficients in Cases 3 and 4, whose coefficients of interactions are as large as their main terms. For Cases 1 and 2, the proposed hierarchical model sometimes misses main terms or interactions whose coefficients are non-zero. This is because the proposed model inherently assumes that if interactions have higher probability to be in the model, then so do their main terms. This assumption is not well satisfied in Cases 1 and 2. We could obtain exactly the same findings with different model selection criteria. Table 3 provides the ‘‘inclusion’’ proportion based on the ‘‘Median Model’’ criterion, which includes variables with posterior probabilities of greater than 0.5 (Barbieri & Berger, 2004). Fig. 1 presents ROC curves of HB-ShortG and C-Gibbs based on the median model criteria. Since Forward selection usually shows the worst performance, it is not included for comparison. While all ROC curves show reasonably good performance, C-Gibbs performance depends on choice of wj,k with slow computing time. In summary, the results show that C-Gibbs is sensitive to the choice of wj,k . As usual, the forward selection method selects a larger model than the true model. Finally, our proposed model performs well for the model where the strong heredity is well satisfied in the sense that if the interactions are in the model, then their main terms are likely to be in the model. In addition to the model flexibility, in this paper, we propose a fast MCMC procedure based on SSS. To illustrate its fast convergence, we compare HG-ShotG to C-Gibbs. To measure the convergence speed of each procedure, we monitor the Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
7
Fig. 1. ROC curves of HB-ShotG and C-Gibbs based on the median model criteria.
Cesaro means of the sampled γ values by the two MCMC procedure, HB-ShotG and C-Gibbs. For comparison, in each method, we generate an MCMC chain with a length of 20,000 and record samples of every 100th iteration without a burn-in period, that is, if γ (k) is the sampled vector in the kth iteration, then we record γ (1) , γ (101) , γ (201) , . . . , γ (19901) . Fig. 2 shows the Cesaro means of the recorded gamma values for Cases 3 and 4. In the panel of ‘‘CASE 3: HB-ShotG’’, we can only observe five lines because the lines for the interactions X1 X2 and X1 X4 overlapped. As expected, HB-ShotG converges much faster than C-Gibbs does. For the sensitivity analysis, we focus on Case 3, which is of our primary interest. We use Beta(2,2) and Beta(1,1) priors for the inclusion probabilities wj (j = 1, 2, 5, 6). We generate five independent MCMC chains under each prior. Each chain has 100K length with 40K burn-in and every 100th samples are saved. After checking the convergence of MCMC, we combine the samples from five MCMC chains. As a result, we have 3000 posterior samples to compute posterior distributions. Fig. 3 presents the posterior distributions of inclusion probabilities (wj : j = 1, 2, 5, 6). Here (a), (c), (e) and (g) are from Beta(1,1) prior and (b), (d), (f) and (h) are from Beta(2,2) prior, respectively. The solid line in each histogram represents the prior distribution for each case. The choice of priors does not make any big difference on the posterior distributions. Hence the posterior inference is robust to the choice of priors. To check the expandability of our HB-ShotG, we redo the Bayesian MCMC analysis with different numbers of predictors, p = 5 and p = 15, under Cases 3 and 4. We use the same simulation model described in (8) and Table 1 and compute the time to generate 20K MCMC chains for each p. For p = 5, we simply drop the covariates X6 , . . . , X10 whose coefficients are 0. For the case p = 15, we include additional covariates X11 , . . . , X15 with zero coefficients. We repeat the simulation 30 times for each p. We carry out the analysis on an Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10 GHz with 2 GB RAM, and record the CPU time. In Fig. 4, each boxplot displays the distribution of log time from 30 duplicates given p. We notice that the computing time increases almost exponentially, hence it is not recommended to use HB-ShotG for large p. 5. Case study In this section, we apply our method to the prostate cancer data (Stamey et al., 1989), previously analyzed in Tibshirani (1996) and Yuan, Joseph, and Zou (2009). The data set consists of the medical records of 97 male patients who were about to Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
8
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
Table 3 The inclusion probabilities based on the median model criteria. This table provides the inclusion probabilities corresponding to the 10 variables in Table 1 based on the ‘‘Median Model’’ criteria. Bold indicates values exceeding 0.5. True
HB-ShotG
C-Gibbs
Forward
0.20
0.50
0.80
1.00 0.94 0.39 0.29 0.01 0.01 0.00 0.02 0.00 0.00
1.00 0.96 0.2 0.26 0.00 0.00 0.00 0.00 0.00 0.00
1.00 1.00 0.65 0.58 0.05 0.01 0.00 0.00 0.00 0.00
1.00 1.00 1.00 0.98 0.73 0.32 0.30 0.29 0.27 0.06
1.00 1.00 0.88 0.87 0.19 0.15 0.14 0.17 0.11 0.04
1.00 0.99 0.98 0.98 0.98 0.96 0.96 0.43 0.55 0.22
0.89 0.1 0.07 0.07 0.02 0.00 0.03 0.00 0.00 0.00
1.00 0.53 0.27 0.31 0.22 0.1 0.11 0 0.01 0
1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.17 0.21 0.06
1.00 0.5 0.29 0.34 0.47 0.27 0.32 0.02 0.06 0.02
Case 1
β1 β2 β3 β4 α1 ,2 α1 ,3 α1 ,4 α2 ,3 α2 ,4 α3 ,4
HB-ShotG
C-Gibbs
Forward
0.20
0.50
0.80
1.00 0.92 0.35 0.26 0.51 0.01 0.01 0.04 0.07 0.00
1.00 0.94 0.22 0.2 0.25 0.00 0.00 0.00 0.00 0.00
1.00 0.98 0.62 0.53 0.5 0.01 0.00 0.02 0.01 0.00
1.00 1.00 1.00 0.97 0.97 0.33 0.33 0.45 0.32 0.03
1.00 0.99 0.85 0.87 0.86 0.12 0.17 0.31 0.24 0.05
1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.74 0.89 0.44
0.39 0.03 0.03 0.06 0.01 0.00 0.01 0.00 0.00 0.00
0.87 0.47 0.3 0.38 0.06 0.07 0.07 0.01 0.02 0
1.00 1.00 1.00 1.00 0.97 0.95 0.96 0.25 0.32 0.16
0.85 0.37 0.31 0.31 0.25 0.23 0.21 0.05 0.07 0.05
Case 2
7.00 2.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 Case 3
β1 β2 β3 β4 α1 ,2 α1 ,3 α1,4 α2,3 α2,4 α3,4
True
7.00 2.00 1.00 1.00 1.00 0.00 0.00 0.50 0.40 0.10 Case 4
7.00 2.00 1.00 1.00 7.00 7.00 7.00 2.00 2.00 1.00
7.00 2.00 1.00 1.00 14.00 14.00 14.00 4.00 4.00 2.00
receive a radical prostatectomy. The explanatory variables are eight clinical measures (p = 8): log of cancer volume (lcavol; X1 ), log of prostate weight (lweight; X2 ), age (age; X3 ), log of benign prostatic hyperplasia amount (lbph; X4 ), seminal vesicle invasion (svi; X5 ), log of capsular penetration (lcp; X6 ), Gleason score (gleason; X7 ) and percentage Gleason scores 4 or 5 (pgg45; X8 ). The response variable is log of level of prostate-specific antigen (lpsa; Y ). Based on model (2), we consider the model with 8 main terms and their two-way interactions as follows: y = β0 +
8 ∑
βj Xj +
j=1
7 8 ∑ ∑
αj,k Xj Xk + ϵ,
(9)
j=1 k=j+1
where ϵ ∼ N(0, σ 2 ). Each explanatory variable is standardized, and the response variable is shifted to have a mean of zero so that the intercept term no longer exists in the model. For the prostate cancer data, we apply two methods: HB-ShotG and C-Gibbs. We generate an MCMC chain with a length of 100K. The MCMC chain is constructed identically as in the simulation study. In applying C-Gibbs, we consider two values of wj,k , 0.5 and 0.8, to observe the effects on the variable selection. Hence, we consider three cases: HB-ShotG, C-Gibbs with wj,k = 0.5, and C-Gibbs with wj,k = 0.8. The three cases are denoted as HB-ShotG, C-Gibbs(0.5) and C-Gibbs(0.8), respectively. For each case, we generate four MCMC chains with different initial states. Then, for each case, the posterior sample is a combination of the MCMC samples from the four MCMC chains. Let γ¯ denote the posterior mean of γ in the MCMC samples. Table 4 shows the lists of variables sorted according to the posterior mean γ¯ from each method. Under our methods, the inclusion probabilities γ¯ for the three variables, lcavol(X1 ), svi(X5 ) and lweight(X2 ), exceed 0.5. Tibshirani (1996) also selected the same main terms, but he did not consider interaction terms in his model. Concerning interaction terms, the two interactions of lcavol:svi(X1 X5 ) and lweight:svi(X2 X5 ) have higher inclusion probabilities than the other four main terms. The nonnegative garrote with the strong heredity (Yuan et al., 2009) selected the seven variables (lcavol, lweight, lbph, svi, gleason, gleason2 , lbph:svi) based on 10-fold cross validation. The results from C-Gibbs estimated smaller inclusion probabilities than our method, even at wj = 0.8 for 1 ≤ j ≤ p. Further, the inclusion probabilities of the interactions in their model are smaller than those of any main terms. MAP Fig. 5 presents the log posterior probabilities that were calculated over the MCMC chains. Let γˆ denote the maximum a posteriori (MAP) estimate, which maximizes π (γ | y) among the posterior samples, where
π (γ | y) = ≈
∫
π (w, γ, σ 2 , τ 2 , ν 2 | y)dwdσ 2 dτ 2 dν 2 ∑ π (w, γ, σ 2 , τ 2 , ν 2 | y)
w,σ 2 ,τ 2 ,ν 2
∝
∑
Q (w, γ, σ 2 , τ 2 , ν 2 ).
w,σ 2 ,τ 2 ,ν 2
Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
9
Fig. 2. The Cesaro means of sampled γ s in Cases 3 and 4. The four panels show the traces of the cumulative means of γj,k sampled from MCMC chains run under combinations of two methods (HB-ShotG and C-Gibbs) and two simulation settings (Cases 3 and 4). In each panel, there are six traces of γj,k , which correspond to the regression coefficients αj,k in Table 1. The trace of γj,k is identified by the pair (i, j) in each panel. In the panel of CASE 3: HB-ShotG, two lines for γ1,2 and γ1,3 overlapped. In the panel of CASE 4:HB-ShotG, three lines for γ1,2 , γ1,3 and γ1,4 overlapped.
γˆ MAP from our model corresponds to the model with lcavol, svi and lweight. In Table 4, the superscripts 1 and 2 for each variable indicate that the corresponding variables belong to the model with the highest and second-highest posterior MAP probabilities, respectively, in the order. In our method, the variables corresponding to γˆ have the highest inclusion probabilities. Fig. 6 compares the prediction errors (PE) under each method. We evaluate the prediction errors based on leave-one-out cross validation for growing models by adding variables into the regression models in the appearance order of Table 4 for each method. Our method outperforms the other methods by keeping the lowest prediction error overall. In particular, when the number of variables in the regression models is four, the prediction error from our method is smaller than those of all other models. 6. Conclusion In this paper, we have proposed a Bayesian model for variable selection in a regression model including interaction terms. The proposed method automatically satisfies the strong heredity. In variable selection with higher order interactions, finding the neighborhood is crucial to implementing MCMC. For efficient MCMC convergence, we implemented a modified Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
10
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
(a) w1
(b) w1
(c) w2
(d) w2
(e) w5
(f) w5
(g) w6
(h) w6
Fig. 3. The posterior distributions of inclusion probabilities wj (j = 1,2,5,6) between Beta(1,1) prior and Beta(2,2) prior for Case 3. The histograms show the posterior distributions and the solid lines represent pdf for Beta(1,1) and Beta(2.2) distributions respectively.
SSS algorithm to account for the strong heredity. Although Hans et al. (2007) briefly mentioned variable selection with interactions, but they did not consider the strong heredity. We designed an algorithm to find the neighborhood in SSVS incorporating the strong heredity. We demonstrated that our new method tends to find relevant variable more effectively Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
11
Fig. 4. The log of running times (min) when p = 5, p = 10 and p = 15 under Cases 3 and 4.
Fig. 5. The plot of the log of the posterior probabilities from our method. Every 100th sample is plotted from an MCMC chain with a length of 100K after discarding the first 40K samples as burn-in.
when higher order interaction terms are considered. The inclusion probability wj of a covariate Xj was assumed fixed in other studies. However, we provide the Bayesian estimation of wj by allowing for the prior distribution on wj . This results in the robust inference on γj by removing the effect from the initial choice of wj . To compensate for the time spent on the inference of wj , we drop β using the target distribution, which integrates out the full posterior distribution over β. We suggest estimating β after variable selection. Acknowledgments Woncheol Jang’s research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2014R1A4A1007895, 2017R1A2B2012816). Yongdai Kim’s research was supported by the National Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
12
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
Table 4 The comparison of Variable Selection under HB-ShotG and C-Gibbs. This table provides the averages of the posterior mean γ¯ from the MCMC samples. We sort γj and γj,k in descending order in each case. The superscripts R1 and R2 indicate the inclusion state of the corresponding variable in the model with the highest MAP and with the 2nd-highest MAP, respectively. Order
HB-ShotG
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
γ¯1 (1.000)R1,R2 γ¯5 (0.825)R1,R2 γ¯2 (0.558)R1,R2 γ¯4 (0.361) γ¯1,5 (0.333)R2 γ¯2,5 (0.226)R2 γ¯8 (0.175) γ¯1,2 (0.174)R2 γ¯7 (0.169) γ¯4,5 (0.159) γ¯6 (0.118) γ¯1,4 (0.099) γ¯2,4 (0.093) γ¯3 (0.092) γ¯5,7 (0.071) γ¯5,8 (0.062) γ¯1,7 (0.053) γ¯1,8 (0.053) γ¯2,7 (0.042) γ¯1,6 (0.031)
C-Gibbs
wj = 0.500
wj = 0.800
γ¯1 (0.600)R2 γ¯5 (0.202)R1 γ¯6 (0.171) γ¯4 (0.151)R2 γ¯8 (0.151)R1 γ¯2 (0.144)R2 γ¯7 (0.137) γ¯3 (0.130)R1,R2 γ¯1,5 (0.018) γ¯1,6 (0.016) γ¯1,8 (0.015) γ¯1,2 (0.012)R2 γ¯1,4 (0.012) γ¯1,7 (0.012) γ¯1,3 (0.010)R2 γ¯5,6 (0.005) γ¯6,8 (0.005) γ¯4,5 (0.004) γ¯5,7 (0.004) γ¯2,5 (0.004)
γ¯1 (0.865)R2 γ¯5 (0.504)R1,R2 γ¯6 (0.478)R1,R2 γ¯8 (0.421)R2 γ¯4 (0.418) γ¯2 (0.399)R1 γ¯7 (0.388)R1,R2 γ¯3 (0.373) γ¯1,5 (0.177) γ¯1,6 (0.172) γ¯1,8 (0.157) γ¯1,4 (0.136) γ¯1,7 (0.136) γ¯1,2 (0.133) γ¯1,3 (0.123) γ¯5,6 (0.095)R2 γ¯2,5 (0.086) γ¯6,8 (0.085)R2 γ¯5,7 (0.085) γ¯4,5 (0.082)
Fig. 6. The prediction errors under HB-ShotG and C-Gibbs.
Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2014R1A4A1007895). Johan Lim’s work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2011-0030810). Joungyoun Kim’s work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2017R1C1B5015192). Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
13
Appendix A.1. Proof of properness in Theorem 1 2 We show how to obtain the( posterior distribution ) in (6). According to (4), the full posterior distribution π (β, w, γ, σ , 2 2 2 2 τ , ν | y) is proportional to Q w, γ, β, σ , τ , ν , which is defined as follows. 2
Q w, γ, β, σ 2 , τ 2 , ν 2
(
)
= π (w)π (γ | w)π (β | γ, σ 2 , τ 2 , ν 2 )π (σ 2 )π (τ 2 , ν 2 | σ 2 ) × likelihood } { } { 1 1 1 1 t −1 = n exp − 2 (y − X β)t (y − X β) β exp − β D γ σ 2σ 2 |Dγ |1/2 ×π (γ | w)π (w)π (σ 2 )π (τ 2 , ν 2 | σ 2 ) } { 1 1 1 1 t −1 t = n β exp − (y − X β ) (y − X β ) − β D γ σ |Dγ |1/2 2σ 2 2 ×π (γ | w)π (w)π (σ 2 )π (τ 2 , ν 2 | σ 2 ) { ( t ) } 1 1 1 X X t −1 = n exp − ( β − µ ) + D ( β − µ ) γ γ γ σ |Dγ |1/2 2 σ2 { ) } ( t 1 1 X X −1 × exp − 2 yt y + µtγ µγ + D γ 2σ 2 σ2 ×π (γ | w)π (w)π (σ 2 )π (τ 2 , ν 2 | σ 2 ).
(10)
Here Dγ = diag γj σ 2 τ 2 + (1 − γj )σ 2 ν 2 = σ 2 diag γj τ 2 + (1 − γj )ν 2
(
)
(
)
= σ 2 D¯ γ ( )−1 )−1 t ( t 1 ¯− D X y Xty X X XtX γ −1 µγ = + Dγ = + 2 2 2 2 σ σ σ σ σ2 ( t ) 1 −1 t = X X + D¯ − X y. γ Next, we integrate π (β, w, γ, σ 2 , τ 2 , ν 2 | y) with respect to β and obtain the following:
∫
π (β, w, γ, σ 2 , τ 2 , ν 2 | y)dβ { ( t ) } ] [∫ 1 1 1 X X t −1 ∝ n exp − ( β − µ ) + D ( β − µ ) dβ γ γ γ σ |Dγ |1/2 2 σ2 { ( t ) } 1 1 X X × exp − 2 yt y + µtγ + Dγ −1 µγ × π (γ | w)π (w)π (σ 2 )π (τ 2 , ν 2 | σ 2 ) 2σ 2 σ2 { ( ) } 1 1 1 t 1 t XtX 1 −1 − ∝ n × exp y y + µ + D µγ γ σ |Dγ |1/2 | X t X + D −1 |1/2 2σ 2 2 γ σ2 γ 2 σ ×π (γ | w)π (w)π (σ 2 )π (τ 2 , ν 2 | σ 2 ) { } ( t ) 1 1 1 1 t 1 t −1 ¯ = n y µ × exp − y + X X + D µ γ γ 1 1/2 σ σ p |D¯ γ |1/2 1p |X t X + D¯ − 2σ 2 2σ 2 γ γ | σ ×π (γ | w)π (w)π (σ 2 )π (τ 2 , ν 2 | σ 2 ) { } 1 1 1 −1/2 t t −1 −1 ¯ = n |X t X D¯ γ + Iq | × exp − 2 yt y + µ (X X + D ) µ γ γ σ 2σ 2σ 2 γ 2 2 2 2 ×π (γ | w)π (w)π (σ )π (τ , ν | σ ).
(11)
We now show that the posterior distribution, equivalently (7), is proper. Let the design matrix X of the regression model t have a size of n × q, with q = p + p(p − ( 1)/2. We also) let the singular value decomposition of X be X = U ΛV and Λ = diag(λ1 , . . . , λq ). We let λmin = min λi , i = 1, . . ., q , which is assumed to be larger than a positive constant λ0 . For any two symmetric real matrices A and B, A ⪰ B implies that A − B is non-negative definite. Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
14
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
Lemma 1 (Sherman–Morrison–Woodbury Golub & Van Loan, 1996). Suppose that A has a size of n × n and is invertible, and U , V and W are n × m, n × m and m × m matrices, respectively. Then, we have
⏐ ⏐ ⏐ ⏐⏐ ⏐ ⏐A + UV t ⏐ = ⏐Im + V t A−1 U ⏐⏐A⏐ and
⏐ ⏐ ⏐ ⏐⏐ ⏐⏐ ⏐ ⏐A + UWV t ⏐ = ⏐W −1 + V t A−1 U ⏐⏐W ⏐⏐A⏐. ⏐ ⏐
⏐ ⏐
Lemma 2. Suppose that Z1 and Z2 are positive-definite symmetric real matrices and Z1 ⪰ Z2 . Then, we have ⏐Z1 ⏐ ≥ ⏐Z2 ⏐.
⏐
⏐⏐ ⏐
−1 −1 ⏐ −1 ⏐⏐ ⏐ Proof. ⏐ ⏐Z1 ⪰⏐ Z2⏐ is equivalent to Z2 Z1 ⪰ I. Thus, all eigenvalues of Z2 Z1 are greater than or equal to 1. Thus, Z2 Z1 ≥ 1 and ⏐Z1 ⏐ ≥ ⏐Z2 ⏐. □
Lemma 3. For any X with a size of n × q and y with a size of n × 1, 1 −1 t ¯− y ≥ yt In − X (X t X )−1 X t y. yt In − X (X t X + D γ ) X
{
}
{
}
Proof. It suffices to show that 1 ¯− X XtX + D γ
(
)−1
)−1
Xt,
)−1
( )−1 ( ) ( t ) 1 ¯− ⪯ XtX and to X t X + D ⪰ X X . It is further equivalent γ
Xt ⪯ X XtX
(
1 ¯− which is equivalent to X t X + D γ
(
1 ¯− XtX + D − X t X ⪰ 0. γ
(
)
(
)
¯ γ ⪰ 0. □ The above is true because D From the lemma above, we have 1
{ 1 ( )} × exp − 2 A γ , σ 2 , τ 2 , η2 σ 2σ ×π (γ | w)π (w)π (σ 2 )π (τ 2 )π (ν 2 ) { 1 { } } 1 −1/2 ≤ n |X t X D¯ γ + Iq | × exp − 2 yt In − X (X t X )−1 X t y σ 2σ
(11) =
n
−1/2
|X t X D¯ γ + Iq |
×π (γ | w)π (w)π (σ 2 )π (τ 2 )π (ν 2 ) { 1 { } } −1/2 × exp − 2 yt In − X (X t X )−1 X t y ≤ n |X t X D¯ γ + Iq | σ 2σ
(12)
1
×π (w)π (σ 2 )π (τ 2 )π (ν 2 ),
(13)
where the last inequality is from the definition of π (γ | w) and 1 t 1 1 y y+ µt (X t X + D¯ − γ )µγ 2σ 2 2σ 2 γ } ( ) 1 t{ 1 −1 t ¯− y In − X X t X + D X y. γ 2
A γ , σ 2 , τ 2 , η2 = −
(
)
=
2σ In (13), we can simply integrate out π (w). Finally, to show the properness of the posterior distribution, it suffices to show that
{ } ∑ ( ) 1 1 2 B γ, σ exp − 2 C dσ 2 < ∞, σ n−2 2σ σ 2 γ∈Γ } { } { where Γ = γ ∈ {0, 1}p+p(p−1)/2 |γ = (γ1 , . . . , γp , γ1,2 , . . . , γ(p−1),p ) , C = yt In − X (X t X )−1 X t y, and ∫ ⏐ t ⏐ 1 1 ⏐X X D¯ γ + Iq ⏐−1/2 B(γ, σ 2 ) = dτ 2 dν 2 . (τ 2 + σ 2 )2 (ν 2 + σ 2 )2 ν 2 <τ 2 ∫
2 2 2 Thus, to show the properness, it suffices to( show ) that, for each γ ∈ Γ , there exists a function h(σ ) such that B(γ, σ ) ≤ h(σ ). To find the upper envelope function h σ 2 of B(γ , σ 2 ), we divide the cases relying on the domain of γ as cases (i) γ ∈ Γ0 , (ii) γ ∈ Γ1 , and (iii) γ ∈ Γ2 , where
{ } Γ0 = γ : γi = 0 for all i and γi,j = 0 for all (i, j) { } Γ1 = γ : γi = 1 for all i and γi,j = 1 for all (i, j) Γ2 = Γ − Γ1 − Γ2 . Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
15
Lemma 4. Under the assumption of Theorem 1, there exists a function h(σ 2 ) such that B(γ, σ 2 ) ≤ h(σ 2 ) for every γ ∈ Γ and
{
∫ σ2
h(σ 2 ) exp −
1 2σ
}
C dσ 2 < ∞. 2
Proof. First, we show this for the case γ ∈ Γ0 . Using Lemma 1, we have
⏐ t ⏐ ⏐ ⏐ ⏐X X D¯ γ + Iq ⏐ = ⏐X t X ν 2 Iq + Iq ⏐ ⏐ 2 t ⏐ = ⏐ν X X + Iq ⏐. Thus,
∫
⏐ t ⏐ 1 1 ⏐X X D¯ γ + Iq ⏐−1/2 dτ 2 dν 2 (τ 2 + σ 2 )2 (ν 2 + σ 2 )2 } ∫ ∞ {∫ ∞ ⏐ ⏐−1/2 1 1 2 ⏐ 2 t = d τ ν X X + Iq ⏐ dν 2 2 2 2 2 (ν + σ 2 )2 0 ν 2 (τ + σ ) ∫ ∞ ⏐ 2 t ⏐ 1 ⏐ν X X + Iq ⏐−1/2 = dν 2 2 + σ 2 )3 ( ν ∫0 ∞ 1 1 2 ≤ )q/2 ( ( )3 dν 2 2 0 1 + ν λmin ν + σ2 ∫ ∞ 1 1 dν 2 ≤ q/2 ( ) 1 2 ) q/2+3 λmin 0 ν 2 + min(λ− , σ min ( ) ( ) 1 2 1 = q/2 := h1 σ 2 . q / 2 + 2 λmin q + 4 min(λ−1 , σ 2 ) ν 2 <τ 2
(14)
min
Second, we consider the case γ ∈ Γ1 . For γ ∈ Γ1 , we have
⏐ t ⏐ ⏐ ⏐ ⏐X X D¯ γ + Iq ⏐ = ⏐X t X τ 2 Iq + Iq ⏐ ⏐ 2 t ⏐ = ⏐τ X X + Iq ⏐. Therefore,
∫
⏐ t ⏐ 1 1 ⏐X X D¯ γ + Iq ⏐−1/2 dτ 2 dν 2 (τ 2 + σ 2 )2 (ν 2 + σ 2 )2 } ∫ ∞ {∫ τ 2 ⏐ ⏐−1/2 1 1 2 ⏐ 2 t = dν τ X X + Iq ⏐ dτ 2 2 + σ 2 )2 2 + σ 2 )2 ( ν ( τ 0 0 ∫ ∞ ⏐ ⏐ 2 t ⏐τ X X + Iq ⏐−1/2 1 dτ 2 ≤ σ6 ∫0 ∞ 1 1 ≤ dτ 2 (1 + τ 2 λmin )q/2 σ 6 0 ( ) ( ) 2 1 1 = := h2 σ 2 . q − 2 λmin σ 6 ν 2 <τ 2
(15)
Here, the inequality in (15) is from τ2
∫ 0
1 (ν 2 + σ 2 )2
dν 2 =
1
σ2
−
1
σ2 + τ2
≤
1
σ2
and 1 (τ 2 + σ 2 )2
≤
1
σ4
.
Finally, we consider the case γ ∈ Γ2 . For γ ∈ Γ2 , we find that, from Lemmas 1 and 2,
⏐ ⏐ ⏐ ⏐ ⏐Iq + X t X D¯ γ ⏐ = ⏐Iq + X D¯ γ X t ⏐ ⏐ ⏐ = ⏐Iq + D¯ 1γ/2 X t X D¯ 1γ/2 ⏐ ⏐ ⏐ = ⏐Iq + D¯ 1γ/2 V Λ2 V t D¯ 1γ/2 ⏐ ⏐ ⏐ ( ) ≥ ⏐Iq + D¯ 1γ/2 V λ2min Iq V t D¯ 1γ/2 ⏐ Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.
16
J. Kim et al. / Journal of the Korean Statistical Society (
)
–
⏐ ) 1/2 ⏐ ( = ⏐Iq + D¯ 1γ/2 λ2min Iq D¯ γ ⏐ ) )q−1 ( ( 1 + λ2min τ 2 . ≥ 1 + λ2min ν 2 Thus, we have
∫ ν 2∫<τ 2
|X t X D¯γ + Iq |
≤ ν 2 <τ 2
∫ ≤ ν 2 <τ 2
1
−1/2
1
(τ 2 + σ 2 ) 2 (ν 2 + σ 2 ) 2
dτ 2 dν 2 1
(1 + λ2min ν 2 )−(q−1)/2 (1 + λ2min τ 2 )−1/2
×
1
(τ 2 + σ 2 ) 2 (ν 2 + σ 2 )2 ( ) 1 1 (1 + λ2min ν 2 )−(q−1)/2 (1 + λ2min τ 2 )−1/2 4 4 dτ 2 dν 2 := h3 σ 2 ,
dτ 2 dν 2
τ ν
which is a constant ( ) function { of(σ .) ( ) ( )} By letting h σ 2 = max h1 σ 2 , h2 σ 2 , h3 σ 2 , we complete the proof. □ 2
References Barbieri, M. M., & Berger, J. O. (2004). Optimal predictive model selection. The Annals of Statistics, 32, 870–897. Bien, J., Taylor, J., & Tibshirani, R. (2013). A Lasso for hierarchical interactions. The Annals of Statistics, 41, 1111–1141. Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics, 37, 373–384. Chipman, H. (1996). Bayesian variable selection with related predictors. The Canadian Journal of Statistics, 24, 17–36. Choi, N. H., Li, W., & Zhu, J. (2010). Variable selection with the strong heredity constraint and its oracle property. Journal of the American Statistical Association, 105, 354–364. Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360. Farcomeni, A. (2010). Bayesian constrained variable selection. Statistica Sinica, 20, 1043–1062. George, E. I., & McCulloch, R. E. (1993). Variable selection via gibbs sampling. Journal of the American Statistical Association, 88, 881–889. Golub, G. H., & Van Loan, C. F. (1996). Matrix computation (3rd ed.). Baltimore: Johns Hopkins University Press. Green, A., & Trichopoulos, D. (2002). Skin cancer. In H. O. Adami, D. Hunter, & D. Trichopoulos (Eds.), Textbook of cancer epidemiology (pp. 281–300). New York: Oxford University Press. Hans, C., Dobra, A., & West, M. (2007). Shotgun stochastic search for ‘‘large p’’ regression. Journal of the American Statistical Association, 102, 507–516. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Hung, R. J., Brennan, P., Malaveille, C., Porru, S., Donato, F., Boffetta, P., et al. (2004). Using hierarchical modeling in genetic association studies with multiple markers: application to a case-control study of bladder cancer. Cancer Epidemiology, Biomakers & Prevention, 13, 1013–1021. Khoury, M. J., & Wacholder, S. (2009). Invited commentary: From genome-wide association studies to gene-environment-wide interaction studieschallenges and opportunities. American Journal of Epidemiology, 169, 227–230. Madigan, D., & York, J. (1995). Bayesian graphical models for discrete data. International Statistical Review, 63, 215–232. Mallows, C. L. (1973). Some comments on CP . Technometrics, 15, 661–675. Meinshausen, N., & Buhlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436–1462. Miller, A. (2002). Subset selection in regression (2nd ed.). London: Chapman and Hall/CRC. Raftery, A. E., Madigan, D., & Hoeting, J. A. (1997). Bayesian model averaging for linear regression Models. Journal of the American Statistical Association, 92, 179–191. Scott, J. G., & Berger, J. O. (2006). An exploration of aspects of Bayesian multiple testing. Journal of Statistical Planning and Inference, 136, 2144–2162. Stamey, T. A., Kabalin, J. N., McNeal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., et al. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate: II. radical prostatectomy treated patients. Journal of Urology, 141, 1076–1083. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B., 58, 267–288. Wu, C. F. J., & Hamada, M. (2000). Experiments: Planing, analysis and parameter design optimization. New York: Wiley. Yuan, M., Joseph, V. R., & Zou, H. (2009). Structured variable selection and estimation. The Annals of Applied Statistics, 3, 1738–1757. Zhao, P., Rocha, G., & Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37, 358–363. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Please cite this article in press as: Kim, J., et al., Bayesian variable selection with strong heredity constraints. Journal of the Korean Statistical Society (2018), https://doi.org/10.1016/j.jkss.2018.03.003.