Robust linear static panel data models using ε -contamination

Robust linear static panel data models using ε -contamination

Accepted Manuscript Robust linear static panel data models using ε -contamination Badi H. Baltagi, Georges Bresson, Anoop Chaturvedi, Guy Lacroix PII...

643KB Sizes 0 Downloads 82 Views

Accepted Manuscript Robust linear static panel data models using ε -contamination Badi H. Baltagi, Georges Bresson, Anoop Chaturvedi, Guy Lacroix

PII: DOI: Reference:

S0304-4076(17)30144-6 http://dx.doi.org/10.1016/j.jeconom.2017.07.002 ECONOM 4403

To appear in:

Journal of Econometrics

Received date : 24 May 2016 Revised date : 23 February 2017 Accepted date : 11 July 2017 Please cite this article as: Baltagi B.H., Bresson G., Chaturvedi A., Lacroix G., Robust linear static panel data models using ε -contamination. Journal of Econometrics (2017), http://dx.doi.org/10.1016/j.jeconom.2017.07.002 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Robust linear static panel data models using ε-contaminationI Badi H. Baltagia,∗, Georges Bressonb , Anoop Chaturvedic , Guy Lacroixd a Department

of Economics and Center for Policy Research, Syracuse University, Syracuse, New York, U.S.A. and Department of Economics, Leicester University, Leicester, UK. b Universit´ e Paris II / Sorbonne Universit´ es, France c University of Allahabad, India d D´ epartment d’´ economique, Universit´ e Laval, Qu´ ebec, Canada

Abstract The paper develops a general Bayesian framework for robust linear static panel data models using ε -contamination. A two-step approach is employed to derive the conditional type-II maximum likelihood (ML-II) posterior distribution of the coefficients and individual effects. The ML-II posterior means are weighted averages of the Bayes estimator under a base prior and the data-dependent empirical Bayes estimator. Two-stage and three stage hierarchy estimators are developed and their finite sample performance is investigated through a series of Monte Carlo experiments. These include standard random effects as well as Mundlak-type, Chamberlain-type and Hausman-Taylortype models. The simulation results underscore the relatively good performance of the three-stage hierarchy estimator. Within a single theoretical framework, our Bayesian approach encompasses a variety of specifications while conventional methods require separate estimators for each case. Keywords: ε-contamination, hyper g-priors, type-II maximum likelihood posterior density, panel data, robust Bayesian estimator, three-stage hierarchy. JEL classification: C11, C23, C26.

I We are grateful to Richard Blundell, Matthew Harding, Franz Palm, participants at the 34th annual convention of Indian Society for Probability and Statistics (ISPS), Tirupati, India, Dec. 02, 2014, participants at the 21th International Panel Data Conference, Budapest, Hungary, June, 29-30, 2015 and at the research seminars of Maastricht University (The Netherlands), Universit´ e Laval (Canada) and University of Munich (Germany) for helpful comments and suggestions. We also thank the editor Oliver Linton, the associate editor and two anonymous referees of the Journal of Econometrics for valuable comments and suggestions. The usual disclaimers apply. ∗ Corresponding author Email addresses: [email protected] (Badi H. Baltagi), [email protected] (Georges Bresson), [email protected] (Anoop Chaturvedi), [email protected] (Guy Lacroix)

1. Introduction The choice of which classic panel data estimator to use in a linear static regression framework depends upon the hypothesized correlation between the individual effects and the regressors. One usually assumes either of two polar situations. The random effects model assumes that the regressors and the individual effects are uncorrelated. On the other hand, the fixed effects model assumes that all the regressors are correlated with the individual effects (see e.g. Mundlak (1978); Chamberlain (1982)). An intermediate situation arises when both a subset of time-varying and time-invariant regressors are assumed correlated with the individual effects, in which case the instrumental variables estimator of Hausman and Taylor (1981) is an appropriate alternative. To a Bayesian analyst, the distinction between fixed, random and mixed models boils down to the specification of the number of stages in a given hierarchical model. While intuitively more attractive, the Bayesian approach nevertheless rests upon hypothesized prior distributions (and possibly on their hyperparameters). The choice of specific distributions is often made out of convenience rather than being based upon relevant subjective information.1 Yet, it is well-known that the estimators can be sensitive to misspecification of the latter. Fortunately, this difficulty can be partly circumvented by use of the robust Bayesian approach which relies upon a class of prior distributions and selects an appropriate one in a data dependent fashion. This paper studies the robustness of Bayesian panel data models to possible misspecification of the prior distribution in the spirit of the works of Good (1965), Dempster (1977), Rubin (1977), Hill (1980), Berger (1985), Berger and Berliner (1984) and Berger and Berliner (1986) to mention a few. In particular, it is concerned with what they call the posterior robustness which is different from the robustness ` a la White (1980). The idea is to acknowledge the prior uncertainty by specifying a class Γ of possible prior distributions and then investigating the robustness of the posterior distribution as the prior varies over Γ. Several classes of priors have been proposed in the literature but the most commonly used one is the ε-contamination class. As mentioned by Berger (1985), the ε-contamination class of priors is particularly attractive to work with when investigating posterior robustness. The εcontamination class combines the elicited prior, termed the base prior, with a contamination class of priors. This approach implements the Type-II maximum likelihood (ML-II) procedure for selecting the appropriate prior distribution. The primary benefit of using such a contamination class of prior distributions is that the resulting estimator performs well even if the elicited base prior distribution differs from the prior distribution that is consistent with the data. The objective of our paper is to propose a robust Bayesian approach for linear static panel data models which departs from the standard Bayesian one in two ways. First, we consider the εcontamination class of prior distributions for the model parameters (and for the individual effects). Second, both the base elicited priors and the ε-contamination priors use Zellner (1986)’s g-priors rather than the standard Wishart distributions for the variance-covariance matrices. The paper contributes to the panel data literature by proposing a general robust Bayesian framework which encompasses all the above mentioned conventional frequentist specifications and their associated estimation methods as special cases. Section 2 gives the general framework, while Section 3 derives the Type-II maximum likeli1 For instance, conventional proper priors in the normal linear model have been based on the conjugate NormalGamma family essentially because all the marginal likelihoods have closed-form solutions. Likewise, statisticians customarily assume that the variance-covariance matrix of the slope parameters follow a Wishart distribution because it is convenient from an analytical point of view.

1

hood posterior mean and the variance-covariance matrix of the coefficients in a two-stage hierarchy model. Section 4 introduces a three-stage hierarchy with generalized hyper-g priors on the variancecovariance matrix of the individual effects. The predictive densities corresponding to the base priors and the ε-contaminated priors turn out to be Gaussian and Appell hypergeometric functions, respectively. The main difference between the two-stage and the three-stage hierarchy models pertains to the definition of the Bayes estimators, the empirical Bayes estimators and the weights of the ML-II posterior means. Section 5 investigates the finite sample performance of our robust Bayesian estimator through extensive Monte Carlo experiments. The simulation results underscore the relatively good performance of the three-stage hierarchy estimator as compared to the standard frequentist estimation methods. Two applications on an earnings equation and a crime model, available in the supplementary appendix, illustrate and confirm the simulation results. Section 6 gives our conclusion. 2. The general setup Let the Gaussian linear mixed model be written as: 0 yit = Xit β + Wit0 bi + uit , i = 1, ..., N , t = 1, ..., T,

(1)

0 where Xit is a (1 × K1 ) vector of explanatory variables including the intercept, and β is a (K1 × 1) vector of parameters. Furthermore, let Wit0 denote a (1 × K2 ) vector of covariates and bi a (K2 × 1) vector of parameters. The subscript i of bi indicates that the model allows for heterogeneity on the W variables. Finally, uit is a remainder term assumed to be normally distributed, i.e.  uit ∼ N 0, τ −1 . The distribution of uit is parametrized in terms of its precision τ rather than its variance σu2 (= 1/τ ) . In the statistics literature, the elements of β do not differ across i and are referred to as fixed effects whereas the bi ’s are referred to as random effects.2 This terminology differs from the one used in econometrics. In the latter, the bi ’s are treated either as random variables, and hence referred to as random effects, or as constant but unknown parameters and hence referred to as fixed effects. In line with the econometrics terminology, whenever the bi ’s 0 are assumed to be correlated (uncorrelated) with all the Xit s, they will be termed fixed (random) 3 effects. In the Bayesian context, following the seminal papers of Lindley and Smith (1972) and Smith (1973), several authors have proposed a very general three-stage hierarchy framework to handle such models (see, e.g., Chib and Carlin (1999); Koop (2003); Chib (2008); Greenberg (2008); Zheng et al. (2008); Rendon (2013)):

First stage : Second stage : Third stage :

y = Xβ + W b + u, u ∼ N (0, Σ), Σ = τ −1 IN T β ∼ N (β0 , Λβ ) and b ∼ N (b0 , Λb )

Λ−1 b

(2)

∼ W ish (νb , Rb ) and τ ∼ G(·).

where y is (N T × 1), X is (N T × K1 ), W is (N T × K2 ), u is (N T × 1) and IN T is a (N T × N T ) identity matrix. The parameters depend upon hyperparameters which themselves follow random 2 See Lindley and Smith (1972), Smith (1973), Laird and Ware (1982), Chib and Carlin (1999), Koop (2003), Chib (2008) and Greenberg (2008) to mention a few. 3 When we write fixed effects in italics, we refer to the terminology of the statistical or Bayesian literature. Conversely, when we write fixed effects (in normal characters), we refer to the terminology of panel data econometrics.

2

distributions. The second stage (also called fixed effects model in the Bayesian literature) updates the distribution of the parameters. The third stage (also called random effects model in the Bayesian literature) updates the distribution of the hyperparameters. As stated by Smith (1973) (page 67) “for the Bayesian model the distinction between fixed, random and mixed models, reduces to the distinction between different prior assignments in the second and third stages of the hierarchy”. In other words, the fixed effects model is a model that does not have a third stage. The random effects model simply updates the distribution of the hyperparameters. The precision τ is assumed to follow a Gamma distribution and Λb−1 is assumed to follow a Wishart distribution with νb degrees of freedom and a hyperparameter matrix Rb which is generally chosen close to an identity matrix. In that case, the hyperparameters only concern the variance-covariance matrix of the b coefficients4 and the precision τ . As is well-known, Bayesian models may be sensitive to misspecification of the distributions of the priors. Conventional proper priors in the normal linear model have been based on the conjugate Normal-Gamma family because they allow closed form calculations of all marginal likelihoods. Likewise, rather than specifying a Wishart distribution for the variance-covariance −1 −1 matrices as is customary, Zellner’s g-prior (Λβ = (τ gX 0 X) for β or Λb = (τ hW 0 W ) for b) has been widely adopted because of its computational efficiency in evaluating marginal likelihoods and because of its simple interpretation as arising from the design matrix of observables in the sample. Since the calculation of marginal likelihoods using a mixture of g-priors involves only a one-dimensional integral, this approach provides an attractive computational solution that made the original g-priors popular while insuring robustness to misspecification of g (see Zellner (1986) and Fern´ andez et al. (2001) to mention a few). To guard against mispecifying the distributions of the priors, many suggest considering classes of priors (see Berger (1985)). 3. The robust linear static model in the two-stage hierarchy Following Berger (1985), Berger and Berliner (1984, 1986), Zellner (1986), Moreno and Pericchi (1993), Chaturvedi (1996), and Chaturvedi and Singh (2012) among others, we consider the εcontamination class of prior distributions for (β, b, τ ): Γ = {π (β, b, τ | g0 , h0 ) = (1 − ε) π0 (β, b, τ | g0 , h0 ) + εq (β, b, τ | g0 , h0 )} .

(3)

π0 (·) is the base elicited prior, q (·) is the contamination belonging to some suitable class Q of prior distributions, 0 ≤ ε ≤ 1 is given and reflects the amount of error in π0 (·) . The precision τ is assumed to have a vague prior, p (τ ) ∝ τ −1 , 0 < τ < ∞, and π0 (β, b, τ | g0 , h0 ) is the base prior assumed to be a specific g-prior with     β ∼ N β0 ιK1 , (τ g0 ΛX )−1 with ΛX = X 0 X   (4)  b ∼ N b0 ιK2 , (τ h0 ΛW )−1 with ΛW = W 0 W,

where ιK1 is a (K1 × 1) vector of ones. Furthermore, β0 , b0 , g0 and h0 are known scalar hyperparameters of the base prior π0 (β, b, τ | g0 , h0 ). The probability density function (henceforth pdf) of

4 Note that in (2), the prior distribution of β and b are assumed to be independent, so Var[θ] is block-diagonal with θ = (β 0 , b0 )0 . The third stage can be extended by adding hyperparameters on the prior mean coefficients  β0 and b0 and on the variance-covariance matrix of the β coefficients: β0 ∼ N β00 , Λβ0 , b0 ∼ N b00 , Λb0 and  −1 Λβ ∼ W ish νβ , Rβ (see for instance, Koop (2003); Greenberg (2008); Hsiao and Pesaran (2008); Bresson and Hsiao (2011)).

3

the base prior π0 (.) is given by: π0 (β, b, τ | g0 , h0 ) = p (β | b, τ, β0 , b0 , g0 , h0 ) × p (b | τ, b0 , h0 ) × p (τ ) . The possible class of contamination Q is defined as:   q (β, b, τ | g0 , h0 ) = p (β | b, τ, βq , bq , gq , hq ) × p (b | τ, bq , hq ) × p (τ ) Q= , with 0 < gq ≤ g0 , 0 < hq ≤ h0 with

  β  b

  −1 N βq ιK1 , (τ gq ΛX )   −1 ∼ N bq ιK2 , (τ hq ΛW ) , ∼

(5)

(6)

(7)

where βq , bq , gq and hq are unknown. The restrictions gq ≤ g0 and hq ≤ h0 imply that the base prior is the best possible so that the precision of the base prior is greater than any prior belonging to the contamination class. The ε-contamination class of prior distributions for (β, b, τ ) is then conditional on known g0 and h0 and two estimation strategies are possible: 1. a one-step estimation of the ML-II posterior distribution5 of β, b and τ ; 2. or a two-step approach as follows6 : (a) Let y ∗ = (y − W b). Derive the conditional ML-II posterior distribution of β given the specific effects b. (b) Let ye = (y − Xβ). Derive the conditional ML-II posterior distribution of b given the coefficients β.

We use the two-step approach because it simplifies the derivation of the predictive densities (or marginal likelihoods). In the one-step approach the pdf of y and the pdf of the base prior π0 (β, b, τ | g0 , h0 ) need to be combined to get the predictive density. It thus leads to a complicated expression whose integration with respect to (β, b, τ ) may be involved. Using a two-step approach we can integrate first with respect to (β, τ ) given b and then, conditional on β, we can next integrate with respect to (b, τ ) . Thus, the marginal likelihoods (or predictive densities) corresponding to the base priors are: m (y | π0 , b, g0 ) =

Z∞ Z

π0 (β, τ | g0 ) × p (y ∗ | X, b, τ ) dβ dτ

m (e y | π0 , β, h0 ) =

Z∞ Z

π0 (b, τ | h0 ) × p (e y | W, β, τ ) db dτ,



and

0 RK1

0 RK2

5 “We consider the most commonly used method of selecting a hopefully robust prior in Γ, namely choice of that prior π which maximizes the marginal likelihood m (y | π) over Γ. This process is called Type II maximum likelihood by Good (1965)” (Berger and Berliner, 1986, page 463). 6 We will see that the mean of the ML-II posterior of β (resp. of b) depends on draws from multivariate tdistributions. The two steps can thus be viewed as a Gibbs sampler.

4

with π0 (β, τ | g0 ) π0 (b, τ | h0 )

= =

 τ g  K21 0





 K22

τ h0 2π

 τg  0 exp − (β − β0 ιK1 )0 ΛX (β − β0 ιK1 )) , 2   τ h0 1/2 τ −1 |ΛW | exp − (b − b0 ιK2 )0 ΛW (b − b0 ιK2 ) . 2

τ −1 |ΛX |

1/2

Solving these equations is considerably easier than solving the equivalent expression in the one-step approach. 3.1. The first step of the robust Bayesian estimator Let y ∗ = y − W b. Combining the pdf of y ∗ and the pdf of the base prior, we get the predictive density corresponding to the base prior7 : ∗

m (y | π0 , b, g0 )

=

Z∞ Z

0 RK1

=

e H



π0 (β, τ | g0 ) × p (y ∗ | X, b, τ ) dβ dτ

g0 g0 + 1

K1 /2

1+



g0 g0 + 1



Rβ2 0 1 − Rβ2 0

(8) !!− N2T

0 b b Γ( N2T ) (β(b)−β 0 ιK1 ) ΛX (β(b)−β0 ιK1 ) 0 ∗ , Rβ2 0 = b , βb (b) = Λ−1 NT NT X X y and v (b) = 0 Λ (β(b)−β b ( β(b)−β ι ) ( ) ( ) 0 0 ιK1 )+v(b) K X 1 π 2 v(b) 2 (y ∗ − X βb (b))0 (y ∗ − X βb (b)), and where Γ (·) is the Gamma function.

e = with H

Likewise, we can obtain the predictive density corresponding to the contaminated prior for the distribution q (β, τ | g0 , h0 ) ∈ Q from the class Q of possible contamination distributions: e m (y | q, b, g0 ) = H ∗

where

Rβ2 q =



gq gq + 1

 K21

1+



gq gq + 1



Rβ2 q 1 − Rβ2 q

!!− N2T

,

(βb (b) − βq ιK1 )0 ΛX (βb (b) − βq ιK1 ) . (βb (b) − βq ιK )0 ΛX (βb (b) − βq ιK ) + v (b) 1

(9)

1

As the ε-contamination of the prior distributions for (β, τ ) is defined by π (β, τ | g0 ) = (1 − ε) π0 (β, τ | g0 )+ εq (β, τ | g0 ), the corresponding predictive density is given by: m (y ∗ | π, b, g0 ) = (1 − ε) m (y ∗ | π0 , b, g0 ) + εm (y ∗ | q, b, g0 ) and

sup m (y ∗ | π, b, g0 ) = (1 − ε) m (y ∗ | π0 , b, g0 ) + ε sup m (y ∗ | q, b, g0 ) .

π∈Γ

7 Derivation

q∈Q

of all the following expressions can be found in the supplementary appendix.

5

The maximization of m (y ∗ | π, b, g0 ) requires the maximization of m (y ∗ | q, b, g0 ) with respect to βq and gq . The first-order conditions lead to βbq = ι0K1 ΛX ιK1

and gbq

with g ∗

−1

ι0K1 ΛX βb (b)

(10)

min (g0 , g ∗ ) ,  !−1  0 b b b b (N T − K1 ) (β (b) − βq ιK1 ) ΛX (β (b) − βq ιK1 ) = max  −1 , 0 K1 v (b)  !−1  ! Rβ2b (N T − K ) 1 q , 0 . −1 = max  K1 1 − R2b =

(11)

βq

Denote supq∈Q m (y ∗ | q, b, g0 ) = m (y ∗ | qb, b, g0 ). Then e m (y | qb, b, g0 ) = H ∗



gbq gbq + 1

 K21

1+



gbq gbq + 1



Rβ2b

q

1 − R2b

βq

!!− N2T

.

Let π0∗ (β, τ | g0 ) denote the posterior density of (β, τ ) based upon the prior π0 (β, τ | g0 ). Also, let q ∗ (β, τ | g0 ) denote the posterior density of (β, τ ) based upon the prior q (β, τ | g0 ). The ML-II posterior density of β is thus given by: ∗

π b (β | g0 )

with

=

Z∞ 0

π b∗ (β, τ | g0 ) dτ

bβ,g = λ 0



Z∞

π0∗

0



bβ,g (β, τ | g0 ) dτ + 1 − λ 0

 Z∞ 0

  bβ,g π ∗ (β | g0 ) + 1 − λ bβ,g qb∗ (β | g0 ) = λ 0 0 0 

K1 /2

g bq  ε  gbq +1  bβ,g =  λ 1 +  0 0 1 − ε g0g+1 





g0 g0 +1



2 Rβ 0 2 1−Rβ

q ∗ (β, τ | g0 ) dτ (12)   N2T −1

1 +  0     R2b     g bq βq 1 + gbq +1 2 1−R bq β

   

.

bβ,g depends upon the ratio of the R2 and R2 , but primarily on the sample size N T . Note that λ 0 β0 βq bβ,g tends to 0 when R2 > R2 and tends to 1 when R2 < R2 , irrespective of the model Indeed, λ 0

β0

βq

β0

βq

fit (i.e, the absolute values of Rβ2 0 or Rβ2 q ). Only the relative values of Rβ2 q and Rβ2 0 matter. It can be shown that π0∗ (β | g0 ) is the pdf (see the supplementary appendix) of a multivariate   t-distribution with mean vector β∗ (b | g0 ), variance-covariance matrix 6

−1 ξ0,β M0,β N T −2

and degrees of

freedom (N T ) with M0,β

(g0 + 1) ΛX and ξ0,β = 1 + = v (b)



g0 g0 + 1



Rβ2 0 1 − Rβ2 0

!

.

(13)

β∗ (b | g0 ) is the Bayes estimate of β for the prior distribution π0 (β, τ ) : β∗ (b | g0 ) =

βb (b) + g0 β0 ιK1 . g0 + 1

(14)

Likewise qb∗ (β) is the ofa multivariate t-distribution with mean vector βbEB (b | g0 ), variance pdf −1 ξq,β Mq,β covariance matrix and degrees of freedom (N T ) with N T −2 ξq,β = 1 +



gbq gbq + 1



Rβ2b

q

1 − R2b

βq

!

and Mq,β =



(b gq + 1) v (b)



ΛX ,

(15)

where βbEB (b | g0 ) is the empirical Bayes estimator of β for the contaminated prior distribution q (β, τ ) given by: βb (b) + gbq βbq ιK1 βbEB (b | g0 ) = . (16) gbq + 1

The mean of the ML-II posterior density of β is then: βbM L−II

= E [b π ∗ (β | g0 )] bβ,g E = λ 0





(17)

bβ,g E [b (β | g0 )] + 1 − λ q (β | g0 )] 0   bβ,g β∗ (b | g0 ) + 1 − λ bβ,g βbEB (b | g0 ) . = λ 0 0 [π0∗



The ML-II posterior density of β, given b and g0 is a shrinkage estimator. It is a weighted average of the Bayes estimator β∗ (b | g0 ) under base prior g0 and the data-dependent empirical Bayes bβ,g → 1 and the estimator βbEB (b | g0 ). If the base prior is consistent with the data, the weight λ 0 ∗ ML-II posterior density of β gives more weight to the posterior π0 (β | g0 ) derived from the elicited prior. In this case βbM L−II is close to the Bayes estimator β∗ (b | g0 ). Conversely, if the base prior is bβ,g → 0 and the ML-II posterior density of β is then close not consistent with the data, the weight λ 0 ∗ to the posterior qb (β | g0 ) and to the empirical Bayes estimator βbEB (b | g0 ). The ability of the ε-contamination model to extract more information from the data is what makes it superior to the classical Bayes estimator based on a single base prior. Following Berger (1985)(page 207), we derive the analytical ML-II posterior variance-covariance matrix of β in the supplementary appendix. 3.2. The second step of the robust Bayesian estimator Let ye = y − Xβ. Moving along the lines of the first step, the ML-II posterior density of b is given by:   bb,h π ∗ (b | h0 ) + 1 − λ bb,h qb∗ (b | h0 ) π b∗ (b | h0 ) = λ 0 0 0 7

with





 ε  bb,h =  λ 1 + 0 1−ε 

where

Rb20 = Rbb2 = q

with bb (β) =

0 Λ−1 e W W y

b h b h+1 h0 h0 +1

K2 /2 

   R2   N2T −1 b0 h0 1 + h0 +1 1−Rb2    0      , 2      Rb  b bq 1 + bh 2 1−R 

h+1

b bq

(bb (β) − b0 ιK2 )0 ΛW (bb (β) − b0 ιK2 ) , (bb (β) − b0 ιK2 )0 ΛW (bb (β) − b0 ιK2 ) + v (β) (bb (β) − bbq ιK2 )0 ΛW (bb (β) − bbq ιK2 ) , (bb (β) − bbq ιK )0 ΛW (bb (β) − bbq ιK ) + v (β) 2

2

and v (β) = (e y − W bb (β))0 (e y − W bb (β)), bbq = ι0 ΛW ιK K2 2

and

−1

ι0K2 ΛW bb (β)

b hq = min (h0 , h∗ )  !−1  0 b b b b (N T − K ) ( b (β) − b ι ) Λ ( b (β) − b ι ) 2 q K2 W q K2 with h∗ = max  −1 , 0 K2 v (β)  ! !−1  Rbb2 (N T − K ) 2 q −1 = max  , 0 . K2 1 − Rb2 bq

π0∗ (b | h 0 ) is the pdf  of a multivariate t-distribution with mean vector b∗ (β | h0 ), variance-covariance matrix

−1 ξ0,b M0,b N T −2

M0,b =

and degrees of freedom (N T ) with

(h0 + 1) ΛW and ξ0,b = 1 + v (β)



h0 h0 + 1

 b (b (β) − b0 ιK2 )0 ΛW (bb (β) − b0 ιK2 ) . v (β)

b∗ (β | h0 ) is the Bayes estimate of b for the prior distribution π0 (b, τ | h0 ) : b∗ (β | h0 ) =

bb (β) + h0 b0 ιK 2 . h0 + 1

b q ∗ (b | h 0 ) is the pdf  of a multivariate t-distribution with mean vector bEB (β | h0 ), variance-covariance matrix

−1 ξ1,b M1,b N T −2

ξ1,b = 1 +

and degrees of freedom (N T ) with b hq

b hq + 1

!

(bb (β) − bbq ιK2 )0 ΛW (bb (β) − bbq ιK2 ) and M1,b = v (β)

8

b h+1 v (β)

!

ΛW

and where bbEB (β | h0 ) is the empirical Bayes estimator of b for the contaminated prior distribution q (b, τ | h0 ) : b b b bbEB (β | h0 ) = β(b) + hq bq ιK2 . b hq + 1

The mean of the ML-II posterior density of b is hence given by:   bbM L−II = λ bb b∗ (β | h0 ) + 1 − λ bβ bbEB (β | h0 ) .

The ML-II posterior variance-covariance matrix of b can be derived in a similar fashion8 to that of βˆM L−II .

3.3. Estimating the ML-II posterior variance-covariance matrix Many have raised concerns about the unbiasedness of the posterior variance-covariance matrices bβ,g and λ bb,h → 0 of βbM L−II and bbM L−II . Indeed, they will both be biased towards zero as λ 0 0 and converge to the empirical variance which is known to underestimate the true variance (see e.g. Berger and Berliner (1986); Gilks et al. (1997); Robert (2007)). Consequently, the assessment of the performance of either βbM L−II or bbM L−II using standard quadratic loss functions can not be conducted using the analytical expressions. What is needed is an unbiased estimator of the true ML-II variances. In what follows, we propose two different strategies to approximate these, each with different desirable properties. 3.3.1. MCMC with multivariate t-distributions Recall that the ML-II posterior densities of β and b, π ˆ ∗ (β|g0 ) and π ˆ ∗ (b|h0 ), are both multivariate t-distributions. We thus propose to use the following algorithm to approximate the variance matrices once the model has been estimated: 1. Loop over D draws 2. In the first step of the model, draw a set of K1 values from the multivariate t-distributions π0∗ (β | g0 ) and qb∗ (β) to get a (K1 × 1) vector of βd as   bβ,g π ∗ (β | g0 ) + 1 − λ bβ,g qb∗ (β | g0 ) π b∗ (β | g0 ) = λ 0 0 0

3. In the second step of the model, draw a set of N values from the multivariate t-distributions π0∗ (b | h0 ) and from q ∗ (b | h0 ) to get a (N × 1) vector of bd as   bb,h π ∗ (b | h0 ) + 1 − λ bb,h qb∗ (b | h0 ) π b∗ (b | h0 ) = λ 0

0

0

4. Once the D draws are completed, use D∗ (= D − Dburn ) draws to compute q h i   ∗ βbM L−II = E β (D ) , σ bβM L−II = diag V ar β (D∗ ) q h i   bbM L−II = E b(D∗ ) , σ bbM L−II = diag V ar b(D∗ ) , where Dburn are the burn in draws.

8 See

the supplementary appendix.

9

3.3.2. Block resampling bootstrap As an alternative to MCMC simulations, we follow Laird and Louis (1987) and directly bootstrap the data. In the context of panel data, various strategies may be used, and chief among them is the so-called individual block resampling bootstrap (see, e.g. Bellman et al. (1989); Andersson and Karlsson (2001); Kapetanios (2008)). Thus for an (N × T ) matrix Y , individual block resampling consists in drawing an (N × T ) matrix Y BR whose rows are obtained by resampling those of Y with replacement. Conditionally on Y , the rows of Y BR are independent and identically distributed.9 We use such a procedure for all the dependent Y and explanatory X variables. Fortunately, we need as few as BR = 20 bootstrap samples to achieve acceptable results.10 The following algorithm is used to approximate the variance matrices: 1. Loop over BR samples 2. In the first step, compute the mean of the ML-II posterior density of β using our initial shrinkage procedure βbM L−II,br

= =

E [b π ∗ (β | g0 )]

  bβ,g β∗ (b | g0 ) + 1 − λ bβ,g βbEB (b | g0 ) . λ 0 0

3. In the second step, compute the mean of the ML-II posterior density of b:   bbM L−II,br = λ bb b∗ (β | h0 ) + 1 − λ bβ bbEB (β | h0 )

4. Once the BR bootstraps are completed, use the (K1 × BR) matrix of coefficients β (BR) and the (N × BR) matrix of coefficients b(BR) to compute: q h i   βbM L−II = E β (BR) , σ bβM L−II = diag V ar β (BR) q h i   bbM L−II = E b(BR) , σ bbM L−II = diag V ar b(BR)

4. The robust linear static model in the three-stage hierarchy

As stressed earlier, the Bayesian literature introduces a third stage in the hierarchical model in order to discriminate between fixed effects and random effects. Hyperparameters can be defined for the mean and the variance-covariance of b (and sometimes β). Our goal in this paper is to consider a contamination class of priors to account for uncertainty pertaining to the base prior π0 (β, b, τ ), i.e., uncertainty about the prior means of the base prior. Consequently, assuming hyper priors for the means β0 and b0 of the base prior is tantamount to assuming the mean of the base prior to be unknown, which is contrary to our initial assumption. Following Chib and Carlin (1999), Chib (2008), Greenberg (2008), Zheng et al. (2008) among others, hyperparameters only concern 9 Note that this assumes homoskedasticity as defined in (2). We do not account for heteroskedasticity in the block bootstrap. 10 For convenience, the number of bootstrap samples BR is relatively small compared to the sample size N . Increasing the number of bootstrap samples does not change the results but increases the computation time considerably. Note also that this approach is many times less computationally intensive than both the 2S with MCMC on the multivariate t-distributions and the full Bayesian estimator.

10

the variance-covariance matrix of the b coefficients. Because we use g-priors in the second stage for β and b, g0 is kept fixed and assumed known. We need only define mixtures of g-priors on the precision matrix of b, or equivalently on h0 . Maruyama and George (2011, 2014) have proposed the following generalized hyper-g prior : p (g) =

−(c+d)

g c−1 (1 + g) B (c, d)

, c > 0, d > 0,

(18)

where B (·) is the Beta function. This Beta-prime (or Pearson Type-VI) hyper prior for g is a generalization of the so-called Pareto type-II hyper-g. Using the generalized hyper-g prior specification, the three-stage hierarchy of the model can be defined as: First stage : y ∼ N (Xβ + W b, Σ) , Σ = τ −1 IN T     −1 −1 Second stage : β ∼ N β0 ιK1 , (τ g0 ΛX ) , b ∼ N b0 ιK2 , (τ h0 ΛW ) Third stage : h0 ∼ β 0 (c, d) → p (h0 ) =

−(c+d)

hc−1 (1 + h0 ) 0 B (c, d)

(19)

, c > 0, d > 0.

−1

We thus use hyperparameters only on the variance (τ h0 ΛW ) , i.e. only on h0 . Furthermore, we do not introduce an ε-contamination class of prior distributions for this hyperparameter of the third stage of the hierarchy, for example, p (h0 ) = (1 − εa ) π0 (h0 ) + εa q (h0 ), as our objective is to account for the uncertainty about the prior means of the base prior π0 (β, b, τ ). The third stage does depend on the priors, so the specification uses Gaussian distributions for the parameters at the second stage and a Beta-prime distribution for the priors at the third stage. Moreover, Berger (Berger, 1985, page 232) has stressed that the choice of a specific functional form for the third stage matters little.11 Therefore we restrict the ε-contamination class of prior distributions to the first stage prior only (the second stage of the hierarchy, i.e., for (β, b, τ )). The first step of the robust Bayesian estimator in the three-stage hierarchy is strictly similar to the one in the two-stage hierarchy. But the three-stage hierarchy differs from the two-stage hierarchy in that it introduces a generalized hyper-g prior on h0 . The unconditional predictive density corresponding to the base prior is then given by m (e y | π0 , β)

=

Z∞ 0

m (e y | π0 , β, h0 ) p(h0 )dh0

Z e K2 H +c−1 d−1 (1 − ϕ) (ϕ) 2 B (c, d) 1

=

0

1+ϕ

Rb20 1 − Rb20

!!− N2T

which can be written as: B(d, K22 + c) e H ×2 F1 m (e y | π0 , β) = B (c, d)

N T K2 K2 ; + c; + c + d; − 2 2 2

Rb20 1 − Rb20



!!

,

(20)

11 Sinha and Jayaraman (2010a,b) studied a ML-II contaminated class of priors at the third stage of hierarchical priors using normal, lognormal and inverse Gaussian distributions to investigate the robustness of Bayes estimates with respect to possible misspecification at the third stage. Their results confirmed Berger (1985)’s assertion that the form of the second stage prior (the third stage of the hierarchy) does not affect the Bayes decision.

11

where 2 F1 (.) is the Gaussian hypergeometric function (see Abramovitz and Stegun (1970) and the supplementary appendix). As shown by Liang et al. (2008), numerical overflow is problematic for moderate to large N T and large Rb20 . As the Laplace approximation involves an integral with respect to a normal kernel, we can follow the suggestion  of Liang et al. (2008) and develop an  expansion12 after a change of variable given by log

h0 h0 +1

.

Similar to the conditional predictive density corresponding to the contaminated prior on β (see equation (9)), the unconditional predictive density corresponding to the contaminated prior on b can be shown to give:  K2 +c h∗ 2 2( h∗  +1 )    K +2c 2   2    Rbb   K2 q N T K2 h∗ h∗  × F + c; 1 − d; ; + c + 1; ; − ∗ ∗ 1  2 2 2 h +1 h +1 1−Rb2     " # bq  e H  ∗  K22   ∗   Rb2 − N2T     m (e y | qb, β) = bq h   1 + h∗h+1   B (c, d)    h∗ +1 1−Rb2     bq     ∗ c  h +  ( ∗ )     B (c, d) − h +1    c       ×       h∗   ×2 F1 c; d − 1; c + 1; h∗ +1

                          

, (21)

where F1 (·) is the Appell hypergeometric function (see Appell (1882); Slater (1966); Abramovitz and Stegun (1970) and the supplementary appendix). m (e y | qb, β) can also be approximated using the same clever transformation as in Liang et al. (2008)(see the supplementary appendix). We have shown earlier that the posterior density of (b, τ ) for the base prior π0 (b, τ | h0 ) in the two-stage hierarchy model is given by:   bb,h π ∗ (b, τ | h0 ) + 1 − λ bb,h q ∗ (b, τ | h0 ) , π b∗ (b, τ | h0 ) = λ 0 0 0 with

Hence, we can write

bb,h = λ 0

bb = λ

Z∞ 0

(1 − ε) m (e y | π0 , β, h0 ) . (1 − ε) m (e y | π0 , β, h0 ) + εm (e y | qb, β, h0 )

 −1   m (e y | qb, β) ε b λb,h0 p(h0 )dh0 = 1 + . . 1 − ε m (e y | π0 , β)

(22)

Therefore, under the base prior, the Bayes estimator of b in the three-stage hierarchy model is given by: Z∞ i 1 h b b∗ (β) = b∗ (β | h0 ) p(h0 )dh0 = d · b (β) + c · b0 ιK2 . c+d 0

12 See the supplementary appendix. For the Monte Carlo simulation study and the empirical applications, instead of using Gaussian hypergeometric integrals 2 F1 and Appel integrals F1 with Laplace approximations, we prefer to solve the integrals numerically with adaptive quadrature methods.

12

Likewise, under the contamination class of priors, the empirical Bayes estimator of b for the threestage hierarchy model is given by bbEB (β) =

Z∞ bbEB (β | h0 ) p(h0 )dh0 0



    1  =  B (c, d)    

(23)

  c h∗ bb (β) ( h∗ +1 ) ×2 F1 c; −d; c + 1; ∗h∗ c h +1 ∗   c+1 ( h∗h+1 ) h∗ b +bq ιK2 × F c + 1; 1 − d; c + 2; ∗ 2 1 nc+1    ∗ o h +1 + bb (β) h∗1+1 + bbq ιK2 h∗h+1   ∗ c ( h∗h+1 ) B (c, d) − c    × ∗ ×2 F1 c; d − 1; c + 1; h∗h+1

          

and the ML-II posterior density of b is given by: π b∗ (b) =

Z∞ 0

bb π b∗ (b, τ ) dτ = λ

Z∞ 0

  Z∞ bb π0∗ (b, τ ) dτ + 1 − λ q ∗ (b, τ ) dτ 0

  bb π ∗ (b) + 1 − λ bb qb∗ (b) . =λ 0

π0∗ (b) is the   pdf of a multivariate t-distribution with mean vector b∗ (β), variance-covariance matrix −1 ξ0,b M0,b and degrees of freedom (N T ) with N T −2 M0,b

(h0 + 1) = ΛW and ξ0,b = 1 + v (β)



h0 h0 + 1



Rb20 1 − Rb20

!

.

qb∗ (b) isthe pdf of a multivariate t-distribution with mean vector bbEB (β), variance-covariance −1 ξq,b Mq,b and degrees of freedom (N T ) with matrix N T −2 ξq,b = 1 +

b hq

b hq + 1

!

Rbb2

q

1 − Rb2

bq

!

and Mq,b =

(b hq + 1) v (β)

!

ΛW .

The mean of the ML-II posterior density of b is thus given by   bbM L−II = E [b bb E [π ∗ (b)] + 1 − λ bb E [b π ∗ (b)] = λ q ∗ (b)] 0   bb b∗ (β) + 1 − λ bb bbEB (β) . = λ

(24)

The ML-II posterior variance-covariance matrix of b can be derived as in the two-stage hierarchy model. The main difference with the latter relates to the definition of the Bayes estimator b∗ (β), bb (as compared to b∗ (β | h0 ), bbEB (β | h0 ) the empirical Bayes estimator bbEB (β) and the weights λ b and λb,h0 ). Since the variance-covariance matrix of both β and b are likely underestimated (see the 13

supplementary appendix) for the same reasons as those raised previously, they are computed using both MCMC with multivariate t-distributions and block resampling bootstrap. As stressed earlier, the two-stage and three-stage hierarchical models correspond to fixed and random effects models, respectively. In the classical framework, these models refer to entirely different assumptions about the data generating process. Bayesian analysis with non-hierarchical priors is analogous to the fixed effects model in panel data econometrics while introducing hierarchical priors leads to the random effects model in panel data econometrics. These specifications can be tested using the Chib method of marginal likelihood (see Chib (1995) and Koop (2003)). In particular, the maximum value of this marginal likelihood determines the choice between the two-stage (FE) and the three-stage (RE) estimators. Adaptation of the Chib method to our ML-II estimator is outside the scope of our study. Our estimator needs to be compared to classical frequentist estimators as well as the standard full hierarchical model in order to assess its relative efficiency. In the next section, we perform a series of simulations to unearth the relative merit of each estimator. 5. A Monte Carlo simulation study 5.1. The DGP of the Monte Carlo study Following Baltagi et al. (2003, 2009) and Baltagi and Bresson (2012), consider the static linear model: yit for i

= x1,1,it β1,1 + x1,2,it β1,2 + x2,it β2 + Z1,i η1 + Z2,i η2 + µi + uit , = 1, ..., N , t = 1, ..., T, with x1,1,it

= 0.7x1,1,it−1 + δi + ζit

x1,2,it

= 0.7x1,2,it−1 + θi + ςit  ∼ N 0, τ −1 , (δi , θi , ζit , ςit ) ∼ U (−2, 2)

uit and β1,1

= β1,2 = β2 = 1.

1. For a random effects (RE) world, we assume that: η1 x2,it µi

= η2 = 0 0.7x2,it−1 + κi + ϑit , (κi , ϑit ) ∼ U (−2, 2)  σµ2 = 0.3, 0.8. ∼ N 0, σµ2 , ρ = 2 σµ + τ −1

=

Furthermore, x1,1,it , x1,2,it and x2,it are assumed to be exogenous in that they are not correlated with µi and uit . 2. For a Mundlak-type fixed effects (FE) world, we assume that: η1

= η2 = 0;

x2,it

=

mδ2

=

δ2,i + ω2,it , δ2,i ∼ N (mδ2 , σδ22 ), ω2,it ∼ N (mω2 , σω2 2 ); mω2 = 1, σδ22 = 8, σω2 2 = 2;

µi

= x2,i π + νi , νi ∼

σν2

= 1, π = 0.8.

N (0, σν2 ),

14

x2,i

T 1X = x2,it ; T t=1

x1,1,it and x1,2,it are assumed to be exogenous but x2,it is correlated with the µi and we assume a constant correlation coefficient, π = 0.8. 3. For a Chamberlain-type fixed effects (FE) world, we assume that: η1 x2,it mδ2

= η2 = 0; = δ2,i + ω2,it , δ2,i ∼ N (mδ2 , σδ22 ), ω2,it ∼ N (mω2 , σω2 2 ); = mω2 = 1, σδ22 = 8, σω2 2 = 2;

µi

=

σν2

=

x2,i1 π1 + x2,i2 π2 + ... + x2,iT πT + νi , νi ∼ N (0, σν2 );

1, πt = (0.8)T −t for t = 1, ..., T.

x1,1,it and x1,2,it are assumed to be exogenous but x2,it is correlated with the µi and we assume an exponential growth for the correlation coefficient πt . 4. For a Hausman-Taylor (HT) world, we assume that: η1 x2,it Z1,i Z2,i µi

= η2 = 1; = 0.7x2,it−1 + µi + ϑit , ϑit ∼ U (−2, 2); = 1, ∀i;

= µi + δi + θi + ξi , ξi ∼ U (−2, 2);  σµ2 ∼ N 0, σµ2 , and ρ = 2 = 0.3, 0.8. σµ + τ −1

x1,1,it and x1,2,it and Z1,i are assumed to be exogenous while x2,it and Z2,i are endogenous because they are correlated with the µi but not with the uit . For each set-up, we vary the size of the sample and the duration of the panel. We choose several (N, T ) pairs with N = 100, 500 and T = 5, 10. We set the initial values of x1,1,it and x1,2,it to zero. We next generate x1,1,it , x1,2,it , uit , ζit , uit , ςit , ω2,it over T + T0 time periods and we drop the first T0 (= 50) observations to reduce the dependence on the initial values. The robust Bayesian estimators for the two-stage hierarchy (2S) and for the three-stage hierarchy (3S) are estimated with ε = 0.5, though we also investigate their robustness to various values of ε.13 values β0 , b0 , g0 , h0 , τ for the initial distributions of β ∼ We must set the hyperparameters  −1 −1 N β0 ιK1 , (τ g0 ΛX ) and b ∼ N b0 ιK2 , (τ h0 ΛW ) . While we can choose arbitrary values for

β0 , b0 and τ , the literature generally recommends using the unit information prior (UIP) to set the g-priors.14 In the normal regression case, and following Kass and Wasserman (1995), the UIP corresponds to g0 = h0 = 1/N T , leading to Bayes factors that behave like the Bayesian Information Criterion (BIC). For the three-stage hierarchy (3S), we need to choose the coefficients (c, d) of the generalized hyper-g priors. Following Liang et al. (2008) we set c = d = 1 for the Beta-prime distribution. In that case, the density is shaped as a hyperbola. In order to have the same shape under the UIP 13 ε = 0.5 is an arbitrary value. We implicitly assume that the amount of error in the base elicited prior is 50%. In other words, ε = 0.5 means that we elicit the π0 prior but feel we could be as much as 50% off (in terms of implied probability sets). We could have chosen any value for ε. 14 We chose: β = 0, b = 0 and τ = 1. 0 0

15

principle (i.e., h0 close to 1/N T ), we chose c = 0.1 and d = 1. For the three-stage hierarchy (3S), we solve the integrals numerically with adaptive quadrature methods (Davis and Rabinowitz, 1984; Press et al., 2007). Most simulations are based upon N = 100, T = 5, ρ = 0.8 and ε = 0.5, although we investigate the robustness of our estimators to different parameter sets. For the 2S and 3S robust estimators, we use BR = 20 samples in the block resampling bootstrap and D = 1, 000 with Dburn = 500 when drawing from the multivariate t-distributions. Finally, the Full Bayesian (FB) estimators are not derived in the paper for the sake of brevity but are presented in the supplementary appendix. When estimating the FB models, we also use D=1,000 draws with Dburn = 500. For each experiment, we run R = 1, 000 replications and we compute the means, the standard errors and the root mean squared errors (RMSEs) of the coefficients. 5.2. The results of the Monte Carlo study In its most general form, the Gaussian linear mixed model can be written as (see equation (1)): y = Xβ + W b + u.

(25)

In what follows, we show how the usual classical estimators can be rewritten as a transformation of this model. We then estimate the classical and Bayesian counterparts and compare their properties. 5.2.1. The random effects world Rewrite the general model (25) as follows: y = Xβ + Zµ µ + u, where u ∼ N (0, Σ), Σ = τ −1 IN T , Zµ = IN ⊗ ιT is (N T × N ), ⊗ is the Kronecker product, ιT is a (T × 1) vector of ones and µ is a (N  × 1) vector of idiosyncratic parameters. When W ≡ Zµ , the random effects, µ ∼ N 0, σµ2 IN , are associated with the error term ν = Zµ µ + u with Var (ν) = σµ2 (IN ⊗ JT ) + σu2 IN T , where JT = ιT ι0T . This model is customarily estimated using Feasible Generalized Least Squares (FGLS)(see Hsiao (2003) or Baltagi (2013)). Table 1 reports the results of fitting the classical FGLS model along with those from the 2S, 3S and Full Bayesian (FB) models, each in a separate panel. The true parameter values appear on the first line of the table. The last column reports the computation time in seconds.15 Note that the computation time increases significantly as we move from the FGLS to the 2S-3S (bootstrap), to the 2S-3S (t-dist) and finally to the FB models. The performance of the robust estimators with block resampling is nevertheless within an acceptable range. The first noteworthy feature of the table is that all the estimators yield essentially the same parameter estimates, standard errors16 and RMSEs. The main difference between the 2S and the bµ , which tends towards zero in the 3S. This implies that 3S lies in the estimated value of the weight λ the empirical Bayes estimator ˆbEB (β) (or equivalently µ ˆEB (β)) accounts for almost 100% of the weight in estimating the individual specific effects µi while the empirical Bayes estimator βˆEB (b) 15 The simulations were conducted using R version 3.3.2 on a MacBook Pro, 2.8 GHz core i7 with 16Go 1600 MGz DDR3 ram. 16 Strictly speaking, we should mention “posterior means” and “posterior standard errors” whenever we refer to Bayesian estimates and “coefficients” and “standard errors” when discussing frequentist ones. For the sake of brevity, we will use “coefficients” and “standard errors” in both cases.

16

(or equivalently βˆEB (µ)) accounts for about 95% of the weight in estimating the β coefficients. For the two-stage hierarchical robust Bayesian estimator, the base prior for the specific effects b (or equivalently µ) is “more consistent” with the data as compared to the base prior for β since the bb (or equivalently λ bµ ) is around 36% while the weight λ bβ is around 6%. For the 3S robust weight λ Bayesian estimator, to the contrary, the base prior for the specific effects b (or equivalently µ) is inconsistent with the data as the ML-II posterior density of b is close to the posterior qb∗ (b | h0 ) and to the empirical Bayes estimator bbEB (β | h0 ). Using either the individual block resampling bootstrap or the multivariate t-distributions yields essentially the same standard errors for both the 2S and 3S models.17 The table also reports the numerical standard error (nse), often referred to as the Monte Carlo error, for the Full Bayesian estimator and the 3S with multivariate t-distributions. The nse is equal to the difference between the mean of the sampled values and the true posterior mean. As a rule of thumb, as many simulations as necessary should be conducted to ensure that the Monte Carlo error of each parameter of interest is less than approximately 10% of the sample standard error (see Brooks and Gelman (1998); Gilks et al. (1997); Koop (2003)). As shown in the table, the estimated nse for the 2S and 3S with multivariate t-distributions and the Full Bayesian estimators easily satisfy this criterion. Moreover, Table A1 in the supplementary appendix underlines the very good behavior of the 3S bootstrap compared to FGLS for several values of N and T , and for ρ = 0.3 and ρ = 0.8. 5.2.2. The Mundlak-type fixed effects world In the fixed effects world, we allow the individual effects µ and the covariates X to be correlated. This is usually accounted for through a Mundlak-type (see Mundlak (1978)) or a Chamberlain-type specification (see Chamberlain (1982)). For the Mundlak-type specification, the individual effects 2 IN ) where π is a (K1 × 1) vector of parameters to are defined as: µ = (Zµ0 X/T )π + $, $ ∼ N (0, σ$  be estimated. The model can be rewritten as y = Xβ + P Xπ + Zµ $ + u, where P = IN ⊗ JTT is the between-transformation (see Baltagi (2013)). We can concatenate [X, P X] into a single matrix of observables and let W b ≡ Zµ $. For the Mundlak world, we compare the standard FGLS estimator on the transformed model to our robust 2S, 3S and FB estimators of that same specification. In our case, as µi = x2,i π + νi , the transformed model is given by: y = x1,1 β1,1 +x1,2 β1,2 +x2 β2 +P x2 π +Zµ ν +u. In this specification, X = [x1,1 , x1,2 , x2 , P x2 ], W = Zµ and b = ν. The simulation results are reported in Table 2. As with the fixed effects model, all the estimators yield essentially the same parameter estimates, standard errors and RMSEs, except for the 2S t-dist estimator. Indeed, relative to the 3S t-dist estimator, the former overestimates the standard errors of the slope parameters as well as the variance of the individual effects, σµ2 . This result is also found when investigating the Chamberlain-type fixed effects world and the Hausman-Taylor world (see Tables 3 and 4). Finally, note that the nse statistics underlines the accuracy of the posterior estimates of the 3S t-dist and FB estimators. Furthermore, increasing the size of N and/or T (N = 500, T = 5) or (N = 100, T = 10) leads to very stable results (see Table A2 in the supplementary appendix). 17 Recall that we use only BR = 20 individual block bootstrap resamples. Fortunately, the results are very robust to the value of BR. For instance, increasing BR from 20 to 200 in the random effects world increases the computation time tenfold but yields the same results.

17

5.2.3. The Chamberlain-type fixed effects world For the Chamberlain-type specification, the individual effects are given by µ = XΠ + $, where 0 0 , ..., XiT ) and Π = (π10 , ..., πT0 )0 is a (T K1 × 1) vector. X is a (N × T K1 ) matrix with X i = (Xi1 Here πt is a (K1 × 1) vector of parameters to be estimated. The model can be rewritten as: y = Xβ + Zµ XΠ + Zµ $ + u. We can concatenate [X, Zµ X] into a single matrix of observables and let W b ≡ Zµ $. For the Chamberlain world, we compare the Minimum Chi-Square (MCS) estimator (see Chamberlain (1982); Hsiao (2003); Baltagi et al. (2009)) to our estimators. These are based on the PT transformed model: yit = x1,1,it β1,1 + x1,2,it β1,2 + x2,it β2 +  t=1 x2,it πt + νi +  uit or y = x1,1 β1,1 + x1,2 β1,2 + x2 β2 + x2 Π + Zµ ν + u. In that specification, X = x1,1 , x1,2 , x2 , x2 , W = Zµ and b = ν. Table 3 once again shows that the results of the 3S bootstrap, the 3S t-dist and the FB are very close to those of the classical MCS estimator. This is true of the slope coefficients β11 , β12 , β2 , the πt coefficients for t = 1, ..., 5 (not reported), their standard errors and their RMSEs’, as well as of the two variances σ2 and σµ2 . Increasing the size of N and/or T (N = 500, T = 5) or (N = 100, T = 10) does not change the results qualitatively, if anything it improves marginally the 3S estimator with individual block resampling bootstrap (see Table A2 in the supplementary appendix).18 5.2.4. The Hausman-Taylor world The Hausman-Taylor model (henceforth HT, see Hausman and Taylor (1981)) posits that y = Xβ + Zη + Zµ µ + u, where Z is a vector of time-invariant variables, and that subsets of X (e.g., 0 0 X2,i ) and Z (e.g., Z2i ) may be correlated with the individual effects µ, but leave the correlations unspecified. Hausman and Taylor (1981) proposed a two-step IV estimator. For our general model 0 , Z 0 and µ ) are jointly normally distributed: (2): y = Xβ + W b + u, we assume that (X2,i i 2i 

    0  µ0i      , Σ11  ∼ N  EX 0 X2,i 2 Σ21 0 Z2i EZ20

Σ12 Σ22





,

0 is the individual mean of X 0 . The conditional distribution of µ | X 0 , Z 0 is given by: where X2,i i 2,it 2i 2,i ! ! 0 −E X2,i 0 , Z0 ∼ N X20 , Σ11 − Σ12 Σ−1 µi | X2,i Σ12 Σ−1 . 2i 22 Σ21 . 22 0 Z2i − EZ20

Since we do not know the elements of the variance-covariance matrix Σjk , we can write:    0 0 −E µi = X2,i θX + Z2i − EZ20 θZ + $i , 0 X 2



where $i ∼ N 0, Σ11 − Σ12 Σ−1 22 Σ21 is uncorrelated with uit , and where θX and θZ are vectors of 0 parameters to be estimated. In order to identify the coefficient vector of Z2i and to avoid possible collinearity problems, we assume that the individual effects are given by:   h  i 0 0 −E 0 −E 0 µi = X2,i X2,i θZ + $ i , (26) X 0 θX + f X 0 Z2i − EZ2 2

18 We

2

only report the results for MCS and the 3S with individual block resampling bootstrap for the sake of brevity.

18

 h i 0 0 −E 0 can be a nonlinear function where is the Hadamard product and f X2,i X20 Z2i − EZ2    0 0 of X2,i − EX 0 Z2i − EZ20 . The first term on the right-hand side of equation (26) corresponds 2

0 to the Mundlak transformation while the middle term captures the correlation between Z2i and µi . The individual effects, µ, are a function of P X and (f [P X Z]), i.e., a function of the columnby-column Hadamard product of P X and Z. We can once again concatenate [X, P X, f [P X Z]] into a single matrix of observables and let W b ≡ Zµ $. For our model, yit = x1,1,it β1,1 + x1,2,it β1,2 + x2,it β2 + Z1,i η1 + Z2,i η2 + µi + uit or y = X1 β1 + x2 β2 + Z1 η1 + Z2 η2 + Zµ µ + u. Then, we assume that

µi = (x2,i − Ex2 ) θX + f [(x2,i − Ex2 ) (Z2i − EZ2 )] θZ + νi .

(27)

We propose adopting the following strategy: If the correlation between µi and Z2i is quite large 2 s (> 0.2), use f [.] = (x2,i − Ex2 ) (Z2i − EZ2 ) with s = 1. If the correlation is weak, set s = 2. In real-world applications, we do not know the correlation between µi and Z2i a priori. We can use a −1 0 proxy of µi defined by the OLS estimation of µ: µ b = Zµ0 Zµ Zµ yb where yb are the fitted values of the pooling regression y = X1 β1 + x2 β2 + Z1 η1 + Z2 η2 + ζ. Then, we compute the correlation between µ b and Z2 . In our simulation study, it turns out the correlations between µ and Z2 are large: 0.97 and 0.70 when ρ = 0.8, and ρ = 0.3, respectively. Hence, we choose s = 1. In this specification, X = [x1,1 , x1,2 , x2 , Z1 , Z2 , P x2 , f [P x2 Z2 ]], W = Zµ and b = ν. Table 4 compares results of the classical IV estimator to those of the robust hierarchical (2S and 3S) and Full Bayesian estimators. Once again, the results are very close to one another. This is true for the slope coefficients β11 , β12 of the exogenous time-varying variables x1,1,it , xi,1,2,it , for the slope coefficients β2 of the endogenous time-varying variable xi,2,it , for the coefficient η1 of the time-invariant variable Z1,i (uncorrelated with µi ) and for the remainder variance σ2 . On the other hand, the 3S bootstrap, the 3S t-dist and the FB estimators yield slightly upward biased estimates of η2 , the coefficient associated with the time-invariant variable Z2,i which is itself correlated with µi . While the biases are relatively small (4.53%, 4.54% and 5.99%, respectively), they are insensitive to the sample size, N , but tend to taper off with T (see Table A3 in the supplementary appendix).19 Interestingly, the standard errors of that same coefficient are considerably smaller when using the Bayesian estimators, irrespective of N and T . Consequently, the RMSEs of the Bayesian estimators are smaller and the 95% confidence intervals are narrower and entirely nested within those obtained with the IV procedure of Hausman-Taylor. The slight bias is thus entirely offset by increased precision. Finally, note that both the IV and the Bayesian estimators yield biased estimates of σµ2 . While the former overestimates the true value and the latter does the opposite, only the IV estimator behaves better when N and/or T increase. It must be noted that the full Bayesian (FB) estimator sometimes yield smaller RMSEs for β11 , β12 and β2 than the classical (FGLS, Mundlak, MCS, IV-HT) estimators and the 3S estimators in the Chamberlain world and more specifically in the Hausman-Taylor world (see Tables 3 and 4). Since N (= 100) is relatively small, the estimated standard errors are likely underestimated. As N gets larger (e.g., N = 500, see Table A4 in the supplementary appendix), the coefficients, SEs and RMSEs of the FB become similar to those of the MCS, IV and 3S bootstrap estimators. Note 19 Table A3 only reports the results for the IV procedure and the 3S with individual block resampling bootstrap to save space.

19

that the computational burden of the full Bayesian method is measured in hours and even days compared to the ε-contamination approach.20 5.2.5. Sensitivity to ε-contamination values Table 5 investigates the sensitivity of the Hausman-Taylor estimator 21 with respect to ε, the contamination part of prior distributions, which varies between 0 and 90%. As shown, most parameter estimates are relatively insensitive to ε. The only noteworthy change concerns the estimated value of λµ in the 2S bootstrap estimator. particular    It more or less corresponds to (1 − ε). This  2 2 2 b b relation may occur whenever h/ h + 1 = h0 / (h0 + 1) and R / 1 − R = R / 1 − R2 (see b0

b0

b bq

b bq

bb,h in section 3.2). For the 3S bootstrap, the estimated standard errors of β2 the definition of λ 0 are much larger than those obtained from usual HT method. But for larger N or larger T , this does not occur (see Table A3 and A4 in the supplementary appendix). The observed stability of the coefficients estimates stems from the fact that the base prior is not consistent with the data as bβ → 0. The ML-II posterior mean of β is thus close to the posterior qb∗ (β | g0 ) and to the weight λ the empirical Bayes estimator βbEB (µ | g0 ). Hence, the numerical value of the ε-contamination, for ε 6= 0, does not seem to play an important role in our simulated worlds. As suggested by a referee, we conducted further simulations with ε = 0. Table 5 reports the results for the Hausman-Taylor model.22 Under the null, H0 : ε = 0, it follows that the bβ,g = 1 and λ bb,h = 1 so that the restricted ML-II estimator of β is given by βbrest = weights λ 0 0   bβ,g β∗ (b | g0 ) + β∗ (b | g0 ). Under H1 : ε 6= 0 the unrestricted estimator is βbun ≡ βbM L−II = λ 0   b 1 − λβ,g0 βEB (b | g0 ) . The restricted ML-II estimator β∗ (b | g0 ) is the Bayes estimator under

the base prior g0 . It differs from the full Bayesian estimator (FB) (described in the supplementary appendix) in that the latter is based on other priors. The use of Zellner’s g-priors leads to more diffuse priors than those of the FB. In other words, the FB is likely more sensitive to the choice of priors than our ML-II estimator even when ε = 0. The full Bayesian estimator (FB) does not guarantee the posterior robustness of our ML-II estimator. Additionally, the restricted MLII estimator (ε = 0) constrains the model to rely exclusively on a base elicited prior which is implicitly assumed error-free. This is a strong assumption. Applying 1 of Magnus and  Theorem   Durbin (1999), we can derive the conditions under which M SE βbrest − M SE βbun is positive

or negative semidefinite (see the supplementary appendix).23 As shown in Table 5 for ε = 0,

20 Recall that in our simulation study, we only use 1, 000 draws and 500 burn-in draws for the full Bayesian estimator (with N = 500, T = 5 and 1, 000 replications). Had we used more draws (as is typically done in MCMC analyses), for instance 10, 000 draws and 5, 000 burn-in draws, then the computing time would have been approximately 4 days. The computation time of our 3S bootstrap estimators would have taken only 3 hours. 21 This exercise could be conducted for the other worlds (RE, FE) but we only report the results for the HausmanTaylor world for the sake of brevity. 22 Note that when ε = 0, the unconditional predictive density corresponding to the base prior, m (e y | π0 , β) , depends on a Gaussian hypergeometric function which can not be numerically evaluated with the Laplace approximation nor with adaptive quadrature methods if Zellner’s g-priors are too large (e.g., g0 > 1). 23 The difference of MSE is given by:     M SE βbrest − M SE βbun = QQ0 − D.

Hence

    M SE βbrest ≤ (resp.≥) M SE βbun iff Q0 D− Q ≤ (resp.≥) 1,

20

the β’s (of the 2S and 3S bootstrap) are slightly downward-biased, and even more so for the 2S bootstrap η’s and the 3S bootstrap η1 . Moreover, the estimated variance of the specific effects, σµ2 , are downward biased in both the 2S bootstrap (σµ2 = 3.53) and the 3S bootstrap (σµ2 = 3.38) estimators. Irrespective of the value of ε, the 3S bootstrap standard errors of β11 and β12 are close to those of the Hausman-Taylor IV estimator. However, when ε = 0 the 3S bootstrap standard errors of η2 is clearly downward-biased compared to those of the 2S and the 3S bootstrap estimators when ε 6= 0. This also holds true for the RMSE of this coefficient. The restricted ML-II estimator (ε = 0) depends on the value of Zellner’s g-priors, g0 (and h0 for b). Whatever the value of ε, we have chosen g0 and h0 so that the prior is as diffuse as possible. A good strategy is to follow the unit information prior (UIP) principle suggested by Kass and Wasserman (1995) which corresponds to g0 = h0 = 1/N T . This leads to Bayes factors that behave like the Bayesian Information Criterion (BIC). In that case, the difference in the MSEs of the parameter estimates computed with ε ∈ ]0, 1[ and those with ε = 0 could be small since in both cases the elicited prior is diffuse. From Table 5, and Tables A4 and A5 in the supplementary appendix, the results differ from the full Bayesian method. In Table A5, we investigate the sensitivity of the 3S bootstrap estimator when both ε and Zellner’s g-priors change. When ε = 0.5, increasing Zellner’s g-priors from g0 = h0 = 1/N T to g0 = h0 = 0.1 has little impact on the results. On the other hand, when ε = 0 (with g0 = h0 = 1/N T ), the β’s and η1 are biased and the estimated variance of the specific effects, σµ2 , is downward-biased (σµ2 = 3.38) as indicated previously. Increasing Zellner’s g-priors (g0 = h0 = 0.1) when ε = 0, worsens the biases on all the parameter estimates, and especially for the two η’s and σµ2 (3.30 instead of 4). Moreover, the RMSEs of all the coefficients increase significantly, thus emphasizing the differences in the MSEs when we move from ε = 0 to ε 6= 0. Table 5 reports the results for ε = 0.1, 0.5, 0.9. The RMSEs are almost the same across the 3S bootstrap specifications. Recall that this estimator is data driven and implicitly adjusts the weights to the different values of ε-contamination. This may by why, even though the choice of ε = 0.5 is somewhat arbitrary, the adjustment compensates for it not being optimal (see Berger (1985)). A possible guide for the choice of parameters of the 3S estimator is as follows: • First, choose g0 and h0 so that the prior is as diffuse as possible. A good approach is to follow the unit information prior (UIP) principle suggested by Kass and Wasserman (1995) which corresponds to g0 = h0 = 1/N T . • Second, choose any ε value lying in ε ∈ ]0, 1[. with  1 − λβ,g0 (β∗ (b | g0 ) − βEB (b | g0 ))  and D = 1 − λβ,g0 [V ar [q ∗ (β | g0 )] − V ar [π0∗ (β | g0 )]]  +λβ,g0 1 − λβ,g0 (β∗ (b | g0 ) − βEB (b | g0 )) (β∗ (b | g0 ) − βEB (b | g0 ))0 ,     where D− denotes the generalized inverse of D. If M SE βbrest and M SE βbun are two positive semidefinite         matrices, the notation M SE βbrest ≤ M SE βbun means that M SE βbun −M SE βbrest is positive semidefinite. In contrast, the difference of MSE between our ML-II estimator and the FB cannot be defined analytically but only be numerically evaluated with our simulation studies. with Q

=

21

5.2.6. Departure from normality Table 6 investigates the robustness of the estimators to a non-normal framework. The remainder disturbances, uit , are now assumed to follow a right-skewed t-distribution instead of the usual normal distribution. The parameters of the right-skewed t-distribution are mean = 0, degrees of freedom = 3, and shape = 2 (see Fern´andez and Steel (1998)). Our 3S bootstrap estimator yields similar slope parameter estimates to those of the classical estimators, i.e. Mundlak FGLS, the MCS or the IV. The main difference concerns the estimates of the remainder variance σu2 and the individual effects variance σµ2 . Indeed, in the Mundlak-type fixed effects world, the estimate of σu2 computed from the 3S bootstrap is closer to the theoretical value than that of the FGLS estimate. In the Chamberlain-type fixed effects world, the estimate of σµ2 (resp. σu2 ) computed from the 3S bootstrap estimator is closer (resp. is further) to the theoretical value than that of the standard MCS estimator. Finally, the 3S bootstrap estimates of σµ2 and σu2 in the Hausman-Taylor world are both closer to their true values than those of the HT estimator. Yet, the more interesting result concerns the precision and the bias of the estimate of η2 in the Hausman-Taylor world. Notice first that the standard error of the 3S bootstrap estimate is once again much smaller than what obtains using the standard HT estimator (0.2039 versus 0.3478). Consequently, the 95% confidence interval of the 3S bootstrap estimator is much narrower and entirely nested within the one obtained with the IV procedure of Hausman-Taylor ([0.6189; 1.418] versus [0.3180; 1.6814]). Second, it is worth emphasizing that in this non-Gaussian framework, the bias of η2 has decreased significantly (1.88% versus 4.53%) and that the estimate of λβ , while still relatively small, is now more important. Following the suggestions of the referees, we further investigated the consequences of relaxing the normality of the remainder terms. We first simulated the model using a t-distribution that exhibited more right-skewness, i.e. with mean zero, five degrees of freedom and shape parameter of 3. The simulation results show that increasing the skewness amplifies the gains of our ε-contamination methodology over the standard Hausman and Taylor (1981) estimator (see Table A6 of the supplementary appendix). Second, we assumed instead that the remainder terms followed a χ2 distribution with 2 degrees of freedom. The results in Table A7 of the supplementary appendix clearly indicate that our approach is preferable to the Hausman-Taylor estimator. 5.2.7. Sensitivity to endogenous covariates As shown above, our robust 3S bootstrap estimator exhibits interesting properties in Gaussian and non-Gaussian frameworks. As a final check on its properties, it is worth investigating its relative behavior in the context of endogenous covariates.24 We focus on the Hausman-Taylor and the Chamberlain worlds with N = 100 and T = 5. For the Hausman-Taylor world, we assume that x2,it = 0.7x2,it−1 + µi + ϑit with ϑit ≡ uit and uit ∼ N (0, 8), as opposed to assuming ϑit ∼ U (−2, 2) above. In addition to being correlated with µi and Z2,i , x2,it is now an endogenous covariate since it is correlated with yit through uit . We have chosen a large variance for uit so as to amplify the potential impact of the endogeneity problem on the results. For the Chamberlain world, we assume that x2,it = ρx2 x2,it−1 + uit , uit ∼ N (0, σu2 ), σu2 = 8 and ρx2 = 0.7. 24 We

are grateful to Richard Blundell for this suggestion and for helpful comments.

22

  Pt−1 = σu2 1 − ρ2t / 1 − ρ2x2 . The covariance between x2,it and uit is given by E [x2,it uit ] ' σu2 j=0 ρ2j x x 2 2  2 2 2 It ranges from σu2 to approximately σ u / 1 − ρx2 ' 1.96σ u as t goes from 1 to T . The specific  PT PT Pt−1 j effect µi = t=1 x2,i,t πt + νt ' t=1 j=0 ρx2 uit−j πt + νt is also correlated with uit and the  P PT T T −t covariance is given by E [µi uit ] = σu2 t=1 πt = σu2 t=1 (0.8) = 5σu2 1 − 0.8T . Table 7 presents the results for the Chamberlain world in the upper panel and those of the Hausman-Taylor world in the bottom one. As for the Chamberlain world, the endogeneity has little if any impact on both the MCS and the robust Bayesian estimators. The parameter estimates are very close to those found in Table 3. The correlation between x2,it and uit increases the standard error (and the RMSE) of x2,it but less so for the other parameter estimates. Note that the Bayesian estimates of σµ2 are closer to the true value than what obtains from the MCS estimator. In the Hausman-Taylor world, as opposed to the Chamberlain world, the classical IV estimator is strongly affected by the endogeneity problem. Oddly, the estimate of the coefficient associated with x2,it and its standard error are little impacted. Rather, the estimates of the intercept and of Z2i lose all statistical significance. Likewise, the estimated variance of the specific effects σµ2 is strongly upward biased (+34.67%) while the estimated variance of the remainder terms σu2 is unbiased. Lastly, the standard errors of the exogenous variables x1,1,it and x1,2,it are now twice as large as those found in Table 4 but nevertheless lead to coefficients which are still statistically significant. The robust estimators (3S bootstrap and 3S t-dist) and the full Bayesian estimator (FB) behave better, especially in terms of precision. The standard errors of the intercept and the coefficient of Z2i are smaller and lead to coefficients that are statistically significant.25 The slope parameters are all slightly biased (except β2 ), and in particular the one associated with Z2i . Nevertheless, the null hypothesis that β1,1 = β1,2 = β2 = η1 = η2 = 1 cannot be rejected. Furthermore, as the standard errors of these coefficients are relatively small, the RMSEs are quite acceptable and often smaller than those of the usual IV estimator. In fact, the IV estimator only outperforms the Bayesian estimators in that it provides a slightly less biased estimate of the variance of the specific effects, σµ2 . The relative (negative) biases of the 3S bootstrap, 3S t-dist and FB amount to 46.34%, 48.11%, 47.04%, respectively. This is somewhat more than the (positive) bias of the Hausman-Taylor IV estimator which amounts to 35%. But more interestingly, the intercept η1 and the slope parameter η2 of the time-invariant variable Z2i are strongly significant as compared to the classical IV estimator. Our 3S robust Bayesian estimator is thus more robust to endogeneity problems than the Hausman-Taylor estimator. One may wonder why Chamberlain’s MCS estimator is more robust to an endogeneity problem than the Hausman-Taylor IV estimator. Recall that the MCS, sometimes called the optimal minimum distance estimator, is qualitatively similar to the GMM estimator and whose properties lead to more efficient estimates than the 2SLS instrumental variable of the Hausman-Taylor estimator. Notwithstanding this, our robust 3S estimators are well-behaved in the face of endogeneity problems as evidenced above. 25 The values of the coefficients c and d of the generalized hyper-g priors appear to play an important role in the precision of the intercept. Recall that we have chosen c = d = 1 for the Beta-prime distribution to increase the precision. With c = 0.1 and d = 1, the intercept is not significantly different from zero.

23

6. Conclusion To our knowledge, our paper is the first to analyze the static linear panel data model using an εcontamination approach within a two-stage and three-stage hierarchical approach. The main benefit of this approach is its ability to extract more information from the data than the classical Bayes estimator with a single base prior. In addition, we show that our approach encompasses a variety of classical or frequentist specifications such as random effects, Mundlak-type and Chamberlain-type fixed effects and Hausman-Taylor models. The frequentist approach, on the other hand, requires separate estimators for each model, respectively FGLS, MCS and IV. Following Chaturvedi and Singh (2012), we estimate the Type-II maximum likelihood (ML-II) posterior distribution of the slope coefficients and the individual effects using a two-step procedure. If the base prior is consistent with the data, the ML-II posterior density gives more weight to the conditional posterior density derived from the elicited prior. Conversely, if the base prior is not consistent with the data, the ML-II posterior density is closer to the conditional posterior density derived from the ε-contaminated prior. The finite sample performance of the two-stage and three-stage hierarchical models are investigated using extensive Monte Carlo experiments. The experimental design includes a random effects world, a Mundlak-type fixed effects world, a Chamberlain-type fixed effects world and a HausmanTaylor-type world. The simulation results underscore the relatively superior performance of the three-stage hierarchy estimator, irrespective of the data generating process considered. The biases and the RMSEs are close and often smaller than those of the conventional (classical) estimators. In the two-stage hierarchy, their exists a trade-off between the Bayes estimators and the empirical Bayes estimators. In the three-stage hierarchy, this trade-off vanishes and only the empirical Bayes estimator matters in the estimation of the coefficients and the individual effects. We also investigate the sensitivity of the estimators to the contamination part of the prior distribution. It turns out most parameter estimates are relatively stable. Next we investigate the robustness of the estimators when the remainder disturbances are assumed to follow either a right-skewed t-distribution or a χ2 distribution. Our robust estimators globally outperform the classical estimators both in terms of precision and bias. Finally, we investigate the consequences of introducing an endogenous covariate. Our results show that both the MCS and our robust 3S estimators perform well. On the other hand, our 3S estimator clearly outperforms the HausmanTaylor IV estimator both in terms of precision and bias. The robust Bayesian approach we propose is arguably a relevant all-in-one panel data framework. In future work we intend to broaden its scope by addressing issues such as heteroskedasticity, autocorrelation of residuals, general IV, dynamic and spatial models. References Abramovitz, M., Stegun, I.A., 1970. Handbook of Mathematical Functions. Dover Publications, Inc. New York. Andersson, M., Karlsson, S., 2001. Bootstrapping error component models. Computational Statistics 16, 221–231. Appell, P., 1882. Sur les fonctions hyperg´eom´etriques de deux variables. Journal de Math´ematiques Pures et Appliqu´ees 3, 173–216. Baltagi, B.H., 2013. Econometric Analysis of Panel Data. Fifth ed., Wiley, Chichester, UK. 24

Baltagi, B.H., Bresson, G., 2012. A robust Hausman-Taylor estimator, in: Baltagi, B.H., Hill, C., Newey, W., White, H. (Eds.), Advances in Econometrics: Essays in Honor of Jerry Hausman. Emerald Group Publishing Limited. volume 29, pp. 175–214. Baltagi, B.H., Bresson, G., Pirotte, A., 2003. Fixed effects, random effects or Hausman-Taylor? a pretest estimator. Economics Letters 79, 361–369. Baltagi, B.H., Bresson, G., Pirotte, A., 2009. Testing the fixed effects restrictions? A Monte Carlo study of chamberlain’s minimum chi-square test. Statistics and Probability Letters 79, 1358–1362. Bellman, L., Breitung, J., Wagner, J., 1989. Bias correction and bootstrapping of error component models for panel data: Theory and applications. Empirical Economics 14, 329–342. Berger, J., 1985. Statistical Decision Theory and Bayesian Analysis. Springer, New York. Berger, J., Berliner, M., 1984. Bayesian input in Stein estimation and a new minimax empirical Bayes estimator. Journal of Econometrics 25, 87–108. Berger, J., Berliner, M., 1986. Robust Bayes and empirical Bayes analysis with ε-contaminated priors. Annals of Statistics 14, 461–486. Bresson, G., Hsiao, C., 2011. A functional connectivity approach for modeling cross-sectional dependence with an application to the estimation of hedonic housing prices in Paris. Advances in Statistical Analysis 95, 501–529. Brooks, S.P., Gelman, A., 1998. Alternative methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics 7, 434–455. Chamberlain, G., 1982. Multivariate regression models for panel data. Journal of Econometrics 18, 5–46. Chaturvedi, A., 1996. Robust Bayesian analysis of the linear regression. Journal of Statistical Planning and Inference 50, 175–186. Chaturvedi, A., Singh, A., 2012. Robust Bayesian analysis of autoregressive fixed effects panel data model. Working Paper, Department of Statistics, University of Allahabad. Chib, S., 1995. Marginal likelihood from the Gibbs sampler. Journal of the American Statistical Association 90, 1313–1321. Chib, S., 2008. Panel data modeling and inference: a Bayesian primer, in: M´ aty´ as, L., Sevestre, P. (Eds.), The Handbook of Panel Data, Fundamentals and Recent Developments in Theory and Practice, pp. 479–516. Chib, S., Carlin, B.P., 1999. On MCMC sampling in hierarchical longitudinal models. Statistics and Computing 9, 17–26. Davis, P.J., Rabinowitz, P., 1984. Methods of Numerical Integration. 2nd ed., Academic Press, New York. Dempster, A.P., 1977. Examples relevant to the robustness of applied inference, in: Gupta, S.S., Moore, D.S. (Eds.), Statistical Decision Theory and Related Topics II. Academic Press, New York, pp. 121–138. 25

Fern´ andez, C., Ley, E., Steel, M.F.J., 2001. Benchmark priors for Bayesian model averaging. Journal of Econometrics 100, 381–427. Fern´ andez, C., Steel, M.F.J., 1998. On Bayesian modeling of fat tails and skewness. Journal of The American Statistical Association 93, 359–371. Gilks, W.R., Richardson, S., Spiegelhalter, D.J., 1997. Markov Chain Monte Carlo in Practice. 2nd ed., Chapman & Hall, London, UK. Good, I.J., 1965. The Estimation of Probabilities. MIT Press, Cambridge, MA. Greenberg, E., 2008. Introduction to Bayesian Econometrics. Cambridge University Press, Cambridge, UK. Hausman, J.A., Taylor, W.E., 1981. Panel data and unobservable individual effects. Econometrica 49, 1377–1398. Hill, B., 1980. Invariance and robustness of the posterior distribution of characteristics of a finite population with reference to contingency tables and the sampling of species, in: Zellner, A. (Ed.), Bayesian Analysis in Econometrics and Statistics, Essays in Honor of Harold Jeffreys. North-Holland, Amsterdam, pp. 383–391. Hsiao, C., 2003. Analysis of Panel Data. Second ed., Cambridge University Press, Cambridge, MA. Hsiao, C., Pesaran, M.H., 2008. Random coefficient models, in: The Handbook of Panel Data, Fundamentals and Recent Developments in Theory and Practice. M´ aty´ as, L. and Sevestre, P., pp. 185–214. Kapetanios, G., 2008. A bootstrap procedure for panel datasets with many cross-sectional units. Econometrics Journal 11, 377–395. Kass, R.E., Wasserman, L., 1995. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association 90, 928–934. Koop, G., 2003. Bayesian Econometrics. Wiley, New York. Laird, N.M., Louis, T.A., 1987. Empirical Bayes confidence intervals based on bootstrap samples. Journal of the American Statistical Association, 82, 739–750. Laird, N.M., Ware, J.H., 1982. Random-effects models for longitudinal data. Biometrics 38, 963–74. Liang, F., Paulo, R., Molina, G., Clyde, M.A., Berger, J.O., 2008. Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association 103, 410–423. Lindley, D., Smith, A., 1972. Bayes estimates for the linear model. Journal of the Royal Statistical Society, Series B (Methodological), 34, 1–41. Magnus, J.R., Durbin, J., 1999. Estimation of regression coefficients of interest when other regression coefficients are of no interest. Econometrica 67, 639–643. Maruyama, Y., George, E.I., 2011. Fully Bayes factors with a generalized g-prior. Annals of Statistics 39, 2740–2765. 26

Maruyama, Y., George, E.I., 2014. Posterior odds with a generalized hyper-g prior. Econometric Reviews 33, 251–269. Moreno, E., Pericchi, L.R., 1993. Bayesian robustness for hierarchical ε-contamination models. Journal of Statistical Planning and Inference 37, 159–167. Mundlak, Y., 1978. On the pooling of time series and cross-section data. Econometrica 46, 69–85. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P., 2007. Numerical Recipes: The Art of Scientific Computing. 3rd ed., Cambridge University Press, New York. Rendon, S.R., 2013. Fixed and random effects in classical and Bayesian regression. Oxford Bulletin of Economics and Statistics, 75, 460–476. Robert, C.P., 2007. The Bayesian Choice. From Decision-Theoretic Foundations to Computational Implementation. 2nd ed., Springer, New York, USA. Rubin, H., 1977. Robust Bayesian estimation, in: Gupta, S.S., Moore, D.S. (Eds.), Statistical Decision Theory and Related Topics II. Academic Press, New York, pp. 351–356. Sinha, P., Jayaraman, P., 2010a. Bayes reliability measures of lognormal and inverse Gaussian distributions under ML-II ε-contaminated class of prior distributions. Defence Science Journal 60, 442–450. Sinha, P., Jayaraman, P., 2010b. Robustness of Bayes decisions for normal and lognormal distributions under hierarchical priors. MPRA Working Paper No. 22416. Faculty of Management Studies, University of Delhi. Slater, L.J., 1966. Generalized Hypergeometric Functions. Cambridge University Press, Cambridge, UK. Smith, A., 1973. A general Bayesian linear model. Journal of the Royal Statistical Society, Series B (Methodological) 35, 67–75. White, H., 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48, 817–838. Zellner, A., 1986. On assessing prior distributions and Bayesian regression analysis with g-prior distribution, in: Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, Studies in Bayesian Econometrics. North-Holland, Amsterdam. volume 6, pp. 389–399. Zheng, Y., Zhu, J., Li, D., 2008. Analyzing spatial panel data of cigarette demand: a Bayesian hierarchical modeling approach. Journal of Data Science 6, 467–489.

27

28

1 0.9986 0.0337 0.0340 0.9977 0.0336 0.0352 0.9972 0.0327 0.0346 0.9915 0.0410 0.0418 0.0015 0.9918 0.0373 0.0382 0.0016 0.9987 0.0340 0.0341 0.0017

1 0.9992 0.0334 0.0342 0.9983 0.0333 0.0351 0.9982 0.0327 0.0347 0.9990 0.0413 0.0413 0.0015 0.9988 0.0375 0.0375 0.0016 0.9992 0.0342 0.0342 0.0016

β12 1 1.0005 0.0341 0.0333 0.9997 0.0344 0.0350 0.9994 0.0337 0.0341 1.0010 0.0402 0.0402 0.0015 1.0009 0.0377 0.0376 0.0017 1.0004 0.0333 0.0333 0.0017

β2

3.9902

3.9101

0.9959

0.9986

3.9367

0.9974

4.0150

3.9405

0.9975

0.9962

4 3.9701

σµ2

1 1.0023

σu2

0.0480

0.0580

0.0660

0.0643

λβ

2S (3S) t-dist: 2S (3S) with multivariate t-distributions. FB: Full Bayesian.

2S bootstrap (3S bootstrap) : 2S (or 3S) with individual block resampling bootstrap.

True FGLS coef se rmse 2S bootstrap coef se rmse 3S bootstrap coef se rmse 2S t-dist. coef se rmse nse 3S t-dist. coef se rmse nse FB coef se rmse nse

β11

N = 100, T = 5, ρ = 0.8, ε = 0.5, Replications=1, 000

2,640.001

< 10−4

7,778.272

1,980.579

558.113

< 10−4

0.3708

351.275

47.703

Computation Time (secs.)

0.3574

λµ

Table 1: Random Effects World (FGLS, Robust 2S, 3S and Full Bayesian)

29

1 1.0014 0.0231 0.0236 1.0000 0.0289 0.0310 1.0002 0.0261 0.0271 0.9953 0.0394 0.0396 0.0014 0.9967 0.0304 0.0306 0.0015 1.0014 0.0237 0.0237 0.0012

1 1.0010 0.0232 0.0242 0.9995 0.0293 0.0311 1.0005 0.0263 0.0289 1.0003 0.0404 0.0404 0.0014 1.0004 0.0316 0.0316 0.0015 1.0011 0.0245 0.0245 0.0013

β12 1 1.0012 0.0354 0.0348 1.0010 0.0344 0.0353 1.0010 0.0345 0.0356 1.0011 0.0348 0.0348 0.0015 1.0011 0.0349 0.0349 0.0015 1.0012 0.0348 0.0348 0.0017

β2 0.8 0.7992 0.0473 0.0467 0.8019 0.0482 0.0487 0.8017 0.0463 0.0473 0.8376 0.2083 0.2116 0.0042 0.8030 0.0492 0.0493 0.0030 0.7992 0.0467 0.0467 0.0025

π

6.4045

6.3182

0.9918

0.9938

6.3690

0.9918

7.7819

6.3905

0.9928

0.9917

6.3334 6.3431

σµ2

1 0.9981

σu2

2S (3S) t-dist: 2S (3S) with multivariate t-distributions. FB: Full Bayesian.

2S bootstrap (3S bootstrap) : 2S (or 3S) with individual block resampling bootstrap.

True Mundlak coef se rmse 2S bootstrap coef se rmse 3S bootstrap coef se rmse 2S t-dist coef se rmse nse 3S t-dist coef se rmse nse FB coef se rmse nse

β11

N = 100, T = 5, ρ = 0.8, ε = 0.5, replications=1, 000

λβ

0.0060

0.0090

0.0006

0.0008

Table 2: Mundlak-type Fixed Effects World (FGLS, Robust 2S, 3S and Full Bayesian)

2,796.030

< 10−4

8,416.163

2,040.113

576.094

< 10−4

0.4050

369.346

51.184

Computation Time (secs.)

0.4136

λµ

30

1 1.0025 0.0255 0.0286 0.9934 0.0301 0.0337 0.9946 0.0267 0.0296 0.9832 0.0452 0.0482 0.0015 0.9891 0.0322 0.0340 0.0015 1.0007 0.0251 0.0251 0.0013

1 0.9989 0.0256 0.0267 0.9961 0.0304 0.0299 0.9959 0.0268 0.0276 0.9935 0.0426 0.0430 0.0015 0.9955 0.0300 0.0304 0.0015 0.9996 0.0230 0.0230 0.0013

β12 1 1.0018 0.0346 0.0372 1.0009 0.0345 0.0368 1.0015 0.0345 0.0367 1.0011 0.0362 0.0362 0.0015 1.0011 0.0361 0.0361 0.0016 1.0015 0.0362 0.0362 0.0019

β2

96.6751

96.1373

0.9947

0.9970

95.2535

0.9955

108.1559

95.6672

0.9935

0.9941

96.2228 94.6900

σµ2

1 1.0038

σu2

0.4628

< 10−4

< 10−4

< 10−4

< 10−4

< 10−4

λµ

0.4136

λβ

< 10−4

8,797.542

2,794.758

2,144.238

531.174

343.361

17.007

Computation Time (secs.)

2S (3S) t-dist: 2S (3S) with multivariate t-distributions. FB: Full Bayesian. The parameters πt are omitted from the table.

2S bootstrap (3S bootstrap) : 2S (or 3S) with individual block resampling bootstrap.

True MCS coef se rmse 2S bootstrap coef se rmse 3S bootstrap coef se rmse 2S t-dist coef se rmse nse 3S t-dist coef se rmse nse FB coef se rmse nse

β11

N = 100, T = 5, ρ = 0.8, ε = 0.5, replications=1, 000

Table 3: Chamberlain-type Fixed Effects World (MCS, Robust 2S, 3S and Full Bayesian)

31

1 0.9976 0.0395 0.0405 0.9965 0.0529 0.0530 0.9905 0.0375 0.0387 0.9938 0.0459 0.0463 0.0016 0.9886 0.0288 0.0310 0.0014 0.9864 0.0205 0.0246 0.0013

1 0.9981 0.0394 0.0385 0.9983 0.0506 0.0506 0.9921 0.0356 0.0365 0.9997 0.0420 0.0419 0.0016 0.9920 0.0268 0.0279 0.0014 0.9875 0.0189 0.0226 0.0013

β12 1 1.0037 0.0444 0.0456 1.0002 0.0624 0.0624 1.0002 0.0628 0.0628 1.0003 0.0450 0.0450 0.0019 1.0004 0.0449 0.0449 0.0020 1.0004 0.0450 0.0450 0.0022

β2 1 1.0173 0.2143 0.2071 1.0031 0.2056 0.2057 1.0136 0.2051 0.2055 1.0471 1.1895 1.1899 0.0209 1.0172 0.1966 0.1972 0.0071 1.0173 0.1935 0.1941 0.0054

η1 1 0.9958 0.1857 0.1903 1.0320 0.1029 0.1078 1.0453 0.0799 0.0919 0.9582 0.3458 0.3481 0.0095 1.0454 0.0615 0.0764 0.0045 1.0599 0.0459 0.0754 0.0033

η2

3.7171

3.6260

0.9937

0.9954

3.7420

0.9947

7.4181

3.8244

0.9935

0.9925

4 4.2904

2S (3S) t-dist: 2S (3S) with multivariate t-distributions. FB: Full Bayesian.

σµ2

1 0.9924

σu2

2S bootstrap (3S bootstrap) : 2S (or 3S) with individual block resampling bootstrap.

True HT coef se rmse 2S bootstrap coef se rmse 3S bootstrap coef se rmse 2S t-dist coef se rmse nse 3S t-dist coef se rmse nse FB coef se rmse nse

β11

N = 100, T = 5, ρ = 0.8, ε = 0.5, replications=1, 000

0.3320

< 10−4

< 10−4

< 10−4

< 10−4

< 10−4

0.5041

λµ

< 10−4

λβ

Table 4: Hausman-Taylor World (IV, Robust 2S, 3S and Full Bayesian)

8,564.863

3,297.587

2,507.205

635.644

335.815

3.637

Computation Time (secs.)

32

ε = 0.9

ε = 0.5

ε = 0.1

ε=0

3S

2S

3S

2S

3S

2S

3S

2S

True HT coef se rmse bootstrap coef se rmse bootstrap coef se rmse bootstrap coef se rmse bootstrap coef se rmse bootstrap coef se rmse bootstrap coef se rmse bootstrap coef se rmse bootstrap coef se rmse

β11 1 0.9976 0.0395 0.0405 0.9835 0.0524 0.0549 0.9854 0.0355 0.0384 0.9965 0.0529 0.0530 0.9905 0.0376 0.0388 0.9965 0.0529 0.0530 0.9905 0.0375 0.0387 0.9965 0.0529 0.0530 0.9905 0.0376 0.0388

β12 1 0.9981 0.0394 0.0385 0.9851 0.0501 0.0522 0.9839 0.0342 0.0378 0.9983 0.0506 0.0506 0.9919 0.0358 0.0367 0.9983 0.0506 0.0506 0.9921 0.0356 0.0365 0.9983 0.0506 0.0506 0.9919 0.0358 0.0367

β2 1 1.0037 0.0444 0.0456 0.9983 0.0623 0.0623 0.9977 0.0626 0.0627 1.0002 0.0624 0.0624 1.0003 0.0629 0.0629 1.0002 0.0624 0.0624 1.0002 0.0628 0.0628 1.0002 0.0624 0.0624 1.0003 0.0629 0.0629

η1 1 1.0173 0.2143 0.2071 0.7195 0.3006 0.4111 0.9725 0.1944 0.1963 1.0045 0.2058 0.2058 1.0140 0.2042 0.2046 1.0031 0.2056 0.2057 1.0136 0.2051 0.2055 1.0015 0.2058 0.2058 1.0140 0.2042 0.2046

η2 1 0.9958 0.1857 0.1903 0.9423 0.1043 0.1192 1.0085 0.0761 0.0766 1.0320 0.1029 0.1078 1.0453 0.0797 0.0917 1.0320 0.1029 0.1078 1.0453 0.0799 0.0919 1.0320 0.1029 0.1078 1.0453 0.0797 0.0917

N = 100, T = 5, ρ = 0.8, replications=1, 000

0.9956

0.9935

0.9947

0.9935

0.9956

0.9935

0.9904

0.9941

σu2 1 0.9924

3.7408

3.8249

3.7420

3.8244

3.7408

3.8240

3.3834

3.5298

σµ2 4 4.2904

Table 5: Hausman-Taylor World, Robustness to ε-contamination

< 10−4

0.5041 < 10−4

0.1079 < 10−4

< 10−4 < 10−4

< 10−4 < 10−4

0.9008

< 10−4 < 10−4

1

1

λµ

1

1

λβ

33 1 0.9979 0.0897 0.0897 0.9905 0.0986 0.0990

True HT coef se rmse 3S bootstrap coef se rmse



1 0.9977 0.0902 0.0891 0.9865 0.0955 0.0964

1 1.0000 0.0333 0.0374 0.9943 0.0588 0.0672

β12 1 1.0021 0.0357 0.0357 1.0036 0.0522 0.0541

The parameter π is omitted. The parameters π1 , . . . , π5 are omitted.

1 1.0005 0.0334 0.0368 0.9982 0.0546 0.0583

True MCS coef se rmse 3S boot strap coef se rmse



β11 1 1.0007 0.0357 0.0351 0.9979 0.0521 0.0531

True Mundlak Coef se rmse 3S bootstrap coef se rmse

Mundlak-type Fixed Effects World† β2 η1 η2 σu2 σµ2 λβ 1 5.0459 6.3257 1.0015 6.6390 6.3816 0.0898 0.0884 1.0017 6.4584 6.4126 0.0363 0.0867 0.0901 Chamberlain-type Fixed Effects World‡ 1 6.6292 95.5874 1.0008 6.9099 91.8444 0.0808 0.0850 1.0004 7.4226 95.7630 0.0005 0.0871 0.0927 Hausman-Taylor 1 1 1 6.9095 4 1.0113 0.9901 0.9997 6.8540 5.1608 0.1137 0.2680 0.3478 0.1152 0.2529 0.3427 1.0051 0.9479 1.0186 6.9214 4.3901 0.0041 0.1607 0.3237 0.2039 0.1608 0.3278 0.2047

N = 100, T = 5, ε = 0.5, ρ = 0.8, replications=1, 000

Table 6: Departure from Normality: The Skewed t-distribution

< 10−4

< 10−4

< 10−4

λµ

34

The parameters π1 , . . . , π5 are omitted from the table.

β12 1 0.9880 0.1092 0.1093 0.9312 0.0648 0.0945 0.9261 0.0416 0.0848 0.0021 0.9305 0.0426 0.0815 0.0024

β11 1 0.9904 0.1082 0.1139 0.9341 0.0668 0.0938 0.9270 0.0433 0.0848 0.0021 0.9325 0.0438 0.0805 0.0024

True HT coef se rmse 3S bootstrap coef se rmse 3S t-dist coef se rmse nse FB coef se rmse nse



β12 1 1.0010 0.0374 0.0413 0.9996 0.0404 0.0418 1.0001 0.0404 0.0404 0.0020 1.0011 0.0406 0.0406 0.0021

β11 1 1.0010 0.0374 0.0401 1.0005 0.0403 0.0410 0.9984 0.0393 0.0393 0.0020 1.0016 0.0395 0.0395 0.0021

True MCS coef se rmse 3S bootstrap coef se rmse 3S t-dist coef se rmse nse FB coef se rmse nse

Chamberlain-type Fixed Effects World† β2 η1 η2 σu2 σµ2 1 8.0000 112.0657 0.9999 7.9944 115.3597 0.0503 0.0518 0.9996 7.9584 111.4192 0.0499 0.0511 0.9990 7.9520 112.5857 0.0505 0.0505 0.0022 0.9995 7.9517 112.5285 0.0503 0.0503 0.0023 Hausman-Taylor World β2 η1 η2 σu2 σµ2 1 1 1 8 32 1.0059 0.9879 1.0195 7.8967 43.0953 0.0511 0.6813 0.6150 0.0521 0.6424 0.5878 1.0011 0.9894 1.2898 7.9191 17.1702 0.0720 0.4734 0.1386 0.0720 0.4735 0.3212 1.0005 0.9902 1.3023 7.9441 16.6030 0.0513 0.4345 0.0947 0.0513 0.4344 0.3168 0.0023 0.0090 0.0047 1.0008 0.9893 1.2948 7.9402 16.9453 0.0513 0.4412 0.0951 0.0513 0.4411 0.3097 0.0023 0.0107 0.0054

N = 100, T = 5, ε = 0.5, replications=1, 000

Table 7: Sensitivity to Endogenous Covariates

< 10−4

< 10−4

< 10−4

< 10−4

λµ

< 10−4

< 10−4

λβ

< 10−4

λµ

< 10−4

λβ