ARTICLE IN PRESS
JID: ECOSTA
[m3Gsc;April 14, 2017;12:1]
Econometrics and Statistics 0 0 0 (2017) 1–13
Contents lists available at ScienceDirect
Econometrics and Statistics journal homepage: www.elsevier.com/locate/ecosta
On limiting distribution of quasi-posteriors under partial identification Wenxin Jiang1,2 Northwestern University, 2006 Sheridan Rd, Evanston, IL 60208, United States
a r t i c l e
i n f o
Article history: Received 6 October 2016 Revised 19 March 2017 Accepted 21 March 2017 Available online xxx
a b s t r a c t The limiting distribution (in total variation) is established for the quasi posteriors based on moment conditions, which only partially identify the parameters of interest. Some examples are discussed. © 2017 EcoSta Econometrics and Statistics. Published by Elsevier B.V. All rights reserved.
Keywords: Generalized method of moments Interval data Moment inequalities Partial identification Quasi-posterior Total variation
1. Introduction Our paper studies theoretically the large sample behavior of certain Bayesian procedures which are of mutual interest to econometricians and statisticians. In Bayesian procedures, it is well known that the posterior distributions can often be approximated in total variation by normal distributions centered at the frequentist maximum likelihood estimates (see, e.g., a popular textbook account of the Berstein-von Mises theorem in Chapter 10 of van der Vaart, 20 0 0). Econometricians have studied quasi-Bayesian approaches which require less assumptions by only assuming some moment conditions rather than a likelihood function (see, e.g., Kim 2002 and Chernozhukov and Hong, 2003). In this context, similar and more general limiting results in, e.g., Chernozhukov and Hong (2003), suggest that the limiting posterior distribution is typically normal and centered at a corresponding frequentist extremum estimator such as one from the Generalized Method of Moments. In the models with partial identification, where parameters are not point identified, so that the frequentist extremum estimator is not unique, the asymptotic normal limiting results mentioned before may fail. Such situations of partial identification have generated much interest recently both in statistics (e.g., Gustafson, 20 05; 20 07; 2015 and in econometrics (e.g., Poirier, 1998; Chernozhukov et al., 2007; Moon and Schorfheide, 2012). The literature either focuses on the inference about the location of the partially identified parameter (e.g., Moon and Schorfheide, 2012; Gustafson, 2015), or on the fully identified set of all possible locations of the parameter (e.g., Chernozhukov et al., 2007). An incomplete sample of applications include missing data, interval censoring, (e.g., Manski and Tamer, 20 02; Manski, 20 03), game-theoretic models
1 2
E-mail address:
[email protected] Taishan Scholar Overseas Distinguished Specialist Adjunct Professor at Shandong University in China. Professor of Statistics at Northwestern University in U.S.A.
http://dx.doi.org/10.1016/j.ecosta.2017.03.006 2452-3062/© 2017 EcoSta Econometrics and Statistics. Published by Elsevier B.V. All rights reserved.
Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
JID: ECOSTA 2
ARTICLE IN PRESS
[m3Gsc;April 14, 2017;12:1]
W. Jiang / Econometrics and Statistics 000 (2017) 1–13
with multiple equilibria (e.g., Bajari et al., 2007; Ciliberto and Tamer, 2009), auctions (Haile and Tamer, 2003), noncompliance of randomized clinical trials (Gustafson, 2015), and gene environment interactions (Gustafson, 2015). Our current paper derives and rigorously proves some results on the limiting posterior distribution in presence of partial identification. In addition, we allow quasi-Bayes procedures based on moment conditions. The limit is in the total variation sense, which is a Bernstein-von Mises type result, but not generally asymptotic normal. When data are informative enough only to determine an identification region, instead of a point parameter, our result says that the limiting posterior is related to the prior distribution truncated in (a frequentist estimate of) this identification region. Our result connects the literature on the inference about the identified parameter set to the inference of the unidentified parameter point, in the sense that one can easily convert a set estimate and combine it with a prior distribution to obtain a large sample approximation of the posterior distribution of the point parameter. This connection may have several meaningful applications. 1. (Simplifying computation) It can be used to avoid lengthy Markov chain Monte Carlo simulation that is typically involved in posterior computations, similar to using the normal approximation in the point-identified situation. 2. (Incorporating prior information.) This is also useful for incorporating prior information to improve the inference results from the more conservative set-based approach. The prior information can be from a same or different study with additional data that are either identifying or partially identifying the parameters of interest. For example, when a small part of the data are exact and all the rest are interval censored, it is obviously not advisable to use the interval data only to estimate an identification region. The exact part of the data can be used to derive a posterior, which can serve as a prior when further incorporating the interval data. 3. (Combining studies.) In the above discussion, we have applied the principle that the prior can be derived from a posterior based on independent data. Multiple applications of this principle can allow meta-analysis (combining results from different studies), sequential computation (for dynamic data flow) or parallel computation (subsetting big data when they are hard to be handled altogether). Our limiting posterior distribution suggests that combining inferences from subsets of data is equivalent to intersecting their resulting identification regions. 1.1. Related works. A fundamental paper about two decades ago by Poirier (1998) has shown many applications of the Bayesian method in handling problems with partial identification, where data D are informative for only a subset of the parameters, say, λ, out of all the parameters (λ, θ ) of the model. This decomposition of (λ, θ ) does not have to be the most natural parameterization, but can be achieved by a clever re-parameterization. This situation is further illustrated by a sequence of works by Gustafson (e.g., Gustafson, 20 05; 20 07; 2015), with many interesting examples. The works of these authors have described that the limiting posterior distribution of p(λ, θ |D ) = p(λ|D ) p(θ |λ ) is the product of a usual asymptotic normal distribution on λ, and a conditional prior p(θ |λ) where λ either takes the true value or its maximum likelihood estimate. Recently this limiting result is rigorously proved in total variation distance by Moon and Schorfheide (2012, Theorem 1). The key for this line of existing work is the following assumption: Assumption (∗ )(Conditional uninformativeness): There exists a parameterization decomposable into (λ, θ ), such that the likelihood p(D|θ , λ ) = p(D|λ ) depends only on λ, i.e., D and θ are independent given λ. Moon and Schorfheide (2012) call λ the “reduced form” parameter, and θ the “structural” parameter of interest. The current work aims for generalizing the works of these previous authors and studying the limiting posteriors under partial identification. The generalization is in two important ways: Generalization (i)(Posterior): We generalize the likelihood-based posterior to be quasi-likelihood-based quasi-posterior, according to a general framework described in Chernozhukov and Hong (2003). Generalization (ii) (Partial identification): We also allow more general scenarios of partial identification, where no obvious decomposition (λ, θ ) can satisfy Assumption (∗ ), so that given λ, the data D are not “conditionally” uninformative / independent of the parameter of interest θ . To be more specific: we allow quasi-likelihood of the form e−nRn (λ,θ ) , where Rn is a general empirical risk function that depends on data D, which was (−1/n ) times the log-likelihood function in the special case of the usual likelihood. We allow this quasi-likelihood to depend also on θ (unlike in Assumption (∗ ) before), and only assume the following: Assumption (†)(Marginal uninformativeness): There is a parameterization that can decompose into (λ, θ ), such that the marginal likelihood e−nRn (λ,θ ) dλ is a constant in θ (and is therefore “marginally” uninformative). It is obvious that this Assumption (†) (marginal uninformativeness) contains Assumption (∗ )(conditional uninformativeness) as a special case, where Rn = Rn (λ ) had no dependence on θ ; but we will show later that there indeed exist interesting examples of Assumption (†) which do not satisfy Assumption (∗ ). The current paper is closely related to the paper by Liao and Jiang (2010) and Liao (2010), who also consider a quasiposterior with partial identification. These works can be viewed as a special case of this current paper (according to the example of Bayesian moment inequalities in Section 2.4), and they have not established the result of convergence in distribution in the total variation distance. Their approach is also fundamentally different: they use a specific (exponential) prior on λ to facilitate the integration over λ, and obtain an expression of the marginal quasi-posterior p(θ |data) for the parameter of interest θ . Their proof technique then starts with analyzing this specific expression of marginal quasi-posterior Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
JID: ECOSTA
ARTICLE IN PRESS
[m3Gsc;April 14, 2017;12:1]
W. Jiang / Econometrics and Statistics 000 (2017) 1–13
3
p(θ |data). In contrast, the proof technique here starts with a general form of the quasi-posterior in (λ, θ ) and does not first integrate over an exponential prior in λ. 1.2. Organization of the paper Section 2 will provide some partial identification examples with quasi-posteriors from Bayesian Generalized Method of Moments (BGMM). Section 3 formulates the main theorem and its conditions. Section 4 explains the main theorem for the BGMM examples and how the theorem can be applied. Section 5 studies the regularity conditions in the context of the BGMM quasi-posteriors. Section 6 describes some limitations of the current paper. Section 7 contains some discussions. Technical proofs are included in an Appendix. Some details of applications are included in Supplementary materials, where one application demonstrates how to apply the main theorem to simplify simulation and description of the quasi-posteriors under partial identification, and another application demonstrates how to apply the theorem to combine analyses from published information. 2. Examples We first describe a class of quasi-posteriors obtained from BGMM (Bayesian Generalized Method of Moments). Then we will give several examples originating from more specific applications, and apply a quasi-posterior from BGMM. We will use these examples to demonstrate Generalizations (i) and (ii), and how sometimes Assumption (†) in the Introduction can still be satisfied when Assumption (∗ ) fails. Consider a quasi posterior density of the form
q(λ, θ |data ) ∝ e−nRn (λ,θ ) p(λ, θ )I (λ, θ ), where Rn is an GMM (Generalized Method of moments) criterion function
¯ (θ ) − λ )T v−1 (m ¯ (θ ) − λ ) + 0.5n−1 log |2π v/n|, Rn = 0.5 ( m ¯ is the sample version of Em, and the variance matrix v can be similar to the one used in Chapter 2 of Liao (2010), where m ¯ ), which turns out to be irrelevant to the asymptotic inferchosen as I for simplicity (or alternatively by an estimate of nvar m ence about θ in the current partial identification scenario. When we need a more explicit form, we will consider a sample ¯ (θ ) = n−1 ni=1 m(Wi , θ ), and Em(θ ) = Em(W , θ ), where W , W1 , . . . , Wn , . . . are iid (independent and identically average m distributed). The above quasi-posterior corresponds to a Bayesian treatment of the moment condition Em(θ ) − λ = 0, (λ, θ ) ∈ . The ¯ (θ ) − λ used in Rn is a sample version of Em(θ ) − λ. Here, λ is a “bias parameter” which will appear naturally difference m in all the examples later in this section, due to partial identification. This framework is more general than the “Bayesian moment inequalities” of Liao (2010). The latter corresponds to the constraint = {(λ, θ ) : λ ≥ , θ ∈ }, where = [0, ∞ )dim(m ) is the nonnegative orthant. The current framework allows other constraint sets such as = [0, 1] for the bias parameter λ, which can naturally arise in the “rounded data” example below. 2.1. Rounded data This is based on a simple example of Section 2 of Moon and Schorfheide (2012). We here identify that it can be regarded as a special case of our framework. In this example, there is no Generalization (ii) in the structure of partial identification. Parameter decomposition still satisfies Assumption (∗ ) in the Introduction. We use this example only to motivate the Generalization (i) in using a quasi-likelihood (instead of a true likelihood). Suppose we are interested in a structural parameter θ but it is only known that it is between [φ , φ + 1], while we n
observe iid copies of W ∼ N(φ , 1). Then the likelihood function is p(data|φ , θ ) ∝ e−nRn ∝ e−0.5 i=1 (Wi −φ ) , which is independent of the structural parameter θ . Here we can take the reduced form parameter (λ in the Introduction) to be φ . Therefore Assumption (∗ ) holds. Our more relaxed Assumption (†) also holds: the marginalized likelihood dφ e−nRn is totally uninformative of θ , which fits our framework. (A referee kindly points out here that the data are only informative about φ . So, any inference on θ must come from the prior. However, given φ , there is a unique identified set ([φ , φ + 1]) for θ . So, inference on the identified set ([φ , φ + 1]) is / can be prior free.) It is noted that we can use a quasi-likelihood based on moment conditions and be more flexible about the modeling of W. For example, suppose θ = EY where the observed W = Y is the integer part of the hidden Y. Then Y − W ∈ [0, 1] and indeed θ ∈ [φ , φ + 1] where φ = EW . However, it is unsatisfactory to assume a normal model W ∼ N(φ , 1) since W = Y is an integer. In this case, our approach would only be to use the moment inequalities EW ≤ EY ≤ EW + 1. We rewrite this with a moment equation Em − λ = E (θ − W ) − λ = 0, where λ (= θ − φ ) is a bias parameter associated with the round¯ − λ. Then ¯ −λ = θ −W ing error Y − W, which is constrained as λ ∈ = [0, 1]. The corresponding sample moment is m 2 ¯ )2 . The corresponding quasiwe use a quasi-likelihood function e−nRn ∝ e−0.5n(m¯ −λ ) /v−0.5 log |2π v/n| with v = n−1 ni=1 (Wi − W posterior will be q(λ, θ |data ) ∝ e−nRn p(λ, θ )I× (λ, θ ), where p(λ, θ ) is a prior and we have explicitly written out a region × for any constraints on λ and θ . The later development on the limiting distribution will show that this quasi-posterior 2
Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
ARTICLE IN PRESS
JID: ECOSTA 4
[m3Gsc;April 14, 2017;12:1]
W. Jiang / Econometrics and Statistics 000 (2017) 1–13
still makes sense in estimating the structural parameter θ , even though we cannot easily use a genuine likelihood-based posterior in this case due to the rounded values of the observed data W. On the other hand, this simple example cannot be used to motivate Generalization (ii), since the structure of partial identification still satisfies the previous framework of Assumption (∗ ): we can reparameterize and treat θ − λ (= φ ) as the new reduced form parameter λ. Given this new λ, data are conditionally uninformative about θ . The next example shows that sometimes no obvious reparameterization like this exists, yet due to Generalization (ii), the example still satisfies the more general Assumption (†) made in our proposed framework.
2.2. Interval regression Interval regression has become an important example of the partial identification literature since Moon and Schorfheide (2012). We here consider an extension of Example 2 of Chernozhukov et al. (2007). Assumes that E (Y − g(X θ )|Z ) = 0 for some positive instrumental variable Z and a known parametric transform g. The structural parameter of interest is θ ∈ . However, Y is only observed to fall in an interval [L, U]. This model cannot be easily treated in a reduced form. It is unclear how to form a reduced form parameter so that data is independent of θ given the reduced form parameter (except when data are discrete and the frequency distribution of (Y, X, Z) can be used as the reduced form parameter, as a referee points out). In addition, it may not be desirable here to use a likelihood approach which would involve a joint probability model of [L, U]. Instead, we will use the moment conditions alone: E(L|Z) ≤ E(g(X θ )|Z) ≤ E(U|Z), which implies Em − λ ≡ (E (g(X θ ) − L )Z T − λ1 , E (U − L )Z − λ2 )T = 0, constrained in λ = (λ1 , λ2 )T ∈ [0 ≤ λ1 ≤ λ2 ] ≡ . Consider a quasi posterior density of the form
q(λ, θ |data ) ∝ e−nRn (λ,θ ) p(λ, θ )I (λ, θ ), where p(λ, θ ) is a prior density on = × with respect to a product base measure dλdθ , θ is the parameter of ¯ (θ ) − λ )T v−1 (m ¯ (θ ) − λ ) + interest, and λ is the nuisance parameter. Here Rn is a BGMM criterion function Rn = 0.5(m 0.5n−1 log |2π v/n|, as described in the beginning of this Section. Then this example is still covered in our framework, since it is obvious that the marginalized likelihood
λ∈ 2
e−nRn (λ,θ ) dλ = 1,
which is completely uninformative about θ . In other words, the quasi-likelihood e−nRn (λ,θ ) depends on θ and therefore violates Assumption (∗ ) in Section 1, but Assumption (†) is still satisfied. This example therefore involves both Generalizations (i) and (ii).
2.3. Interval quantile regression This example is similar to the previous one, except that we consider quantile regression here. This may be useful when we would like to model a quantile (e.g., the median, instead of the mean), which are more resistant against outliers, e.g., in income. The interquartile range can also be derived, which can measure the “spread” (in addition to the “location”), without needing to model the second moments. Assume that E (q − I (Y ≤ g(X θ ))|Z ) = 0 for some fixed q ∈ (0, 1), for some positive instrumental variable Z and a known parametric transform g. The structural parameter of interest in θ ∈ . However Y is only observed to fall in an interval [L, U]. This model cannot be easily treated in a reduced form. It is unclear how to form a reduced form parameter so that data is independent of θ given the reduced form parameter. In addition, it is not desirable here to use a likelihood approach which would involve a joint probability model of [L, U]. Instead we will use the moment conditions alone: E (q − I (L ≤ g(X θ ))|Z ) ≤ 0 and E (q − I (U ≤ g(X θ ))|Z ) ≥ 0. or alternatively, 0 ≤ E (q − I (U ≤ g(X θ ))|Z ) ≤ E (I (L ≤ g(X θ )) − I (U ≤ g(X θ ))|Z ). This implies Em − λ ≡ (E (q − I (U ≤ g(X θ ))Z T − λ1 , E (I (L ≤ g(X θ )) − I (U ≤ g(X θ )))Z − λ2 )T = 0, constrained in λ = (λ1 , λ2 )T ∈ [0 ≤ λ1 ≤ λ2 ] ≡ . Consider a quasi posterior density of the form
q(λ, θ |data ) ∝ e−nRn (λ,θ ) p(λ, θ )I (λ, θ ), where p(λ, θ ) is a prior density on = × with respect to a product base measure dλdθ , θ is the parameter of ¯ (θ ) − λ )T v−1 (m ¯ (θ ) − λ ) + interest, and λ is the nuisance parameter. Here Rn is a GMM criterion function Rn = 0.5(m ¯ corresponding to 0.5n−1 log |2π v/n| similar to the one used in the previous example, with the new sample moment m the quantile-related moment Em defined above for the current example. Then this example is still covered in our framework, since it is obvious that the marginalized likelihood
λ∈ 2
e−nRn (λ,θ ) dλ = 1,
which is completely uninformative about θ . Therefore Assumption (†) in Section 1 is satisfied (but not Assumption (∗ )). This example therefore involves both Generalizations (i) and (ii). Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
ARTICLE IN PRESS
JID: ECOSTA
[m3Gsc;April 14, 2017;12:1]
W. Jiang / Econometrics and Statistics 000 (2017) 1–13
5
2.4. Bayesian moment inequalities Consider a special case of the BGMM quasi-posterior density described in the beginning of this Section.
q(λ, θ |data ) ∝ e−nRn (λ,θ ) p(λ, θ )I (λ, θ ), ¯ (θ ) − λ )T v−1 (m ¯ (θ ) − λ ) + 0.5n−1 log |2π v/n|. When we constrain λ ∈ = [0, ∞ )dim(m ) , this corresponds where Rn = 0.5(m to a Bayesian treatment of the moment inequality models which assume Em(θ ) ≥ 0 (componentwise) (Chernozhukov et al., ¯ (θ ) ∼ N (λ, v/n ), where λ is the 2007). This is because the quasi-likelihood may be derived heuristically from assuming m expectation Em(θ ) and is constrained here to have nonnegative components. Chernozhukov et al. (2007) provided several interesting examples for such moment inequality models. The constraint λ ∈ may be regarded as part of the specification on the prior distribution p(λ, θ )I × (λ, θ ). Before this prior constraint is imposed, the unconstrained integration over λ of the quasi-likelihood e−nRn (λ,θ ) is uninformative about θ , since
λ∈ dim(m)
e−nRn (λ,θ ) dλ = 1.
Therefore, Assumption (†) (but not Assumption (∗ )) in Section 1 is satisfied, and we have found another example involving both Generalizations (i) and (ii). 3. A general framework Consider a quasi posterior density of the form
q(λ, θ |data ) ∝ e−nRn (λ,θ ) p(λ, θ )I (λ, θ ), where p(λ, θ ) (when restricted to = × ) is a prior density with respect to a product base measure dλdθ , θ is a parameter of interest, and λ is a nuisance parameter. Here Rn is an empirical risk function. The factor e−nRn (λ,θ ) is called the quasi-likelihood, since it plays the role of a likelihood function which summarizes the information from data, since the empirical risk Rn depends on data. We consider a partially identified case when the quasilikelihood is uninformative about θ , but informative about λ for each fixed θ . The following two major conditions describe this situation formally. The first is a marginal uniformativeness condition that generalizes Assumption (†) in the Introduction. Condition 1. (Marginal uninformativeness in θ ) There exists Cn > 0 independent of λ and θ , and a nonstochastic function τ (θ ) > 0, such that τn (θ ) ≡ dλCn e−nRn (λ,θ ) satisfies
|τn (θ )/τ (θ ) − 1|2 p(θ )dθ = o p (1 ).
This condition states that the quasi-likelihood function is unin f ormative to the posterior inference on θ , in the sense that the marginalized likelihood e−nRn (λ,θ ) dλ, before incorporating prior information, is “almost flat” in θ , i.e., proportional to τ (θ ) of order 1 (with an irrelevant proportional constant that can depend on n). This describes uninformativeness in θ , since the marginal likelihood changes with θ “slowly” by a factor of order 1, determined by the ratio of τ (θ )’s at different θ locations. (In contrast, in identified cases when data are informative about a parameter θ , the likelihood typically changes with θ by a factor that is exponential large in n.) In all our examples in Section 2, we use the BGMM quasi-likelihood, where Rn is quadratic in λ and the integration over λ in the marginal likelihood can be exactly computed to give τ (θ ) = 1. Therefore Condition 1 is satisfied for all our examples. ¯ (θ ) − An example of τ (θ ) = 1 can be obtained if we ignore the determinant in BGMM and define Rn = 0.5(m ¯ (θ ) that depends on θ . Then the marginal likelihood m λ )T v(θ )−1 (m¯ (θ ) − λ ) where v(θ ) is an estimate of nvar −nR (λ,θ ) e n dλ = det(2π v(θ )/n )1/2 , which is proportional to det(v(θ )). We can take τ (θ ) to be the large limit of this
quantity det(v(θ )), which may vary with θ . We could redefine the empirical risk Rn to absorb the factor related to τ (θ ) (or making τ = 1) as we did in all our examples. However, to allow flexibility in the definition of the empirical risk, we will keep the factor τ in the formulation and proof of our theorem later. The second condition assumes that fixing θ , the parameter λ is asymptotic normal according to the quasi-likelihood. (Roughly speaking, this implies that λ is identified and regular given the unidentified θ , and we therefore have a partially identified case here, instead of a fully unidentified case). √ ˜ (θ ) + t/ n, such that the quasiCondition 2. (Asymptotic normality of λ given θ ) There exists a reparameterization λ = λ likelihood of the local parameter t conditional on θ converges to a normal distribution in total variation, as follows:
dt |{e−nRn (λ(θ )+t/ ˜
√ n,θ )
/
dse−nRn (λ(θ )+s/ ˜
√ n,θ )
− φV (t )}| = o p (1 ), −1 t/2
for almost all θ according to the prior p(θ ), where φV (t ) = e−t V conditional asymptotic variance scaling as 1 and can depend on θ . T
/ det(2π V ) is the normal density of N(0, V), V is a
Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
ARTICLE IN PRESS
JID: ECOSTA 6
[m3Gsc;April 14, 2017;12:1]
W. Jiang / Econometrics and Statistics 000 (2017) 1–13
This condition assumes that the quasi-likelihood function is in f ormative to the posterior inference on λ conditional on √ ˜ (θ ) + t/ n. The center λ ˜ (θ ) will be any given θ , in the sense that the usual Bayesian asymptotic normality holds for λ = λ called an extreme estimator, since it is usually the minimizer of the empirical risk Rn (λ, θ ) over λ given θ , or its first order approximation, see, e.g., Belloni and Chernozhukov (2009, Theorem 1). When the likelihood function is used to in the posterior, the Bayesian asymptotic normality (sometimes called the Bernstein-von Mises Theorem) is a standard result for Bayesian asymptotics (see, e.g., van der Vaart, 20 0 0, Chapter 10). In the context of partial identification, Moon and Schorfheide (2012) also assumed asymptotic normality for the reduced form parameter in their Assumption 1 (ii) and have discussed how the assumption can be satisfied. When quasi-likelihood is used instead, general conditions on asymptotic normality are given by, e.g., Chernozhukov and Hong (2003, Theorem 1) and Belloni and Chernozhukov (2009, Theorem 1). The most important condition involves a quadratic approximation of Rn around its minimizer in λ. Our Condition 2 could be proved by checking, e.g., Belloni and Chernozhov’s conditions for the posterior density under a flat prior, of λ conditional on θ , for almost all θ according to the prior p(θ ). However, this is unnecessary for our examples. For all our examples in Section 2, Rn is quadratic in λ and the quasi-likelihood conforms exactly to a normal density. So, Condition 2 is satisfied for all our examples. Under these two basic assumptions, and with some additional mild regularity conditions, we have the following Theorem. Theorem 1. Under Conditions 1 and 2, and additional mild regularity Conditions 4–8 (to be stated later), we have the following results: ˜ (θ ), V/n ) × (i) p(λ, θ |data) converges in total variation and in probability to a density proportional to N (λ ˜ (θ ), θ )I (λ ˜ ( θ ), θ ) × τ ( θ ). p(λ After integrating away the nuisance parameter λ, we conjecture that ˜ (θ ), θ )I (λ ˜ ( θ ), θ ) × τ ( θ ). (ii) p(θ |data) converges in total variation and in probability to a density proportional to p(λ ˜ (θ ) converges to a nonstochastic limit λ(θ ) under Condition 3, the data dependent (iii) In the case when the extreme estimator λ λ˜ (θ ) in p(λ˜ (θ ), θ )I (λ˜ (θ ), θ ) that appears in both results (i) and (ii) can be replaced by λ(θ ). More formally:
dtdθ |e−nRn (λ,θ ) p(λ, θ )I (λ, θ )/A − φV (t )τ (θ ) p(λ(θ ), θ )I (λ(θ ), θ )/B| = o p (1 ),
√ ˜ (θ ) + t/ n. The where A = dtdθ e−nRn (λ,θ ) p(λ, θ )I (λ, θ ) and B = dtdθ φV (t )τ (θ ) p(λ(θ ), θ )I (λ(θ ), θ ), and λ = λ corresponding marginalized result is separately listed as: (iv)
dθ |
λe−nRn (λ,θ ) p(λ, θ )I (λ, θ )/A − τ (θ ) p(λ(θ ), θ )I (λ(θ ), θ )/B| = o p (1 ).
˜ (θ ) converges in some sense to a nonstochastic Results (iii) and (iv) involve a situation where the extremum estimator λ limit λ(θ ). More formally, they assume: Condition 3 (On limit of the extremum estimator).
˜ (θ ) − λ(θ )||2 = o p (1 ). dθ p(θ )||λ
(Here || · || is the Euclidean norm.) Other mild regularity conditions include the following. Condition 4 (On nondegenerate identification region).
p(θ )I (λ(θ ) ∈ )dθ > 0.
This means that the prior probability of an “identification region” [θ ∈ : λ(θ ) ∈ ] is positive. Condition 5. (On regularity of the conditional asymptotic variance.)
dθ p(θ )|trV | < ∞.
Condition 6 (On regularity of the conditional prior).
dθ p(θ )τ (θ )2 [sup p(λ|θ )2 ] +
λ
dθ p(θ )τ (θ )2 [sup ||∂λ p(λ|θ )||2 ] < ∞. λ
Condition 7 (On prior probability of a boundary). Let λ(θ ) be the large sample limit in Condition 3, and define a δ -boundary ∂δ = [λ : max{d (λ, ), d (λ, c )} ≤ δ ] for the region (which is the parameter region of λ), where d between a point and a set is the minimal Euclidean distance. We assume that the prior distribution of λ(θ ) is nonsingular on the boundary of , i.e., (4)
lim
δ↓0
dθ p(θ )I[λ(θ ) ∈ ∂δ ] = 0.
Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
JID: ECOSTA
ARTICLE IN PRESS W. Jiang / Econometrics and Statistics 000 (2017) 1–13
[m3Gsc;April 14, 2017;12:1] 7
Remark 1. Due to the Cauchy–Schwarz inequality, Conditions 4 and 6 imply: Condition 8 (On positive normalizing constant of limiting density).
p(λ(θ ), θ )I (λ(θ ), θ )τ (θ )dθ > 0.
Therefore, in results (iii) and (iv), the denominator B > 0. (Also, A > 0 with probability tending to 1, see this later in the proof of results (iii). Similarly in results (i) and (ii), the denominators in the normalizing constants are also positive with probability tending to 1.) Remark 2. The result (iii) can help explain why the limiting quasi-posterior makes sense relative to the true parameter relation λ(θ ). The factor I × (λ(θ ), θ ) can be recognized as the indicator function on the identification region of θ . For ex˜ (θ ) = m ¯ (θ ), λ(θ ) = Em(θ ) and = [0, ∞ )dim(m ) . ample, in the moment inequality Example 2.4, one can easily verify that λ Therefore I× (λ(θ ), θ ) = 1 only on the identification region {θ ∈ : Em(θ ) ≥ 0 componentwise}. Results (iii) reasonably implies that the limiting posterior gives 0 mass outside of the identification region and is prior dependent inside the identification region (and is proportional to p(θ ) for flat p(λ|θ )). ˜ (θ ), which can be estimated by data (e.g., by m ¯ (θ ) in the moment Remark 3. The result (i) and (ii) use data dependent λ inequality Example 2.4.) This has the advantage of obtaining a data-driven asymptotic distribution of the posterior. E.g., for ¯ (θ ) ≥ 0 componentwise}, which is the flat p(λ|θ ), the posterior density is asymptotically the same as constant p(θ )I{θ : m prior density truncated in an estimated identification region from a frequentist’s approach. This can be used to compute the posterior distribution approximately without resorting to MCMC (Markov Chain Monte Carlo, as performed in Chapter 2 of Liao, 2010). Remark 4. Both τ (θ ) and V (which can depend on θ too) may be shown to be related to the second order derivatives of the large sample limit of Rn in more general situations. However, there is no need to study these relations in detail in the current paper due to the following two reasons: (a). In all our examples in this paper, one can easily verify that τ (θ ) can be simply taken to be 1. (b). The asymptotic variance V (for λ conditional on θ ) does not affect the limiting posterior distribution marginally for θ , which is often the only parameter of interest. 4. Explanation and application of the main theorem We now provide some explanation of the main Theorem in the context of some examples in Section 2. According to ˜ ( θ ) , θ ) I ( θ ∈ , λ ˜ (θ ) ∈ )τ (θ ), where λ ˜ (θ ) is the the Theorem, the limiting posterior in θ can be approximated by h ∝ p(λ maximizer of the quasi-likelihood over λ when θ is fixed, and τ (θ ) can be derived from the integral of the quasi-likelihood T −1 over λ, when θ is fixed. In all the examples in Section 2, the quasi-likelihood e−0.5n(m¯ (θ )−λ ) v (m¯ (θ )−λ )|2π v/n|−1/2 is max˜ ¯ (θ ), and τ (θ ) = 1 since the quasi-likelihood is defined to integrate to 1 over λ. imized at the sample moment λ(θ ) = m ¯ ( θ ) , θ ) I ( θ ∈ , m ¯ (θ ) ∈ ), where the indicator function is over Then we can approximate the limiting posterior by h ∝ p(m ¯ , = , = [0, 1], the ¯ (θ ) = θ − W the estimated identification region of θ . Therefore, for the example of rounded data, m ¯ ,W ¯ + 1] ), where pλ, θ is the joint prior density of λ, θ . If this prior ¯ , θ )I (θ ∈ [W limiting posterior becomes h ∝ pλ,θ (θ − W ¯ ,W ¯ + 1] ), which is the truncated prior in the estimated identification region. is flat for λ ∈ [0, 1], then h ∝ p(θ )I (θ ∈ [W If the prior is not flat for λ ∈ [0, 1], e.g., pλ,θ ∝ e−0.3λ I (λ ∈ [0, 1] ) p(θ ), then this describes that the rounding-off error has a bias λ that is more likely to be closer to 0 than to 1. This non-flat prior in λ will affect the limiting posterior of θ as ¯ ¯ ,W ¯ + 1] ), so that θ values closer to the lower bound W ¯ than to the upperbound W ¯ + 1 will be ∝ e−0.3(θ −W ) p(θ )I (θ ∈ [W preferred, even when p(θ ) is flat. For a more sophisticated application to the case of Bayesian moment example in Section 2, please see the Supplementary materials. As one possible application of the main theorem, when one can easily compute and describe the frequentist estimate of the identification region, it is possible to compute and describe the posterior approximately without using a sample from the MCMC (Markov Chain Monte Carlo, which would need a possibly long burn-in period and would not be iid). We now consider the general problem of how to simulate from the limiting posterior density according to the main theorem. We will denote the limiting posterior density as proportional to h(θ ) and rewrite it as
˜ (θ ), θ )I (λ ˜ ( θ ) , θ ) × τ ( θ ) = p( λ ˜ (θ )|θ ) p(θ )I (θ ∈ ) ˆ τ ( θ ), h ( θ ) = p( λ ˜ (θ ) ∈ } is a frequentist estimate of the identification region ≡ {θ ∈ : λ(θ ) ∈ }. We can use ˆ ≡ {θ ∈ : λ where acceptance/rejection sampling to sample from h(θ ): Consider a proposal density g(θ ) such that h(θ )/g(θ ) ≤ M for a finite constant M. Then simulate θ ∗ ∼ g(θ ), and u ∼ Unif[0, 1] independently. Accept θ = θ ∗ if u ≤ u¯ (θ ) ≡ h(θ )/[g(θ )M]. Repeat this to get a sample of the accepted θ ’s, which will be iid with a density proportional to h(θ ). See, e.g., Tanner (1996, Section 3.3.3)). ˜ (θ )|θ ) is flat or not in θ , since this The algorithm gives the correct target distribution h, whether the factor p(λ ˜ (θ )|θ ) p(θ )I (θ ∈ factor is accounted for when computing the acceptance probability u¯ (θ ): u¯ (θ ) = h(θ )/[g(θ )M] = p(λ ˆ τ (θ )/[g(θ )M]. ) Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
JID: ECOSTA 8
ARTICLE IN PRESS
[m3Gsc;April 14, 2017;12:1]
W. Jiang / Econometrics and Statistics 000 (2017) 1–13
The success of this acceptance / rejection algorithm depends on a good proposal density g. There are two natural choices: ˆ (or on a set that contains ˆ ). A second choice is to choose g to One choice is to choose g to be a uniform distribution on be the prior p(θ ). We note that the first choice may have a higher acceptance rate than the second choice, if the prior p(θ ) ˆ is very small. is very diffused, and the identification region The above algorithm is applied to an example of Bayesian moment inequalities in the Supplementary materials, where we describe the fast computation speed in obtaining iid samples from the limiting posterior. We also describe there a second application regarding how to apply the theorem to combine information from published analyses involving partial identification. These are still quite primitive and we believe more substantial applications can be interesting future work. 5. Regularity conditions for BGMM (Bayesian Generalized Method of Moments) We now study the 7 regularity conditions from Section 3 for quasi-posteriors obtained from BGMM, which was described for the examples in Section 2:
q(λ, θ |data ) ∝ e−nRn (λ,θ ) p(λ, θ )I (λ, θ ), where
¯ (θ ) − λ )T v−1 (m ¯ (θ ) − λ ) + 0.5n−1 log |2π v/n|. Rn = 0.5 ( m Condition 1: It is obvious that the marginalized likelihood
e−nRn (λ,θ ) dλ = 1.
So, Condition 1 is satisfied with Cn = 1 and τ (θ ) = 1. ˜ (θ ) = m ¯ (θ ). With the BGMM choice of Rn , the quasi-likelihood Condition 2: The extremum estimator of λ given θ is λ e−nRn is already proportional to a normal density φv with variance v. We only need that φv converges to φ V in total variation. This is achievable if v is consistent estimator of V. This happens, when, e.g., v is a sample version of V = varW |θ m(W, θ ). Condition 3: We can take λ(θ ) = Em(θ ). Then the condition is satisfied when p(θ ) is supported on a compact set , and ¯ (θ ) converges to Em(θ ) uniformly on , in probability. when m Condition 4: This condition means that the prior probability of the identification region [θ ∈ : λ(θ ) ∈ ] is positive. This is reasonable in many partial identification problems where the identification region is nondegenerate. Condition 5: When tr(V) is bounded in θ on the support of p(θ ), it is obvious that the integral is finite. Condition 6: We do not need to worry about the τ (θ ) factor, since we have τ (θ ) = 1 for BGMM. Suppose there exists an extension of p(λ|θ ) from × to dim(λ ) × such that its function values and ∂ λ derivatives are all bounded functions, then the condition is obviously satisfied. Condition 7: Note that ∂ δ typically has Lesbegue measure O(δ ) in the direction of one λj component. For Example 2.4 (with the Bayesian moment inequalities), = [0, ∞ )dim(m ) . The event λ(θ ) ∈ ∂ δ implies that some |λj (θ )| ≤ δ for some j ∈ {1, . . . , dim(m )}. As long as the prior density for λj (θ ) is finite at λ j (θ ) = 0 for all j, its integral on [−δ, +δ ] will be O(δ ), guaranteeing that the condition holds. By basic calculus, the prior density of λj (θ ) can be computed by reparameterizing it in a form of dθ(−k ) p(θk (λ j , θ(−k ) ), θ(−k ) )/|∂θk λ j (θ )|, for some decomposition of θ into some θ k and all other components θ(−k ) . Suppose that the prior density p(θ ) is bounded and supported on a bounded set , and that |∂θk λ j (θ )| is bounded away from 0 on θ ∈ . Then the prior density of λj (θ ) is finite.
This last condition on the derivative can be verified by noting that λ(θ ) = E m(θ ). For example, suppose E m = [EZ (eX θ − L ), EZ (U − eX θ )] for some positive instrumental variable Z, as is useful for a Bayesian moment inequality approach of the interval regression model E (Y − eX θ |Z ) = 0 where Y is only known to fall in [L, U]. Then the absolute value of the derivative of any component Em against one chosen θ component is of the form |EZk Xk eX θ |. Suppose Zk Xk > 0 (e.g., suppose we can translate Xk to make XK > 0 and we take Zk = Xk ), then |EZk Xk eX θ | ≥ |E |Zk Xk |e− sup |X θ | > 0, if we assume bounded X θ . Then the derivative condition (and therefore Condition 7) is satisfied. 6. Limitations of the current framework It is noted that although we have relaxed the previous set up of partial identification under the reduced form parameterization, there are still situations of partial identification that are not covered by this paper. For one example, let the empirical risk be a classification error of a step function: Rn (θ ) = n−1 ni=1 |Yi − I (Xi > θ )|, where (Yi , Xi ) = (0, 0 ) for about half of the n observations, and (Yi , Xi ) = (1, 1 ) for the remaining observations. Then Rn (θ ) is minimized to be 0 by any θ ∈ (0, 1). This is obviously a case where θ is partially identified, since there is a gap in the possible Xi values and the best location θ can be anywhere in the gap. However, Condition 1 is not satisfied, since there is no λ, and the quasi-likelihood e−nRn (θ ) can change by a big factor of e0.5n when θ moves from outside to the inside of the region (0, 1). For a second example (we thank a referee for reminding us about this example), consider Bayesian analysis using a likelihood function from a mixture model λN (θ , 1 ) + (1 − λ )N (0, 1 ), λ ∈ = [0, 1], θ ∈ , when the true model is N(0, 1). Then at any θ = 0, the best λ (corresponding to the true model) is λ(θ ) = 0, which lies at the boundary of the parameter Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
ARTICLE IN PRESS
JID: ECOSTA
[m3Gsc;April 14, 2017;12:1]
W. Jiang / Econometrics and Statistics 000 (2017) 1–13
9
space [0, 1]. Therefore, the boundary condition (Condition 7) fails (unless the prior on θ focuses on {0}). On the other hand, at θ = 0, λ is unidentified. Then we suspect that the asymptotic normality Condition 2 may fail for λ given θ . This is because applying a typical asymptotic normality results here would assume identifiability in λ (see, e.g., Chernozhukov et al., 2007, Assumption 3). It will be interesting future work to consider more general situations of partial identification. 7. Discussions In this paper, we have derived the limiting distribution (in total variation) of the posterior distribution under partial identification. Our framework is general enough to include quasi-Bayes methods based on moment conditions. In addition, we allow more general partial identification, where the model may not be easily reparameterized to be an identifiable model with some reduced form parameters. The resulting limit of the posterior distribution combines information from the data and from the prior reasonably: it uses the data information only to locate an identifiable region, and then leaves the within-region knowledge to be determined by the prior distribution. As a referee points out, due to the complexity of the model (complexity of the empirical risk function), a researcher may not know which parameter is unidentified among many parameters. (This is an issue in macroeconomic application such as the dynamic stochastic general equilibrium model estimation, e.g., Muller, 2012.) It would be nice to find a simple way to check which parameter is unidentified. Our theorem on the limiting distribution may be informative for this interesting task. For example, our results suggest that for a posterior sample of (λ, θ ), the identified component λ concentrates at a rate √ 1/ n, while the unidentified component θ does not concentrate as n increases. One could therefore draw posterior samples of (λ, θ ) based on two or more subsets of data with different sample sizes and discern which parameter is unidentified, as reflected by its non-shrinking spread of distribution. In the context of BGMM examples, a referee comments that a way to interprets λ is that it captures the potential deviation from the original moment condition model. Then, the prior on λ reflects researcher’s belief about the potential misspecification. It may be interesting to assign a prior on λ with a variance describing the prior belief on the size of the deviation of the moment conditions and obtain inference that is robust against such deviations. We will leave this as a potential future topic. In the Bayesian literature of partial identification, there is a new direction of work where the Bayesian inference is targeted at the identification set, rather than a point parameter. See, for example, Kline and Tamer (2017), and Chen et al. (2016). This direction is different and interesting, and has the advantage of stating conclusions robustly without being influenced by additional assumptions on the prior distributions. Another earlier work Kitagawa (2012) explicitly addresses this robustness aspect associated with targeting at the identification set, using bounds on the posterior probabilities due to a class of priors. Our current paper, on the other hand, uses the traditional framework of Bayesian inference, in the sense that the unknown true parameter is regarded as a random point in a parameter space. This follows the line of work by Poirier (1998), Gustafson (20 05, 20 07, 2015) and Moon and Schorfheide (2012), and has the advantage of being able to improve the parametric inference by incorporating useful prior information. Both approaches are content on accepting partial identification and are robust regarding the mechanism of missing data, as compared to other approaches that strive for point identification by introducing additional assumptions on the missing data mechanism. Acknowledgments I thank Professor Hyungsik Roger Moon for kindly reading a draft of this paper and providing useful references. I am very grateful to the referees and the Associate Editor for many valuable comments that have improved this paper. Appendix A. Proofs Proof of results (iii). We first prove that for two nonnegative functions a, b such that A = nating measure suppressed in notation), we have
|a/
a − b/
Proof of (1):
|a/A − b/B| ≤
= =
|a − b|/
a > 0 and B =
b > 0 (with any common domi-
b.
(1)
(|b − a|/B + |a/B − a/A| )
(|a − b|/B + |a||A − B|/(AB ))
|a − b|/B + |B − A|/B
≤2
b| ≤ 2
|a − b|/B.
Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
ARTICLE IN PRESS
JID: ECOSTA 10
[m3Gsc;April 14, 2017;12:1]
W. Jiang / Econometrics and Statistics 000 (2017) 1–13
Then applying the inequality (1) above, we can ignore the normalizing factor of p(λ, θ |data) and only need to prove that for some constant Cn > 0 independing of λ and θ ,
T ≡
dtdθ |Cn e−nRn (λ,θ ) p(λ, θ )I (λ, θ ) − φV (t )τ (θ ) p(λ(θ ), θ )I (λ(θ ), θ )| = o p (1 ).
√ −1 ˜ (θ ) + t/ n. Note that Condition 8 and T = o p (1 ) toHere φV (t ) = |2π V |−1/2 e−1/2t V t is the density for N(0, V), and λ = λ gether imply that the two terms of the difference in T both have positive integrals with probability tending to 1 and therefore we can apply (1) to show that the normalized versions have difference op (1). We need to pay attention to the two indicator functions in this task, since the indicator function is not continuous in the usual sense. For this purpose, now we introduce another inequality: If I1, 2 ∈ {0, 1}, then
|aI1 − bI2 | ≤
|(a − b)I1 + b(I1 − I2 )| ≤
|a − b| +
|b||I1 − I2 |.
(2)
Setting I1, 2 to be the two indicator functions in the results we wanted to prove above, we found that
T ≤ T1 + T2 , and we only need to prove
T1 ≡ and
T2 ≡
dtdθ |Cn e−nRn (λ(θ )+t/ ˜
√ n,θ )
p(λ, θ ) − τ (θ )φV (t ) p(λ(θ ), θ )| = o p (1 ).
dtdθ φV (t )τ (θ ) p(λ(θ ), θ )|I (λ, θ ) − I (λ(θ ), θ )| = o p (1 ).
Proof of T2 = o p (1 ). For T2 with two indicator functions, we can rewrite it as dξ f (ξ )|I (λ ) − I (λ(θ ))| = o p (1 ), where ξ = (t, θ ), λ = √ λ˜ (θ ) + t/ n, f (ξ ) = φV (t )τ (θ ) p(λ(θ ), θ )I (θ ∈ ). We split the integral domain into three parts: A ∪ B ∪ C, where
A = [ξ : max{d (λ(θ ), ), d (λ(θ ), c )} ≤ δ ], B = [ξ : d (λ(θ ), ) > δ ], C = [ξ : d (λ(θ ), c ) > δ ], for a minimal set distance under any metric d, and any δ > 0. Notice that I (λ(θ )) = 0 for ξ ∈ B and 1 for ξ ∈ C. Then we bound the left hand side as follows: dξ f (ξ )|I (λ ) − I (λ(θ ))|(IA + IB + IC ) ≤ dξ f (ξ )IA + dξ f (ξ )I (λ )IB + dξ f (ξ )(1 − I (λ ))IC . Now note that I (λ )IB and (1 − I (λ ))IC are indicator functions which can be one only when d(λ, λ(θ )) > δ . Then the integral is bounded by dξ f (ξ )|I (λ ) − I (λ(θ ))| ≤ dξ f (ξ )IA + 2 dξ f (ξ )I[d (λ, λ(θ )) > δ ] ≤ dξ f (ξ )IA + 2 dξ f (ξ )d (λ, λ(θ ))/δ . Therefore we have another inequality:
dξ f (ξ )|I (λ ) − I (λ(θ ))| ≤ dξ f (ξ )I[max{d (λ(θ ), ), d (λ(θ ), c )} ≤ δ ] + 2 dξ f (ξ )d (λ, λ(θ ))/δ.
(3)
The first term in the upper bound will be related to the prior chance of λ(θ ) falling within a distance of δ to the boundary √ of , which does not depend on n, and is typically converges to 0 as δ goes to 0. The second term is typically O p (1/ n )/δ √ ˜ (θ ) is n-consistent for λ(θ )). (if λ The integrand of the first term in (3), after integrating away t, is equal to dθ τ (θ ) p(λ(θ ), θ )I[λ(θ ) ∈ ∂δ ] where ∂δ = A. This first integral satisfies limδ↓0 dθ τ (θ ) p(λ(θ ), θ )I[λ ∈ ∂δ λ] = 0. This is implied (after applying the Cauchy–Schwartz inequality) by Condition 6 on boundedness of the conditional prior, together with Condition 7 which states that
lim
δ↓0
dθ p(θ )I[λ(θ ) ∈ ∂δ ] = 0.
(4)
√ ˜ (θ ) + t/ n − λ(θ )|| ≤ For the second term in (3), choose d to be the Euclidean metric || · ||. Then d (λ, λ(θ )) = ||λ √ ˜ (θ ) − λ(θ )|| + ||t/ n|| ). Then the second term in (3) is bounded by (||λ √ ˜ (θ ) − λ(θ )|| + ||t ||φV (t )dt / n]/δ 2 dθ τ (θ ) p(λ(θ ), θ )[||λ ˜ (θ ) − λ(θ )||2 + ≤ dθ p(θ )τ (θ )2 p(λ(θ )|θ )2 [ dθ p(θ )||λ dθ p(θ )|trV |/n]/δ = o p (1 )/δ . (This is implied by Conditions 6 and 3. We can also apply Condition 5 regarding the V matrix.) Then we have proved that dtdθ φV (t )τ (θ ) p(λ(θ ), θ )|I (λ, θ ) − I (λ(θ ), θ )| = oδ (1 ) + o p (1 )/δ for any small positive δ , where limδ↓0 oδ (1 ) = 0. Therefore
T2 ≡
dtdθ φV (t )τ (θ ) p(λ(θ ), θ )|I (λ, θ ) − I (λ(θ ), θ )| = o p (1 ).
Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
ARTICLE IN PRESS
JID: ECOSTA
W. Jiang / Econometrics and Statistics 000 (2017) 1–13
[m3Gsc;April 14, 2017;12:1] 11
Proof of T1 = o p (1 ). Now we return to the other statement involving T1 without the indicator functions. We wanted to prove that there exists Cn > 0 independent of λ and θ , such that
dtdθ |Cn e−nRn (λ(θ )+t/ ˜
T2 ≡
√ n,θ )
p(λ, θ ) − φV (t )τ (θ ) p(λ(θ ), θ )| = o p (1 ).
√ ˜ (θ ) + t/ n.) (Unless otherwise noted, λ is related to t by the reparameterization λ = λ √ ˜ dim ( λ ) / 2 To rewrite the left hand side, we can let Cn = Cn /n for Cn in Condition 1, then dsCn e−nRn (λ(θ )+s/ n,θ ) = τn (θ ), and √ √ ˜ ˜ f (t |θ ) = e−nRn (λ(θ )+t/ n,θ ) / dse−nRn (λ(θ )+s/ n,θ ) . Then the left hand side can be re-written as
dtdθ |[{τn (θ ) − τ (θ )} + τ (θ )] f (t |θ ) p(λ, θ ) − τ (θ )φV (t ) p(λ(θ ), θ )| |τn (θ ) − τ (θ )| p(θ ) sup | p(λ|θ )|dθ + dtdθ τ (θ )| f (t |θ ) p(λ, θ ) − φV (t ) p(λ(θ ), θ )|
T1 = ≤
λ
≡ T11 + T12 . The first term T11 = o p (1 ) due to boundedness of the conditional prior density from Condition 6, and the relation in Condition 1 that
|τn (θ )/τ (θ ) − 1|2 p(θ )dθ = o p (1 ).
Now we rewrite
T12 ≡
≤
dtdθ τ (θ )|{e−nRn (λ(θ )+t/ ˜
√ n,θ )
/
dse−nRn (λ(θ )+s/ ˜
√
dtdθ |τ (θ ){e−nRn (λ(θ )+t/ n,θ ) / dse−nRn (λ(θ )+s/ + dtdθ τ (θ )φV (t )|( p(λ, θ ) − p(λ(θ ), θ ))| ˜
˜
√ n,θ ) √ n,θ )
} p(λ, θ ) − φV (t ) p(λ(θ ), θ )| − φV (t )} p(λ, θ )|
≡ T121 + T122 , √ ˜ (θ ) + t/ n. where λ = λ Now assume the following (from Condition 6):
dθ p(θ )τ (θ )2 [sup p(λ|θ )2 ] +
dθ p(θ )τ (θ )2 [sup ||∂λ p(λ|θ )||2 ] < ∞.
λ
λ
Then the second term
T122 = ≤ ≤
dtdθ τ (θ )φV (t )|( p(λ, θ ) − p(λ(θ ), θ ))| dtdθ φV (t )τ (θ ) p(θ )[sup ||∂λ p(λ|θ )||]||λ − λ(θ )|| λ
√ ˜ (θ ) − λ(θ )|| + ||t ||/ n ) dtdθ φV (t )τ (θ ))[sup ||∂λ p(λ|θ )||] p(θ )(||λ
≤
λ
˜ (θ ) − λ(θ )||2 + |trV |/n ) dθ p(θ )(||λ
dθ p(θ )τ (θ )2 [sup ||∂λ p(λ|θ )||]2 λ
= o p ( 1 ), assuming (∗∗ ) (from Condition 3)
˜ (θ ) − λ(θ )||2 = o p (1 ), dθ p(θ )||λ
and (from Condition 5) noting that (‡)
dθ p(θ )|trV |/n = o p (1 ).
For the first term,
T121 = ≤
dtdθ |{e−nRn (λ(θ )+t/ ˜
dtdθ |{e−nRn (λ(θ )+t/ ˜
√ n,θ ) √ n,θ )
/ /
dse−nRn (λ(θ )+s/ ˜
dse−nRn (λ(θ )+s/ ˜
√
√
n,θ )
− φV (t )}τ (θ ) p(λ, θ )|
n,θ )
− φV (t )}| p(θ )τ (θ ) sup p(λ|θ ) ≡ B1 . λ
Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
ARTICLE IN PRESS
JID: ECOSTA 12
[m3Gsc;April 14, 2017;12:1]
W. Jiang / Econometrics and Statistics 000 (2017) 1–13
Now we assume Condition 2, which states that
dt |{e−nRn (λ(θ )+t/ ˜
√ n,θ )
/
dse−nRn (λ(θ )+s/ ˜
√
n,θ )
− φV (t )}| = o p (1 ),
for almost all θ according to the prior p(θ ). Then for all these θ ,
E
dt |{e−nRn (λ(θ )+t/ ˜
√ n,θ )
/
dse−nRn (λ(θ )+s/ ˜
√ n,θ )
− φV (t )}| = o(1 ),
since the L1 distance of two densities is bounded by 2, and convergence in probability implies convergence in mean. Then by √ √ ˜ ˜ using a dominated convergence theorem and noting that E dt |e−nRn (λ(θ )+t/ n,θ ) / dse−nRn (λ(θ )+s/ n,θ ) − φV (t )| ≤ 2, which is integrable under (... ) p(θ )τ (θ ) supλ p(λ|θ )dθ due to Condtion 6, we arrive at
E
dt e−nRn (λ(θ )+t/ ˜
√ n,θ )
/
dse−nRn (λ(θ )+s/ ˜
√
n,θ )
− φV (t )
p(θ )τ (θ ) sup p(λ|θ )dθ = o(1 ). λ
Exchanging ∫dθ and E by Fubini’s theorem, we obtain √ √ ˜ ˜ E ( dt |{e−nRn (λ(θ )+t/ n,θ ) / dse−nRn (λ(θ )+s/ n,θ ) − φV (t )}| ) p(θ )τ (θ ) supλ p(λ|θ )dθ = EB1 = o(1 ). Then 0 ≤ T121 ≤ B1 = o p ( 1 ). So far, the arguments above, when collected together, have proven that T1 = o p (1 ). This, together with the results from earlier sections, have proven the theorem in the formulation of result (iii), using the deterministic relation λ(θ ) in the result. Proof of results (i). ˜ (θ ) in p(λ ˜ (θ ), θ )I (λ ˜ (θ ), θ ). We will now prove that The original result (i) is formulated with the data dependent λ ˜ (θ ) instead of λ(θ ) is also OK, in the sense that the limiting densities differ only by op (1) in the L1 -distance. Due to using λ (1), it suffices for us to prove that
S≡
˜ (θ ), θ )I (λ ˜ (θ ), θ )τ (θ ) − φV (t ) p(λ(θ ), θ )I (λ(θ ), θ )τ (θ )| = o p (1 ). dtdθ |φV (t ) p(λ
The variable t can be integrated away. By Condition 8 p(λ(θ ), θ )I (λ(θ ), θ )τ (θ )dθ > 0, and S = o p (1 ), both terms in the difference of S should have positive integrals with probability tending to 1, which enables us to use (1) and show that the normalized versions also have difference op (1). Now we use again the aforementioned inequality (2): If I1, 2 ∈ {0, 1}, then |aI1 − bI2 | ≤ |a − b| + |b||I1 − I2 |. Setting I1, 2 to be the two indicator functions in the results we wanted to prove above, we found that
S ≤ S1 + S2 , and we only need to prove
S1 ≡ and
S2 ≡
˜ ( θ ) , θ ) − p( λ ( θ ) , θ ) | = o p ( 1 ) , d θ τ ( θ ) | p( λ
˜ (θ ), θ ) − I (λ(θ ), θ )| = o p (1 ). dθ τ (θ ) p(λ(θ ), θ )|I (λ
For the first term,
S1 =
˜ (θ ), θ ) − p(λ(θ ), θ ))| ≤ d θ τ ( θ ) | ( p( λ
˜ (θ ) − λ(θ )||, dθ p(θ )τ (θ )[sup ||∂λ p(λ|θ )||]||λ λ
which is op (1) due to the assumptions made before from Conditions 6 and 3. For the second term S2 we apply again the aforementioned inequality (3):
dξ f (ξ )|I (λ ) − I (λ(θ ))| ≤ dξ f (ξ )I[max{d (λ(θ ), ), d (λ(θ ), c )} ≤ δ ] + 2 dξ f (ξ )d (λ, λ(θ ))/δ.
(4)
˜ (θ ), f (ξ ) = p(λ(θ ), θ )τ (θ )I (θ ∈ )φV (t ). Then the left hand side of (4) can be recognized Now we take ξ = (θ , t ), λ = λ to be S2 , and its upper bound is oδ (1 ) + o p (1 )/δ due to Conditions 3,7 and 6, where oδ (1) converges to 0 as δ↓0, and is independent of data. This shows that the second term S2 is also op (1). ˜ (θ ) used Collecting the arguments above, we have shown that result (i) also holds with the data dependent relation λ (instead of the deterministic λ(θ )) in the result. Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006
ARTICLE IN PRESS
JID: ECOSTA
W. Jiang / Econometrics and Statistics 000 (2017) 1–13
[m3Gsc;April 14, 2017;12:1] 13
Proof of results (ii) and (iv). The total variation (or L1 ) distance of the joint densities is stronger than that of the corresponding marginal densities, i.e.,
|q(λ, θ ) − q (λ, θ )|dλdθ ≥
|
q(λ, θ )dλ −
q (λ, θ )dλ|dθ .
(5)
Therefore, the convergence of the joint distributions (result (i) and result (iii)) implies the convergence of the marginal distributions (results (ii) and (iv), respectively). Supplementary material Supplementary material associated with this article can be found, in the online version, at 10.1016/j.ecosta.2017.03.006 10.1016/j.ecosta.2017.03.006. References Bajari, P., Benkard, L., Levin, J., 2007. Estimating dynamic models of imperfect competition. Econometrica 75, 1331–1370. Belloni, A., Chernozhukov, V., 2009. On the computational complexity of MCMC-based estimators in large samples. Ann. Stat. 37, 2011–2055. Chen, X., Christensen, T., Tamer, E., 2016. MCMC Confidence Sets for Identified Sets. Cowles Foundation Discussion Paper No. 2037. Chernozhukov, V., Hong, H., 2003. An MCMC approach to classical estimation. J. Econom. 115, 293–346. Chernozhukov, V., Hong, H., Tamer, E., 2007. Estimation and confidence regions for parameter sets in econometric models. Econometrica 75, 1243–1284. Ciliberto, F., Tamer, E., 2009. Market structure and multiple equilibria in airline markets. Econometrica 77, 1791–1828. Gustafson, P., 2005. On model expansion, model contraction, identifiability, and prior information: two illustrative scenarios involving mismeasured variables (with discussion). Stat. Sci. 20, 111–140. Gustafson, P., 2007. Measurement error modelling with an approximate instrumental variable. J. R. Stat. Soc. B 69, 797–815. Gustafson, P., 2015. Bayesian Inference for Partially Identified Models: Exploring the Limits of Limited Data. CRC Press, New York. Haile, P., Tamer, E., 2003. Inference with an incomplete model of English auctions. J. Polit. Econ. 111, 1–51. Kim, J.-Y., 2002. Limited information likelihood and Bayesian analysis. J. Econ. 107, 175–193. Kitagawa, T., 2012. Estimation and Inference for Set-Identified Parameters Using Posterior Lower Probability. University College London. Working paper Kline, B., Tamer, E., 2016. Bayesian inference in a class of partially identified models. Quant. Econ. 7, 329–366. Liao, Y., 2010. Bayesian analysis in partially identified parametric and nonparametric models. Northwestern University Ph.D. thesis. Liao, Y., Jiang, W., 2010. Bayesian analysis in moment inequality models. Ann. Stat. 38, 275–316. Manski, C., 2003. Partial Identification of Probability Distributions. Springer-Verlag, New York. Manski, C., Tamer, E., 2002. Inference on regressions with interval data on a regressor or outcome. Econometrica 70, 519–547. Moon, H.R., Schorfheide, F., 2012. Bayesian and frequentist inference in partially identified models. Econometrica 80, 755–782. Muller, U.K., 2012. Measuring prior sensitivity and prior informativeness in large Bayesian models. J. Monet. Econ. 59, 581–597. Poirier, D.J., 1998. Revising beliefs in nonidentified models. Econom. Theory 14, 483–509. Tanner, A.M., 1996. Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, third Springer-Verlag, New York. van der Vaart, A.W., 20 0 0. Asymptotic Statistics. Cambridge University Press.
Please cite this article as: W. Jiang, On limiting distribution of quasi-posteriors under partial identification, Econometrics and Statistics (2017), http://dx.doi.org/10.1016/j.ecosta.2017.03.006