Journal of Econometrics 177 (2013) 116–129
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Efficient semiparametric estimation for endogenously stratified regression via smoothed likelihood Stephen R. Cosslett ∗ Department of Economics, Ohio State University, Columbus, OH 43210-1172, USA
article
info
Article history: Received 27 July 2011 Received in revised form 9 July 2013 Accepted 16 July 2013 Available online 24 July 2013 JEL classification: C14 C24 C25
abstract This paper presents efficient semiparametric estimators for endogenously stratified regression with two strata, in the case where the error distribution is unknown and the regressors are independent of the error term. The method is based on the use of a kernel-smoothed likelihood function which provides an explicit solution for the maximization problem for the unknown density function without losing information in the asymptotic limit. We consider both standard stratified sampling and variable probability sampling, and allow for the population shares of the strata to be either unknown or known a priori. © 2013 Elsevier B.V. All rights reserved.
Keywords: Semiparametric estimation Asymptotic efficiency Endogenously stratified regression
1. Introduction This paper presents asymptotically efficient semiparametric estimators of a regression function when the sample is endogenously stratified, comprised of two strata defined by whether the dependent variable y is above or below some threshold value c. Typically, one of these strata will be oversampled in order to provide more information about a subpopulation of particular interest.1 This paper deals with the case where the functional form of the error distribution is unknown, and the error term is independent of the regressors. Several methods are available for consistent estimation from endogenously stratified samples (see, for example, the review given by Cosslett (1993)). A standard method is weighted least squares, with weights inversely proportional to the stratum sampling probabilities.2 However, the resulting parameter estimates are not asymptotically efficient, either in the present case where x and ε are independent or in the less restrictive case where
∗
Tel.: +1 614 292 4106; fax: +1 614 292 3906. E-mail address:
[email protected].
1 A leading early example in the econometrics literature is Hausman and Wise (1981). Those authors used a maximum likelihood estimator, valid when the errors are normally distributed, that is more efficient than weighted least squares. 2 Weighted least squares estimators were investigated by Holt et al. (1980) and Nathan and Holt (1980); improved weighting schemes were presented by Jewell (1985). 0304-4076/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jeconom.2013.07.003
E [ε|x] = 0 (where x denotes the vector of regressors, ε is the regression error term, and the expectation refers to the underlying population). The estimation method proposed here is smoothed maximum likelihood.3 Let f be the density function of the errors, and let β be the vector of parameters of the regression function. The first step is to apply kernel smoothing to the log likelihood function. Next, maximize the smoothed log likelihood with respect to the error density function f at fixed β , and finally maximize the resulting concentrated (profile) objective function with respect to β . The basic reason for the initial smoothing operation is to make functional maximization of the log likelihood more tractable. For several semiparametric estimation problems, including the present one, maximization of the smoothed log likelihood with respect to f leads to an integral equation that can be solved explicitly.4 The estimator is then implemented by solving a set of trimmed score equations for β , similar to the semiparametric maximum likelihood estimator of Ai (1997).5 A key step in making
3 See Cosslett (2007); a similar approach but using smoothed self-consistency equations rather than smoothed likelihood was used in Cosslett (2004). 4 The reason for considering the case of two strata is the existence of an explicit solution for f . The specific data-dependent trimming scheme used here relies on having an explicit expression for the score function, and is not applicable in more general cases where f is defined implicitly as the solution of an integral equation. 5 This is because, except in a few special cases, currently available results on asymptotic properties rely on convergence of a trimmed score function rather than on convergence of the log likelihood itself.
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
this all work is to trim the kernel density estimates in the score function in a way that preserves the orthogonality conditions needed for asymptotic efficiency. A more direct approach would be to omit the initial smoothing, and start with the nonparametric maximum likelihood estimator (NPMLE) of the distribution function F , followed by some suitable smoothing of the concentrated likelihood before maximizing over β .6 It is not known if that can be done in a way that achieves asymptotic efficiency. In the related case of censored regression, existing implementations that use the NPMLE (which in that case is the Kaplan–Meier estimator) do not achieve the semiparametric efficiency bound. An alternative approach that would work is semi-nonparametric (or sieve) estimation, for example via spline estimation of the function f ′ /f as was done for censored regression by Kim and Lai (2000). One potential advantage of the smoothed maximum likelihood approach is that, given only the form of the kernel function and a value for the smoothing parameter, it leads to an explicit functional form for the estimated density function. Returning to the present estimation method, there are two sample designs to consider. In standard stratified sampling, two independent random samples of given sizes n1 and n2 are drawn from the strata with y ≥ c and with y < c respectively. In variable probability sampling, a sample of given size n is drawn with sampling rates π1 and π2 from each of the two strata (and so the number of observations in each stratum is random). In addition, information may or may not be available on the population shares of the two strata, Q1 = Pr{y ≥ c } and Q2 = Pr{y < c }. When the population shares are known, the likelihoods for standard stratified sampling and variable probability sampling are equivalent, so altogether there are three cases to be dealt with. The next section sets up the notation and the specification of the stratified regression model. Subsequent sections derive the estimators in the three cases of known population shares, standard stratified sampling with unknown population shares, and variable probability sampling with unknown population shares. Asymptotic properties of the estimators and other technical results are derived in Appendix A, while Appendix B presents the results of a small simulation study. 2. Endogenously stratified regression Suppose the population density of (x, y) is f (y − xβ) h(x). The two strata are defined by a threshold value of y, which can be taken as zero without loss, i.e., S1 = {yi ≥ 0},
S2 = {yi < 0}.
(If the threshold is an observed exogenous variable ci , redefine the dependent variable as y∗i = yi −ci , and include ci as a regressor with fixed coefficient −1.) The population proportions of these strata are Q1 = Q1 (β, f , h) = Q2 = Q2 (β, f , h) =
dx h(x) [1 − F (−xβ)] dx h(x) F (−xβ)
(2.1)
where F is the cumulative distribution function corresponding to the density function f (u) of the error terms. Standard stratified sampling. In this design, a specified number of observations nj are drawn randomly from each stratum Sj . The sample proportions are denoted pj = nj /n, where n is the total number of observations. Let E denote expectation with respect
6 The NPMLE for endogenously stratified regression is given by Vardi (1985).
117
to the underlying population, while ES denotes the weighted expectation corresponding to the densities of the stratified sample, i.e., ES [ · ] = p1 E [ ·|y ≥ 0] + p2 E [ ·|y < 0].
(2.2)
The log likelihood is log L(β, f , h) =
n {log f (yi − xi β) + log h(xi )} i =1
− n1 log Q1 − n2 log Q2 .
(2.3)
Variable probability sampling. In this design, individuals are initially sampled randomly from the entire population, but depending on the stratum j they are then randomly selected for inclusion in the final sample with known probability πj ; this continues until the sample reaches a specified size n. (This would be suitable, for example, in a survey where it is relatively easy to determine an individual’s stratum but costly to collect all the data.) Thus the main difference is that the sample proportions nj /n are random. In this case, let p denote the expected stratum proportions, pj = πj Qj /π¯ ,
π¯ = π1 Q1 + π2 Q2
(2.4)
(where π¯ is the overall sampling rate). The expectation ES over the density of the stratified sample is again given by (2.2), while the log likelihood is now log L(β, f , h) =
n {log f (yi − xi β) + log h(xi )} i =1
+ n1 log
p1 Q1
+ n2 log
p2 Q2
.
(2.5)
The problem is to construct an asymptotically efficient semiparametric estimator of β , under suitable regularity conditions on the unknown densities f and h. A key piece of information is the fact that each stratum in the sample was drawn from the same underlying population, or equivalently that the density h appearing in (2.1) is the same as in the log likelihood (2.3) or (2.5). The way in which this enters depends on whether the population shares Q are known or unknown a priori. (There are also intermediate cases – not considered here – where the shares Q are unknown but can be consistently estimated from some external data.) Known population shares. If the population shares are known, then the log likelihood functions (2.3) and (2.5) are the same to within a constant, and therefore the estimators for standard stratified sampling and variable probability sampling are the same. Eq. (2.1) now represents a constraint to be taken into account when maximizing the likelihood. For variable probability sampling with known shares, the subsample sizes n1 and n2 are ancillary statistics: they have a known probability distribution and therefore convey no information about the unknown parameters of the model. The usual approach in such cases is to carry out ex-post inference conditional on the ancillary statistics, and therefore inference as well as estimation is the same for both types of sampling when Q is known. On the other hand, for a priori analysis of a variable probability sample (i.e., for sample design) one would use the conditional variances of the parameter estimators. However, the score function for asymptotically efficient estimation conditional on n1 and n2 has the special property that its conditional expected value is zero in each stratum separately (see Eq. (A.3.15)). It follows that the conditional and unconditional asymptotic variances are equal, and so for a priori inference there is no loss in treating n1 and n2 as fixed at their expected values. Unknown population shares. If the population shares are unknown, then Q1 and Q2 in the log likelihood functions (2.3) and (2.5) denote functions of β , f and h, as given by Eq. (2.1). In this case, the two sampling schemes have to be considered separately.
118
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
3. Estimation with known population shares
is determined by the largest and smallest values of ei (β).) In the present case, m = 2 and
When the population shares Q are known a priori, the log likelihoods (2.3) for standard stratified sampling and (2.5) for variable probability sampling both simplify to
ei (β) =
n log L(β, f , h) = {log f (yi − xi β) + log h(xi )}
(3.1)
i=1
yi − xi β xi β
The smoothed log likelihood corresponding to (3.5) can then be written as 1 log L˜ (β, f ) = du g˜1 (u|β) log f (u) n
(after dropping constant terms), subject to the restriction (2.1). Concentrating the likelihood with respect to h Since x may be a high-dimensional vector, kernel smoothing would not be an appropriate method for concentrating the likelihood with respect to h(x). Instead, we use a method based on empirical likelihood, as was used in the case where f is a known parametric function (Cosslett, 1981, 1993). Fix (β, f ) in a feasible set such that F (−xi β) < Q2 < F (−xj β) for some i and j. Assign a point mass wi to each data point. Then maximizing (3.1) with respect to h subject to (2.1) reduces to the problem of maximizing log Lh (w) =
n
log wi
i=1
subject to
n
wi [1 − F (−xi β)] = Q1 ,
i =1
wi F (−xi β) = Q2 .
(These constraints include the normalization i wi = 1, but are written in a way that allows the Lagrange multipliers to appear symmetrically.) The solution is
g˜1 (u|β) =
n 1
nhn i=1 n 1
g˜2 (u|β) =
1
1
n F¯ (−xi β|λ)
where F¯ (u|λ) = λ1 [1 − F (u)] + λ2 F (u)
(3.2)
with Lagrange multipliers λ1 = λ1 (β, f ) and λ2 = λ2 (β, f ) defined implicitly by n 1 1 − F (−xi β)
n i=1
F¯ (−xi β|λ)
n 1 F (−xi β)
= Q1 ,
n i=1 F¯ (−xi β|λ)
= Q2 .
(3.3)
It follows from (3.2) and (3.3) that λ1 and λ2 are related by
λ1 Q1 + λ2 Q2 = 1.
(3.4)
(The explicit dependence of λ1 and λ2 on β and f will be dropped to simplify the notation.) The resulting concentrated likelihood is, apart from constant terms, log L(β, f ) =
n {log f (yi − xi β) − log F¯ (−xi β|λ)}.
(3.5)
i =1
Smoothed likelihood Suppose that an individual term in the log likelihood has the form ℓi = ℓ(ei (β), f ), where ei (β) is an m-dimensional vector of index functions. The smoothing operation replaces ℓi by
˜ ei (β), f ) = ℓ˜ i = ℓ(
du
1
K m
hn
ei (β) − u hn
ℓ(u, f )
where K denotes a symmetric m-dimensional kernel function with marginals K , and hn is a bandwidth that shrinks at a suitable rate as n → ∞. (In general, integration is over the real line, but if one uses a kernel with bounded support, then the range of integration
K
yi − xi β − u hn
xi β − u
, (3.7)
K . nhn i=1 hn (These terms are not stratum dependent.) Thus each integrand in (3.6) factorizes into two terms, one of which contains the data dependence and the other contains the unknown function f . The subsidiary conditions (3.3), which implicitly determine λ1 and λ2 , are likewise replaced by their smoothed versions dv g˜2 (v|β) dv g˜2 (v|β)
wi =
(3.6)
where
i=1
du g˜2 (u|β) log F¯ (−u|λ)
−
n
.
1 − F (−v) F¯ (−v|λ) F (−v) F¯ (−v|λ)
= Q1 , (3.8)
= Q2
with F¯ given by (3.2). Concentrating the likelihood with respect to f The next step is to maximize (3.6) with respect to f , subject to du f (u) = 1. Note that log L˜ (β, f ) depends on f not only through the terms f (u) and F¯ (u|λ), but also indirectly through λ1 = λ1 (β, f ) and λ2 = λ2 (β, f ), which are defined implicitly by (3.8) and are therefore functionals of f . Let δλ1 and δλ2 denote the variations in λ induced by a variation δ f . The variation of (3.6) with respect to λ is
∂ log L˜ ∂ log L˜ δλ1 + δλ2 = − du g˜2 (u|β) ∂λ1 ∂λ2 [1 − F (−u|β)] δλ1 + F (−u|β) δλ2 × F¯ (−u|λ) = −Q1 δλ1 − Q2 δλ2 where the second inequality comes from (3.8). However, (3.8) still implies λ1 Q1 + λ2 Q2 = 1 for any f , from which it follows that Q1 δλ1 + Q2 δλ2 = 0. Therefore the variation of (3.6) with respect to λ is zero, and we do not need to take into account the variation of λ with respect to f . The resulting first-order condition for maximization of (3.6) is g˜1 (u|β)
1 f (u)
− λ1
u
dv g˜2 (−v|β) −∞
1 F¯ (v|λ)
∞
1 = 0. (3.9) ¯ F (v|λ) u Although the integral equation (3.9) is nonlinear in f , it can be transformed into a linear differential equation with an explicit solution (see Appendix A.1.2). Introducing some additional notation, let
− λ2
˜ (u|β) = G
dv g˜2 (−v|β)
u
dv {˜g1 (v|β) − g˜2 (−v|β)} −∞
=
n 1
n i =1
u − (yi − xi β) u + xi β K¯ − K¯ hn
hn
(3.10)
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
where K¯ (u) is the integral of the kernel function K (u), and let
˜ (v|β, λ) = λ1 λ2 + (λ2 − λ1 ) G˜ (v|β) D ˜ ˜ (v|β, λ) = g1 (v|β) . M ˜D(v|β, λ)
u
˜ (v|β, λ) dv M
∞
˜ (v|β, λ) − log dv M −∞
.
(3.12)
λ2 = 0. λ1
(3.13)
This restriction, together with (3.4), implicitly determine λ1 and λ2 . As shown in Appendix A.1.2, the resulting solution f˜ (u|β, λ) then satisfies not only the first-order condition (3.9), but also the previous restrictions (3.8) on the Lagrange multipliers. A minor complication arises from the fact that (3.4) and (3.13) also allow for a spurious solution at λ1 = λ2 = 1. One way of avoiding this is to change the parameterization to
λ1 = 1 − µ/Q1 ,
λ2 = 1 + µ/Q2 ,
(3.14)
which satisfies (3.4). Then R˜ (β, λ) = µ R˜ 1 (β, µ) (with R˜ 1 (β, µ) nonsingular at µ = 0), and (3.13) can be replaced by the condition R˜ 1 (β, µ) = 0. Details are given in Appendix A.1.1. While we could proceed with the concentrated smoothed likelihood L˜ (β, f˜ ), there is no loss in asymptotic efficiency if we simplify the problem by substituting f˜ in the original version L(β, f ) in (3.5). This leads to the final version of the concentrated ˜ log likelihood, log L˜ c (β) = log L˜ c (β, λ(β)) , where 2
log L˜ c (β, λ) =
n
×
˜ (v − xi β|β, λ) , dv M
(3.15)
0
with the convention that
b a
is to be interpreted as −
a b
if a > b,
˜ and where λ = λ(β) is the solution of the subsidiary conditions (3.4) and (3.13). Score function Taking into account the dependence of the Lagrange multipliers on β , the estimated score function has the form
˜ d log L˜ c (β, λ(β)) dβ
R˜ β = S˜β − S˜µ R˜ µ
(3.16)
where S˜β = ∂ log L˜ /∂β , R˜ β = ∂ R˜ /∂β , and (for this sampling scheme)
R˜ µ = −
1 ∂ log L˜ Q1 ∂λ1 1 ∂ R˜ Q1 ∂λ1
+
+
1 ∂ log L˜ Q2 ∂λ2
1 ∂ R˜ Q2 ∂λ2
,
.
n
s˜β (yi , xi |β, λ) =
˜ 2µ (v|β, λ) − dv m
−∞
(3.20)
1 Q1 Q2 λ1 λ2
,
b
a
(again with the convention that a becomes − b if a > b); ˜ 1β (y, x|β, λ), explicit expressions for the moment functions m ˜ 2β (v, x|β, λ), m ˜ 1µ (u|β, λ) and m ˜ 2µ (u|β, λ) are given in Apm pendix A.2, Eqs. (A.2.1)–(A.2.4).
Asymptotically equivalent parametric model The kernel sums g˜1 (u|β) and g˜2 (u|β) are consistent estimators of g1 (u|β) and g2 (u|β), which are weighted averages (using the sampling probabilities as weights) of the stratum densities of the residuals e(β) = y − xβ and of xβ respectively (see Eqs. (A.3.1)–(A.3.2) for specific expressions for g1 and g2 ). By substituting g1 and g2 for g˜1 and g˜2 , define G(u|β), f (u|β, λ), S (β, λ), ˜ (u|β), f˜ (u|β, λ), S˜ (β, λ), R(β, λ) and λ(β) as the analogs of G ˜R(β, λ), and λ(β) ˜ . In particular, λ(β) is the implicit function defined by R(β, λ) = 0, where λ depends on an underlying scalar parameter µ. Consider an experiment where the density of (x, y) in the underlying population is
ℓ0 (x, y|β) =
f (y − x β|β, λ(β)) F¯ (−x β|β, λ(β))
× F¯ (−x β0 |β0 , λ(β0 )) h(x)
(3.21)
and where the likelihood for an observation is either (3.22)
ℓ0 (x, y|β) {(π1 /π¯ ) 1(y ≥ 0) + (π2 /π¯ ) 1(y < 0)}
(3.23)
for variable probability sampling. The function R(β, λ) must, of course, be defined such that (3.21) is in fact a likelihood function; this is verified by Eq. (A.3.5). In the hypothetical case where f (u|β, λ(β)) is a known function of u and β , this defines a parametric likelihood function, and the ‘‘asymptotic’’ score function S (β, λ(β)) can now be interpreted as the corresponding score function. We shall suppose that this parametric model satisfies classical conditions such that there is a consistent root β = β¯ of the score equation S (β) ≡ S (β, λ(β)) = 0, and that β¯ is asymptotically normal with asymptotic variance given by (varS [n−1/2 S (β0 , λ0 )])−1 . Appendix A.3 gives expressions for g1 (u|β0 ), g2 (u|β0 ) and their derivatives, which are used to evaluate S (β0 , λ0 ) and also to verify that R(β0 , λ0 ) = 0. It can then be verified that varS [n−1/2 S (β0 , λ0 )] is equal to the inverse of the semiparametric efficiency bound (A.8.1). The task is therefore to construct an estimator βˆ asymptotically equivalent to β¯ .
(3.18)
τ (a) = t ([a − bn ]/bn )
i=1
0
˜ 2β (v, 0|β, λ), dv m
−∞ ∞
Trimming factor As usual with kernel-based semiparametric estimators, uniform ˜ and their derivatives does not automatically convergence of g˜1 , G imply uniform convergence of the score function and its derivatives because some of the denominator terms are not bounded below and so must be trimmed. Define the trimming factor τ by
n
˜ 1β (yi , xi |β, λ) m yi ˜ 2β (v, xi |β, λ) + dv m i=1
∞
(3.17)
These derivatives can be written as S˜β =
R˜ µ =
for standard stratified sampling or yi
S˜µ = −
R˜ β =
ℓ0 (x, y|β) {Q1−1 1(y ≥ 0) + Q2−1 1(y < 0)}
˜ (yi − xi β|β, λ) + (λ2 − λ1 ) log M
i=1
˜ S˜ (β, λ(β)) =
(3.19)
0
The requirement that this should be a proper density function, i.e., du f˜ (u|β, λ) = 1, can be expressed as follows:
˜ 1µ (yi − xi β|β, λ) s˜µ (yi − xi β|β, λ) = m i=1 yi ˜ 2µ (v − xi β|β, λ) dv m +
(3.11)
−∞
R˜ (β, λ) ≡ (λ2 − λ1 )
n i=1
Then the solution of the optimization problem has the form
˜ (u|β, λ) f˜ (u|β, λ) = λ1 M × exp (λ2 − λ1 )
S˜µ =
119 n
120
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
where t is a function such that t (u) = 1 for u ≥ 1 and t (u) = 0 for u ≤ 0, with a smooth polynomial interpolation for 0 ≤ u ≤ 1 such that the second derivative is continuous (see for example Ai (1997)), and where bn is a shrinking lower bound. (The dependence of τ on bn will not be shown explicitly.) To simplify the notation, terms like a−1 τ (a) are to be interpreted as zero if a = 0. Then 1 a−1 τ (a) < b− n . Care is needed in trimming the score function so as not to interfere with the derivation of asymptotic efficiency. In particular, we have to maintain the condition ES [S ∗ (β0 , λ0 )] = 0, where S ∗ (β, λ) denotes the trimmed version of S (β, λ). To achieve this, ˜ 1β and define S˜β∗ (β, λ) and R˜ ∗β (β, λ) by replacing the functions m ˜ 2β in (3.18) and (3.20) by their trimmed versions m
˜ (v − xβ|β, λ)], ˜ ∗jβ (v, x|β, λ) = m ˜ jβ (v, x|β, λ) τ [M m j = 1, 2.
(3.24)
˜ (u|β, λ) is bounded below by a positive value when The term D λ is restricted to a known region that includes the asymptotic limit λ0 = λ(β0 ) = (p1 /Q1 , p2 /Q2 ) (see Appendix A.1.3), and ˜ as the denominator do not require additional thus terms with D trimming—in particular, the condition R˜ (β, λ) = 0 is unchanged. The trimmed score function is then given by the analog of (3.16), S˜ ∗ = S˜β∗ + (R˜ ∗β /R˜ µ ) S˜µ . Asymptotic efficiency Let βˆ be the estimator obtained by solving S˜ ∗ (β, λ) = 0 and R˜ (β, λ) = 0 for (β, λ) with suitable values of the smoothing parameter hn and the trimming parameter bn . Asymptotic efficiency of βˆ is verified by showing that it is asymptotically equivalent to the classical estimator β¯ defined above, i.e., that plim n1/2 (βˆ − ¯ = 0. By a standard classical argument, this follows if n−1 S˜ ∗ (β) β) and n−1 ∂ S˜ ∗ (β)/∂β converge in probability to n−1 S (β) and n−1 ∂ S (β)/∂β uniformly in β , and n−1/2 S˜ ∗ (β0 ) converges in prob˜ ability to n−1/2 S (β0 ), where S˜ ∗ (β) = S˜ ∗ (β, λ(β)) . As shown in Appendix A.6, these results hold under suitable rates of convergence of hn , the smoothing parameter, and bn , the trimming parameter.7 In practice, there may be multiple solutions of S˜ ∗ (β, λ) = 0 and R˜ (β, λ) = 0, in which case an initial consistent estimator is required. A specific initial estimator, using inverse probability weighted moments, is introduced in Appendix B in connection with the simulation study. It should also be emphasized that asymptotic efficiency of βˆ does depend on estimating the specific function f ( ·|β, λ(β)), defined by the asymptotic limit of (3.12). For example, by analogy with the kernel-based adaptive estimator of the regression model from a random sample, a plausible alternative strategy might be to estimate the population density of the residuals y − xβ , which can be done using inverse probability weighting: f˜IPW (u|β) =
n 1 Q1
Q2
1(yi ≥ 0) + 1(yi < 0) n h n i =1 p1 p2 u − (yi − xi β) ×K . hn
However, substituting this estimated density into the likelihood function does not lead to an asymptotically efficient estimator of β .
4. Standard stratified sampling with unknown population shares In this case the log likelihood is n {log f (yi − xi β) + log h(xi )}
log L(β, f , h) =
i=1
− n1 log Q 1 (β, f , h) − n2 log Q2 (β, f , h)
(4.1)
with Q 1 (β, f , h) and Q2 (β, f , h) given by (2.1). Proposition 7 in the Appendix shows that Q is nonparametrically identified. Concentrated likelihood As before, the first step is nonparametric maximum likelihood estimation of h for given (β, f ). Assigning a point mass wi to each data point as before, this step reduces to maximization of log Lh (w) =
n
log wi − n1 log
i=1
n
wi [1 − F (−xi β)]
i=1
− n2 log
n
wi F (−xi β)
(4.2)
i =1
wi = 1. Let hˆ be the resulting empirical density, and let q1 = Q1 (β, f , hˆ ) and q2 = 1 − q1 be the estimated population shares. (As before, the dependence on β and f is not over {wi } subject to
i
explicitly shown.) For notational convenience, define parameters λ1 = λ1 (q) and λ2 = λ2 (q) by
λ1 = p1 /q1 ,
λ2 = p2 /q2 . (4.3) Then the solution for wi can again be written as 1/wi = n F¯ (−xi β|λ) with F¯ given by (3.2). This leads to n 1 1 − F (−xi β)
n i=1
= q1
F¯ (−xi β|λ)
(4.4)
which implicitly determines q. The resulting concentrated likelihood is log L(β, f ) =
n {log f (yi − xi β) − log F¯ (−xi β|λ)} i =1
− n1 log q1 − n2 log q2
(4.5)
(apart from constant terms). The corresponding smoothed log likelihood is given by 1 n
log L˜ (β, f ) =
du g˜1 (u|β) log f (u)
du g˜2 (u|β) log F¯ (−u|λ)
−
− p1 log q1 − p2 log q2
(4.6)
with q determined by the smoothed version of (4.4),
dv g˜2 (v|β)
1 − F (−v) F¯ (−v|λ)
= q1 .
(4.7)
The problem of maximizing (4.6) with respect to f is the same as maximizing (3.6), because the variation with respect to λ is again zero. This leads to the concentrated log likelihood log Lc (β) = ˜ log Lc (β, λ(β)) , where log Lc (β, λ) =
n
˜ (yi − xi β|β, λ) + (λ2 − λ1 ) log M
i=1 yi
× 7 As with other semiparametric estimators of this kind, the convergence rates of the smoothing and trimming parameters may seem unrealistically slow. It is not known whether these rates are really necessary or are just artifacts of the method of proving uniform convergence.
˜ dv M (v − xi β|β, λ)
0
− n1 log q1 − n2 log q2 , ˜ defined as before. with M
(4.8)
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
In this representation, q = q˜ (β) is the solution of the subsidiary condition (3.13), with λ = λ(q) given by (4.3). As in the case of known Q , this condition still allows for a spurious solution at λ1 = λ2 = 1 corresponding to q1 = p1 . To avoid this, a factor (q1 − p1 )2 can be extracted before solving R˜ (β, λ) = 0 (see Appendix A.1.1). Score function and asymptotic properties The score function again has the form (3.16), except that the underlying auxiliary parameter is now q1 instead of µ, and therefore the terms S˜µ and R˜ µ are replaced by S˜q = −
p1 ∂ log L˜
+
p2 ∂ log L˜
∂λ1 q22 ∂λ2 p1 ∂ R˜ p2 ∂ R˜ R˜ q = − 2 + 2 . ∂λ q1 q2 ∂λ2 1 q21
, (4.9)
˜ 1µ (u|β, λ) The expression for S˜q is analogous to (3.19) but with m ˜ 2µ (u|β, λ) replaced by the moment functions m ˜ 1q (u, s|β, λ) and m ˜ 2q (u|β, λ) given in Appendix A.2, Eqs. (A.2.5)–(A.2.6) (where and m s ∈ {1, 2} is the stratum index). The expression for R˜ q is now R˜ q =
∞
˜ 2µ (v|β, λ) − dv m −∞
λ21 λ22 p1 p2
.
(4.10)
Estimation of population shares ˆ , the An estimator of the population share Q1 is then q˜ 1 = q1 (β) ˜ ˆ solution of R(β, λ(q)) = 0. A standard expansion of the kernel estimators leads to R˜ (β0 , λ0 ) = O(h2n ) + Op (n−1/2 ), which implies asymptotic bias in q˜ 1 unless hn = op (n−1/4 ). However, this is incompatible with the range of convergence rates needed to prove asymptotic efficiency of βˆ : it does not appear to be feasible to formulate an objective function that simultaneously yields efficient estimators of both β and q1 . We therefore need a two-step method: ˆ λ(q1 )) = 0 with after computing βˆ , re-estimate q1 by solving R˜ (β, a smaller bandwidth hn = op (n−1/4 ) in order to reduce the bias in R˜ (β0 , λ0 ).8 In fact, there is no loss in dropping the smoothing from the numerator of the integrand in (3.13), so that the equation to be solved is
=0
n 1
5. Variable probability sampling with unknown population shares Starting with the log likelihood (2.5) and substituting for p1 and p2 from (2.4) gives log L(β, f , h) =
n {log f (yi − xi β) + log h(xi )} i =1
− n log [π1 Q1 (β, f , h) + π2 Q2 (β, f , h)]
1
˜ (yi − x′i β| ˆ β, ˆ λ) n i=1 D
− log
λ2 λ1
Concentrated likelihood Concentrating the likelihood (5.1) with respect to h involves maximization of log Lh (w) =
n
log wi − n log
n
wi F¯ (−xi β|π )
i
wi = 1, with F¯ again given by (3.2). As
before, let q1 = Q1 (β, f , hˆ ) and q2 = 1 − q1 be the estimated population shares, and let π¯ = π1 q1 + π2 q2 be the estimated overall sampling rate. In this case, the optimization problem has an explicit solution (conditional on β and f ), given by
π¯ =
n 1
−1
1
(5.2)
n i=1 F¯ (−xi β|π )
and
wi = n−1 π¯ /F¯ (−xi β|π ). Note that q1 can be recovered from π¯ by q1 = (π¯ − π2 )/(π1 − π2 ), and similarly for q2 . The resulting concentrated log likelihood is log L(β, f ) =
n {log f (yi − xi β) − log F¯ (−xi β|π )}
(5.3)
i =1
(apart from constant terms). This is the same as the conditional likelihood of y given x, which was originally used (with normal errors) by Hausman and Wise (1981); the derivation via nonparametric maximum likelihood estimation of h(x) was given by Jewell (1985). The corresponding smoothed log likelihood is given by n
log L˜ (β, f ) =
(4.11)
8 An alternative approach would be to use a higher-order (bias reducing) kernel.
i =1
over {wi } subject to
1
with λ given by (4.3). The resulting estimator qˆ 1 has an asymptotic representation consisting of two independent terms, one arising from the dependence of Q1 on β and the other from its dependence on the functions f and h. Details are given by Proposition 6 in Appendix A.6. An expression for the asymptotic variance of qˆ 1 is given by Eq. (A.6.4); a straightforward evaluation then shows that it is equal to the efficiency bound given in Appendix A.8, Eq. (A.8.2).
(5.1)
(apart from constant terms), with Q1 (β, f , h) and Q2 (β, f , h) given by (2.1). Although similar to the case of standard stratified sampling, this likelihood is more informative, as can be seen from the fact that there is a preliminary consistent estimator Q˜ 1 = (n1 /π1 )/(n1 /π1 + n2 /π2 ) which does not depend on β or f . We can suppose without loss that π1 ̸= π2 , because otherwise existing adaptive estimators for random sampling can be used.
i=1
The trimmed score S˜ ∗ (β, λ) is defined as before, using the trimming factor in (3.24). As shown in Appendix A.1.3, the denominator ˜ (u|β, λ) is bounded away from zero in the present case also. term D Solving S˜ ∗ (β, λ) = 0 and R˜ (β, λ) = 0 for (β, λ) subject to (4.3), with suitable values of hn and bn , gives the estimator βˆ . From the expression for S (β0 , λ0 ) given in Appendix A.3, we again verify that varS [n−1/2 S (β0 , λ0 )] is equal to the inverse of the relevant semiparametric efficiency bound in (A.8.2). Asymptotic efficiency of βˆ follows by the same argument as in the previous case.
ˆ λ) ≡ (λ2 − λ1 ) R˜ ∗ (β,
121
du g˜1 (u|β) log f (u)
−
du g˜2 (u|β) log F¯ (−u|π )
(5.4)
while the smoothed version of (5.2) is
π¯ −1 =
dv g˜2 (v|β)
1 F¯ (−v|π )
.
(5.5)
Note that (5.4) differs from (3.6) only in that λ1 and λ2 have been replaced by known constants π1 and π2 . Nevertheless, to allow for a uniform treatment for all three sample designs, we re-introduce λ1 and λ2 as auxiliary parameters defined by
λj = πj /π¯ = πj /(π1 q1 + π2 q2 ).
(5.6)
122
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
Optimization with respect to f proceeds as before. The solution f˜ is again given by (3.12) and the concentrated log likelihood by log Lc (β, λ) =
n
A.1. Constructing the estimators
˜ (yi − xi β|β, λ) + (λ2 − λ1 ) log M
i =1 yi
×
˜ (v − xi β|β, λ) − n log π¯ dv M
(5.7)
0
subject to (3.13), which is equivalent to substituting f˜ for f in (5.5). As before, (3.13) can be solved by first substituting for λ1 and λ2 in terms of π¯ (or q1 ). Because we assumed π1 ̸= π2 , no special treatment is needed to avoid a spurious solution at λ1 = λ2 . Score function and asymptotic properties Because λ1 and λ2 are now defined in terms of q by (5.6) instead of (4.3), we have
∂ log L˜ ∂ log L˜ + π2 , π1 ∂λ1 ∂λ2 π ∂ R˜ ∂ R˜ 2 − π1 R˜ q = π1 + π2 π¯ 2 ∂λ1 ∂λ2 π2 − π1 S˜q = π¯ 2
Appendix A
Denote the three sampling schemes as SQ (known Q ), Sstd (unknown Q , standard stratified sampling), and Sv p (unknown Q , variable probability sampling). A.1.1. Solving for λ The auxiliary parameters (λ1 , λ2 ) are in principle determined by the boundary condition du f˜ (u|β, λ) = 1 together with appropriate relationship (3.4), (4.3) or (5.6). However, for SQ and Sstd , there can be a spurious solution at λ1 = λ2 = 1. Since this could also be a feasible outcome, it cannot be excluded a priori from the parameter space, and instead we modify the constraint Eq. (3.13). For SQ , use the parameterization (3.14): λ1 = 1 − µ/Q1 and λ2 = 1 + µ / Q2 . Then R˜ (β, λ) = µ2 R˜ 1 (β, µ), where R˜ 1 (β, µ) is nonsingular at µ = 0, and (λ1 , λ2 ) is determined by the solution of R˜ 1 (β, µ) ≡
˜ 1q (u|β, λ) and instead of (4.9). Redefine the moment functions m ˜ 2q (u|β, λ), using the expressions given by Eqs. (A.2.7)–(A.2.8) in m Appendix A.2. Then S˜q is given by the analog of (3.19), as before, while the expression for R˜ q in (4.10) is replaced by R˜ q =
∞
˜ 2q (v|β, λ) . dv m
(5.8)
−∞
The trimmed score S˜ ∗ (β, λ) is defined as in the previous section. Solving S˜ ∗ (β, λ) = 0 and R˜ (β, λ) = 0 for (β, λ), subject to (5.6), gives the estimator βˆ . An expression for S (β0 , λ0 ) is given in Appendix A.3. Again, varS [n−1/2 S (β0 , λ0 )] is equal to the inverse of the relevant semiparametric efficiency bound in (A.8.3), and asymptotic efficiency of βˆ follows. Finally, we can obtain an efficient estimator qˆ 1 of the population share Q1 by the same two-step method as described in the previous section. The only difference is that λ1 and λ2 in (4.11) are expressed in terms of q1 using (5.6) instead of (4.3). 6. Conclusion Smoothing the log likelihood with respect to an index function or a residual makes maximization with respect to unknown functions tractable in a number of cases, without compromising the asymptotic efficiency of parameter estimation from the concentrated likelihood. This paper shows how the technique can be applied to the case of endogenously stratified regression with two strata, with errors independent of the regressors. In this application, the unknown function is the density function of the error terms, while the parameters are conventional regression coefficients. Using a specific trimming method, we are able to show that the solution of the trimmed score equations provides a parameter estimator that meets the semiparametric asymptotic efficiency bound for this problem. In implementing this estimator, each evaluation of the score function and its derivatives involves one-dimensional numerical integration of a ratio of kernel estimators, but this is not a major computational cost by today’s standards. In addition, a smoothing parameter and a trimming parameter have to be specified. On the other hand, a possibly attractive feature is that the estimating equations are then explicit expressions in terms of kernel functions of the residuals and of the regression functions.
[∆Q − µ + G˜ (v|β)] g˜1 (v|β) ˜ (v|β) Q1 Q2 + µ ∆Q − µ2 + µ G µ Q2 µ Q1 − ℓ2 − =0 + ℓ2
dv
Q2
Q2
Q1
(A.1.1)
Q1
where ∆Q = Q1 − Q2 and ℓ2 (z ) = z −2 [log(1 + z ) − z ]. Note that this does not exclude the outcome µ = 0: if (3.8) has the solution µ = 0, then f˜ (u|β, λ) = g˜1 (u|β) and it follows that R˜ 1 (β, 0) = 0 (see the remark following the proof of Proposition 1 below). Similarly, for Sstd , use the parameterization (4.3), i.e., λ1 = p1 /q1 and λ2 = p2 /(1 − q1 ). Then R˜ (β, λ) = (q1 − p1 )2 R˜ 2 (β, q1 ), with R˜ 2 (β, q1 ) nonsingular at q1 = p1 , and (λ1 , λ2 ) is determined by the solution of R˜ 2 (β, q1 ) ≡
−
dv p1 p2
˜ (v|β) g˜1 (v|β) G ˜ (v|β) p1 p2 + (p1 − q1 ) G
ℓ2
q1 − p1 p2
= 0.
+
p2 p1
ℓ2
p1 − q1
p1 (A.1.2)
Again, this does not exclude the outcome q1 = p1 : if (4.4) has the solution q1 = p1 , then R˜ 2 (β, p1 ) = 0. A.1.2. Solution of the variational equation Since uniqueness is not an issue here, we just verify that f˜ (u|β, λ) is the required solution. The following proposition is stated for SQ , but the proofs are essentially the same for the other two cases. (For Sv p , the case λ1 = λ2 does not occur.) Proposition 1. Let f˜ (u|β, λ) be the density function defined by (3.12), subject to the boundary conditions (3.14) and (A.1.1). Then f = f˜ solves the first-order variational equation (3.9) and the subsidiary conditions (3.8). Proof. Let F˜ (u|β, λ) be the corresponding distribution function, and define F˜¯ (u|β, λ) as in (3.2). First, suppose that µ ̸= 0, in which case the condition (A.1.1) is equivalent to (3.13), and f˜ (u|β, λ) =
1
dF˜¯ (u|β, λ)
λ2 − λ1 du g˜1 (u|β) ˜ = F¯ (u|β, λ) ˜D(u | β, λ)
(A.1.3)
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
Note that the last part of the proof also shows that when µ = 0, then (3.8) implies (A.1.1), and thus the cases where µ = 0 and R˜ 1 (β, µ) ̸= 0, or where q1 = p1 and R˜ 2 (β, q1 ) ̸= 0, are indeed ‘‘spurious’’ solutions of the boundary conditions.
˜ defined as in Section 3. It follows that with D
d
=−
˜ (v|β, λ) D
g˜1 (v|β) − g˜2 (−v|β) = ˜ λ2 − λ1 dv F¯ (v|β, λ) F˜¯ (v|β, λ) ˜ (v|β, λ) d 1 D + λ2 − λ1 dv F¯˜ (v|β, λ) 1
g˜2 (−v|β) F˜¯ (v|β, λ)
.
A.1.3. Bounded denominator terms The following result justifies the absence of trimming for the ˜ (u|β, λ) defined in (3.11). denominator term D (A.1.4)
Substituting (A.1.3) and (A.1.4) into the left-hand side of (3.9) then leads to
˜ (u|β, λ) D
1
˜ (v|β, λ) D
λ1 λ2 − λ1 F¯˜ (v|β, λ) v=∞ ˜ (v|β, λ) D + λ2 . F¯˜ (v|β, λ)
F˜¯ (u|β, λ)
+
Proposition 2. Assume that Q1 ∈ (δ1 , 1 − δ2 ) for some δ1 > 0, δ2 > 0. Then a closed set ΛD can be constructed such that (i) D˜ (u|β, λ) > 0 for all λ ∈ ΛD and (ii) λ0 ∈ ΛD with probability approaching 1, where λ0 = (p1 /Q1 , p2 /Q2 )′ . Proof. First, consider standard stratified sampling. In the expres˜ (u |β) in (3.10), each term in the sum is negative for strasion for G ˜ (u|β) > tum 1 and positive for stratum 2, and therefore p2 > G −p1 , which implies.
v=u v=−∞
(A.1.5)
v=u
˜ (∞|β) = G˜ (−∞|β) = 0, From the definitions (3.10)–(3.11), G ˜D(∞|β, λ) = D˜ (−∞|β, λ) = λ1 λ2 , and F¯˜ (−∞|β, λ) = λ1 , while condition (3.13) implies F˜¯ (∞|β, λ) = λ2 . It follows that the expression (A.1.5) is zero, i.e., Eq. (3.9) is satisfied. To verify (3.8), substitute for F˜ (u|β, λ) in terms of F¯˜ (u|β, λ), which gives
˜ (u|β, λ) ≥ λ1 λ2 − max{(λ2 − λ1 ) p1 , (λ1 − λ2 ) p2 } D
dv g˜2 (−v|β)
F˜ (v | β, λ)
1 − {p1 + [p21 + 4p2 Q1 ]1/2 } + Q1 , 2
1
λ2 − λ1
= Q2
ΛD = {λ|δ1 ≤ p1 /λ1 ≤ 1 − δ2 , p1 /λ1 + p2 /λ2 = 1}. (A.1.6)
as required, where the second equality comes from (A.1.4) and the last equality comes from (3.14). The other equation in (3.8) follows in the same way. Finally, suppose µ = 0, i.e., λ1 = λ2 = 1, in which case f˜ (u|β,
λ) = g˜1 (u|β), F˜ (u|β, λ) = G˜ 1 (u|β), and F˜¯ (u|β, λ) = 1. Eq. (3.9) is clearly satisfied, while conditions (3.8) reduce to dv g˜2 (v|β) ˜ 1 (v| β) = Q2 . From (A.1.1), G ˜R1 (β, 0) = 1 ˜ (v|β)] g˜1 (v|β) d v [∆ Q + G Q1 Q2 1 Q2 Q1 + − = 0. 2
Q1
˜ 1 (v| β) = Q2 dv g˜2 (v|β) G
as required.
Then, using the restriction p1 /λ1 + p2 /λ2 = 1, (A.1.7) reduces to
˜ (u|β, λ) ≥ min{λ1 , λ2 } ≥ min{p1 /(1 − δ2 ), p2 /(1 − δ1 )} > 0 D as required. Finally, consider Sv p , where the restriction on λ is λ1 /π1 = λ2 /π2 . Without loss, consider the case where π2 > π1 , and let
ΛD = {λ|λ1 ≥ pˆ 1 (π2 − π1 )/π2 , λ1 /π1 = λ2 /π2 }. It follows from (A.1.7) (with pˆ substituted for p) that λ ∈ ΛD ˜ (u|β, λ) > 0. Also, pˆ → p in probability, and λ0,1 = implies D p1 /Q1 > p1 (π2 − π1 )/π2 , so λ0 ∈ ΛD with probability approaching 1, as required. A.2. Moment functions in the score equations
Q2
˜ (v|β)/∂v = g˜1 (v|β) − g˜2 (−v|β), Integrating by parts, using ∂ G leads to
(A.1.8)
proaching 1. Next, consider Sstd . Since p1 /λ1 is the estimator of Q1 , there is no loss in restricting λ to the set
v=−∞
=
{p2 + [p22 + 4p1 Q2 ]1/2 } − Q2 ,
which contains the point µ = µ0 = Q1 − p1 corresponding to λ0 . Then a suitable ΛD is any closed subinterval of (A.1.8) containing µ0 . For variable probability sampling with known Q , p1 and p2 are replaced by the corresponding sample values pˆ 1 = n1 /n and pˆ 2 = n2 /n here, and the same result holds with probability ap-
F˜¯ (v|β, λ) 1 g˜2 (−v|β) = 1 − λ1 dv λ2 − λ1 F˜¯ (v | β, λ) v=∞ ˜ (v|β, λ) D λ1 1 1+ = λ2 − λ1 λ2 − λ1 F˜¯ (v|β, λ) 1 − λ1
(A.1.7)
˜ (u|β, λ0 ) ≥ min{λ0,1 , λ0,2 } > 0, so Evaluating this at λ0 gives D there do exist sets Λ satisfying conditions (i) and (ii). If Q is known, ˜ (u|β, λ) > 0 express λ in terms of µ according to (A.1.1). Then D for µ in the interval
2
123
˜ 1β , m ˜ 2β , m ˜ 1µ , Population shares known. The moment functions m ˜ 2µ , which appear in the components of the score function and m (3.18)–(3.20), are given by the following expressions: ˜ 1β (y, x|β, λ) = m
1
dg˜1 (y − xβ|β)
g˜1 (y − xβ|β)
−
λ2 − λ1
dβ ˜ (y − xβ|β) dG
˜ (y − xβ|β, λ) D
dβ
(A.2.1)
124
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
(λ2 − λ1 ) g˜1 (v − xβ|β) ˜ (v − xβ|β, λ) D ˜ 1β (x, v|β, λ) ×m 2 1 λ1 Q1 + λ22 Q2 ˜ 1µ (u|β, λ) = −1 m ˜ (u|β, λ) Q1 Q2 (λ2 − λ1 ) D
(3.21), the corresponding asymptotic score function S (β, λ(β)), and the function
˜ 2β (v, x|β, λ) = m
˜ 2µ (u|β, λ) = m
λ21 Q1 + λ22 Q2 Q1 Q2
·
g˜1 (u|β)
˜ (u|β, λ)2 D
.
(A.2.2)
dβ
(A.2.4)
Standard stratified sampling, population shares unknown. The mo˜ 1β and m ˜ 2β appearing in the score function are ment functions m ˜ 1µ and m ˜ 2µ are replaced by the functhe same as above, while m ˜ 1q and m ˜ 2q defined as follows: tions m
˜ 1q (u, s|β, λ) = m
λ λ
1
2 2
1
˜ (u|β, λ) p1 p2 λ2 − λ1 D 1 1 − 1(s = 1) − 1(s = 2) λ1
λ21 λ22
g˜1 (u|β)
λ2
dv −∞
∂ g (u|β) ∂ u(β) ∂ g (u|β) = + . ∂u ∂β ∂β
2 1
R(β, λ) = (λ2 − λ1 )
(A.2.3)
As usual, the notation d/dβ represents the total derivative with respect to β , dg (u(β)|β)
∞
g1 (v|β) λ1 λ2 + (λ2 − λ1 ) G(v|β)
λ2 − log . λ1
(A.3.4)
Substituting (A.3.2) for g˜2 (u|β) in (3.8), and summing the two equations, leads to the result E
F¯0 (−X β0 )/F¯ (−X β|β, λ) = 1,
(A.3.5)
which shows that (3.21) is indeed a likelihood function. Next, we need to evaluate the score function S (β, λ) at (β0 , λ0 ) in order to find the information matrix for the parametric model. Evaluating (A.3.1)–(A.3.3) and their derivatives at (β0 , λ0 ) leads to the following expressions:
¯ 0 (−u) g1 (u|β0 ) = f (u) H
(A.3.6)
¯ 0 (−u) D(u|β0 ) = F¯0 (u)H
(A.3.7)
¯ 0 (−u) dg1 (u|β0 )/dβ = f (u) (E¯ [x | u] − x) H − (λ0,2 − λ0,1 ) f (u) (E [x|xβ0 = −u] − x) h0 (−u) ′
(A.2.5)
(A.3.8) (A.2.6)
¯ 0 (−u) dD(u|β0 )/dβ = (λ0,2 − λ0,1 ) f (u) (E¯ [x|u] − x) H − (λ0,2 − λ0,1 ) F¯0 (u) (E [x|xβ0 = −u] − x) h0 (−u)
Variable probability sampling, population shares unknown. The mo˜ 1β and m ˜ 2β are again the same as above, while ment functions m ˜ 1q and m ˜ 2q are redefined by the following expressions: m
¯ 0 (u) = λ0,1 [1 − where h0 (u) = h(u|β0 ), H0 (u) = H (u|β0 ), H H0 (u)] + λ0,2 H0 (u), and
˜ 2q (u|β, λ) = m
˜ (u | β, λ)2 p1 p2 D
,
where s ∈ {1, 2} is the stratum index.
π1 π2 (π1 − π2 ) 1 ˜ 1q (u|β, λ) = m , 3 ˜ (u|β, λ) π¯ D ˜ 2q (u|β, λ) = − m
π1 π2 (π1 − π2 )2 g˜1 (u|β) . ˜ (u|β, λ)2 π¯ 4 D
(A.3.9)
(A.2.7)
(A.2.8)
¯ 0 (−u)]−1 { λ0,1 [1 − H0 (−u)] E¯ [x|u] = [H × E [x|xβ0 > −u] + λ0,2 H0 (−u) E [x|xβ0 < −u]}. In the derivatives, u stands for u(β0 ) where u(β) = y − xβ , so that (for example) dg1 (u|β)/dβ = ∂ g1 (u|β)/∂β − x ∂ g1 (u | β)/∂ u. From these results, we can immediately verify that f (u) λ0,2 = 0. − log ¯F0 (u) λ0,1
A.3. Asymptotic score function
R(β0 , λ0 ) = (λ0,2 − λ0,1 )
The probability limits of the kernel estimators g˜1 (u|β) and g˜2 (u|β) are the corresponding densities for the stratified population,
Known Q. Substituting (A.3.6)–(A.3.8) and (A.3.8) in the expressions for the moment functions m1β , m2β , m1µ and m2µ gives
g1 (u|β) = λ0,1 E [f (u + x [β − β0 ] )|xβ > −u) × [1 − H (−u|β)] + λ0,2 E [f (u + x [β − β0 ] )|xβ < −u) H (−u|β)
m1β (y, x|β0 , λ0 ) = (A.3.1)
g2 (u|β) = E [F¯0 (−X β0 )|X β = u] h(u|β)
(A.3.2)
where λ0,1 = p1 /Q1 , λ0,2 = p2 /Q2 , h(u|β) is the population density of xβ , H (u|β) is the corresponding distribution function, and F¯0 (u) = λ0,1 [1 − F (u)] + λ0,2 F (u). The probability limit of ˜ (u|β) is then G
+ x [β − β0 ])|xβ < −u] H (−u|β).
(A.3.3)
Let m1β (y, x|β, λ), m2β (u, x|β, λ), m1µ (u|β, λ) and m2µ (u|β, λ) be the moment functions defined by substituting g1 and G for ˜ in (A.2.1)–(A.2.2) and in (A.2.3)–(A.2.4), (A.2.5)–(A.2.6), g˜1 and G or (A.2.7)–(A.2.8) (depending on the sampling scheme, with µ replaced by q in the cases where Q is unknown), and similarly for the score function components sβ (y, x|β, λ) and sµ (u | β, λ). These are the functions used in constructing the parametric likelihood
m1µ (u|β0 , λ0 ) =
log p2
F¯0 (ε)
−
p1
( E¯ [X |ε] − x)
d
1
f (u − xi β0 )
Q1 Q2
p2 Q2 1
−
p1 Q1
Q1 Q2
p21 Q1
p21 Q1
+
p22
(A.3.12)
Q2
¯ 0 (−u) F¯0 (u)H
1
−1
(A.3.10)
(A.3.11)
Q2 Q1 du F¯0 (u − xi β0 ) ¯ × ( E [X |u − xi β0 ] − xi )
× m2µ (u|β0 , λ0 ) =
f (ε)
dε
m2β (u, x|β0 , λ0 ) =
G(u|β) = − λ0,1 E [1 − F0 (u + x [β − β0 ])|xβ > −u]
× [1 − H (−u|β)] + λ0,2 E [F0 (u
d
du
+
p22 Q2
−1
(A.3.13) f (u)
¯ 0 (−u) F¯0 (u)2 H
(A.3.14)
where εi = yi − xi β0 . (The apparent singularity at p1 /Q1 = p2 /Q2 is due to common factors which could be explicitly factorized out, but at the expense of more complicated expressions.) Next, Sβ (β0 , λ0 ), Sµ (β0 , λ0 ), Rβ (β0 , λ0 ) and Rµ (β0 , λ0 ) are evaluated by substituting ˜ in Eqs. (3.18)–(3.20). (A.3.11)–(A.3.14) for the moment functions m Finally, S (β0 , λ0 ) is given in terms of these components by the asymptotic version of (3.16), S = Sβ − (Rβ /Rµ ) Sµ .
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
It is then straightforward to verify that E [sβ (y, x|β0 , λ0 )|y ≥ 0] = [Rβ (β0 , λ0 )/Rq (β0 , λ0 )]
× E [sq (y, x|β0 , λ0 )|y ≥ 0]
(A.3.15)
i.e., the score function has expected value zero in each stratum separately. For variable probability sampling, therefore, E [s(y, x|β0 , λ0 )|n1 , n2 ] = 0, and it follows that the conditional and unconditional variances are equal, var[s(y, x|β0 , λ0 )|n1 , n2 ] = var[s(y, x|β0 , λ0 )]. Unknown Q, standard stratified sampling. This differs from the previous case only in that m1µ and m2µ are replaced by m1q (u, s|β0 , λ0 ) =
p1 p2
p2
−
1
¯ 0 (−u) Q2 Q1 F¯0 (u)H Q1 Q2 − 1(s = 1) − 1(s = 2)
Q12 Q22 p1
m2q (u|β0 , λ0 ) =
−1
p1
p2
f ( u)
p1 p2
¯ (−u) Q12 Q22 F¯0 (u)2 H
,
and that Rq (β0 , λ0 ) is given by the analog of (4.10). Unknown Q, variable probability sampling. In this case,
π1 π2 1 (π1 − π2 ) π¯ 3 F¯0 (u)H¯ 0 (−u) π1 π2 f (u) , m2q (u|β0 , λ0 ) = − 4 ¯ (−u) π¯ F¯0 (u)2 H while Rq (β0 , λ0 ) is now given by the analog of (5.8). In all three cases, S (β0 , λ0 ) is found to be the same as the
m1q (u|β0 , λ0 ) =
efficient (semiparametric) score. It is straightforward to verify that in each case, varS [n−1/2 S (β0 , λ0 )] is equal to the inverse of the corresponding semiparametric efficiency bound, as given in Eqs. (A.8.1)–(A.8.3). A.4. Assumptions and conditions The regularity conditions given in Assumptions 1–6, together with Conditions 1 and 2 on the smoothing and trimming procedures, are used to show that βˆ (the solution of the trimmed score equation S˜ ∗ (β) = 0) is asymptotically equivalent to β¯ , which is in turn consistent and asymptotically normal with asymptotic variance given by the inverse of var[n−1/2 S (β0 , λ0 )]. Assumption 1. (i) β ∈ B, an open subset of Rk ; (ii) x ∈ Rk and ε ∈ R are independent random variables with bounded density functions h(x) and f (ε); and (iii) the regression function xβ is a continuous random variable for all β ∈ B (except for β = 0, if 0 ∈ B.) Assumption 2. An initial able.
√
n-consistent estimator of β is avail-
This is not restrictive, because GMM estimation is available, using inverse-probability weighted sample analogs of population moments of the form E [z (x) (y − xβ)] = 0. As a result, the parameter space B can be replaced by a neighborhood of β0 . Next, to simplify matters, we can subsume some standard regularity conditions under the following: Assumption 3. Suppose the likelihood for the equivalent parametric model is given by (3.22) or (3.23). Then the corresponding score equation has a consistent asymptotically normal root, with asymptotic variance (varS [S (β0 , λ0 )])−1 . The likelihoods (3.22) and (3.23) include the function λ(β), which is defined by the subsidiary condition R(β, λ) = 0. Assumption 3 includes the requirement that R(β, λ) is twice continuously differentiable in its arguments in a neighborhood of (β0 , λ0 ), so that the implicit function λ(β) inherits these properties, and also
125
Rλ (β0 , λ0 ) ̸= 0. (Expressions for R(β, λ) and R(β0 , λ0 ) are given in Eqs. (A.3.5) and (A.3.10).) Additional bounds are needed to ensure that the trimmed version of the equivalent parametric model has the same asymptotic properties as the untrimmed version. Denote θ ′ = (β ′ , µ), m1θ = (m′1β , m1µ )′ , m1θ θ = ∂ m1θ /∂θ ′ , and similarly for m2θ and m2θ θ . Assumption 4. In some neighborhood of (β0 , λ0 ), (i) |m1θ θ |, |m1θ | and |m1θ m′2θ | are bounded by a function ϕ1 (y, x) with finite expected value, and (ii) |m2θθ | and |m2θ m′2θ | are bounded by a function ϕ2 (y, x) such that dv ϕ2 (v, x) 1(0 < v < y) and dv ϕ2 (v, x) 1(0 > v > y) have finite expected values. The next two assumptions are used to control the rates of convergence of the kernel estimators of g, G and their derivatives. Assumption 5. (i) The functions g1 (u|β), ∂ g1 (u|β)/∂β, ∂ 2 g1 (u|β) /∂β ∂β ′ and their first two derivatives with respect to u are bounded uniformly in β ; and similarly for G(u|β) and its derivatives. (ii) The functions ∂ g1 (u|β)/∂β and ∂ 2 g1 (u|β)/∂β ∂β ′ are bounded by an integrable function ϕ(u). The second derivatives with respect to β are needed when estimating the Hessian, and the second derivatives with respect to u are then used to control the bias terms in the kernel estimators. Part (ii) of the assumption is used to ensure convergence of R˜ ∗β (β, λ) and its derivatives, which involve integration over an infinite range. Assumption 6. (i) E [|x|p ] < ∞ for some p > 4, and (ii) E [ | ε |q ] < ∞ for some q > 0. The first part bounds the variances of the kernel estimators. Assumption 6 implies that we can find a set Wn ⊂ Rk+1 , with volume increasing no faster than a power of n, such that all sample observations are in Wn with probability approaching 1 as n → ∞; this allows for uniform convergence on an expanding set (see Lemma B.1 in Ai (1997)). Assumption 7. Let h0 ( · ) be the marginal density of xβ0 . Then f (ε)h0 (−ε) > 0 on a set with positive measure. Condition 1. The kernel function K is bounded, differentiable, and 2 symmetric, such that K ( u ) ≥ 0 , K ( u ) du = 1 , u K ( u ) du < ∞, [dK (u)/du]2 du < ∞, and [d2 K (u)/du2 ]2 du < ∞. Condition 2. Let the convergence rates of the kernel bandwidth hn and the trimming parameters bn be hn ∼ n−α , bn ∼ n−β (with α > 0, β > 0). Then β < 21 − 52 α, β < 41 α, 3β < 4α − 21 , and α < (p − 4)/(p + 4) (where p comes from Assumption 6). (although less This implies that 18 < α < 15 , and also p > 36 7 restrictive rates could be achieved by using a higher-order biasreducing kernel). A.5. Convergence rates Uniform convergence of kernel estimators. For a kernel-based estimator a˜ with probability limit a, let ∆a˜ = a˜ − a. Under Assumptions 5–6 and Conditions 1–2, the following bounds hold uniformly in β ∈ B and (x, ε) ∈ Wn (see, for example, Lemma B.1 of Ai (1997))9 :
∆g˜1 (u(β)|β) = Op (h2n ),
9 See also Sherman (1994, Theorem 3) (and the paragraph following the proof), which provides a more general result on uniform convergence.
126
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
3/2 −1/2+ ∆dg˜1 (u(β)|β)/dβ = (c + |x|) [op (h− n ) + O(h2n )] n ∆G˜ (u(β)|β) = Op (h2n )
˜ (u))| = Op (h2n bn−m−1 ). But ∆a˜ (u) will which implies that |∆τ (m) (M generally include a bias term of order h2n , and therefore the first term dominates (or is of the same order as) the second term. The result then follows.
∆dG˜ (u(β)|β)/dβ = (c + |x|) Op (h2n ) 1 −5/2 −1/2+ n ) ∆(d2 g˜1 (u(β)|β)/dβ dβ ′ ) = (c + |x|2 ) op (b− n hn 2˜ ′ 2 −1 −3/2 −1/2+ ∆(d G(u(β)|β)/dβ dβ ) = (c + |x| ) op (bn hn n )
where u(β) = y − xβ . The rates of convergence in Condition 2 have been used to drop non-leading terms, and logarithmic factors have been dropped for simplicity—the notation np+ means any power of n greater than p. As usual, products of terms are dealt with by repeated use of identities of the form ∆(A˜ B˜ ) = A ∆B˜ + ˜ If we write the trimming factor from (3.24) as B ∆A˜ + ∆A˜ · ∆B. ˜ (u|β, λ)), then to leading order τ˜ ∗ (u) = τ (M 1 ˜ ) − τ (M ) = τ ′ (M ¯ ) ∆M ˜ = O(b− ˜1 ) ∆τ˜ ∗ ≡ τ (M n ) Op (∆g −1 2 = Op (bn hn )
′′
′
¯ is between M and M. ˜ Similarly, where τ ′ (a) = dτ (a)/da and M second-order expansions yield 2 ˜ + O(b− ˜ 2 ∆τ˜ ∗ = τ ′ (M ) ∆M n ) (∆M )
˜ −1 τ˜ ∗ ) = [M −1 τ ′ (M ) − M −2 τ (M )] ∆M ˜ ∆(M −3 2 ˜ + O(bn ) (∆M ) .
(A.5.3)
(A.5.4)
Proposition 3. Suppose that du|a(u)| < ∞, let ∆a˜ (u) = a˜ (u) − a(u), and let τ (m) (u) = dm τ (u)/dum (m = 0, 1). As before, let ˜ (u) = g˜1 (u)/D˜ (u). (The dependence on the parameters θ has been M suppressed.) If ∆a˜ (u|θ ) has a uniform bound Op (Bn ), then
du ∆a˜ (u) · τ
(m)
˜ (u)) + (M
˜ (u)). du a(u) ∆τ (m) (M
(b) Suppose also that ∂ a(u|θ )/∂ u, ∂ 2 a(u|θ )/∂ u ∂θ and ∂ 3 a(u|θ )/ ∂θ 3 are bounded uniformly in θ , and that |˜a(u|θ )| ≤ cn (θ ). If ∆[∂ a˜ (u|θ )/∂ u] has a uniform bound Op (Cn ), then du ∆ [˜g1β (u) a˜ (u)] = Op (Cn ). Proof. Rewrite the integral in part (a) as
du ∆g˜1 (u|θ ) · a(u|θ ) +
du g˜1 (u|θ ) ∆a˜ (u|θ ).
(A.5.6)
Substituting for ∆g˜1 , the first term becomes n 1
=
n 1
n j=1
du K
u − uj (θ ) hn
a(u|θ ) −
du g1 (u|θ ) a(u|θ ) 1
1
2
n
(a(uj (θ )|θ ) − E [a(uj (θ )|θ )]) + h2n
n
×
dη η2 K (η) · ∂ 2 a(uj + η h¯ n |θ )/∂ u2
j =1
where 0 ≤ h¯ n ≤ hn . Under the assumptions, the first average converges uniformly at rate n−1/2+ to zero, while the second average is uniformly bounded. The first term in (A.5.6) therefore has a uniform bound O(h2n ) + op (n−1/2+ ), which is in general dominated by Op (Bn ). The second term in (A.5.6) is bounded by sup |∆a˜ (u)| du g˜1 (u) = sup |∆a˜ (u)|, and the result (a) then follows. Part (b) of the proposition can be proved along the same lines, using integration by parts (which is why we needed a˜ to be bounded). A.6. Asymptotic properties of the estimator
m−1 ˜ (u))] = Op (b− du ∆ [˜a(u) τ (m) (M Bn ). n
The main step is to show the asymptotic equivalence of βˆ and β¯ . This follows by a classical argument from the following result
Proof. First, rewrite the integral as
du ∆ [˜g1 (u|θ ) a˜ (u | θ )] = Op (Bn ).
nhn j=1
Uniform convergence of integrals. An additional step is needed to verify uniform convergence of the integrals appearing in S˜ ∗ (β), because integrability of terms like ∆g˜ (u|β) with respect to u does not imply integrability of the corresponding uniform bounds. There is evidently no problem with integrals over a finite interval, as long as the length of the interval (in the present case, |yi |) has finite ˜ R˜ ∗β and their derivatives expected value, but the expressions for R, involve integration over an unbounded range. The following two results deal with trimmed and untrimmed integrands respectively.
(A.5.1)
¯ ) ∆M ˜ · ∂ M /∂ u + τ (M ) ∆(∂ M ˜ /∂ u) ∆(∂ τ˜ (u)/∂ u) = τ (M −2 2 −1/2+ −1 −3/2 = O(bn hn ) + op (n b n hn ) (A.5.2) ∗
Proposition 4. (a) Suppose that a(u|θ ), ∂ a(u|θ )/∂θ and ∂ 2 a(u|θ )/ ∂ u2 are bounded uniformly in θ . If ∆a˜ (u|θ ) has a uniform bound Op (Bn ), then
(A.5.5)
˜ (u) = g˜1 (u)/D˜ (u) and using the result that D˜ (u) Substituting M is bounded below, we can bound the first term in (A.5.5) by du ∆a˜ (u) · τ (m) (M ˜ (u)) ≤ sup |∆a˜ (u)| · sup |D˜ (u)−1 | ˜ (u)−1 τ (m) (M ˜ (u)) · du g˜1 (u) M m −1 ≤ sup |∆a˜ (u)| · sup |D˜ (u)−1 | · b− n
while the second term in (A.5.5) is bounded by
du a(u) ∆τ (m) (M ˜ (u)) ≤ sup |∆τ (m) (M ˜ (u))| · du|a(u)|. ˜ (u)) = τ (m+1) (M ¯ (u)) · ∆M ˜ (u), where M ¯ (u) is Now ∆τ (m) (M ˜ ˜ between M (u) and M (u), and ∆M (u) = O(∆g˜1 (u)) = Op (h2n ),
and from Assumption 3. For brevity, it is stated specifically for the case of known Q ; when Q is unknown, the Lagrange parameter µ is replaced by q1 , and λ is parameterized by (4.4) or (5.6) according to the sampling scheme. Proposition 5. Let λ = λ(µ) as in (3.14), and let θ ′ = (β ′ , µ). Under Assumptions 1–6 and Conditions 1–2, (a) n−1 {S˜ ∗ (β, λ) − S (β, λ)}, R˜ (β, λ) − R(β, λ), n−1 {∂ S˜ ∗ (β, λ)/ ∂θ −∂ S (β, λ)/∂θ}, and ∂ R˜ (β, λ)/∂θ −∂ R(β, λ)/∂θ converge to zero in probability uniformly in β and λ in a neighborhood of (β0 , λ0 ).10 ˜ (b) The term n−1/2 (S˜ ∗ − S ) − (n−1 ∂ S˜ ∗ /∂µ) · (∂ R˜ /∂µ)−1 · n1/2 R, evaluated at (β0 , λ0 ), converges to zero in probability. While there is some additional complexity due to the auxiliary parameters λ and the subsidiary condition R˜ (β, λ) = 0, the
10 Note that ∂ S˜ ∗ (β, λ)/∂θ is the derivative of the trimmed score function (as opposed to the trimmed derivative of the score function).
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
derivation is essentially the same as that of Proposition 2 in Cosslett (2004) (which in turn is based methods developed by Ai (1997) and Ichimura and Lee (1991)), and therefore most details are omitted. Proof of Proposition 5 (a). The term with the slowest rate of ∗ ∗ convergence is n−1 (S˜ββ − Sββ ), where S˜ββ = ∂ S˜β∗ /∂β ′ and similarly ∗ ∗ for Sββ and Sββ . Step (i) is to obtain a uniform bound on n−1 ∆S˜ββ =
∗ ∗ n−1 (S˜ββ − Sββ ), using Assumption 5 and the bounds given in Appendix A.5 above; the result is ∗ 5/2 −1 −1/2+ 3 n−1 ∆S˜ββ = Op (h− bn n ) + Op (h2n b− n n )
(A.6.1)
(taking into account the condition hn /bn → 0), and this converges to zero under Condition 2. In step (ii), Assumption 4 is used to ∗ verify that a uniform WLLN holds for n−1 Sββ and n−1 Sββ . Finally, in step (iii), Assumption 4 is again used, together with the bounded convergence theorem for integrals, to show uniform convergence ∗ of n−1 ES [Sββ − Sββ ] to zero. A similar argument applies to all the
˜ and their derivatives, except that other terms appearing in S˜ ∗ , R, step (ii) is omitted for the non-stochastic terms R∗β and R∗βθ , while steps (ii) and (iii) are of course omitted for the untrimmed terms. Proof of Proposition 5 (b). Rewrite the term in question as n
ming factors are handled using (A.5.3)–(A.5.4). (L stands for ‘‘linear’’, and R for ‘‘remainder’’.) Step (i) consists of bounding ∆(1,R) S˜ ∗ , again using the results of Appendix A.5 above. Although the second-order expansion is lengthy, it is not difficult to determine the leading-order terms. The result is 2 1/2 0+ 3 4 1/2 n−1/2 ∆(1,R) S˜ ∗ = op (b− ) + Op (b− ) n hn n n hn n
(A.6.2)
which again converges to zero under Condition 2. In step (ii), we use the standard approach of expressing the linear term as a double sum of the form n−1/2 ∆(1,L) S˜ ∗ = n−3/2
i
which tends to zero by Condition 2. Finally, step (iii) shows asymptotic equivalence of n−1/2 S ∗ and n−1/2 S. Let s∗i and si denote the individual terms in the trimmed and untrimmed score functions. An essential feature of the trimming scheme is that ES [s∗i (β0 , λ0 )] = 0, which can be verified by explicit calculation, and therefore ∆S ∗ is a sum of i.i.d. terms with mean zero. It follows that a sufficient condition is mean square convergence, ES [(s∗i − si )2 ] → 0. This can be shown from the bounds in Assumption 4, using pointwise convergence 1 − τ (M (u)) → 0 and the bounded convergence theorem. Therefore, ∆S ∗ also converges to zero in probability. The final result in this section concerns the asymptotic variance of the second-stage estimator qˆ 1 .
ˆ be the solution of R˜ ∗ (β, ˆ λ(q1 )) = 0, Proposition 6. Let qˆ 1 = q1 (β) ∗ ˜ where R is given by (4.11) and λ(q1 ) comes from (4.4) or (5.6), depending on the sample design. Increase the convergence rate of the kernel bandwidth in R˜ ∗ (but not elsewhere) to hn ∼ n−γ with 1 < γ < 12 . Then 4
1 n1/2 (ˆq1 − Q1 ) = R− −R′β n1/2 (βˆ − β0 ) q
+ C n−1/2 (Sq − E [Sq ]) + O(h2n n1/2 ) + op (1)
−1/2
(∆(1) S˜ ∗ + ∆S ∗ ) where n−1/2 ∆(1) S˜ ∗ = n−1/2 (S˜ ∗ − S ∗ ) − (n−1 ∂ S˜ ∗ / ∂µ) · (∂ R˜ /∂µ)−1 · n1/2 R˜ and ∆S ∗ = S ∗ − S. As before, all terms are evaluated at (β0 , λ0 ); note that R(β0 , λ0 ) = 0. Let ∆(1) S˜ ∗ = ∆(1,L) S˜ ∗ + ∆(1,R) S˜ ∗ , where ∆(1,L) S˜ ∗ contains terms linear in the ˜ and their derivatives, while kernel approximation errors ∆g˜1 , ∆G ∗ ˜ ∆(1,R) S contains terms that are second order or higher; the trim-
ψn (εi , xi , εj , xj )
j ̸= i
127
(where all terms are evaluated at (β0 , λ0 )), with C = p1 p2 (Q1 /p1 − Q2 /p2 )2 for standard stratified sampling and C = π¯ 2 /(π1 π2 ) for variable probability sampling.
˜ = O(h2n ) Proof. The relevant uniform convergence bounds are ∆D −1/2
+ op (n−1/2+ ) and ∆(dD˜ /dβ) = (c +|x|)[O(h2n )+ op (n−1/2+ hn )], from which it follows that R˜ ∗β and R˜ ∗q converge uniformly in ˆ λ(q1 )) probability to Rβ and Rq . Then a linear expansion of R˜ ∗ (β, √ √ √ yields n (ˆq1 − Q1 ) = − (R′β /Rq ) n (βˆ − β0 ) + n R˜ ∗ /Rq + op (1). ˜ leads to Next, a linear expansion of R˜ ∗ (β0 , λ0 ) in ∆D n ∆D˜ (εi ) 1 ˜R∗ (β0 , λ0 ) = (λ2 − λ1 ) 1 − + n i=1 D(εi )2 D(εi ) − log
λ2 + O(h4n ) + op (n−1+ ). λ1
Write the first term as a double sum, and apply the projection theorem for U-statistics as before. Comparing the result with the relevant expression for Sq , we find
where the sum over j comes from the kernel estimates. A straight3 −2 −1 2 forward calculation gives ES [ψn2 ] = O(h− ) + O(h4n b− n bn n n ), which is o(n) according to Condition 2, and therefore we can apply the U-statistic projection theorem of Powell et al. (1989) to get
R˜ ∗ (β0 , λ0 ) = C (Sq − ES [Sq ]) + O(h2n ) + op (n−1/2 )
n−1/2 ∆(1) S˜ ∗ = n1/2 ES [ψn ] + 2n−1/2
n1/2 (βˆ − β0 ) and n−1/2 Sq are asymptotically independent. It follows that
{rn (εi , xi )
i
− ES [ψn ]} + op (1)
(A.6.3)
where rn (ε, x) = 21 { ES [ψn (ε, x, εj , xj )|ε, x] + ES [ψn (εi , xi , ε, x) |ε, x] }. The first of these conditional expectations has the bound 2 2 O(b− n hn ), coming from the bias in the kernel estimators. The key result, however, is that ES [ψn (εi, xi , εj , xj )|εj , xj ] = 0, and therefore also ES [ψn ] = 0; this result also holds in other applications of the smoothed likelihood approach (Cosslett, 2004, 2007), and is the payoff for adopting the present trimming scheme.11 The right hand side of (A.6.3) now reduces to n−1/2 times a sum of indepen2 2 dent terms with mean zero and bounded by O(b− n hn ). Application 2 2 of a central limit theorem then gives n−1/2 ∆(1,L) S˜ ∗ = Op (b− n hn ) ,
11 An equivalent result is that the present trimming scheme preserves the orthogonality condition (3.12) in Newey (1994).
and the result follows.
Corollary 1. Direct calculation shows that cov(S , Sq ) = 0, and thus
avar (ˆq1 ) =
1 ′ Rβ (varS [n−1/2 S ])−1 Rβ R2q
+ C 2 varS [n−1/2 Sq ]
(A.6.4)
with C defined as in the statement of Proposition 6. A.7. Identification Nonsingularity of the asymptotic information matrix, which was part of Assumption 3, implies that β is locally identified. However, since the estimating equations may have multiple roots, a preliminary consistent estimator in needed. If Q is unknown, this involves a consistent estimator of Q , so the following result can be useful in the case of standard stratified sampling. (In the case of variable probability sampling, Q can be estimated directly from n1 and n2 , as noted in Section 5.)
128
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
Proposition 7. Let h0 ( · ) be the marginal density of xβ0 , and suppose f and h0 are continuous. Then Q is nonparametrically identified if f (ε)h0 (−ε) > 0 on a set with positive measure. This means that Q is identified without any parametric assumptions for f or h. A sufficient condition is that the boundary between the strata (in Rk+1 ) passes through an open region in which the population density is strictly positive.
(b) Standard stratified sampling, unknown Q : Vββ = (A − BB′ /DS )−1 Vβ Q =
λ1 λ2
−
γˆ1 = γˆ2 =
1
n1 h n
i
1
n2 h n
i
1(yi > 0) K 1(yi < 0) K
yi hn
yi
,
hn
× B=
du
d
log
f (u)
2
du F¯0 (u) E¯ [xx′ |u] − E¯ [x|u]E¯ [x′ |u] d
du
DV =
du
DS = (λ1 λ2 )
f ( u) F¯0 (u) f (u)
¯ 0 (−u) H
E¯ [x|u],
¯ 0 (−u) F¯0 (u)2 H
− DV , DQ = DV − (λ1 λ2 )−1 (λ21 Q1 + λ22 Q2 )−1 −1
(where DS > 0 and DQ > 0). Then the asymptotic variance bounds
ˆ qˆ 1 ), in a self-explanatory notation, are for (β, (a) Known Q : Vββ = (A + BB′ /DQ )−1
DS − B′ A−1 B
(λ21 Q1 + λ22 Q2 )
1
λ1 λ2 (λ2 − λ1 )
(A + BB′ /DV )−1 B/DV
1
λ1 λ2 (λ2 − λ1 )2
(A.8.3)
(λ21 Q1 + λ22 Q2 )
1
(λ1 λ2 )2 (λ2 − λ1 )2
·
1 DV − B′ A−1 B
.
In the special case λj = 1 (i.e., the sampling proportions are the same as the population proportions), the terms B, DS and DQ are zero, but the correct expressions are obtained by taking the limit p1 − Q1 → 0 in the usual way. Appendix B. Simulation results
Semiparametric efficiency bounds can be evaluated explicitly for the cases considered here, using standard methods;12 by comparing these with the corresponding asymptotic variances, one can explicitly verify that the smoothed maximum likelihood estimators are asymptotically efficient. (Also, comparison with the relevant bounds for the case where f is a known parametric function shows that adaptive estimation is not possible here, i.e., there is necessarily a loss of efficiency when the error distribution is unknown, in contrast with the case of a regression model under random sampling.) In the following, set λj (j = 1, 2) equal to its limiting value, λj = pj /Qj for standard stratified sampling or λj = πj /π¯ for variable probability sampling. Let H0 be the marginal distribution function ¯ 0 (u), and E¯ [ ·|u] be defined as in Appendix A.3. of xβ0 . Let F¯0 (u), H Define the following integrals: du f (u)
Vβ Q =
−
A.8. Asymptotic efficiency bounds
λ1 λ2
1
Vββ = (A + BB′ /DV )−1
VQQ =
where K is a conventional symmetric kernel function, hn → 0, and nhn → 0. Allowing for the fact that the density is estimated on the boundary, plim γˆj = (2Qj )−1 E [f (−xβ0 )], and therefore Qˆ 1 = γˆ2 /(γˆ1 + γˆ2 ) is a consistent estimator of Q1 .
A =
Q1 Q2
(λ2 − λ1 )2 ·
(A.8.2)
(c) Variable probability sampling, unknown Q :
Because f is continuous, the marginal densities p1,y (0) and p2,y (0) can be estimated consistently from each stratum, for example using the kernel estimators
2
λ1 λ2
−1
(λ2 − λ1 ) (A − BB′ /DS )−1 B/DS
Q1 Q2
VQQ =
Proof. The two strata have the observable probability densities p1 (x, y) = Q1 f (y − xβ0 ) h(x) 1(y ≥ 0) p2 (x, y) = Q2−1 f (y − xβ0 ) h(x) 1(y < 0).
Q1 Q2
(A.8.1)
12 See Begun et al. (1983), or an alternative method proposed by Severini and Tripathi (2001). Some of the bounds for endogenously stratified regression were presented previously in Cosslett (1985).
Two sets of results are presented. First, the asymptotic efficiency bounds in the previous section are evaluated in a simple case, and compared with the asymptotic variance of the conventional inverse probability weighted (IPW) least-squares estimator. This comparison shows the theoretical gain from efficient estimation, and also the potential loss when Q is unknown. Second, a small-scale simulation study, again in a particular case, illustrates how finite-sample properties of the three estimators compare with the asymptotic results. The population model is y = α + β x + u with x ∼ N (0, 1) and β0 = 1. Three error distributions are considered, which have been frequently used in simulation studies: (i) standard normal, (ii) t-distribution with 3 degrees of freedom, scaled to unit variance, and (iii) a ‘‘contaminated’’ normal mixture, N (0, 1/9) with probability 0.9 and N (0, 9) with probability 0.1. For each error distribution, α was adjusted to give population shares Q1 = 0.9, Q2 = 0.1, while the stratified samples had p1 = p2 = 0.5. Data was generated by both standard stratified sampling and variable probability sampling. Table 1 compares the asymptotic variances of βˆ for the different estimators. At least for this specific sample design, the main qualitative features are that (i) there are substantial efficiency gains from semiparametric maximum likelihood estimation (MLE), with a nontrivial gain even for normal errors, and (ii) somewhat surprisingly, the efficiency loss when Q is unknown is relatively small. Simulations were carried out for sample sizes 100, 200 and 500, using a standard normal kernel. Values of the smoothing parameter were generated by the ad hoc rule hn = r n−α , where r is a preliminary estimate of the interquartile range of the error distribution and α = 0.2.13 There seem to be currently no good guidelines for the choice of trimming parameter, but as found in other studies, results are generally insensitive to the value of bn . Results are reported here for bn = 0.001. In the small number
13 This is not strictly compatible with the condition α < 0.2, but for these sample sizes the difference between n−0.2 and, say, n−0.18 is negligible.
S.R. Cosslett / Journal of Econometrics 177 (2013) 116–129
129
Table 1 Asymptotic variances. Error type
Standard sampling, Q known
Standard sampling, Q unknown
Variable probability, Q unknown
IPW, Q known
N(0, 1) t(3) Normal mixture
0.980 0.472 0.113
1.037 0.510 0.118
0.994 0.480 0.114
1.352 1.226 1.114
Table 2 Simulation results. Error typea
Sample size
Standard sampling known Q
Standard sampling unknown Q
Variable probability unknown Q
Efficiency ratio
90% interval ratio
Efficiency ratio
90% interval ratio
Efficiency ratio
90% interval ratio
1 1 1
100 250 500
0.89 0.92 0.93
1.12 1.07 1.05
0.90 0.93 0.93
1.13 1.08 1.07
0.91 0.89 0.90
1.06 1.15 1.11
2 2 2
100 250 500
0.91 0.97 0.99
1.08 1.01 0.98
0.91 0.97 0.98
1.13 1.02 1.02
0.81 0.94 0.95
1.24 1.06 1.01
3 3 3
100 250 500
0.85 0.87 0.86
1.03 1.07 1.06
0.85 0.86 0.87
1.03 1.07 1.06
0.85 0.88 0.89
1.04 1.05 1.03
a
Error types: 1: standard normal; 2: t(3); 3: normal mixture (see text).
of simulations where no solution was found for the trimmed score equation, βˆ was computed by minimizing the square of the trimmed score function. Initial consistent estimates of β were obtained from the IPW estimator. (Simulation results for the IPW estimator with known Q are not shown—they are very close to the asymptotic approximation.) With unknown population shares we need a consistent initial estimator of Q . For variable probability sampling, this is Q˜ 1 = (n1 /π1 )/ (n1 /π1 + n2 /π2 ). For standard stratified sampling, the nonparametric estimator used in Proposition 7 to show identifiability is available. A simpler method, however, is to add one or more additional weighted moment equations corresponding to moments nonlinear in x, such as E [x2 ε] = 0, and treat Q1 as an additional parameter. (In principle, some error distributions may result in a singular covariance matrix, but in practice no problems seem to arise.) Table 2 compares the finite-sample performance of the estimators relative to their asymptotic properties. The ‘‘efficiency ratio’’ is defined as the ratio of the asymptotic standard deviation of βˆ to the root mean square error (scaled by n−1/2 ). The contribution of the bias to the root mean square error is generally small, and is not reported separately. While variances or root mean square errors are typically reported in simulation studies, they do not necessarily reflect the reliability of asymptotic statistics (for example, the finite-sample variance might not even exist). To provide some information about the shape of the distribution, the ‘‘90% interval ratio’’ is the central 90% range of the estimates, scaled by the corresponding interval for the asymptotic normal approximation. The estimators perform reasonably well for these sample sizes; the slowest approach to the asymptotic approximation appears to be for standard stratified sampling with errors generated by the normal mixture distribution. Even with normal errors, the semiparametric MLE estimator for known Q outperforms the IPW estimator: the ratio of their asymptotic standard errors is about 0.85, which falls below the finite-sample efficiency ratios for the MLE.
References Ai, C., 1997. A semiparametric maximum likelihood estimator. Econometrica 65, 933–963. Begun, J.M., Hall, W.J., Huang, W.-M., Wellner, J.A., 1983. Information and asymptotic efficiency in parametric-nonparametric models. Annals of Statistics 11, 432–452. Cosslett, S.R., 1981. Maximum likelihood estimator for choice-based samples. Econometrica 49, 1289–1316. Cosslett, S.R., 1985. Efficiency bounds for distribution-free estimators from endogenously stratified samples. Paper presented at the 1985 World Congress of the Econometric Society. Cosslett, S.R., 1993. Estimation from endogenously stratified samples. In: Maddala, G.S., Rao, C.R., Vinod, H.D. (Eds.), Handbook of Statistics, Vol. 11: Econometrics. North-Holland, Amsterdam, pp. 1–43. Cosslett, S.R., 2004. Efficient semiparametric estimation of censored and truncated regressions via a smoothed self-consistency equation. Econometrica 72, 1277–1293. Cosslett, S.R., 2007. Efficient estimation of semiparametric models by smoothed maximum likelihood. International Economic Review 48, 1245–1272. Hausman, J.A., Wise, D.A., 1981. Stratification on endogenous variables and estimation: the Gary income maintenance experiment. In: Manski, C.F., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometric Applications. MIT Press, Cambridge, MA, pp. 365–391. Holt, D., Smith, T.M.F., Winter, P.D., 1980. Regression analysis of data from complex surveys. Journal of the Royal Statistical Society Series A 143, 474–487. Ichimura, H., Lee, L.-F., 1991. Semiparametric least squares estimation of multiple index models: single equation estimation. In: Barnett, W.A., Powell, J., Tauchen, G.E. (Eds.), Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge University Press, Cambridge, pp. 3–49. Jewell, N.P., 1985. Least squares regression with data arising from stratified samples of the dependent variables. Biometrika 72, 11–21. Kim, C.-K., Lai, T.L., 2000. Efficient score estimation and adaptive M-estimators in censored and truncated regression models. Statistica Sinica 10, 731–749. Nathan, G., Holt, D., 1980. The effect of survey design on regression analysis. Journal of the Royal Statistical Society Series B 42, 377–386. Newey, W.K., 1994. The asymptotic variance of semiparametric estimators. Econometrica 62, 1349–1382. Powell, J.L., Stock, J.H., Stoker, T.M., 1989. Semiparametric estimation of index coefficients. Econometrica 57, 1403–1430. Severini, T.A., Tripathi, G., 2001. A simplified approach to computing efficiency bounds in semiparametric models. Journal of Econometrics 102, 23–66. Sherman, R.P., 1994. U-processes in the analysis of a generalized semiparametric regression estimator. Econometric Theory 10, 372–395. Vardi, Y., 1985. Empirical distributions in selection bias models. Annals of Statistics 13, 178–203.