Robust estimation with many instruments

Robust estimation with many instruments

Journal of Econometrics xxx (xxxx) xxx Contents lists available at ScienceDirect Journal of Econometrics journal homepage: www.elsevier.com/locate/j...

608KB Sizes 0 Downloads 54 Views

Journal of Econometrics xxx (xxxx) xxx

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Robust estimation with many instruments Mikkel Sølvsten Department of Economics, University of Wisconsin, 1180 Observatory Drive, Madison, WI 53706, United States

article

info

Article history: Received 14 November 2017 Received in revised form 6 October 2018 Accepted 6 April 2019 Available online xxxx JEL classification: C26 C36 Keywords: Instrumental variables Generalized method of moments Minimax estimation Stein’s method

a b s t r a c t Linear instrumental variables models are widely used in empirical work, but often associated with low estimator precision. This paper proposes an estimator that is robust to outliers and shows that the estimator is minimax optimal in a class of estimators that includes the limited maximum likelihood estimator (LIML). Intuitively, this optimal robust estimator combines LIML with Winsorization of the structural residuals and the Winsorization leads to improved precision under thick-tailed error distributions. Consistency and asymptotic normality of the estimator are established under many instruments asymptotics and a consistent variance estimator which allows for asymptotically valid inference is provided. © 2019 Elsevier B.V. All rights reserved.

1. Introduction The focus of this paper is robust estimation of the structural coefficient in a linear instrumental variables (IV) model with many instruments. Such models have received considerable attention, both because the use of many instruments can lead to efficiency gains (see, e.g., Angrist and Krueger, 1991) and because a flurry of empirical applications are naturally characterized by having a large number of instruments (see e.g., Kling, 2006; Doyle, 2007, 2008; Chang and Schoar, 2008; Autor and Houseman, 2010; Maestas et al., 2013; French and Song, 2014; Aizer and Doyle, 2015; Dobbie and Song, 2015; Jacobsen and Van Benthem, 2015). In models with multiple instruments, it is well understood that inference based on the commonly employed two-stage least squares estimator (2SLS) and standard asymptotics may give misleading confidence intervals. Inference based on LIML and many instruments asymptotics – where the number of instruments and the sample size are allowed to be proportional – tends to correct this problem (Kunitomo, 1980; Morimune, 1983; Bekker, 1994). A further motivation of LIML is that it is an efficient estimator when the errors of the model are jointly normal (Chioda and Jansson, 2009). However, presence of gross outliers among the errors is common in empirical applications,1 and such thick-tailed error distributions can render LIML inefficient or even inconsistent in extreme cases. As an illustration, this paper takes the model and data from Angrist and Krueger (1991) and documents that the structural errors are distributed roughly like a normal distribution contaminated with gross outliers. This example highlights the need to understand if alternatives to LIML can be more efficient under thick-tailed deviations from normal errors. The main contribution of this paper is to propose an optimal robust estimator of the structural coefficient in a linear IV model. Intuitively, this estimator combines LIML with Winsorization of the structural residuals, and the Winsorization E-mail address: [email protected]. 1 In 2015–2016 the American Economic Review published 10 articles where the authors found outliers to be enough of a concern that they reported on this and Winsorized the outcome variable. One of these, Dobbie and Song (2015), is specifically a setting with many instruments. Winsorization of the outcome variable, as opposed to the residuals, leads to inconsistency, but is nevertheless common in applied research. https://doi.org/10.1016/j.jeconom.2019.04.040 0304-4076/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

2

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

bounds the influence of outliers among the errors. This robustness property makes the estimator more efficient than LIML when the structural errors have thick tails, and the efficiency gain is approximately 80 percent in the empirical example, in the sense that LIML would need an 80 percent larger sample to achieve the same level of precision. The proposed robust estimator is shown to be minimax optimal. This optimality result is derived in three steps. First, the paper proposes a new class of generalized method of moments estimators (GMM) for the linear IV model. Two particular members of this class correspond to LIML and the optimal robust estimator. The paper then shows that each estimator of the structural coefficient is consistent and asymptotically normal at the usual parametric rate under many instruments asymptotics. Second, the article characterizes an optimal estimator within the class which minimizes asymptotic variance when the shape of the joint error distribution is known. For example, if the joint error distribution is normal, then the optimal estimator is LIML. Finally, the paper uses this optimality result to show that the optimal robust estimator is the member of the class which minimizes the maximal asymptotic variance over a neighborhood of contaminated normal distributions, i.e., over mixtures between the standard normal distribution (with high probability) and some unknown contaminating distribution (with low probability). This approach mirrors the one taken by Huber (1964, 1973, 1981) in the context of the classical robust regression model, and the optimal robust estimator derived here treats the structural residuals of the IV model the same way as Huber’s minimax estimator treats the residuals of the regression model. These contributions add to a growing literature on many instruments asymptotics that started with Kunitomo (1980) and Morimune (1983), who derived asymptotic variances for LIML that are larger than the usual IV formulas and depend on the number of instruments. Bekker (1994) provided consistent estimators of these larger variances under normal errors, and Hansen et al. (2008) extended the variance formulas and estimators to allow for nonnormal errors.2 This paper expands the class of asymptotically normal estimators to include robust alternatives to LIML and provides formulas for their asymptotic variances that are natural extensions of the existing formulas. In addition, this paper provides consistent variance estimators inspired by the GMM setup. However, these variance estimators differ from the classical ‘‘sandwich’’ type estimators used with GMM (Newey and McFadden, 1994, section 4) and are also new in the well-studied special case of LIML. This paper also adds to the literature on efficiency and robust estimation in the linear IV model. Anderson et al. (2010) showed optimality of LIML among estimators that are functions of the sufficient statistics from the normal model. Under normality of the errors, Chioda and Jansson (2009) showed optimality of LIML among estimators that are invariant to rotations of those sufficient statistics. The optimality results of this paper are complementary to the existing literature, as they imply optimality of LIML under normal errors, but for a different class of estimators than previously considered. More importantly, they also bring a new perspective to these results by presenting estimators that are robust and more efficient than LIML under nonnormal errors and many instruments. In models with a fixed number of instruments there exist multiple examples of such robust estimators or estimators that can be more efficient than LIML under nonnormal errors. Examples are the two-stage least absolute deviations estimator (Amemiya, 1982; Powell, 1983), the resistant estimator of Krasker and Welsch (1985), the two-stage quantiles and two-stage trimmed least squares estimators (Chen and Portnoy, 1996), the IV quantile regression estimator (Chernozhukov and Hansen, 2006), the robust estimators of Honoré and Hu (2004), the nonlinear IV estimators of Hansen et al. (2010), and the adaptive estimator of Cattaneo et al. (2012). However, I am unaware of papers that establish consistency or asymptotic normality of these estimators under many instruments asymptotics. Finally, this paper makes an additional contribution of potentially independent interest. The contribution is to give high-level conditions for asymptotic normality of a single element of a GMM estimator with dimension proportional to the sample size. Following Huber (1967), there have been numerous papers giving high-level conditions in GMM setups. See, e.g., Hansen (1982), Pakes and Pollard (1989), Andrews (1994), Newey (1994), Newey and McFadden (1994), Ai and Chen (2003), Chen et al. (2003), Chen (2007) and Newey and Windmeijer (2009). These papers cover cases of smooth and non-smooth objective functions and parametric and semi-parametric estimators, but all of them rely on an intermediate result of consistency of the estimator for some pseudo-true, non-random value. In contrast, this paper allows for the estimator to have a random, sample-dependent ‘‘limit’’. This is a necessary extension, as the reduced form parameters in the linear IV model do not settle down around some non-random value under many instruments asymptotics. This paper presents results for smooth and non-smooth objective functions, and verifies the high-level conditions for examples that are differentiable or Lipschitz continuous. The next section defines the model, describes the class of estimators, and presents the associated variance estimators. Section 3 gives high-level conditions for consistency and asymptotic normality, and Section 4.1 verifies these for each estimator in the class. Section 4.2 shows consistency of the variance estimators, and 4.3 derives the minimax property of the optimal robust estimator. Sections 5 and 6 present simulation results and the empirical example provided by Angrist and Krueger (1991). Proofs are in a supplemental appendix (SA). 2 Some additional papers that consider estimation and inference with many or many weak instruments are Hahn (2002), Hahn and Hausman (2002), Chamberlain and Imbens (2004), Chao and Swanson (2005), Chao et al. (2012), Hausman et al. (2012), Hansen and Kozbur (2014), Kolesár (2018) and Wang and Kaffo (2016). See also Newey (1990), Belloni et al. (2012) and Kolesár (2013) for estimation in related models. Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

3

Notation



For a vector v , ∥v∥ = v ′ v denotes the Euclidean norm. λmin (A) and λmax (A) are the smallest and largest eigenvalues for a symmetric matrix A. For an arbitrary matrix A, ∥A∥ = λmax (A′ A)1/2 gives the largest singular value of A, and σmin (A) = λmin (A′ A)1/2 returns the smallest singular value of A. For any absolutely continuous function f : R → R, let f ′ be the derivative of f where it exists and zero otherwise. Let {ani }i,n be shorthand for a triangular array {ani : i ∈ {1, . . . , n}, n ∈ N}, and let {ani }i be shorthand for a row of that array {ani : i ∈ {1, . . . , n}}. The distribution function of the standard normal distribution is denoted Φ . Limits are considered as n → ∞ unless otherwise noted. 2. Model and estimators Consider a linear IV model with two endogenous and kn instrumental variables. The model consists of a structural and a reduced form equation given by yin = xin β0 + wi′ δ0 + εi

(1a)

′ xin = zin π0n + wi′ η0 + ui

(i = 1, . . . , n)

(1b)

where the unobserved stochastic errors εi , ui ∈ R are potentially dependent so that both yin and xin are endogenous. wi ∈ RG denotes a vector of exogenous variables that includes an intercept and zin ∈ Rkn is a vector of instruments. ′ , δ0′ , η0′ )′ ∈ Rkn +2G serves as a nuisance parameter. The model involves The parameter of interest is β0 ∈ R while (π0n potentially many instruments where kn

→ α ∈ [0, 1) n as the sample size n grows large. In particular, this includes standard asymptotics where kn does not depend on n. To simplify notation, drop the subscript n on all variables and let the matrices Z and W denote the stacked observations of zi′ and wi′ , respectively. The error vectors of the model (εi , ui )′ are assumed i.i.d. and independent of the instruments and exogenous variables. This paper explores this independence assumption by considering moment conditions that are nonlinear in the structural errors. Such nonlinearity leads to potential efficiency gains when compared to LIML, but these gains come at the cost of relying upon estimators that are potentially inconsistent when the errors are instead mean independent of (Z , W ). Under mean independence, even LIML may be inconsistent whereas alternatives based on leave-one-out estimation can be consistent (see, e.g., Hausman et al., 2012). Additionally, assume that the structural error εi has a bounded and absolutely continuous density f with nonzero and finite Fisher information, If = E[(f ′ /f )2 (εi )], the reduced form error ui has finite variance, and the intercept in (1b) is normalized so that E[ui ] = 0. This paper focuses on robustness with respect to the structural error εi , which is why the reduced form error ui is assumed to have a finite second moment. An alternative focus on robustness with respect to ui might also be of interest, but is potentially less relevant. First, xi has bounded support in many applications so that outliers potentially are less of a concern. Second, the impact of ui on the asymptotic variance tends to be smaller than that of εi as detailed in the asymptotic variance formula of Section 4. Finally, with high dimension of the instruments, robust estimation of the reduced form Eq. (1b) may not yield substantial efficiency gains as uncovered in Bean et al. (2013) for the robust regression model. The matrix of instruments, (Z , W ), has full rank and satisfies that n −1 Z ′ Z = Ik and n −1 Z ′ W = 0. This can always be achieved by removing redundant columns in (Z , W ) and replacing Z with the rescaled residual Z˜ = Z¯ (n −1 Z¯ ′ Z¯ )−1/2 where Z¯ = (IG − W (W ′ W ) −1 W ′ )Z . Finally, the instruments are assumed to be jointly strong in the sense that the conditional covariance between xi and zi′ π0 is bounded away from zero, i.e., σxz =

1 n

∑ [

E xi zi′ π0 | Z = 1n π0′ Z ′ Z π0 = ∥π0 ∥2 > c + op (1) for some c > 0.

]

i

This accommodates that each individual instrument is weak in the sense of Staiger and Stock (1997). To simplify the exposition, this paper presents the class of estimators when the coefficients on the exogenous regressors (δ0 , η0 ) are known and normalized to equal zero. The online appendix B1.2 treats the extension where (δ0 , η0 ) is unknown. When the dimension of the exogenous regressors G is fixed, this extension involves no substantial complications. While it may be of interest to let G grow with sample size, such a framework is beyond the scope of the current paper as discussed further in Section 4.2. 2.1. Class of estimators The class of estimators is indexed by two Lipschitz continuous functions φ and ψ , and the estimators take values in a parameter space Θn ⊂ Rk+2 whose dimension can be proportional to the sample size. Each estimand is a vector θ0 = (β0 , γ0 , π0′ )′ where the additional nuisance parameter γ0 is defined as

γ0 =

E [φ (εi )ui ] E [φ (εi )ψ (εi )]

,

Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

4

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

i.e., γ0 makes φ (εi ) uncorrelated with ui − ψ (εi )γ0 . The functions φ and ψ , the parameter space Θn , and the nuisance ˆ γˆ , πˆ ′ )′ is an parameter γ0 will be discussed further after the definition of the estimators. Each estimator θˆ = (β, approximate minimizer of an objective function, i.e.,

( ) ∥mn (θˆ )∥ ≤ inf ∥mn (θ )∥ + op n−1/2 , θ ∈Θn

where mn (θ ) =

1 n





i

mni (θ ),

zi π φi (β )





mni (θ ) = ⎝ φi (β ) (xi − ψi (β )γ )



zi xi − ψi (β )γ − zi′ π

(

⎟ ⎠, )

φi (β ) = φ (εi (β )), ψi (β ) = ψ (εi (β )), and εi (β ) = yi − xi β . Each estimator θˆ , nuisance parameter γ0 , objective function mn , and asymptotic variance estimator (defined below) is indexed by φ and ψ . For simplicity, this is kept implicit in the notation. To motivate mn as a moment function, it is useful to relate it to LIML. A standard way to define LIML is as an extremum estimator that minimizes a variance ratio

βˆ LIML = arg min β∈R

ε (β )′ P ε (β ) ε (β ) ′ ε (β )

where ε (β ) denotes the stacked observations of εi (β ) and P = n −1 ZZ ′ is the projection onto Z . From this definition ′ it follows that θˆLIML = (βˆ LIML , γˆLIML , πˆ LIML )′ solves the nonlinear first order conditions (see SA A4 for a derivation and definition of the limited information maximum likelihood estimators γˆLIML and πˆ LIML )

⎛ ′ ⎞ z π ε (β ) ∑⎜ i i ⎟ 1 ⎝ εi (β ) (xi − εi (β )γ ) ⎠ = 0. n ( ) i zi′ xi − εi (β )γ − zi′ π

(2)

The left hand side is mn (θ ) when φ (ε ) = ψ (ε ) = ε , from which it follows that the class of estimators includes LIML. A natural interpretation of mn is that its first entry is the first order condition for a robust IV-regression of yi on xi using zi′ π as an instrument for xi and that the remaining entries are the first order conditions for an IV-regression of xi on ψi (β ) and zi using φi (β ) as an instrument for ψi (β ). For this interpretation one needs to subtract the first entry of mn from the second, but doing so does not affect the asymptotic distribution of βˆ . The effect of including ψi (β ) in the regression of xi on zi is a rotation of the errors which makes φ (εi ) uncorrelated with ui − ψ (εi )γ0 . To understand the importance of this rotation, one can consider (2) for γ = 0 and without the second equation. Solving that for β yields 2SLS which is not consistent under many instruments asymptotics. Section 4 shows, under regularity conditions, that an optimal choice of φ is proportional to the score function for εi , ′ φ (ε ) = ff ((εε)) , and that an optimal choice of ψ is proportional to the conditional mean of ui given εi , ψ (ε ) = E[ui | εi = ε]. These optimal functions are usually unknown, but a feasible approach would be to let φ and ψ be the optimal functions for some fixed distribution, e.g., to let φ and ψ be one of the Huber score min{1, max{ε, −1}}, the Cauchy score ε/(ε 2 + 1), or the Gauss score ε . See Hansen et al. (2010) for a version of this approach under standard asymptotics and for a different class of estimators. An alternative approach, introduced by Huber (1964) in the regression model, is to fix some error distribution (e.g. normal) and use an estimator that minimizes the maximal asymptotic variance over a neighborhood of the fixed error distribution. Section 4 applies this approach to the current model and derives that the ensuing optimal robust (minimax) estimator is indexed by φ and ψ that are equal to φν0 (ε ) = min{ν0 , max{ε, −ν0 }} for some value of ν0 , i.e., φν0 is linear around zero and censors extreme values of ε at ±ν0 . The online appendix B1.1 implements a data driven censoring level νˆ which also serves to make the estimator scale invariant. For this censoring level the optimal robust estimator is approximately 5% less efficient than LIML when (εi , ui ) has a joint normal distribution. On the other hand, the optimal robust estimator is approximately 80% more efficient than LIML for certain thick-tailed error distributions and in the empirical example. The main conditions on the functions φ and ψ are (i) that for each φ there exists a unique δ ∗ ∈ R such that E[φ (εi + δ ∗ )] = 0, and (ii) that

E[φ ′ (εi + δ ∗ )] ̸ = 0 and E φ (εi + δ ∗ )ψ (εi + δ ∗ ) ̸ = 0.

[

]

(3)

These conditions will, in general, be satisfied when φ and ψ equal one of the Huber, Cauchy, or Gauss scores. For example, if φ and ψ equal the Gauss score, then δ ∗ = −E[εi ] and (3) is satisfied when the variance of εi is nonzero and finite. Similarly, if φ and ψ equal the Huber score, then there exists a unique δ ∗ ∈ R such that E[φ (εi + δ ∗ )] = 0 and (3) is satisfied when εi + δ ∗ puts positive mass on [−1, 1]. To keep the notation simple, normalize the intercept in (1a) so that δ ∗ = 0 and demean the function ψ such that E[ψ (εi )] = 0. These two normalizations lead the asymptotic results to depend on φ (εi ) and ψ (εi ) instead of φ (εi + δ ∗ ) and ψ (εi + δ ∗ ) − E[ψ (εi + δ ∗ )]. Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

5

A necessary condition for consistency of βˆ is that mn uniquely identifies β0 over the parameter space Θn , but this is not the case if Θn = Rk+2 . To see this, consider the special case where φ (ε ) = ψ (ε ) = ε . In this case, it follows that the first order conditions in (2) have a second solution where β˜ = arg maxβ∈R ε (β )′ P ε (β )/ε (β )′ ε (β ), and this solution is not [consistent] for β0 in general (see SA A4 for a derivation). To ensure consistency of βˆ , let the parameter space

Θn = βˆ init ± bn × Rk+1 where βˆ init is an initial consistent estimator of β0 and bn is a bandwidth that shrinks to zero

slowly. This can be thought of as using an initial estimator to select the consistent root of mn , and Section 4 shows that the asymptotic distribution of βˆ neither depends on the initial estimator nor on the bandwidth. A natural choice of initial estimator is LIML which is consistent when εi has finite variance.



−1/2

d

ˆ n (βˆ − β0 ) − → N (0, 1) for some sequence of asymptotic Section 4 shows, under regularity conditions, that nΣ ˆ n . This implies that confidence intervals or hypothesis tests for β0 can be constructed using standard variance estimators Σ ˆ n , let Jn be a Jacobian of mn and suppose that Jn (θ ) is invertible at θˆ . Then, Σ ˆ n is the upper methods. In order to describe Σ left entry of a sandwich type estimator, i.e.,

)

( ˆ n = Jn Σ

−1

(θˆ ) 1n



msni (

θˆ )msni (θˆ )′ Jn

−1

(θˆ )′

i

(4) 11

where



zi′ π φi (β )



msni (θ ) = ⎝ φi (β ) (xi − ψi (β )γ )⎠ . 0k

ˆ n differs from a straightforward application of GMM formulas, as it uses the short set of moment functions The estimator Σ msni rather than mni to form the inner matrix. Section 4 provides some further discussion of how this difference stems from the many parameters and sample moments in θ and mn . ˆ n to the variance estimator adopted by Hansen et al. (2008). That paper When βˆ is LIML, it is instructive to compare Σ proposes an asymptotic variance estimator that separately estimates four different terms, and these terms are the standard variance estimator, the correction term proposed by Bekker (1994), and two additional terms that are present under some forms of nonnormality. The variance estimator considered here is numerically different from the Hansen et al. (2008) ˆ n automatically accounts for all four terms. estimator, but asymptotically equivalent, i.e., Σ 3. High-level conditions for asymptotic normality This section gives high-level conditions for asymptotic normality of a single element of a GMM estimator with dimension proportional to the sample size, and Section 4 verifies these high-level conditions for the estimators of the previous section. The results apply to just-identified GMM estimators θˆ of θ0 ∈ Rp , where p/n → α ∈ [0, 1),

∥mn (θˆ )∥ ≤

inf

θ ∈Θn

⊂Rp

( ) ∥mn (θ )∥ + op n−1/2 ,

and the first entry of θ0 , say β0 , is the object of interest. Thus, the results can be seen as extensions of Pakes and Pollard (1989, theorem 3.1) and Newey and McFadden (1994, theorems 3.2 and 7.2) to a setup where the dimension of the parameter space can be proportional to sample size. The main idea behind this section is to use a Taylor expansion of mn (θˆ ) around a sample dependent approximation to θˆ denoted θ¯ . Expansions of sample moments are standard in the literature, but the approach taken here differs from the classical theorems since θ¯ is different from θ0 and random, i.e., it depends on the sample. The same idea appears in the heuristic analysis of robust regression with many regressors (El Karoui et al., 2013), which was made rigorous in El Karoui (2013). Furthermore, the approach has some similarities with the analysis of two-step estimators in Newey and McFadden (1994, section 6), although in their setup the first step estimator has a non-random limit. Throughout the presentation, the high-level conditions will be discussed in the context of the linear IV model with many instruments. For illustrative purposes a brief discussion of a nonlinear endogenous model with many instruments is also provided. 3.1. The approximation to θˆ : θ¯

¯ should be small The definition of the approximation to θˆ is constructed to satisfy two requirements. First, ∥θˆ − θ∥ enough that a Taylor expansion of mn yields good approximations to the asymptotic behavior of the object of interest, βˆ − β0 . Second, θ¯ should be simple enough that one can characterize the limiting behavior of mn (θ¯ ). To accommodate this, assume that the first entry of θ¯ is β0 , that θ¯ sets all but a fixed number of entries in mn (θ¯ ) equal to zero, and that θ¯ is unique. Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

6

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

In the context of the IV model, a definition that satisfies these requirements is θ¯ = (β0 , γ0 , π¯ ′ )′ where π¯ is the unique solution to 1 n

∑ (

zi xi − ψ (εi )γ0 − zi′ π¯ = 0.

)

(5)

i

This definition of θ¯ sets the last k entries of mn (θ¯ ) equal to zero and one can think of π¯ as an infeasible estimator of ¯ π0 , since it depends on the unknown parameters ∑ (β′ 0 , γ0 ). Furthermore, the two nonzero entries of mn evaluated at θ z πφ ¯ ( ε ), this observation is a consequence of (5) which implies that have mean zero. For the first entry of mn (θ¯ ), 1n i i i the infeasible estimator π¯ is uncorrelated with φ (εi ). The following lemma gives conditions that ensure convergence in ¯ to zero, and the mean zero condition on mn (θ¯ ) is almost necessary for the first of these conditions. probability of ∥θˆ − θ∥ Lemma 3.1.

Suppose that P θ¯ ∈ Θn → 1. If

(

)

(i) ∥mn (θ¯ )∥ = op (1); (ii) for every δ > 0, there exists a c > 0, such that inf

θ∈Θn , ¯ ∥θ−θ∥>δ

∥mn (θ )∥ > c + op (1);

p

¯ − then ∥θˆ − θ∥ → 0. This lemma extends Pakes and Pollard (1989, theorem 3.1, hereafter PP3.1) to allow for a random ‘‘limit’’ of θˆ and for proportionality between the dimensions of θ¯ and the sample size. The proof of Lemma 3.1 is along the same lines as the proof of PP3.1, but the high-level conditions (i) and (ii) can be satisfied in settings where the corresponding conditions of PP3.1 fails. For example, in the linear IV model with many instruments we have that ∥mn (θ0 )∥ is bounded away from zero, whereas the Efron–Stein inequality (see SA A2 for a discussion and references) can be used to show that ∥mn (θ¯ )∥ = op (1). Remark 1. When the dimension of mn is fixed, the condition E[mn (θ0 )] = 0 is often sufficient for condition (i): ∥mn (θ¯ )∥ = op (1). However, in the current setting that need not be the case. A simple example is 2SLS in the linear IV model. This estimator can be represented as a GMM estimator based on the moment function mn (θ ) =

1 n

∑ (z ′ π (y − x β )) i i i zi (xi − zi′ π )

for θ = (β, π ′ )′ .

i

p

Here, E[mn (θ0 )] = 0, but ∥mn (θ¯ )∥ − → α|E[ui εi ]| which leads to potential inconsistency of 2SLS. Remark 2. The high-level theory introduced here may also apply to nonlinear endogenous models with many instruments. A concrete example is the rational expectations model from Hansen and Singleton (1982) augmented with a reduced form. Here, the structural parameters are solutions to the moment equations

E (1, zi′ π0 )′ εi (β0 ) = 0 where εi (β ) = β1 eyi +β2 xi − 1

[

]

and the reduced form is posited to be xi = zi′ π0 + ui and yi = g(zi ) +vi for i.i.d. data where zi ∈ Rk is independent of (ui , vi ) with mean zero, and g is an unknown function such that g(zi ) + β20 zi′ π0 does not depend on zi (see Newey and McFadden, 1994, p. 2128 for these conditions). An analog estimator based on these moment conditions is nonlinear two-stage least squares (N2SLS) which is based on the moment function m n (θ ) =

1 n

∑ ((1, z ′ π )′ ε (β )) i i zi (xi − zi′ π )

for θ = (β ′ , π ′ )′ .

i

p

As in the linear model, we have that ∥mn (θ¯ )∥ = ∥mn (β0′ , πˆ ′ )∥ − → α|E[ui εi (β0 )]| so N2SLS is potentially inconsistent with many instruments. This issue can be solved in the same way as in the linear IV model by using εi (β ) as a control variable in the first stage:

( mn (θ ) =

1 n

∑ i

When γ0 =

E[ui εi (β0 )] E[εi (β0 )2 ]

(1, zi′ π )′ εi (β ) εi (β )(xi − εi (β )γ ) zi (xi − εi (β )γ − zi′ π )

and π¯ solves

)

1 n



i zi

for θ = (β ′ , γ , π ′ )′ .

(

xi − εi (β0 )γ0 − zi′ π¯

)

= 0, then it follows (under regularity conditions) that ′ ′ ′ ¯ ¯ ∥mn (θ )∥ = op (1) for θ = (β0 , γ0 , π¯ ) , which is condition (i) of Lemma 3.1.3 3 For this example, a discussion of the remaining high-level conditions is beyond the scope of this paper. Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

7

3.2. Rate of convergence of θˆ and asymptotic normality of βˆ

¯ = Op (n−1/2 ) and asymptotic normality of βˆ . Theorem 3.2 treats This subsection gives conditions that lead to ∥θˆ − θ∥ the case where mn is continuously differentiable, and Proposition 3.3 relaxes this smoothness condition. The first result uses the following notation. Let the vector θ ∗ = (β ∗ , γ ∗ , π ∗ ) solve 0 = mn (θ¯ ) + Jn (θ¯ ) θ ∗ − θ¯

(

)

where Jn is a Jacobian of mn . When Jn (θ¯ ) is invertible it follows that

β ∗ − β0 = −Jn1 (θ¯ )′ mn (θ¯ ), where Jn1 (θ¯ )′ is the first row of Jn −1 (θ¯ ). The conditions of Theorem 3.2 imply that β ∗ is asymptotically normal and that the same conclusion can be transferred to βˆ . Theorem 3.2. Θn . If

p

¯ − Suppose that ∥θˆ − θ∥ → 0, P θ¯ ∈ Θn → 1 and

(

)



p

¯ − n infθ ∈∂ Θn ∥θ − θ∥ → ∞ where ∂ Θn is the boundary of

(i) mn is continuously differentiable with Jn ; ( Jacobian ) (ii) there exists a c > 0 such that σmin Jn (θ¯ ) > c + op (1); (iii) for any sequence {δn } of positive numbers converging to zero, sup ∥θ −θ¯ ∥<δn

∥Jn (θ ) − Jn (θ¯ )∥ = op (1);

(iv) ∥mn (θ¯ )∥ = Op (n−1/2 ) and

¯ = Op n then ∥θˆ − θ∥

(



) −1/2



−1/2 1

nΣn

d

Jn (θ¯ )′ mn (θ¯ ) − → N (0, 1) for some Σn with Σn

−1

= Op (1);

and

( ) d nΣn−1/2 βˆ − β0 − → N (0, 1). p

If, additionally, Σn − → Σ then



d

n(βˆ − β0 ) − → N (0, Σ ).

Condition (i) states that mn is continuously differentiable, so it is natural to compare the conditions of Theorem 3.2 to those of Newey and McFadden (1994, theorem 3.2, hereafter NM3.2). The main differences are that in Theorem 3.2 the dimensions of θ , mn , and Jn can grow as quickly as n, that the ‘‘limiting’’ object, θ¯ is a random vector, and that the conclusion only gives a limiting distribution for a single element of θˆ . In NM3.2 these dimensions are fixed, the limiting object, θ0 , is nonrandom, and the conclusion is a limiting distribution of θˆ . A further difference is that NM3.2 allows for overidentification in the sense that the dimension of mn can be larger than that of θ . NM3.2 requires that the Jacobian converges in probability to a matrix of full rank, which in turn implies the bound on the singular values in (ii). Here, the dimension of Jn (θ¯ ) grows quickly with n so it will, generally, not converge in probability. However, (ii) is sufficient for the desired result. Similarly, NM3.2 imposes continuity on the limit of the Jacobian, but Jn may not have a limit. Instead, (iii) imposes a stochastic equicontinuity condition on Jn . NM3.2 assumes that mn (θ0 ) satisfies a central limit theorem (CLT), whereas the essence of (iv) is that the (fixed number of) nonzero entries of mn (θ¯ ) satisfy a CLT. The presence of the random θ¯ makes mn (θ¯ ) a sample average of dependent observations, and there are multiple specialized tools to deal with such averages. SA A2 gives a CLT inspired by Chatterjee (2008) which can be used to establish (iv), and Section 4 applies the CLT to the IV model. The final condition of NM3.2 is that the limiting object, θ0 , is an interior point of the parameter space, and Theorem 3.2 places a similar condition on θ¯ , but accommodates that θ¯ and Θn are random. It is possible to get rid of the assumption that mn is continuously differentiable, provided that mn is well-approximated by a continuously differentiable random function Mn . The following proposition outlines what is meant by wellapproximated. p

¯ − Proposition 3.3. Suppose that ∥θˆ − θ∥ → 0, P θ¯ ∈ Θn → 1 and of Θn . If

(

)



p

¯ − n infθ∈∂ Θn ∥θ − θ∥ → ∞ where ∂ Θn is the boundary

(i) for any sequence {δn } of positive numbers converging to zero,



sup θ∈Θn , ∥θ −θ¯ ∥<δn

n∥mn (θ ) − mn (θ¯ ) − (Mn (θ ) − Mn (θ¯ ))∥ 1+



¯ n∥θ − θ∥

= op (1) ;

(ii) Mn is continuously differentiable with Jacobian Jn ; (iii) Jn satisfies conditions (ii) and (iii) of Theorem 3.2 and mn satisfies condition (iv); then



−1/2

nΣn

(

) d βˆ − β0 − → N (0, 1).

Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

8

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

The conditions of this proposition are similar to the conditions of Newey and McFadden (1994, theorem 7.2), and the discussion following Theorem 3.2 about differences and similarities applies here as well. Section 4 verifies the conditions of Proposition 3.3 for mn with discontinuous derivatives (e.g., when φ or ψ is the Huber score), but not when mn is discontinuous (e.g., when φ or ψ are the sign function). The conditions of Newey and McFadden (1994, theorem 7.2) have been verified in cases with a discontinuous mn (see, e.g., Andrews, 1994), but I leave it to future work to establish whether the conditions of Proposition 3.3 can be verified for a discontinuous mn . 4. Asymptotic normality, inference, and optimality This section presents three results. First, it gives primitive conditions on the model and estimators of Section 2 that are sufficient for the high-level conditions of Section 3 and therefore sufficient for asymptotic normality. Second, it presents a consistency result for the asymptotic variance estimators. Third, it characterizes the functions φ and ψ that lead to an optimal estimator or to the optimal robust estimator. 4.1. Asymptotic normality of βˆ In order to show asymptotic normality of βˆ , this section imposes the following regularity conditions in addition to the assumptions stated together with the model. Assumption 1. (i) {(zi′ π0 )2 }i,n is uniformly integrable and one of the following is satisfied: (a) φ and ψ are bounded and E[u4i ] < ∞; or (b) E[φ 6 (εi ) + ψ 6 (εi ) + u6i ] < ∞.

[

]

(ii) Θn = βˆ init ± bn × Rk+1 , where bn = op (1) and n−1/2 + |βˆ init − β0 | = op (bn ). (iii) There exists a finite set A ⊆ R such that φ ′ and ψ ′ are Lipschitz continuous on each connected set in R \ A. The uniform integrability assumption on zi′ π0 and existence of various moments are sufficient for the law of large numbers and central limit theorem in SA A2. The conditions on the bandwidth, bn , satisfy two requirements. First, bn approaches zero, which ensures consistency of βˆ . Second, bn approaches zero slowly, which implies that the asymptotic distribution of βˆ neither depends on the bandwidth nor on the initial estimator, βˆ init . When√βˆ init is consistent for β0

ˆ n,LIML /n1/4 , then (ii) is Σ is as defined in (4) for φ (ε ) = ψ (ε ) = ε . The smoothness

there will always exist a bn that satisfies (ii), and if the initial estimator is LIML and bn =

ˆ n,LIML satisfied when (i) is satisfied for φ (ε ) = ψ (ε ) = ε . Here, Σ condition in (ii) implies that mn can be approximated by a continuously differentiable Mn as in Proposition 3.3, and that the stochastic equicontinuity condition on the Jacobian of Mn , Theorem 3.2(iii), is satisfied. The smoothness condition is satisfied by the Huber, Cauchy and Gauss scores, where the Cauchy and Gauss scores satisfy that the derivatives are continuous everywhere. The following result verifies that Assumption 1 is sufficient for the high-level conditions of Lemma 3.1. p

¯ − Lemma 4.1 (Consistency). If θˆ is indexed by φ and ψ that satisfy Assumption 1, then ∥θˆ − θ∥ → 0. I now turn to the main result of the paper, which shows that Assumption 1 is sufficient for the conditions of Proposition 3.3 and therefore for asymptotic normality of βˆ . The asymptotic variance Σn of βˆ takes on a sandwich form Σn = Dn −1 Ωn Dn −1 where

[ ] [ ] [ ] Ωn = (1 − α )2 σxz E φ (εi )2 + α (1 − α )E φ (εi )2 E (ui − ψ (εi )γ0 )2 ∑ [ ] (Pii − α )zi′ π0 E φ (εi )2 (ui − ψ (εi )γ0 ) + 2(1 − α ) 1n i

+

1 n



(Pii − α )2 cov φ (εi )2 , (ui − ψ (εi )γ0 )2 .

)

(

i

Furthermore, Dn = (1 − α )σxz E φ ′ (εi )

[

+

1 n



]

(Pii − α )zi′ π0 E φ ′ (εi )(ui − ψ (εi )γ0 ) − γ0 φ (εi )ψ ′ (εi ) .

[

]

i

Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

9

Theorem 4.2 (Asymptotic Normality). If θˆ is indexed by φ and ψ that satisfy Assumption 1 and there exists a c > 0 such that |Dn | > c + op (1) and Ωn > c + op (1), then



(

nΣn−1/2 βˆ − β0

)

d

− → N (0, 1).

In the case where βˆ is LIML, Theorem 4.2 is a special case of Hansen et al. (2008, theorem 3). In this case, the first term of Σn is the variance of LIML under standard asymptotics, the second term of Σn is a ‘‘many instruments penalty’’ characterized by Bekker (1994, eq. (4.7)) under joint normality of (εi , ui ), and the third and fourth terms of Ωn are the terms named A + A′ and B in Hansen et al. (2008) which appears under some forms of nonnormality. In the cases where the number of instruments is fixed or grows slowly (α = 0), the asymptotic variances in Theorem 4.2 are the same as the asymptotic variances for the class of nonlinear IV (NLIV) estimators studied in Hansen et al. (2010). For the remaining cases, i.e., when φ or ψ are nonlinear and α > 0, Theorem 4.2 has no antecedents (to the best of my knowledge). Remark 3. To compare the robust estimators of this paper with the NLIV estimators of Hansen et al. (2010), it is useful to represent the NLIV estimators as GMM estimators for the moment function m n (θ ) =

1 n

∑( i

zi′ πφi (β ) zi (xi φi′ (β ) − zi′ π )

)

for θ = (β, π )′ .

Here the first moment coincides with that used for the robust estimators of this paper, but the remaining k moments differ ∑ ′ ′ in that they correspond to a regression of xi φ ′ (εi ) onto zi . Furthermore, if π¯ is the solution to 1n ¯ ) = 0, i zi (xi φ (εi ) − zi π p

then it follows, under mild regularity conditions, that ∥mn (θ¯ )∥ − → α|E[φ (εi )φ ′ (εi )ui ]| where θ¯ = (β0 , π¯ ). As the limit of ¯ ∥mn (θ )∥ need not be zero, this suggests that the NLIV estimators are generally inconsistent with many instruments. The simulations in Section 5 provide further suggestions that this conjecture is true. 4.2. Consistent variance estimation The assumptions that are sufficient for asymptotic normality are also sufficient for consistency of the asymptotic variance estimator. Lemma 4.3 (Variance Estimation). If θˆ is indexed by φ and ψ that satisfy Assumption 1 and there exists a c > 0 such that |Dn | > c + op (1) and Ωn > c + op (1), then



(

ˆ n−1/2 βˆ − β0 nΣ

)

d

− → N (0, 1).

The underlying observations leading to Lemma 4.3 are that the last k elements of mn (θ¯ ) are zero, and that the first two elements of mni (θ¯ ) are uncorrelated across observations. This implies that

V

[√

nmn (θ¯ ) | Z =

]

1 n

∑ [

V msni (θ¯ ) | Z ,

]

(6)

i

where



zi′ π φi (β )



msni (θ ) = ⎝ φi (β ) (xi − ψi (β )γ )⎠ . 0k

ˆ = Jn1 (θˆ )′ 1 i msni (θˆ )msni (θˆ )′ Jn1 (θˆ ) where Jn1 (θˆ )′ is the first row of Jn−1 (θˆ ), and in the light of (6) The variance estimator is Σ n √ n √ ˆ n is an analog estimator. and the representation n(βˆ − β0 ) = Jn1 (θ¯ )′ nmn (θ¯ ) + op (1), Σ √ √ Under standard asymptotics, we instead have the standard asymptotic representation n(βˆ −β0 ) = Jn1 (θ0 )′ nmn (θ0 ) + [√ ] 1∑ op (1) and V nmn (θ0 ) | Z = n i V [mni (θ0 ) | Z ]. An analog variance estimator that emerges from these conclusions (see, e.g., Newey and McFadden, 1994, section 4) is therefore



˜ n = Jn1 (θˆ )′ 1 Σ n



mni (θˆ )mni (θˆ )′ Jn1 (θˆ )

i

where mni (θ ) is the long moment function:



zi′ π φi (β )



mni (θ ) = ⎝ φi (β ) (xi − ψi (β )γ )



zi xi − ψi (β )γ − zi π

(



⎟ ⎠. )

˜ n overestimates Σn when there are many instruments. An interpretaThe simulations in Section 5 demonstrate that Σ ˜ n does not accurately account for the estimation error in the large tion of this discrepancy with standard theory is that Σ Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

10

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

number of first stage coefficients. The sources of this discrepancy are the failure of the standard expansion for βˆ , which ˜ n , is not consistent for leads to a different asymptotic variance, and that the point estimator θˆ , which is plugged in to Σ θ0 . Remark 4. The preceding theory describes a feasible inference procedure under the simplification that (δ0 , η0 ), the coefficients on the exogenous regressors, are known. The online appendix B1.2 demonstrates how the moment function and variance estimator can be extended to allow for inference when these coefficients are unknown. The argument relies on consistency of δˆ which in turn is a consequence of the assumption that the number of exogenous regressors is fixed. It goes beyond the scope of this paper to consider a framework where the number of exogenous regressors is proportional to sample size, but a limited set of simulations (reported in the online appendix B1.2) suggests that the inference procedure 1 developed herein is slightly conservative when the ratio of exogenous regressors to sample size is as large as 10 . However, the interested reader is referred to contemporary work for papers that touch upon the issues that arise when a parameter of high dimension enters the statistical problem through non-linear functions (El Karoui et al., 2013; El Karoui and Purdom, 2018; Kline et al., 2018), as would be the case for δˆ in a framework where the number of exogenous regressors grows with sample size. 4.3. The optimal robust estimator Corollary 4.4 below presents two conditions, either of which is sufficient for us to derive a simplified expression for the asymptotic variances of Theorem 4.2. Either condition is also sufficient for the efficiency results in Lemma 4.5 and Proposition 4.6 which provide lower bounds for these simplified asymptotic variances. Corollary 4.4 (Simplified Variance). Suppose θˆ is indexed by φ and ψ that satisfy Assumption 1. If one of the following conditions are satisfied: 2 (i) 1n i (Pii − α ) = op (1); or (ii) ui = ψ (εi )γ0 + ηi where ηi is independent of εi , and E[φ (εi )ψ ′ (εi )] = 0;



then



(

nΣn∗ (φ, ψ )−1/2 βˆ − β0

)

E φ ( εi ) 2

[

Σn (φ, ψ ) = ∗

d

− → N (0, 1) where

]

E [φ (εi )s(εi )]2 σxz

[ ] [ ] E φ (εi )2 E (ui − ψ (εi )γ0 )2 α + , 1 − α E [φ (εi )s(εi )]2 σxz σxz

s = f ′ /f is the score function for εi and the notation makes it explicit that Σn∗ depends on φ and ψ . There are a variety of primitive conditions on zi that imply the first condition of this lemma (see Anatolyev and Yaskov, 2016), and one of the simplest is that zi indicate group membership and that all groups have equal sizes (Bekker and van der Ploeg, 2005). Similarly, there are a variety of primitive conditions that lead to the second condition, a simple one being that (εi , ui ) are jointly normal and ψ is linear. In practical applications both of these conditions may fail, but if one wants to investigate the relevance of the following efficiency results (which are derived for Σn∗ ), then it is possible to compare estimates of Σn and Σn∗ .4 In the empirical example of Section 6, the estimated difference between Σn and Σn∗ ˆ n , which suggests that the efficiency results can be seen as focusing on the relevant terms is less than one percent of Σ of the asymptotic variance. For a fixed joint distribution of the errors (εi , ui ) the following lemma characterizes the efficiency bound for estimators indexed by φ and ψ satisfying that

E[φ (εi )2 ] < ∞, E[ψ (εi )2 ] < ∞, and E[φ (εi )ψ (εi )] ̸ = 0. Furthermore, the result gives conditions under which the bound is attained by a specific estimator in the class. The result is a consequence of the following well-known inequalities,

E φ ( εi ) 2

[

]

E [φ (εi )s(εi )]

2



1 If

and E (ui − ψ (εi )γ0 )2 ≥ E (ui − E[ui | εi ])2 .

[

]

[

]

Lemma 4.5 (Efficiency Bound). The greatest lower bound on Σn∗ (φ, ψ ) is inf Σn∗ (φ, ψ ) =

φ,ψ

1 If σxz

+

[ ] α 1 E (ui − E[ui | εi ])2 . 1 − α If σxz σxz

If E[s(εi )ui ] ̸ = 0, then the bound is attained by (φ (ε ), ψ (ε )) = (s(ε ), E[ui | εi = ε]). 4 However, for inference on β we recommend using Σ ˆ n , as it is consistent with or without the conditions of Corollary 4.4. 0 Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

11

If E[s(εi )ui ] ̸ = 0, then this result bounds any asymptotic variance, Σn∗ (φ, ψ ), from below by an asymptotic variance that lets φ (ε ) = s(ε ) and ψ (ε ) = E[ui | εi = ε]. As Σn∗ (φ, ψ ) is homogeneous of degree zero, it follows that any φ proportional to s = f ′ /f combined with any ψ proportional to E[ui | εi = ·], leads to a minimal Σn∗ (φ, ψ ). If E[s(εi )ui ] = 0, then the bound may not be attainable, but the bound is attainable in the special case where E[ui | εi ] = 0. This case can be thought of as no endogeneity and implies that any feasible choice (s, ψ ) reaches the efficiency bound. An interpretation of this is that the choice of ψ is irrelevant, when there is no endogeneity, and a continuity argument would imply that the choice of ψ is less important than the choice of φ , when there is weak endogeneity. The simulations in Section 5 suggest that the choice of ψ tends to affect the sampling variance less than the choice of φ both under weak and strong endogeneity. If the joint distribution of (εi , ui ) is such that the optimal φ or ψ is unbounded, then it follows that there exists a small perturbation to the distribution of εi such that the previous optimal choice of φ and ψ leads to an infinite asymptotic variance. In the context of the regression model which has a univariate error term, Huber (1964) characterized such an issue as nonrobustness, and introduced the idea of a contamination model in order to characterize robust estimators. The following extends Huber’s contamination model to the setup of the linear IV model which has a bivariate error term. Given the importance of the joint normal distribution and LIML, I focus on the case of a contaminated normal distribution, but more generally one could consider any contaminated distribution (see, Huber, 1964, theorem 1). Assume that the absolutely continuous density of εi is f = (1 − δ )Φ ′ + δ h where δ ∈ [0, 1) is a fixed, small level of contamination and h is an unknown absolutely continuous contamination density. Restrict the possible contamination densities such that the Fisher information, If , is bounded by one, i.e., If = Ef [s(εi )2 ] ≤ 1, where Ef denotes expectation when εi has density f . Furthermore, let ui be generated from the model ui = s(εi ) + ηi



2 − If

where ηi has a standard normal distribution and is independent of εi . The contamination model for εi is the same as Huber’s. The choice of contamination model for ui is guided by two principles. First, the joint distribution of (εi , ui ) should be normal under no contamination (δ = 0), and this is achieved since δ = 0 implies that s(εi ) = −εi , If = 1, and ui | εi = ε ∼ N (−ε, 1). Second, the variance of ui should stay bounded under contamination (δ > 0), as the estimators are not robust with respect to outliers in ui . The model for ui achieves this, since var(ui ) = 2 for any contamination density h. For this contamination model, Proposition 4.6 characterizes the minimax efficiency bound for estimators indexed by φ and ψ satisfying that sup Ef [φ (εi )2 ] < ∞, sup Ef [ψ (εi )2 ] < ∞, and inf Ef [φ (εi )ψ (εi )]2 > 0. f

f

f

The optimal robust estimator is the estimator that achieves the bound. Proposition 4.6 (Minimax Efficiency Bound). Index the asymptotic variance by f , i.e, write Σn∗ (φ, ψ, f ) instead of Σn∗ (φ, ψ ). The greatest lower bound on supf Σn∗ (φ, ψ, f ) is min sup Σn∗ (φ, ψ, f ) = Σn∗ (φν0 , φν0 , f0 ) ψ,φ

f

where φν0 (ε ) = min{ν0 , max{ε, −ν0 }}, ν0 solves

δ 1−δ

=

2Φ ′ (ν0 )

ν0

− 2Φ (−ν0 ), and

⎧ ⎪ ⎪ 1√− δ e−ε2 /2 , if |ε| ≤ ν0 , ⎨ 2π f 0 (ε ) = ⎪ 1−δ ⎪ ⎩ √ eν02 /2−|ε|ν0 , if |ε| ≥ ν0 . 2π This result is a consequence of the following saddle point property min φ

Ef0 [φ (εi )2 ] Ef0 [φ (εi )s0 (εi )]

2



Ef0 [φν0 (εi )2 ] Ef0 [φν0 (εi )s0 (εi )]

2

≥ sup f

Ef [φν0 (εi )2 ] Ef [φν0 (εi )s(εi )]2

where s0 = f0′ /f0 , which Huber (1964, 1973) used to characterize an optimal robust estimator in the regression model. As in the regression model, Proposition 4.6 shows that the bound is achieved by the estimator that is indexed by φ and ψ equal to φν0 (ε ). The function φν0 (ε ) censors ε at ±ν0 , where the level of censoring depends on the amount of contamination δ . A common way to choose the level of censoring in robust estimation of the regression model is to pick ν0 as the solution to

E φν′ 0 (εi )

[

0.95 = σε

2

]2

E φν0 (εi )2

[

] when εi ∼ N (0, σε2 ).

Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

(7) with

many

instruments.

Journal

of

Econometrics

(2019),

12

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

This can be thought of as losing 5% efficiency due to the choice of φ when there is no contamination, it leads to ν0 = 1.345σε , and it is optimal for a contamination level of δ ≈ 0.058. The online appendix B1.1 adapts a method proposed by Huber (1964) which uses (7) to estimate ν0 and the remaining parameters simultaneously. Remark 5. An alternative to the robust approach developed here is so-called adaptive estimation and inference, where the optimal choices of φ and ψ from Lemma 4.5 are estimated non-parametrically and plugged in to the moment function mn . A variation of this approach, which generalizes 2SLS, was developed in Cattaneo et al. (2012) and analyzed in the context of a fixed number of instruments. Existing Monte Carlo results on adaptive estimation (e.g., Steigerwald (1992) and Cattaneo et al. (2012)) show that such procedures have disappointing size properties in finite samples and they are not pursued further in this paper. However, the estimator proposed in Cattaneo et al. (2012) is also considered in the simulations of Section 5, where it has biases that are comparable to 2SLS. 5. Simulations This section presents the results of a simulation study which shows that the asymptotic results give good approximations to the finite sample behavior of the estimators considered in this paper. The simulations consider twelve estimators which are the optimal robust estimator, LIML, 2SLS, five other combinations of φ and ψ as one of the Huber, Cauchy, or Gauss scores, the NLIV estimator of Hansen et al. (2010), the adaptive estimator of Cattaneo et al. (2012), the LASSO adaptation of 2SLS by Belloni et al. (2012), and Fuller (1977)’s modification of LIML. The estimators incorporate the extensions to estimate the coefficients on included exogenous variables and the scale parameter ν , both described in the online appendix B1, and are implemented using a quasi-Newton method using LIML as the initial value.5 2SLS and the modifications thereof are mainly considered here in order to give examples of estimators that generally lead to incorrect inference in the context of many instruments. The simulations generate i.i.d. data from the model yi = xi β0 + δ0 + εi xi = zi′ π0 + η0 + ui

(i = 1, . . . , 500)

where δ0 = η0 = 0, β0 = 1, zi ∼ N (0, Ik ), π0 = zero and covariance matrix

[

√1

ρ 10

√ ] ρ 10 10

√ σxz k

(1, . . . , 1) ∈ Rk , k = 50, and (εi , ui ) is independent of zi with mean ′

where ρ ∈ [−1, 1] .

(8)

The simulations consider three levels for the strength of the instruments, σxz . The levels are σxz = 1 in Tables 1 and 2, σxz = .5 in Table 3 (left), and σxz = .2 in Table 3 (right). Additionally, the simulations consider two levels for the strength of endogeneity which is measured in terms of ρ (the correlation between εi and ui ). The levels are ρ = −.7 in Table 1 (right) and ρ = −.3 elsewhere. Finally, the errors are generated such that aεi has density f , ui = b(f ′ /f )(εi ) + c ηi , and ηi is standard normal and independent of εi . The density f is the density of a standard normal in Table 1 (in which case (εi , ui ) is joint normal), the density of a Huber distribution with parameter 1.345 in Table 2 (left), and the density of a t(3) distribution in Table 2 (right) and Table 3. This implies that LIML is optimal in the class considered for Table 1, the optimal robust estimator is optimal for Table 2 (left), and the estimator that lets both φ and ψ be the Cauchy score is nearly optimal for Table 2 (right) and Table 3. In each case the constants (a, b, c) are chosen such that (8) holds. The strength of the instruments (together with the number of observations and the variance of ui ) implies that the value of the concentration parameter (Rothenberg, 1984) equals 50 in Tables 1 and 2, 25 in Table 3 (left), and 10 in Table 3 (right). It is well-known that the value of the concentration parameter tends to influence the quality of the asymptotic approximations based on many instruments asymptotics, so the simulations should reflect the values of the concentration parameter that tend to occur in empirical research. Hansen et al. (2008) conducted a survey (n = 28) of microeconomic studies published in AER, JPE, and QJE and found that 80% of the papers had a value of the concentration parameter between 8.95 and 588 with a median of 23.6. Thus, the range of the concentration parameter considered here is relevant for empirical research. The survey also found a median value for |ρ| of .279, so the value ρ = −.3 used here is relevant as well. Finally, the probability limit of the first stage F -statistic is 1 + σxz , i.e., 2 in Tables 1 and 2, 1.5 in Table 3 (left), and 1.2 in Table 3 (right). Thus, the designs considered here involves instruments that are quite weak when measured by the F -statistic. Tables 1–3 report four summary statistics from the study. The first statistic is the median bias of βˆ standardized by 1.48 times the median absolute deviation (mad) of βˆ . 1.48 × mad(βˆ ) is a robust estimate of the standard deviation of βˆ , so the bias is reported at the appropriate scale, and according to Theorem 4.2, βˆ has no asymptotic bias. The √ second statistic

ˆ n /n > 1.96, is a rejection percentage of the testing procedure that rejects the hypothesis that β0 = 1 when |βˆ − 1|/ Σ and according to Lemma 4.3, this test has asymptotic size of 5%. The third statistic is the same as the second except that ˜ n , instead of Σ ˆ n , and the discussion following Lemma 4.3 suggest that it uses the classical GMM variance estimator, Σ 5 The code, implemented in the statistical software R, is available on request. Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

13

Table 1 Simulation results, normal errors. Estimator

ρ = −.3

LIML

φ,

ρ = −.7 ˜n Size, Σ

RE

−0.01

4.76

2.26

1.00

−0.02

5.34

3.57

1.00

−0.01

2.21 2.51 2.55 2.24 2.44 2.43

1.00 0.98 0.99 0.99 1.00 1.00

−0.01 −0.03 −0.01 −0.02 −0.03 −0.07

5.16 5.05 5.23 5.81 5.02 5.28

3.72 3.62 3.55 3.95 3.63 3.66

0.97 0.95 0.94 0.93 0.95 0.98

4.46 2.30 3.97 3.36 1.05

−4.23 −2.01 −4.45 −4.42 −0.09

97.17 69.56 97.80 72.91 5.66

Bias

ˆn Size, Σ

˜n Size, Σ

RE

Bias

ˆn Size, Σ

ψ

Gauss, Huber Huber, Gauss Optimal robust Gauss, Cauchy Cauchy, Gauss Cauchy, Cauchy

0.00 0.00 −0.01 −0.01 −0.01

4.51 4.64 4.78 4.55 4.69 4.73

2SLS NLIV ADAP LASS FULL

−1.53 −0.81 −1.58 −1.50 −0.03

34.12 31.22 32.44 8.85 4.45

4.50 2.16 4.05 3.75 1.07

NOTE: 10,000 replications, 500 observations, 50 instruments. Bias is med(βˆ − β0 )/(1.48 · mad(βˆ )), Size ˆ n uses Σ ˆ n to estimate the asymptotic variance, Size Σ ˜ n uses the classical GMM variance estimator Σ ˜ n, Σ and RE is mad(βˆ LIML )2 /mad(βˆ )2 . Table 2 Simulation results, Huber and t(3) errors. Estimator

Huber errors Bias

LIML

ˆn Size, Σ

t(3) errors

˜n Size, Σ

RE

Bias

ˆn Size, Σ

˜n Size, Σ

RE

0.00

4.47

2.10

1.00

0.00

4.02

1.46

1.00

Gauss, Huber Huber, Gauss Optimal robust Gauss, Cauchy Cauchy, Gauss Cauchy, Cauchy

0.00 0.00 0.00 0.00 −0.01 −0.01

4.46 4.18 4.36 4.43 4.18 4.17

1.97 1.99 1.97 1.89 2.05 2.03

1.02 1.15 1.17 1.03 1.16 1.18

0.00 0.02 0.01 0.00 0.02 0.00

4.14 3.97 4.08 4.08 3.81 3.92

1.53 1.77 1.85 1.61 1.61 1.68

1.00 1.74 1.77 1.02 1.79 1.85

2SLS NLIV ADAP LASS FULL

−1.43 −0.86 −1.59 −1.44 −0.03

31.15 30.69 32.08 7.72 4.15

4.22 2.95 4.55 3.31 1.05

−1.23 −0.84 −1.55 −1.23 −0.02

23.92 27.55 25.14 5.39 3.78

φ,

ψ

4.47 5.28 7.39 3.43 1.05

NOTE: See Table 1. Table 3 Simulation results, weak signal. Estimator

σxz = .5 Bias

LIML

σxz = .2 ˆn Size, Σ

˜n Size, Σ

RE

Bias

ˆn Size, Σ

˜n Size, Σ

RE

0.00

3.93

1.22

1.00

−0.08

4.25

1.13

1.00

Gauss, Huber Huber, Gauss Optimal robust Gauss, Cauchy Cauchy, Gauss Cauchy, Cauchy

−0.01 −0.02 −0.02 −0.03 −0.02 −0.03

4.09 3.77 4.12 4.08 3.70 4.03

1.45 1.65 1.81 1.41 1.57 1.64

1.02 1.73 1.77 1.05 1.75 1.85

−0.09 −0.12 −0.13 −0.13 −0.12 −0.16

4.94 3.78 4.42 5.69 3.51 4.16

1.76 1.50 1.97 2.07 1.45 1.86

1.05 1.72 1.77 1.31 1.73 2.07

2SLS NLIV ADAP LASS FULL

−1.44 −0.97 −1.76 −1.17 −0.03

30.81 36.59 30.97 2.06 3.65

10.29 10.38 17.13 5.88 1.09

−1.62 −1.07 −1.96 −1.18 −0.11

37.54 46.05 36.30 0.70 3.71

φ,

ψ

28.43 24.63 50.74 13.86 1.23

NOTE: See Table 1.

this testing procedure will have asymptotic size less than 5%. The fourth statistic is the square of mad(βˆ LIML )/mad(βˆ ). This is a robust estimate of the relative efficiency (RE) of βˆ and LIML. According to the discussion after Proposition 4.6, the asymptotic relative efficiency of the optimal robust estimator and LIML under normal errors is (approximately) 0.95, i.e., an efficiency loss of 5%. Under the heavier tailed distributions in Tables 2 and 3, the asymptotic relative efficiency of the optimal robust estimator and LIML is greater than 1. Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

14

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

The ‘‘Bias’’ columns of Tables 1–3 show that all the estimators considered, except for the 2SLS inspired ones, are essentially median unbiased, and a comparison across Tables 2 and 3 indicates that a small bias emerges when the ˆ n ’’ columns show that the testing procedure based on Σ ˆ n has strength of the instruments is very small. The ‘‘Size, Σ good size properties for all the parameter values considered, so the small bias that emerges in Table 3 does not seem to affect its size. A comparison within Table 1 reveals that the size properties are somewhat affected by the strength ˜ n ’’ columns illustrate that of the endogeneity, something which is not captured by the asymptotic analysis. The ‘‘Size, Σ ˜ n is conservative (rejection percentages are too the testing procedure based on the classical GMM variance estimator Σ low) for the sample sizes considered here. Furthermore, a comparison across Tables 2 and 3 suggests that this degree of conservativeness is decreasing in the strength of the instruments, which is not too surprising as the effect of many instruments is mitigated by a strong signal in the instruments (Chao and Swanson, 2005). The ‘‘RE’’ columns of Table 1 show that when the errors are jointly normal, there is about a 5% efficiency loss (relative to LIML) from letting φ be the Huber or Gauss scores. On the other hand, there is a 0 − 11% efficiency loss from letting ψ be the Huber or Gauss scores. This latter finding conforms with the observation made after Lemma 4.5, that the efficiency loss from a suboptimal ψ depends on the strength of the endogeneity. The ‘‘RE’’ columns of Tables 2 and 3 show that the robust estimators are substantially more efficient than LIML under the thick-tailed distributions, e.g., the optimal robust estimator is roughly 75% more efficient than LIML in the case of t(3) errors (Table 2 (right)). Furthermore, a comparison of the rows in Table 3 show that the efficiency gains relative to LIML from letting φ be the Huber or Gauss scores are generally larger than the efficiency gains from letting ψ be the Huber or Gauss scores. This section ends with a simulation that illustrates the importance of the assumption that the structural errors, (εi , ui ), are independent of the instruments. It does so by adjusting the simulations to a setting where the (conditional) covariance matrix of (εi , ui ) is

[

√1

ρ 10

√ ] ρ 10 10

σ 2 (zi ) for σ 2 (zi ) =

1 + d × (zi′ π0 )2 1 + d × σxz

.



λi wi where wi ∼ N (0, Ik ) independently of λi ∼ exp(1). All Furthermore, the instruments are generated as zi = remaining parameters and distributions are the same as in Table 1 (left). The heteroskedastic design considered here ensures consistency of LIML, but not of the proposed variance estimator for LIML nor of the existing variance estimators proposed by Bekker (1994) and Hansen et al. (2008). On the other hand, the robust alternatives of this paper are expected to exhibit biases as their consistency relies on full independence between ˆ n , also relies on the full error terms and instruments. Furthermore, consistency of the proposed variance estimator, Σ independence assumption, so the current design leads to size distortions stemming from both a bias in the point estimator and in the standard error (except for LIML). Finally, the error terms have an unconditional distribution with thick tails – a mixture of normals with differing variances – so we may expect to see some efficiency gain from the robust estimators when compared to LIML. The degree of heteroscedasticity is indexed by d and Table 4 reports results for two values, d = 1 and d = 3. The simulation results largely conform with expectations. The ‘‘Bias’’ columns show that a modest negative bias emerges for ˆ n ’’ columns suggest that Σ ˆ n is slightly downward biased for the robust estimators under heteroskedasticity. The ‘‘Size, Σ the case of LIML, and this apparent bias compounds with the bias in the robust point estimators to create size distortions of 2–5 percent. Finally, the ‘‘RE’’ column suggest that there are substantial efficiency gains from the robust estimators, thus highlighting a potential trade-off between bias and variance in the presence of heteroscedasticity. Finally, to clarify the of the biases in the above simulations it is useful to recall that consistency relies on the ∑ source ′ ¯ (εi ) = op (1) which itself follows from the conditional mean restriction E[φ (εi ) | zi ] = 0 and convergence result 1n i zi πφ the covariance restriction 1 n



(Pii − α )E[φ (εi )(ui − γ0 ψ (εi )) | zi ] = op (1)

(9)

i

E[u ψ (ε )]

where γ0 = E[φ (εi )ψ (iε )] . In the above simulations, E[φ (εi ) | zi ] = 0 is satisfied for all estimators considered since i i the heteroskedastic error terms have symmetric distributions. However, (9) fails for the robust estimators since the heteroskedasticity induces a non-zero sample correlation between Pii and E[φ (εi )(ui − γ0 ψ (εi )) | zi ]. The importance of (9) in the special case of LIML has previously in Hausman et al. (2012) who also note that (9) is ∑ been highlighted 2 satisfied under arbitrary heteroskedasticity if 1n i (Pii − α ) = op (1) as was imposed and discussed in Corollary 4.4(i). 6. Quarter of birth and returns to schooling This section considers the empirical example provided by the Angrist and Krueger (1991) study of the returns to schooling using quarter of birth as an instrument. The data comes from the 1980 U.S. Census and includes 329,509 males born 1930–1939. The structural equation includes a constant, year, and state dummies, and the reduced form equation includes 180 instruments which are quarter of birth times year or state of birth. This model corresponds to table 7 of Angrist and Krueger (1991). In this example, the estimated concentration parameter is 257 and the correlation between εi and ui is estimated at −0.2. These observations and the simulations suggest that the asymptotic approximations should work well for this example. Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

15

Table 4 Simulation results, heteroskedasticity. Estimator

d=1

d=3

˜n Size, Σ

RE

−0.01

6.13

3.30

1.00

−0.04

6.19

3.25

1.00

Gauss, Huber Huber, Gauss Optimal robust Gauss, Cauchy Cauchy, Gauss Cauchy, Cauchy

−0.20 −0.01 −0.13 −0.29 −0.01 −0.14

7.84 7.01 8.09 8.33 6.83 7.65

4.37 4.57 5.09 4.57 4.17 4.66

1.53 1.03 1.32 1.80 1.02 1.33

−0.23 −0.03 −0.18 −0.35 −0.03 −0.20

8.61 8.66 9.69 9.57 8.30 9.34

4.48 5.59 6.11 5.13 5.53 6.07

1.69 1.10 1.64 2.12 1.10 1.68

2SLS NLIV ADAP LASS FULL

−1.30 −0.76 −1.28 −1.24 −0.03

51.10 47.79 47.85 5.49 5.96

6.07 3.05 6.31 4.69 1.05

−1.20 −0.71 −1.12 −1.17 −0.05

55.98 53.98 53.48 5.12 5.98

Bias LIML

φ,

ˆn Size, Σ

˜n Size, Σ

RE

Bias

ˆn Size, Σ

ψ

6.70 3.51 7.30 5.54 1.04

NOTE: See Table 1. Table 5 Returns to schooling. Estimator

Estimate

Standard error

Variance ratio

LIML

0.1064

0.01488

1.00

Gauss, Huber Huber, Gauss Optimal robust Gauss, Cauchy Cauchy, Gauss Cauchy, Cauchy

0.1051 0.0891 0.0894 0.1043 0.0869 0.0874

0.01441 0.01085 0.01099 0.01401 0.01040 0.01063

1.07 1.88 1.83 1.13 2.05 1.96

2SLS OLS

0.0928 0.0673

0.00930 0.00035

φ,

ψ

NOTE: Males born 1930–1939, 1980 IPUMS, n

ˆ n (βˆ LIML )/Σ ˆ n (βˆ ). Σ

=

329, 509. Variance ratio is

Table 5 presents the OLS estimate and the estimates from the eight estimators considered in the simulation study. ˆ n for the estimators analyzed in this paper and classical Additionally, Table 5 reports standard error estimates based on Σ standard error estimates for OLS and 2SLS. The latter two are only included for easy reference as they lead to confidence ˆ n (βˆ LIML )/Σ ˆ n (βˆ ) for each estimator, which intervals with incorrect coverage. Finally, Table 5 includes the variance ratios Σ provides an estimate of the efficiency gains relative to LIML. Table 5 shows that the robust estimators deliver point estimates that are similar to either LIML or 2SLS. Furthermore, the table indicates that the robust estimators can be substantially more efficient than LIML, e.g., the optimal robust estimator is estimated to be 83% more efficient than LIML. These efficiency gains are similar in size to the gains in the simulation study with t(3) errors in Table 2 (right). Furthermore, these gains are similar in size to the gains achieved by some of the nonlinear IV estimators proposed by Hansen et al. (2010) in a similar model using three instruments. To further illustrate why the robust estimators are more efficient than LIML in this example, this section presents two figures that describe the distribution of the errors. Fig. 1 presents a kernel estimate of the density of εi along with a normal and t(3) density where the location and scale parameters are based on the median and mad of the LIML residuals. From this figure it is evident that the t(3) density provides a better fit than the normal, although the normal provides a reasonable fit in the center of the distribution. Fig. 2 depicts kernel estimates of the optimal φ and ψ (f ′ /f and E[ui | εi = ·]) together with the appropriately scaled estimates of these functions implied by the estimators that sets both of φ and ψ equal to one of the Gauss (LIML), Huber (optimal robust), or Cauchy scores. From this figure it is clear that the Huber and Cauchy scores provide a better fit to the unknown optimal φ and ψ than the Gauss score does. 7. Summary This paper proposed an optimal robust estimator in a linear IV model with many instruments and showed that it is consistent and asymptotically normal under many instruments asymptotics. The optimality of the estimator was shown to be in terms of minimax asymptotic variance over a neighborhood of contaminated normal distributions, and the optimal robust estimator can be substantially more efficient than LIML under thick-tailed error distributions. Furthermore, the paper provided a simple to use standard error estimator which, when coupled with the normal approximation, leads to asymptotically valid inference. Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

16

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

Fig. 1. Estimates of f (ε ) using LIML residuals.

Fig. 2. Estimates of f ′ (ε )/f (ε ) and E[ui | εi = ε] using LIML residuals. Overlay of kernel estimate of f (ε ) at bottom.

Acknowledgments I am grateful to Michael Jansson, Jim Powell, Demian Pouzo, and Noureddine El Karoui for valuable advice, and thank the editor, two anonymous referees, and seminar participants at UC Berkeley, DAEiNA, University of Michigan, University of Virginia, UNC Chapel Hill, Cornell, Chicago Booth, UW Madison, UC Davis, Aarhus University, University of Copenhagen, and University of Connecticut for helpful comments. Appendix A. Supplementary data Supplementary material related to this article can be found online at https://doi.org/10.1016/j.jeconom.2019.04. 040. The supplemental appendices contain proofs of all theorems plus additional methodological and technical results. Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

17

References Ai, C., Chen, X., 2003. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71 (6), 1795–1843. Aizer, A., Doyle, Jr., J., 2015. Juvenile incarceration, human capital, and future crime: Evidence from randomly-assigned judges. Q. J. Econ.. Amemiya, T., 1982. Two stage least absolute deviations estimators. Econometrica 50 (3), 689–711. Anatolyev, S., Yaskov, P., 2016. Asymptotics of diagonal elements of projection matrices under many instruments/regressors. Econometric Theory 1–22. Anderson, T., Kunitomo, N., Matsushita, Y., 2010. On the asymptotic optimality of the liml estimator with possibly many instruments. J. Econometrics 157 (2), 191–204. Andrews, D.W., 1994. Empirical process methods in econometrics. Handb. Econom. 4, 2247–2294. Angrist, J.D., Krueger, A.B., 1991. Does compulsory school attendance affect schooling and earnings? Q. J. Econ. 106 (4), 979–1014. Autor, D.H., Houseman, S.N., 2010. Do temporary-help jobs improve labor market outcomes for low-skilled workers? evidence from ‘‘work first’’. Am. Econ. J. Appl. Econ. 96–128. Bean, D., Bickel, P.J., El Karoui, N., Yu, B., 2013. Optimal m-estimation in high-dimensional regression. Proc. Natl. Acad. Sci. 110 (36), 14563–14568. Bekker, P.A., 1994. Alternative approximations to the distributions of instrumental variable estimators. Econometrica 62 (3), 657–681. Bekker, P.A., van der Ploeg, J., 2005. Instrumental variable estimation based on grouped data. Stat. Neerl. 59 (3), 239–267. Belloni, A., Chen, D., Chernozhukov, V., Hansen, C., 2012. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 (6), 2369–2429. Cattaneo, M.D., Crump, R.K., Jansson, M., 2012. Optimal inference for instrumental variables regression with non-gaussian errors. J. Econometrics 167 (1), 1–15. Chamberlain, G., Imbens, G., 2004. Random effects estimators with many instrumental variables. Econometrica 72 (1), 295–306. Chang, T., Schoar, A., 2008. Judge specific differences in chapter 11 and firm outcomes. In: American Law & Economics Association Annual Meetings. bepress, p. 86. Chao, J.C., Swanson, N.R., 2005. Consistent estimation with a large number of weak instruments. Econometrica 73 (5), 1673–1692. Chao, J.C., Swanson, N.R., Hausman, J.A., Newey, W.K., Woutersen, T., 2012. Asymptotic distribution of jive in a heteroskedastic iv regression with many instruments. Econometric Theory 28 (01), 42–86. Chatterjee, S., 2008. A new method of normal approximation. Ann. Probab. 36 (4), 1584–1610. Chen, X., 2007. Large sample sieve estimation of semi-nonparametric models. Handb. Econom. 6, 5549–5632. Chen, X., Linton, O., Van Keilegom, I., 2003. Estimation of semiparametric models when the criterion function is not smooth. Econometrica 71 (5), 1591–1608. Chen, L.-A., Portnoy, S., 1996. Two-stage regression quantiles and two-stage trimmed least squares estimators for structural equation models. Comm. Statist. Theory Methods 25 (5), 1005–1032. Chernozhukov, V., Hansen, C., 2006. Instrumental quantile regression inference for structural and treatment effect models. J. Econometrics 132 (2), 491–525. Chioda, L., Jansson, M., 2009. Optimal invariant inference when the number of instruments is large. Econometric Theory 25 (03), 793–805. Dobbie, W., Song, J., 2015. Debt relief and debtor outcomes: Measuring the effects of consumer bankruptcy protection. Am. Econ. Rev. 105 (3), 1272–1311. Doyle, Jr., J.J., 2007. Child protection and child outcomes: Measuring the effects of foster care. Am. Econ. Rev. 97 (5), 1583–1610. Doyle, Jr., J.J., 2008. Child protection and adult crime: Using investigator assignment to estimate causal effects of foster care. J. Political Econ. 116 (4), 746–770. El Karoui, N., 2013. Asymptotic behavior of unregularized and ridge-regularized highdimensional robust regression estimators: rigorous results. arXiv:1311.2445. El Karoui, N., Bean, D., Bickel, P.J., Lim, C., Yu, B., 2013. On robust regression with high-dimensional predictors. Proc. Natl. Acad. Sci. 110 (36), 14557–14562. El Karoui, N., Purdom, E., 2018. Can we trust the bootstrap in high-dimensions? the case of linear models. J. Mach. Learn. Res. 19 (1), 170–235. French, E., Song, J., 2014. The effect of disability insurance receipt on labor supply. Am. Econ. J. Econ. Policy 6 (2), 291–337. Fuller, W.A., 1977. Some properties of a modification of the limited information estimator. Econometrica 45 (4), 939–953. Hahn, J., 2002. Optimal inference with many instruments. Econometric Theory 18 (01), 140–168. Hahn, J., Hausman, J., 2002. A new specification test for the validity of instrumental variables. Econometrica 70 (1), 163–189. Hansen, L.P., 1982. Large sample properties of generalized method of moments estimators. Econometrica 50 (4), 1029–1054. Hansen, C., Hausman, J.A., Newey, W.K., 2008. Estimation with many instrumental variables. J. Bus. Econom. Statist. 26 (4), 398–422. Hansen, C., Kozbur, D., 2014. Instrumental variables estimation with many weak instruments using regularized jive. J. Econometrics 182 (2), 290–308. Hansen, C., McDonald, J.B., Newey, W.K., 2010. Instrumental variables estimation with flexible distributions. J. Bus. Econom. Statist. 28 (1), 13–25. Hansen, L.P., Singleton, K.J., 1982. Generalized instrumental variables estimation of nonlinear rational expectations models. Econometrica 1269–1286. Hausman, J.A., Newey, W.K., Woutersen, T., Chao, J.C., Swanson, N.R., 2012. Instrumental variable estimation with heteroskedasticity and many instruments. Quant. Econ. 3 (2), 211–255. Honoré, B.E., Hu, L., 2004. On the performance of some robust instrumental variables estimators. J. Bus. Econom. Statist. 22 (1), 30–39. Huber, P.J., 1964. Robust estimation of a location parameter. Ann. Math. Stat. 35 (1), 73–101. Huber, P.J., 1967. The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. University of California Press, Berkeley, Calif., pp. 221–233. Huber, P.J., 1973. Robust regression: asymptotics, conjectures and monte carlo. Ann. Statist. 799–821. Huber, P.J., 1981. Robust Statistics. Wiley. Jacobsen, M.R., Van Benthem, A.A., 2015. Vehicle scrappage and gasoline policy. Am. Econ. Rev. 105 (3), 1312–1338. Kline, P., Saggio, R., Sølvsten, M., 2018. Leave-out estimation of variance components. arXiv preprint arXiv:1806.01494. Kling, J.R., 2006. Incarceration length, employment, and earnings. Am. Econ. Rev. 96 (3), 863–876. Kolesár, M., 2013. Estimation in an instrumental variables model with treatment effect heterogeneity. Unpublished Working Paper. Kolesár, M., 2018. Minimum distance approach to inference with many instruments. J. Econometrics 204 (1), 86–100. Krasker, W.S., Welsch, R.E., 1985. Resistant estimation for simultaneous-equations models using weighted instrumental variables. Econometrica 53 (6), 1475–1488. Kunitomo, N., 1980. Asymptotic expansions of the distributions of estimators in a linear functional relationship and simultaneous equations. J. Amer. Statist. Assoc. 75 (371), 693–700. Maestas, N., Mullen, K.J., Strand, A., 2013. Does disability insurance receipt discourage work? using examiner assignment to estimate causal effects of ssdi receipt. Am. Econ. Rev. 103 (5), 1797–1829.

Please cite this article as: M. Sølvsten, Robust estimation with many instruments. Journal of Econometrics (2019), https://doi.org/10.1016/j.jeconom.2019.04.040.

18

M. Sølvsten / Journal of Econometrics xxx (xxxx) xxx

Morimune, K., 1983. Approximate distributions of k-class estimators when the degree of overidentifiability is large compared with the sample size. Econometrica 51 (3), 821–841. Newey, W.K., 1990. Efficient instrumental variables estimation of nonlinear models. Econometrica 58 (4), 809–837. Newey, W.K., 1994. The asymptotic variance of semiparametric estimators. Econometrica 62 (6), 1349–1382. Newey, W.K., McFadden, D., 1994. Large sample estimation and hypothesis testing. Handb. Econom. 4, 2111–2245. Newey, W.K., Windmeijer, F., 2009. Generalized method of moments with many weak moment conditions. Econometrica 77 (3), 687–719. Pakes, A., Pollard, D., 1989. Simulation and the asymptotics of optimization estimators. Econometrica 57 (5), 1027–1057. Powell, J.L., 1983. The asymptotic normality of two-stage least absolute deviations estimators. Econometrica 51 (5), 1569–1575. Rothenberg, T.J., 1984. Approximating the distributions of econometric estimators and test statistics. Handb. Econom. 2, 881–935. Staiger, D., Stock, J., 1997. Instrumental variables regression with weak instruments. Econometrica 65 (3), 557–586. Steigerwald, D.G., 1992. On the finite sample behavior of adaptive estimators. J. Econometrics 54 (1–3), 371–400. Wang, W., Kaffo, M., 2016. Bootstrap inference for instrumental variable models with many weak instruments. J. Econometrics 192 (1), 231–268.

Please cite this article as: M. Sølvsten, https://doi.org/10.1016/j.jeconom.2019.04.040.

Robust

estimation

with

many

instruments.

Journal

of

Econometrics

(2019),