Adaptive nonparametric instrumental variables estimation: Empirical choice of the regularization parameter

Adaptive nonparametric instrumental variables estimation: Empirical choice of the regularization parameter

Journal of Econometrics 180 (2014) 158–173 Contents lists available at ScienceDirect Journal of Econometrics journal homepage: www.elsevier.com/loca...

499KB Sizes 0 Downloads 113 Views

Journal of Econometrics 180 (2014) 158–173

Contents lists available at ScienceDirect

Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom

Adaptive nonparametric instrumental variables estimation: Empirical choice of the regularization parameter Joel L. Horowitz ∗ Department of Economics, Northwestern University, 2001 Sheridan Road, Evanston, IL 60208, USA

article

info

Article history: Received 19 September 2012 Received in revised form 3 October 2013 Accepted 24 March 2014 Available online 29 March 2014 JEL classification: C13 C14 C21

abstract In nonparametric instrumental variables estimation, the mapping that identifies the function of interest, g, is discontinuous and must be regularized to permit consistent estimation. The optimal regularization parameter depends on population characteristics that are unknown in applications. This paper presents a theoretically justified empirical method for choosing the regularization parameter in series estimation. The method adapts to the unknown smoothness of g and other unknown functions. The resulting estimator of g converges at least as fast as the optimal rate multiplied by (log n)1/2 . The asymptotic integrated mean-square error (AIMSE) of the estimator is within a specified factor of the optimal AIMSE. © 2014 Elsevier B.V. All rights reserved.

Keywords: Ill-posed inverse problem Regularization Series estimation Nonparametric estimation

1. Introduction This paper is about estimating the unknown function g in the model Y = g (X ) + U ;

E (U |W = w) = 0

(1.1)

for almost every w or, equivalently, E [Y − g (X )|W = w] = 0

(1.2)

for almost every w . In this model, g is a function that satisfies regularity conditions but is otherwise unknown, Y is a scalar dependent variable, X is a continuously distributed explanatory variable that may be correlated with U (that is, X may be endogenous), W is a continuously distributed instrument for X , and U is an unobserved random variable. The data are an independent random sample of (Y , X , W ). The paper presents a theoretically justified, empirical method for choosing the regularization parameter that is needed for estimation of g. Existing nonparametric estimators of g in (1.1)–(1.2) can be divided into two main classes: sieve (or series) estimators and kernel estimators. Sieve estimators have been developed by Ai and Chen (2003), Newey and Powell (2003), Blundell et al. (2007), and



Tel.: +847 491 8253; fax: +847 491 7001. E-mail address: [email protected].

http://dx.doi.org/10.1016/j.jeconom.2014.03.006 0304-4076/© 2014 Elsevier B.V. All rights reserved.

Horowitz (2012). Kernel estimators have been developed by Hall and Horowitz (2005) and Darolles et al. (2011). Florens and Simoni (2010) describe a quasi-Bayesian estimator based on kernels. Hall and Horowitz (2005) and Chen and Reiss (2011) found the optimal rate of convergence of an estimator of g. Horowitz (2007) gave conditions for asymptotic normality of the estimator of Hall and Horowitz (2005). Horowitz and Lee (2012) showed how to use the sieve estimator of Horowitz (2012) to construct uniform confidence bands for g. Newey et al. (1999) present a control function approach to estimating g in a model that is different from (1.1)–(1.2) but allows endogeneity of X and achieves identification through an instrument. The control function model is non-nested with (1.1)–(1.2) and is not discussed further in this paper. Chernozhukov et al. (2007), Horowitz and Lee (2007), and Gagliardini and Scaillet (2012) have developed methods for estimating a quantile-regression version of model (1.1)–(1.2). Chen and Pouzo (2009, 2012) developed a method for estimating a large class of nonparametric and semiparametric conditional moment models with possibly non-smooth moments. This class includes the quantile-regression version of (1.1)–(1.2). As is explained further in Section 2 of this paper, the relation that identifies g in (1.1)–(1.2) creates an ill-posed inverse problem. That is, the mapping from the population distribution of (Y , X , W ) to g is discontinuous. Consequently, g cannot be estimated consistently by replacing unknown population quantities in the identifying relation with consistent estimators. To achieve a consistent

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

estimator it is necessary to regularize (or modify) the mapping that identifies g. The amount of modification is controlled by a parameter called the regularization parameter. The optimal value of the regularization parameter depends on unknown population characteristics and, therefore, cannot be calculated in applications. Although there have been proposals of informal rules-of-thumb for choosing the regularization parameter in applications, theoretically justified empirical methods are not yet available. This paper presents an empirical method for choosing the regularization parameter in sieve or series estimation, where the regularization parameter is the number of terms in the series approximation to g. The method consists of optimizing a sample analog of a weighted version of the integrated mean-square error of a series estimator of g. The method does not require a priori knowledge of the smoothness of g or of other unknown functions. It adapts to their unknown smoothness. The estimator of g based on the empirically selected regularization parameter also adapts to unknown smoothness. It converges in probability at a rate that is at least as fast as the asymptotically optimal rate multiplied by (log n)1/2 , where n is the sample size. Moreover, its asymptotic integrated mean-square error (AIMSE) is within a specified factor of the optimal AIMSE. The paper does not address the question of whether the factor of (log n)1/2 can be removed or is an unavoidable price that must be paid for adaptation. This question is left for future research. Section 2 provides background on the estimation problem and the series estimator that is used with the adaptive estimation procedure. This section also reviews the relevant mathematics and statistics literature. The problems treated in that literature are simpler than (1.1)–(1.2). Section 3 describes the proposed method for selecting the regularization parameter. Section 4 presents the results of Monte Carlo experiments that explore the finite-sample performance of the method. Section 5 presents an empirical example, and Section 6 presents concluding comments. All proofs are in the Appendix.

replacing m in g = A−1 m with a consistent estimator, even if A were known. To estimate g consistently, it is necessary to regularize (or modify) A so as to remove the discontinuity of A−1 . A variety of regularization methods have been developed. See, for example, Engl et al. (1996), Kress (1999), and Carrasco et al. (2007), among many others. The regularization method used in this paper is series truncation, which is a modification of the Petrov–Galerkin method that is well-known in the theory of integral equations. See, for example, Kress (1999, pp. 240–245). It amounts to approximating A with finite-dimensional matrix. The singular values of this matrix are bounded away from zero, so the inverse of the approximating matrix is a continuous operator. The details of the method are described further in Section 2.2. 2.2. Sieve estimation and regularization by series truncation The adaptive estimation procedure uses a two-stage estimator that is a modified version of Horowitz’s (2012) sieve estimator of g. The estimator is defined in terms of series expansions of g, m, and A. Let {ψj : j = 1, 2, . . .} be a complete, orthonormal basis for L2 [0, 1]. The expansions are ∞ 

g (x) =

m(w) =

∞ 

2.1. The estimation problem and the need for regularization Let X and W be continuously distributed random variables. Assume that the supports of X and W are [0, 1]. This assumption does not entail a loss of generality, because it can be satisfied by, if necessary, carrying out monotone increasing transformations of X and W . Let fXW and fW , respectively, denote the probability density functions of (X , W ) and W . Define m(w) = E (Y |W = w)fW (w). Let L2 [0, 1] be the space of real-valued, square-integrable functions on [0, 1]. Define the operator A from L2 [0, 1] → L2 [0, 1] by

(Ah)(w) =

 [0,1]

h(x)fXW (x, w)dx,

where h is any function in L2 [0, 1]. Then g in (1.1)–(1.2) satisfies Ag = m. Assume that A is one-to-one, which is necessary for identifica2 tion of g. Then g = A−1 m. If fXW is integrable on [0, 1]2 , then zero is a limit point (and the only limit point) of the singular values of A. Consequently, the singular values of A−1 are unbounded, and A−1 is a discontinuous operator. This is the ill-posed inverse problem. Because of this problem, g could not be estimated consistently by

mk ψk (w),

k=1

and fXW (x, w) =

∞  ∞ 

cjk ψj (x)ψk (w),

j=1 k=1

where

 bj =

[0,1]

g (x)ψj (x)dx,



This section explains the estimation problem and the need for regularization, outlines the sieve estimator that is used with the adaptive estimation procedure, and reviews the statistics literature on selecting the regularization parameter.

bj ψj (x),

j=1

mk =

2. Background

159

[0,1]

m(w)ψk (w)dw,

and

 cjk =

[0,1]2

fXW (x, w)ψj (x)ψk (w)dxdw.

To estimate g, we need estimators of mk , cjk , m, and fXW . Denote the data by {Yi , Xi , Wi : i = 1, . . . , n}, where n is the sample n size. −1 ˆ The estimators of mk and cjk , respectively, are m k = n i=1 Yi ψk  (Wi ) and cˆjk = n−1 ni=1 ψj (Xi )ψk (Wi ). The estimators of m and

ˆ (w) = fXW , respectively, are m

Jn

Jn Jn

k=1

ˆ k ψk (w) and fˆXW (x, w) = m

ˆjk ψj (x)ψk (w), where Jn is a pilot series truncation j=1 k=1 c point that, for now, is assumed to be non-stochastic. It is assumed that as n → ∞, Jn → ∞ at a rate that is specified in Section 3.1. Section 3.3 describes an empirical method for choosing Jn . Define the operator Aˆ that estimates A by ˆ )(w) = (Ah

 [0,1]

h(x)fˆXW (x, w)dx.

The first-stage estimator of g is defined as1

ˆ. g˜ = Aˆ −1 m

(2.2)

1 Aˆ and Aˆ −1 are defined on the subspace spanned by {ψ : j = 1, . . . , J }. j n Under the assumptions of this paper, Aˆ can be represented by a square, non-singular ˆ Eq. (2.2) and this equivalence do matrix, and (2.2) is equivalent to gˆ = (Aˆ ∗ Aˆ )−1 Aˆ ∗ m. not hold if Aˆ is non-square, as can happen if X and W have different dimensions. The row and column dimensions of a non-square Aˆ can be chosen separately, thereby requiring the choice of two regularization parameters. The treatment of this case is beyond the scope of this paper.

160

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

To obtain the second-stage estimator that is used with this paper’s adaptive estimation procedure, let ⟨·, ·⟩ denote the  inner product  in L2 [0, 1]. For j = 1, . . . , Jn , define b˜ j = g˜ , ψj . The b˜ j ’s are the generalized Fourier coefficients of g˜ with the basis functions {ψj } and the truncation point Jn . Let J ≤ Jn be a positive integer. The second-stage estimator of g is gˆJ =

J 

b˜ j ψj .

(2.3)

j =1





If J is chosen to minimize the AIMSE of gˆJ , then gˆJ − g  converges in probability to 0 at the optimal rate of Chen and Reiss (2011). See Proposition A.1 of the Appendix. However, this choice of J is not available in applications because it depends on unknown population parameters. The adaptive estimation procedure consists of choosing J in (2.3) to minimize a sample analog of a weighted version of the AIMSE of gˆJ . This procedure is explained in Section 3.1. Let Jˆ denote the resulting value of J. The adaptive estimator   of g

is gˆJˆ . Under the regularity conditions of Section 3.2, gˆJˆ − g  con-





verges in probability to 0 at a rate that is at least as fast as the optimal rate times (log n)1/2 . Moreover, the AIMSE of gˆJˆ is within a factor of 2 + (4/3) log n of the AIMSE of the infeasible estimator that minimizes the AIMSE of gˆJ . Achieving these results does not require knowledge of the smoothness of g or the rate of convergence of the singular values of A. An alternative to the estimator (2.3) consists of using the estimator (2.2) with Jˆ in place of Jn . However, replacing Jn with Jˆ ˆ to be variables of the causes the lengths of the series in Aˆ and m optimization problem. The methods of proof of this paper do not apply in this case. The asymptotic properties of g˜ with Jˆ in place of Jn are unknown.

but are not necessarily trigonometric functions and the eigenvalues must be estimated from data. Among the settings in the mathematics and statistics literature, this is the closest to the one considered here. Loubes and Marteau obtain non-asymptotic oracle inequalities for the risk of their estimator and show that, for a suitable p > 1, their estimator’s rate of convergence is within a factor of (log n)p of the asymptotically optimal rate. However, the eigenfunctions of A∗ A are not known in econometric applications. Section 3 describes a method for selecting J empirically when neither the eigenvalues nor eigenfunctions of A∗ A are known. In contrast to Loubes and Marteau (2009), the results of this paper are asymptotic. However, Monte Carlo experiments that are reported in Section 4 indicate that the adaptive procedure works well with samples of practical size. Parts of the proofs in this paper are similar to parts of the proofs of Loubes and Marteau (2009). 3. Main results This section begins with an informal description of the method for choosing J. Section 3.2 presents the formal results. 3.1. Description of the method for choosing J Define EA (·) as the mean of the leading term of the asymptotic expansion of the random variable (·). Specifically, if Zn = Z˜n + rn , where Zn , Z˜n , and rn are random variables, E (Z˜n ) exists, and rn = op (Z˜n ) as n → ∞, then EA (Zn ) = E (Z˜n ). Define the asymptotically optimal J as the value that minimizes the asymptotic integrated mean-square error (AIMSE) of gˆJ as an estimator of g. The AIMSE

2



is EA gˆJ − g  . Denote the asymptotically optimal value of J by Jopt . It is shown in Proposition A.1 of the appendix that under the

2



2.3. Review of related mathematics and statistics literature

regularity conditions of Section 3.2, EA gˆJopt − g  converges to zero at the fastest possible rate (Chen and Reiss, 2011). However,

Ill-posed inverse problems in models that are similar to but simpler than (1.1)–(1.2) have long been studied in mathematics and statistics. Two important characteristics of (1.1)–(1.2) are that (1) the operator A is unknown and must be estimated from the data and (2) the distribution of V ≡ Y − E [g (X )|W ], is unknown. The mathematics and statistics literatures contain no methods for choosing the regularization parameter in (1.1)–(1.2) under these conditions. A variety of ways to choose regularization parameters are known in mathematics and numerical analysis. Engl et al. (1996), Mathé and Pereverzev (2003), Bauer and Hohage (2005), Wahba (1977), and Lukas (1993, 1998) describe many. Most of these methods assume that A is known and that the ‘‘data’’ are deterministic or that Var (Y |X = x) is known and independent of x. Such methods are not suitable for econometric applications. Spokoiny and Vial (2009) describe a method for choosing the regularization parameter in an estimation problem in which A is known and V is normally distributed. The resulting estimator of g converges at a rate that is within a factor of (log n)p of the optimal rate for a suitable p > 0. Cohen et al. (2004) and Loubes and Ludeña (2008) also consider a setting in which A is known. Efromovich and Koltchinskii (2001), Cavalier and Hengartner (2005), Hoffmann and Reiss (2008), and Marteau (2006, 2009) consider settings in which A is known up to a random perturbation and, possibly, a truncation error but is not estimated from the data. Johannes and Schwarz (2010) treat estimation of g when the eigenfunctions of A∗ A are known to be trigonometric functions, where A∗ is the adjoint of A. Breunig and Johannes (2013) consider estimation of a linear functional of g. Loubes and Marteau (2009) treat estimation of g when the eigenfunctions of A∗ A are known

EA gˆJ − g  depends on unknown population parameters, so it cannot be minimized in applications. We replace the unknown parameters with sample analogs, thereby obtaining a feasible



2



2

estimator of a weighted version of EA gˆJ − g  . Let Jˆ denote the value of J that minimizes the feasible estimator. Note that Jˆ is a random variable. The adaptive estimator of g is gˆJˆ . The AIMSE of the

 

2 

adaptive estimator is EA gˆJˆ − g  . Under the regularity conditions of Section 3.2, which specify properties of the basis functions and place smoothness restrictions on g,

 

2 

2

EA gˆJˆ − g  ≤ [2 + (4/3) log(n)]EA gˆJopt − g  .



 

(3.1)

 

Thus, the rate of convergence in probability of gˆJˆ − g  is within

a factor of (log n)1/2 of the optimal rate. Moreover, the AIMSE of gˆJˆ is within a factor of 2 + (4/3) log n of the AIMSE of the infeasible optimal estimator gˆJopt . The following notation is used in addition to that already defined. For any positive integer J, let AJ be the operator on L2 [0, 1] that is defined by

(AJ h)(w) =

 [0,1]

h(x)aJ (x, w)dx,

where aJ (x, w) =

J  J  j=1 k=1

cjk ψj (x)ψk (w).

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

Let Jn be the series truncation point defined in Section 2.2. For any x ∈ [0, 1], define

δn (x, Y , X , W ) = [Y − gJn (X )]

Jn 

1 ψk (W )(A− Jn ψk )(x),

and



 1 ∗ δn (·, Y , X , W ), ψj = [Y − gJn (X )][(A− Jn ) ψj ](W ).

Therefore,

k =1

Sn (x) = n

−1

n 

Tn ( J ) = EA

δn (x, Yi , Xi , Wi ),

gJ =

J 

J 

 n

−1

j=1

i=1

and

161

2 n   2 −1 ∗ [Yi − gJn (Xi )][(AJn ) ψj ](Wi ) − gJ  . i=1

It follows from Lemma 3 of the appendix and the assumptions of Section 3.2 that bj ψj .

j =1

EA

The following proposition is proved in the Appendix.

J 

 n

−1

j =1

2 n  −1 ∗ [Yi − gJn (Xi )][(AJn ) ψj ](Wi ) i=1 J

Proposition 1. Let Assumptions 1–6 of Section 3.2 hold. Then, as n → ∞, gˆJ − gJ =

J  

= n− 1 E A

 1 ∗ 2 [Y − gJn (X )]2 {[(A− Jn ) ψj ](W )} . j =1

Therefore,

Sn , ψj ψj + rn ,



j =1

T n ( J ) = n− 1 E A

where

j =1

  J       ∥rn ∥ = op  S ,ψ ψ   j =1 n j j  uniformly over J ≤ Jn .

This is the desired form of Tn (J ). Tn (J ) depends on the unknown parameters gJn and AJn and on the operator EA . Therefore, Tn (J ) must be replaced by an estimator for use in applications. One possibility is to replace gJn , AJn , gJ , and



Now for any J ≤ Jn , 2  2 2   EA gˆJ − g  = EA gˆJ − gJ  + gJ − g  2  2  = EA gˆJ − gJ  + ∥g ∥2 − gJ  .

ˆ gˆJ , and the empirical expectation, respectively. This EA with g˜ , A, gives the estimator T˘n (J ) ≡ n−2



2

J  

Sn , ψj

2

 2 + ∥g ∥2 − gJ  .

j=1

J  

Sn , ψj

2

 2 − gJ  .

 J   2 −1 ∗ 2 ˆ [Yi − g˜ (Xi )] {[(A ) ψj ](Wi )} − gˆJ  . 2

j =1

However, T˘n is unsatisfactory for two reasons. First, it does not

 

2 

ˆ This account for the effect on EA gˆJˆ − g  of the randomness of J.

  2  2    gˆJ  − gJ  = gˆJ − gJ 2 + 2 gJ , gˆJ − gJ .

j =1

Assume that Jn ≥ Jopt . Then because ∥g ∥2 does not depend on J, Jopt = arg min J >0

Tn (J ).

We now put Tn (J ) into an equivalent form that is more convenient for the analysis that follows. Observe that 1 (A− Jn ψk )(x) =

Jn 

Tˆn (J ) ≡ (2/3)(log n)n

where c jk is the (j, k) element of the inverse of the Jn ×Jn matrix [cjk ]. This inverse exists under the assumptions of Section 3.2. Therefore,

k=1

Jn  Jn 

c jk ψk (W )ψj (x)

j=1 k=1

=

Jn 

−2

n 

 [Yi − g˜ (Xi )]2

i =1

c jk ψj (x),

1 ψk (W )(A− Jn ψk )(x) =

(3.2)

The right-hand side of (3.2) is asymptotically non-negligible, so the estimator of Tn must compensate for its effect. It is shown in the Appendix that these problems can be overcome by using the estimator

 J   2 −1 ∗ 2 ˆ × {[(A ) ψj ](Wi )} − gˆJ  .

j =1

Jn 



randomness is the source of the factor of log n on the right-hand side of (3.1). Second, some algebra shows that

Define Tn (J ) = EA

n  i =1

Therefore, it follows from Proposition 1 that EA gˆJ − g  = EA

J   2 1 ∗ 2   . [Y − gJn (X )]2 {[(A− Jn ) ψj ](W )} − gJ

1 ∗ ψj (x)[(A− Jn ) ψj ](W ),

j =1

where ∗ denotes the adjoint operator. It follows that

δn (x, Y , X , W ) = [Y − gJn (X )]

Jn  j=1

1 ∗ ψj (x)[(A− Jn ) ψj ](W )

(3.3)

j =1

Tˆn multiplies the first term of T˘n by a logarithmic factor, which ac-

 2 commodates the randomness of Jˆ and non-negligibility of gˆJ  −  2 gJ  . We obtain Jˆ by solving the problem minimize : Tˆn (J ),

(3.4)

J : 1≤J ≤Jn

where Jn is the truncation parameter used to obtain g˜ , and Jn satisfies Assumption 6 of Section 3.2. Section 3.2 gives conditions under which gˆJˆ satisfies inequality (3.1) The problem of choosing Jn in applications is discussed in Section 3.3.

162

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

3.2. Formal results This section begins with the assumptions under which gˆJˆ is shown to satisfy (3.1). A theorem that states the result formally follows the presentation of the assumptions. Let A∗ denote the adjoint operator of A. Define U = Y − g (X ). For each positive integer J and any positive, increasing sequence {νj : j = 1, 2, . . .}, define the set of functions

   J        −1 HJ ν = h ∈ L2 [0, 1] : h − h, ψj ψj  ≤ νJ .   j =1 

For each positive integer J, define the set of functions

 KJ =

h=

J  j =1

hj ψj :

J 

 h2j

=1

j =1

and the scalar parameter

−1  . ρJ = sup (A∗ A)1/2 ν  ν ∈KJ

The parameter ρJ2 is the inverse of the smallest eigenvalue of the

J × J matrix whose (j, k) element is ℓ=1 cjℓ ckℓ . In addition, ρJ is a generalization of the sieve measure of ill-posedness defined by Blundell et al. (2007).2 The assumptions are as follows.

∞

Assumption 1. (i) The supports of X and W are [0, 1]. (ii) (X , W ) has a bounded probability density function fXW with respect to Lebesgue measure. The probability density function of W , fW , is non-zero almost everywhere on [0, 1]. Assumption 2. (i) There is a finite constant CY such that E (Y 2 |W = w) ≤ CY for each w ∈ [0, 1]. (ii) There are finite constants CU > 0, j −2 cU1 > 0, and cU2 > 0 such that E (|U |j |W = w) ≤ CU j!E (U 2 |W = 2 w) and cU1 ≤ E (U |W = w) ≤ cU2 for every integer j ≥ 2 and w ∈ [0, 1]. Assumption 3. (i) (1.1) has a solution g ∈ L2 [0, 1]. (ii) The estimators g˜ and gˆJ are as defined in (2.2)–(2.3). Assumption 4. The operator A is one-to-one. Assumption 5. (i) The basis functions {ψj } are orthonormal and complete on L2 [0, 1]. (ii) There is a non-decreasing, positive sequence {νj : j = 1, 2, . . .} such that j−s νj is bounded away from 0 for all j and some s > 3, and g ∈ HJ ν for any integer J > 0. If νj increases exponentially fast as j increases, then j−s νj → ∞ for any finite s. (iii) There are constants Cψ and τ with 0 < Cψ < ∞ and 0 ≤ τ < (s − 3)/2 such that sup0≤x≤1 |ψj (x)| ≤ Cψ jτ for each j = 1, 2 , . . . . (iv) There are constants α > 1/2, ε > 0, C < ∞ and ∞ D with j=1 j2α b2j < D < ∞ such that for all δ ∈ (1/2, α),

Assumptions 1 and 2 are smoothness and boundedness conditions. Assumption 3 defines the model and estimators of g. Assumption 4 is required for identification of g. Assumption 5 specifies properties of the basis functions {ψj }. Assumption 5(ii) specifies the accuracy of a truncated series approximation to g and is similar to assumption 2.1 of Chen and Reiss (2011). Assumption 5(ii) is satisfied with νj ∝ js if g has s square-integrable derivatives and the basis functions belong to a class that includes trigonometric functions, Legendre polynomials that are shifted and scaled to be orthonormal on [0, 1], and B-splines that have been orthonormalized by, say, the Gram–Schmidt procedure. Assumption 5(iii) is satisfied by trigonometric functions (τ = 0), shifted and scaled Legendre polynomials (τ = 1/2), and orthonormalized cubic B-splines (τ = 3/2) if s is large enough. Legendre polynomials require s > 4, and cubic B-splines require s > 6. Assumption 5(iv) requires the basis functions to be such that a truncated series approximation to A is sufficiently accurate. The assumption restricts the magnitudes of the off-diagonal elements of the infinite dimensional matrix [cjk ] as j, k → ∞. It is analogous to the diagonality restrictions of Hall and Horowitz (2005) and assumption 6 of Blundell et al. (2007).3 Assumption 6 restricts the rate of increase of Jn as n → ∞ and further restricts the size of τ . Assumption 5 implies that the function m satisfies

  J      mj ψj  ≤ C1 J −1−ε ρJ−1 m −   j=1 for some constants C1 < ∞ and ε > 0. See Lemma 4 in the appendix for a proof. Assumptions 5(ii) and 6 imply that Jn > Jopt for all sufficiently large n. For sequences of positive numbers {an } and {bn }, define an ≍ bn if an /bn is bounded away from 0 and ∞ as n → ∞. The problem of estimating g is said to be mildly ill-posed if ρJ ≍ J r for some finite r > 0 and severely ill posed if ρJ ≍ eβ J for some finite β > 0. Suppose that νj ≍ js ands < ∞. Then  in the mildly ill-posed case, Jopt ∝ n1/(2r +2s+1) and gˆJopt − g  = Op [n−s/(2r +2s+1) ] (Blundell et al., 2007; Chen and  Reiss, 2011). In the severely ill-posed case, Jopt = O(log n) and gˆJopt − g  = Op [(log n)−s ]. Rates approaching the parametric rate Op (n−1/2 ) are possible if s = ∞ but depend on the details of ρJ and the νj ’s. The results of this section hold in the mildly and severely ill-posed cases and for finite and infinite values of s. We now have the following theorem. Theorem 3.1. Let Assumptions 1–6 hold. Then gˆJˆ satisfies inequality (3.1) for all sufficiently large n.  3.3. Choosing Jn in applications

  (AJ − A)h ≤ CJ −1−ε ρJ−1 and sup ∥h∥ h∈HJ δ D   (AJ − A)(gJ − g )   ≤ C ρJ−1 , gJ − g    J J 2δ 2 where HJ δ D = h = j=1 hj ψj : j =1 j hj ≤ D .

Jn in problem (3.4) depends on ρJ and, therefore, is not known in applications. This section describes a way to choose Jn empirically. It is shown that inequality (3.1) holds with the empirically selected Jn . The method for choosing Jn has two parts. The first part consists of specifying a value for Jn that satisfies Assumption 6. This value

2 Blundell et al. (2007) define the sieve measure of ill-posedness as supk ∈HJ :∥h∥=1 [∥(A∗ A)1/2 h∥]−1 , where HJ ⊂ KJ is a Sobolev space. The sieve measure of ill-posedness and ρJ are the same if the eigenvectors of (j, k = 1, . . . , J) are in HJ .

Assumption 6. As n → ∞, (i) ρJn (Jn3 /n)1/2 → 0, (ii) ρJn (Jn4 /n)1/2 → ∞, and (iii) Jn1+4τ /ρJ2n → 0.

∞

ℓ=1 cjℓ ckℓ

3 The following is a simple example of a model that satisfies Assumption 5(iv). 1/2 Let cjj = λj , cjk = 0 if j ̸= k, λj ≤ c1 j−2a , and bj ∝ j−d for constants a, c1 > 0 1/2

and d > 1/2. Then ρJ−1 ≤ c1 J −a , (AJ − A)h = 0 for h ∈ Hjδ D , and ∥(AJ − A)

(gJ − g )∥ ≤ c2 J

−a

≤ c3 ρ J

−1

for constants c2 , c3 > 0.

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

163

depends on the unknown quantity ρJ . The second step consists of replacing ρJ with an estimator.4 To take the first step, define Jn0 by Jn0 = arg min {ρJ2 J 3.5 /n : ρJ2 J 3.5 /n − 1 ≥ 0}. J =1,2,...

(3.5)

Because ρJ2 J 3.5 /n is an increasing function of J, Jn0 is the smallest

integer for which ρJ2 J 3.5 /n ≥ 1. For example, if ρJ = J β for some β > 0, then Jn0 is the integer that is closest to and at least as large as n1/(3.5+2β) . If ρJ = eβ J for some β > 0, then Jn0 = O(log n). Jn0 satisfies Assumption 6(i) and (ii). It satisfies Assumption 6(iii) if τ is not too large, but it is not feasible because it depends on ρJ . We obtain a feasible estimator of Jn0 by replacing ρJ2 in (3.5) with an estimator. To specify the estimator, let Aˆ J (J = 1, 2, . . .) be the operator on L2 [0, 1] whose kernel is aˆ J (x, w) =

J  J 

cˆjk ψj (x)ψk (w);

x, w ∈ [0, 1].

Fig. 1. Graph of g (x).

j=1 k=1

The estimator of ρJ2 is denoted by ρˆ J2 and is defined by

ρˆ J−2

    = inf Aˆ ∗J Aˆ J h . h∈KJ

Thus, ρˆ J−2 is the smallest eigenvalue of Aˆ ∗J Aˆ J . The estimator of Jn0 is Jˆn0 = arg min {ρˆ J2 J 3.5 /n : ρˆ J2 J 3.5 /n − 1 ≥ 0}. J =1,2,...

The main result of this paper is given by the following theorem. Theorem 3.2. Let Assumptions 1–6 hold. Assume that either ρJ ≍ J β (mildly ill-posed case) or ρJ ≍ eβ J (severely ill-posed case) for some finite β > 0. Then (i) P (Jˆn0 = Jn0 ) → 1 as n → ∞. (ii) Let gˆˆ Jˆ be the estimator of g that is obtained by replacing Jn with Jˆn0 in (3.4). Then for all sufficiently large n

2  2    EA gˆˆ Jˆ − g  ≤ [2 + (4/3) log(n)]EA gˆJopt − g  .  Thus the estimator gˆˆ Jˆ that is based on the estimator Jˆn0 satisfies the same inequality as the estimator gˆJˆ that is based on a nonstochastic but infeasible Jn .5 4. Monte Carlo experiments This section describes the results of a Monte Carlo study of the finite-sample performance of gˆJˆ . There are 1000 Monte Carlo replications in each experiment. The basis functions {ψj } are Legendre polynomials that are centered and scaled to be orthonormal on [0, 1]. Jn is chosen using the empirical method that is described in Section 3.3. The Monte Carlo experiments use two different designs. There are 5 experiments with Design 1 and 2 experiments with Design 2.

4 Blundell et al. (2007) proposed estimating the rate of increase of ρ as J increases J by regressing an estimator of log(ρJ ) on log J or J for the mildly and severely illposed cases, respectively. Blundell et al. (2007) did not explain how to use this result to select a specific value of J for use in estimation of g. 5 It is possible that the efficiency of gˆ can be improved by re-estimating its Fourier

In Design 1, the sample size is 1000. Realizations of (X , W ) were generated from the model fXW (x, w) = 1 + 2

cj cos(jπ x) cos(jπ w),

(4.1)

j =1

where cj = 0.7j−1 in experiment 1, cj = 0.6j−2 in experiment 2, cj = 0.52j−4 in experiment 3, cj = 1.3 exp(−0.5j) in experiment 4, and cj = 2 exp(−1.5j) in experiment 5. In all experiments, the marginal distributions of X and W are U [0, 1], and the conditional distributions are unimodal with an arch-like shape. The estimation problem is mildly ill-posed in experiments 1–3 and severely illposed in experiments 4–5. The function g is g ( x) = b0 +

∞ √ 

2

bj cos(jπ x),

(4.2)

j =1

where b0 = 0.5 and bj = j−4 for j ≥ 1. This function is plotted in Fig. 1. The series in (4.1) and (4.2) were truncated at j = 100 for computational purposes. Realizations of Y were generated from Y = E [g (x)|W ] + V , where V ∼ N (0, 0.01). Monte Carlo Design 2 mimics estimation of an Engel curve from the data used in the empirical example of Section 5. The data consist of household-level observations from the British Family Expenditure Survey (FES), which is a popular data source for studying consumer behavior. See Blundell et al. (1993), for example. We use a subsample of 1516 married couples with one or two children and an employed head of household. In these data, X denotes the logarithm of total income, and W denotes the logarithm of annual income from wages and salaries of the head of household. Blundell et al. (2007) and Blundell and Horowitz (2007) discuss the validity of W as an instrument for X . In the Monte Carlo experiment, the dependent variable Y is generated as described below. The experiment mimics repeated sampling from the population that generated the FES data. The data {Xi , Wi : i = 1, . . . , 1516} were transformed to be in [0, 1]2 by using the transformations Xi → (Xi −

min Xj )/( max Xj −

1≤j≤1516

coefficients using a series length of Jˆ instead of truncating the series of length Jˆn0

and

that is g˜ . We do not investigate this possibility here. The resulting estimator, like gˆˆ Jˆ ,

Wi → (Wi −

would have an AIMSE that is within a factor of log n of the optimal AIMSE.

∞ 

1≤j≤1516

min Xj )

1≤j≤1516

min Wj )/( max Wj −

1≤j≤1516

1≤j≤1516

min Wj ).

1≤j≤1516

164

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

The transformed data were kernel smoothed using the kernel K (v) = (15/16)(1 − v 2 )2 I (|v| ≤ 1) to produce a density fFES (x, w; σ ), where σ is the bandwidth parameter. The bandwidths are σ = 0.05 and 0.10, respectively, in experiments 6 and 7. This range contains the cross-validation estimate, σ = 0.07, of the optimal bandwidth for estimating the density of the FES data. Numerical evaluation of ρJ showed that ρJ ∝ eβ J with β = 3.1 when σ = 0.05 and β = 3.4 when σ = 0.10. Thus, the estimation problem is severely ill-posed in the Design 2 experiments. The experiments use g (x) = Φ [(x − 0.5)/0.24], which mimics the share of household expenditures on some good or service. The dependent variable Y is Y = E [g (X )|W ] + V , where (X , W ) is randomly distributed with the density fFES (x, w; σ ) and V ∼ N (0, 0.01) independently of (X , W ). Each experiment consisted repeating the following procedure 1000 times. A sample of n = 1516 observations of (X , Z ) was generated by independent random sampling from the distribution whose density is fFES (x, w; σ ). Then 1516 corresponding observations of Y were generated from (4.2). Finally, g was estimated using the methods described in this paper. The results of the experiments are summarized in Table 1,



2

Fig. 2. Empirical mean of gˆJ − g  as a function of J. The solid line is from experiment 1. The dashed line is from experiment 4.

2  2  which shows the empirical means of gˆJopt − g  and gˆˆ Jˆ − g  , the ratio 

 

R≡



2 

Empirical mean of gˆˆ Jˆ − g 

 ,

2 Empirical mean of gˆJopt − g 



and B ≡ 2 + (4/3) log n, which is the theoretical asymptotic upper bound on R from inequality (3.1). The results show that the differ-

2

 

2 

ences between the empirical means of gˆJopt − g  and gˆˆ Jˆ − g 



are small and that the ratio R is well within the theoretical bound B in all of the experiments. In experiments 3 and 7, R = 1 because Jˆ = Jopt in all Monte Carlo replications.



2

Fig. 2 shows the empirical mean of gˆJ − g  as a function of J for experiments 1 and 4. Denote this quantity by IMSEJ . IMSE1 is large because gˆ1 is a highly biased estimator of g. IMSE5 is large because the variances of the estimated fifth-order Fourier coefficients, b˜ 5 , are large. IMSEJ varies little among J values and experiments when 2 ≤ J ≤ 4. When J is in this range, the variances of the estimated Fourier coefficients are similar among experiments. Increases in the variance of gˆJ as J increases are compensated by decreases in the bias. 5. An empirical example This section presents the estimate of an Engel curve for food that is obtained by applying the methods of Section 3 to the FES data described in Section 4. As in Section 4, the basis functions are Legendre polynomials that are shifted and scaled to be orthonormal on [0, 1]. Jn is chosen using the empirical method of Section 3.3, and Jˆ is chosen by solving (3.4) after replacing Jn with Jˆn0 . Computation of ρˆ J showed that ρˆ J ∝ e3.3J , so the estimation problem is severely ill posed. The estimated Engel curve is shown in Fig. 3. It is nearly linear. This may be surprising, but a test of the hypothesis that the true Engel curve is linear against a nonparametric alternative (Horowitz, 2006) does not reject the hypothesis of linearity (p > 0.1). Similarly, in parametric instrumental variables estimation under the assumption that g is a polynomial function of X , it is not possible to reject the hypothesis that the coefficients of terms of degree higher that one are zero. The result of the specification test of Horowitz

Fig. 3. Estimated Engel curve for food.

(2006) implies that a 90% confidence region centered on the true Engel curve would contain a linear function.6 6. Conclusions This paper has presented a theoretically justified, empirical method for choosing the regularization parameter in nonparametric instrumental variables estimation. The method does not require a priori knowledge of smoothness or other unknown population parameters. The method and the resulting estimator of the unknown function g adapt to the unknown smoothness of g and the density of (X , W ). The results of Monte Carlo experiments indicate that the method performs well with samples of practical size. It is likely that the ideas in this paper can be applied to the multivariate model Y = g (X , Z ) + U, E (U |W , Z ) = 0, where Z is a continuously distributed, exogenous explanatory variable or vector and W is an instrument for the endogenous variable X . This model is more difficult than (1.1)–(1.2), because it requires selecting at least two regularization parameters, one for X and one or

6 The hypotheses that Engel curves for services and other goods are linear also cannot be rejected. The inability to reject linearity of several Engel curves is likely due to the severe ill-posedness of the estimation problem, which prevents estimation of the curves with sufficient accuracy to discriminate between linearity and nonlinearity.

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

165

Table 1 Results of Monte Carlo experiments.

2

 

2 

Expt No.

Design

Empirical mean of gˆJopt − g 

Empirical mean of gˆˆ Jˆ − g 

Ratio of empirical means, R

Theoretical asymptotic upper bound on R

1 2 3 4 5 6 7

1 1 1 1 1 2 2

0.0957 0.0983 0.100 0.0940 0.103 0.0020 0.0029

0.100 0.138 0.100 0.0978 0.203 0.0039 0.0029

1.045 1.400 1.000 1.040 1.977 1.999 1.000

11.2 11.2 11.2 11.2 11.2 11.8 11.8



more for the components of Z . The multivariate model will be addressed in future research. Acknowledgments

Lemma 2. The following hold as n → ∞: sup

   ˆ  (A − AJn )h = Jn1/2 Op (n−1/2 )

(A.4)

sup

   ˆ  (A − AJn )∗ h = Jn1/2 Op (n−1/2 ).

(A.5)

h∈HJ δ D , ∥h∥=1

I thank Xiaohong Chen, Joachim Freyberger, Sokbae Lee, and Vladimir Spokoiny for helpful discussions. This research was supported in part by NSF grant SES-0817552.

and h∈HJ δ D , ∥h∥=1

Appendix This appendix presents proofs of Propositions 1 and A.1, Theorems 3.1 and 3.2, and several lemmas that are used in the proofs of the propositions and theorems. Assumptions 1–6 hold throughout. Define Jn = {J : 1 ≤ J ≤ Jn }. For J ∈ Jn , define S¯n (J ) ≡ n−1 E [Y − gJn (X )]2

Proof.   Only (A.4) is proved. The proof of (A.5) is similar. Let hj = ψj , h . For any h ∈ HJ δD ,

 2 Jn Jn 2      ˆ hj (ˆcjk − cjk ) (A − AJn )h = k=1

J  1 ∗ 2 {[(A− Jn ) ψj ](W )} ,

=

j=1 n

S˜n (J ) ≡ n−2





[Yi − gJn (Xi )]2

i=1



k=1



J

j =1

 Jn   Jn

1 ∗ 2 {[(A− , Jn ) ψj ](Wi )}



j =1

and Sˆn (J ) = n

n 



 J  − 1 ∗ 2 [Yi − g˜ (Xi )] {[(Aˆ ) ψj ](Wi )} .

i=1

j=1

Define Ui = Yi − g (Xi ) (i = 1, . . . , n). We begin with five lemmas that are used in the proof of Proposition 1. Then Propositions 1 and A.1 are proved. Four additional lemmas that are used in the proof of Theorem 3.1 are presented after the proofs of the propositions. Finally, Theorems 3.1 and 3.2 are proved. Lemma 1. Let J ≤ Jn . Then





(A.1)

h∈KJ

 Jn Jn  

and







Jn  (ˆcjk − cjk )2

j =1

2 

 

sup (Aˆ − AJn )h ≤ Op (n−1 )

Jn  Jn 

Jn  1

1 −1 But A− Jn (A − AJn )AJn h = 0 for h ∈ KJ . Therefore,

J  

h∈KJ

Sn , ψj

2

j =1

=

J 

 n

j =1

−1

n 

2 Ui [(AJn ) ψj ](Wi ) −1 ∗

+ rn ,

i=1

where |rn | = bn ρJ2 /n, bn = op (1), and bn does not depend on J. Moreover, there are finite constants M1 and M2 such that

ρ /n ≤ E

J 

so it follows from (A.3) that

M1 J2

 1   ρJ = sup A− Jn h . 

for every J ∈ Jn .

h∈KJ

j2δ

= Jn Op (n−1 ).  Lemma 3. As n → ∞,

(A.3)

j−2δ

k=1 j=1

= Jn Op (n−1 )

(A.2)

1 −1 1 −1 −1 (A−1 − A− AJn (A − AJn )A− Jn )h = −[I + AJn (A − AJn )] Jn h.

  ρJ = sup A−1 h ,

(A.6)

for any h ∈ HJ δ D . In addition, E (ˆcjk − cjk )2 ≤ Cn−1 for some constant C < ∞, every (j, k) and every sequence of Fourier coefficients {cjk }. Therefore cˆjk − cjk = Op (n−1/2 ) for every (j, k) and sequence {cjk }, where. Op (n−1/2 ) does not depend on the sequence. It follows that,

Proof. Only (A.1) is proved. The proof of (A.2) is similar. Let I denote the identity operator in KJn . Note that KJ ⊂ KJn . For h ∈ KJ ,

Jn

,

j

k =1 j =1

j =1

Now,

j2δ

j

Jn  Jn 2   (ˆcjk − cjk )2   ˆ (A − AJn )h ≤ D 2δ

h∈KJ

 −1   −1  A h = A h .

2



j2δ h2j

h∈HJ δ D

1 ∗  sup (A− Jn ) h = ρJ .

cˆjk − cjk

where the last line follows from the Cauchy–Schwarz inequality. ∞ But j=1 j2δ h2j < D. Therefore,

2

1  sup A− Jn h = ρJ

(j hj )



j =1

k=1

−2

δ

j =1

 n

−1

n  i=1

2 Ui [(AJn ) ψj ](Wi ) −1 ∗

≤ M2 (ρJ2 J /n)

166

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

ρJ2 /n. It follows from Markov’s inequality and Assumption 6 that for some positive sequence {bn } with bn → 0 as n → ∞,

Proof. We have n 

S n ( x ) = n− 1

δn (x, Yi , Xi , Wi )

J 

i−1



Jn

=





n

ψj (x) n−1



j =1

1 ∗ [Yi − gJn (Xi )][(A− Jn ) ψj ](Wi ) .

for every j ≤ J with probability arbitrarily close to 1. The lemma follows by combining (A.7) and (A.8). 

i=1

Sn , ψj

2

=

J 

j=1

 n

−1

Lemma 4. There are constants C1 < ∞ and ε > 0 such that

2 n  −1 ∗ [Yi − gJn (Xi )][(AJn ) ψj ](Wi ) .

j =1

  J      mj ψj  ≤ C1 J −1−ε ρJ−1 . m −   j=1

i=1

But n− 1

n 

Proof. Define mJ =

1 ∗ [Yi − gJn (Xi )][(A− Jn ) ψj ](Wi ) = Rnj1 + Rnj2 ,

i=1

m − mJ =

where

∞  ∞ 

n 

(A − AJ )g =

and

bj cjk ψk .

∞  ∞ 

bj cjk ψk +

k=J +1 j=1

Rnj2 = −n−1

n  1 ∗ [gJn (Xi ) − g (Xi )][(A− Jn ) ψj ](Wi ).

J ∞  

bj cjk ψk

k=1 j=J +1

= (m − mJ ) +

i =1

J ∞  

bj cjk ψk .

k=1 j=J +1

E (Rnj1 ) = 0, and

Therefore, it follows from the triangle inequality that

Var (Rnj1 ) = n−1

J 

j =1

  J ∞           m − mJ  ≤ ( A − A J ) g  +  bj cjk ψk  .  k=1 j=J +1 

1 ∗ 2 E σU2 (W ){[(A− J ) ψj ](W )} ,

j =1

where σU2 (w) = E (U 2 |W = w). But σU2 and fW are bounded from above Therefore, use of Lemma 1 yields Var (Rnj1 ) ≤ M2 n−1 J ρJ2

j =1

for some finite constant M2 . Similarly, σU2 and fW are bounded away from 0, so J 

mk ψk . Then

k =1

But

1 ∗ Ui [(A− Jn ) ψj ](Wi )

i =1

J 

J

k=J +1 j=1

Rnj1 = n−1

J 

(A.8)

j =1

Therefore, J  

R2nj2 ≤ bn ρJ2 /n

But

2   2 J J ∞ ∞        bj cjk bc ψ  =   k=1 j=J +1 j jk k  k=1 j =J +1  2 ∞ ∞    2 ≤ bj cjk = A(gJ − g ) . k=1

j =J +1

Therefore, Var (Rnj1 ) ≥ M1 n

ρJ

−1 2

      m − mJ  ≤ (A − AJ )g  + A(gJ − g )     = (A − AJ )gJ  + 2 (A − AJ )(gJ − g ) .

j =1

for finite constant M1 > 0. Therefore, M1 n−1 ρJ2 ≤

J 

Var (Rnj1 ) ≤ M2 n−1 J ρJ2

The lemma now follows from Assumption 5. (A.7)

j =1

for every j ≤ J. In addition, E (Rnj2 ) = −

 [0,1]2

1 ∗ fXW (x, w)[gJn (x) − g (x)][(A− Jn ) ψj ](w)dxdw

  1 ∗ = − A(gJn − g ), (A− Jn ) ψj   1 ∗ = − (A − AJn )(gJn − g ), (A− Jn ) ψj .

 1+ε/2

Lemma 5. Let ε > 0 be as in Assumption 5(iii). Then Jn = op (1).

  g˜ − g 

Proof. Let δ be such that 1/2 < δ < 1/2 + ε/2 and Assumption 5(iii) holds. Define

 

 

ˆ −m ˆ. hˆ = arg min Ah h∈HJn δ D

1+ε/2

Therefore, the Cauchy–Schwarz inequality gives

We show that Jn ∥hˆ − g ∥ = op (1). We further show that this implies that with probability approaching 1 as n → ∞, the constraint h ∈ HJn δ D is not binding. Therefore, hˆ = g˜ with probability

     E (Rnj2 ) ≤ (A − AJ )(gJ − g ) (A−1 )∗ ψj  . n n Jn

approaching 1. It follows that Jn ∥˜g − g ∥ = op (1). By Assumption 5 and the triangle inequality

1+ε/2

But ∥(A − AJn )(gJn − g )∥ = O(ρJ−n 1 Jn−s ) for some s > 3 by Assump-

1 ∗ tion 5, and ∥(A− Jn ) ψj ∥ ≤ ρJ by Lemma 1. Therefore, |E (Rnj2 )| =

O(ρJ ρJ−n 1 Jn−s ) = o(ρJ /n1/2 ) for every J ≤ Jn . Also Var (Rnj2 ) ≤ Jn−2s

      ˆ    h − g  ≤ hˆ − gJn  + gJn − g      = hˆ − gJn  + O(νJ−1 ).

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

167

We show that as n → ∞,

Now

    ˆ    h − gJn  ≤ ρJn A(hˆ − gJn )   ˆ ) − (Aˆ − A)hˆ = ρJn (Aˆ hˆ − m

P

 Jn 

 ˆ < D → 1.

j2δ h2j

(A.10)

j =1

 ˆ − m) − A(gJn − g ) − (Ag − m) + (m     ˆ ) − (Aˆ − A)hˆ + (m ˆ − m) − A(gJn − g ) = ρJn (Aˆ hˆ − m         ˆ ) + ρJn (Aˆ − A)hˆ  ≤ ρJn (Aˆ hˆ − m     ˆ − m + ρJn A(gJn − g ) . + ρJn m

To do this, observe that Jn 

j2δ hˆ 2j =

Jn 

j =1

j2δ [bj + (hˆ j − bj )]2 .

j =1

It follows from Minkowski’s inequality and Assumption 5 that

 Jn 

Now

1/2 2δ

j [bj + (hˆ j − bj )]2

 ≤

Jn  j =1

j=1

       ˆ      (A − A)hˆ  ≤ (Aˆ − AJn )hˆ  + (AJn − A)hˆ  1/2

= Op (Jn /n)

+ O(Jn

1/2 j2δ b2j

 Jn 

+

ρJn )

1/2 2δ

j (hˆ j − bj )

2

.

j=1

−1−ε −1

by Lemma 2 and Assumption 5. In addition, standard arguments  ˆ − m = Op [(Jn /n)1/2 ] + combined with Lemma 4 show that m

But



ρJn ). Therefore,        ˆ   ˆ  + ρJn A(gJn − g ) h − gJn  ≤ ρJn Aˆ hˆ − m

O(Jn

−1−ε −1

Jn 

1/2 j2δ b2j

 +

j =1

Jn 

1/2 2δ

j (hˆ j − bj )

2

j =1

    < D + Jnδ hˆ − g  = D + Jn−1−ε/2+δ op (1) = D + op (1).

+ Op [ρJn (Jn /n)1/2 ] + O(Jn−1−ε ).

It follows that as n → ∞,

Now Assumption 5 implies that P (˜g = hˆ ) → 1. 

    A(gJ − g ) = (A − AJ )(gJ − g ) = O(ν −1 ρ −1 ). n n n Jn Jn

(A.11)

Therefore,

Proof of Proposition 1. We have

     ˆ   ˆ  + Op [ρJn (Jn /n)1/2 ] + O(Jn−1−ε ). h − gJn  ≤ ρJn Aˆ hˆ − m

ˆ. AJn g˜ + (Aˆ − AJn )˜g = m Therefore

Now

1 1 ˆ ˆ − A− g˜ = A− g Jn m Jn (A − AJn )˜

     ˆ  ˆ ˆ ˆ ˆ  ≤ Ag −m Ah − m     ˆ − m) = (Aˆ − A)g + (Ag − m) − (m       ˆ − m ≤ (Aˆ − A)g  + m

1 1 ˆ −1 ˆ ˆ − A− = A− g − gJn ). Jn m Jn (A − AJn )gJn − AJn (A − AJn )(˜

It follows that 1 1ˆ ˆ − A− g˜ − gJn = A− Jn m Jn AgJn − Rn , 1 ˆ 1 ˆ− where Rn = A− g − gJn ). Some algebra shows that A− Jn (A − AJn )(˜ Jn m

= Op [(Jn /n)1/2 ] + O(Jn−1−ε ρJ−n 1 ).

1ˆ A− Jn AgJn = Sn . Therefore,

Therefore,

  ˆ  h − gJn  = Op [ρJn (Jn /n)1/2 ] + O(Jn−1−ε )

gˆJ − gJ =

J  

Sn , ψj ψj −



j =1

and

It follows by combining this result with Assumptions 5 and 6 that (A.9)

rn = −

We now that the constraint hˆ ∈ HJn δ D does not bind. Let  show 

hˆ =

Jn  j =1



j =1

J  

Rn , ψj ψj .



j =1

Now

 hˆ j = hˆ , ψj (j = 1, . . . , Jn ). Then

Rn , ψj ψj ,

and

  ˆ  h − g  = Op [ρJn (Jn /n)1/2 ] + O(Jn−1−ε ) + O(νJ−n 1 ).

    Jn1+ε/2 hˆ − g  = op (1).

J  

   1 ˆ ψj , Rn = ψj , A− g − gJn ) Jn (A − AJn )(˜   1 ∗ ˆ = (A− ) ψ , ( A − A )(˜ g − g ) . j J J n n Jn

The Cauchy–Schwarz inequality gives hˆ j ψj .

       ψj , Rn  ≤ (A−1 )∗ ψj   (Aˆ − AJn )(˜g − gJn ) . J n

168

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

Consider, first, convergence of S˜n1 (J )/Sn (J ). By Lemma 1,

Therefore,

     ψj , Rn  = ρJ Op [(Jn /n)1/2 ] g˜ − gJ 

 −1 ∗ 2 (A ) ψj  ≤ ρ 2 J Jn

by Lemmas 1 and 2 and (A.11). It follows that

for every j ≤ J, J ∈ Jn . This result together with Assumption 5(iii) 1 ∗ 2 implies that {[(A− ≤ c1 Jn1+2τ ρJ2 for some constant Jn ) ψj ](w)} c1 < ∞ and every j ≤ J and J ∈ Jn . Define

n

 ∥r n ∥ =

J  

1/2 Rn , ψj

2

1 ∗ 2 Kj (w) = {[(A− Jn ) ψj ](w)} .

j =1

=J

1/2

ρJ Op (n

−1/2

)

Jn1/2

  g˜ − gJ 

Then

n

  1/2 for every J ∈ Jn . But J 1/2 Jn g˜ − gJn  = op (1) for every J ∈ Jn by Lemma 5. Therefore,

n 

S˜n1 (J ) = n−2

Ui2

i=1

∥rn ∥ = ρJ Op (n−1/2 )op (1).

J 

Kj (Wi )

j =1

for every J ∈ Jn . Moreover,

The proposition follows by combining this result with Lemma 3. 

 K¯ nJ (w) ≡

ρJ2

 −1 n− 1

n

 2 Proposition A.1. For any J ≤ Jn , gˆJ − g  = Op (ρJ2 J /n + νJ−2 ).

J 

Kj (w) ≤ c1 Jn2+2τ .

j =1

Now let an = n for some constant d > 0 such n1−2d /Jn4+4τ → ∞ as n → ∞. Such a d exists under Assumption 6. Let Bn denote the event max1≤i≤n Ui2 ≤ an . Let B¯ n denote the complement of Bn . It follows from Markov’s inequality that P (B¯ n ) → 0 as n → ∞. We have d

Proof. It follows from Proposition 1 that for or any J ≤ Jn ,

2



EA gˆJ − gJ  = E

J  

Sn , ψj

2

.

(A.12)

j=1

Applying Lemma 3 to the right-hand side of (A.12) yields

2



EA gˆJ − gJ  = E

J  



ρJ2

 −1

n Sn , ψj

2

S˜n1 (J ) = n−1

n 

Ui2 K¯ nJ (Wi )I (Bn )

i=1

≤ M2 ρJ2 J /n.

+ n− 1

j=1

n 

Ui2 K¯ nJ (Wi )I (B¯ n ),

i=1

In addition,

where I is the indicator function. Define

2

2

2

EA gˆJ − g  = EA gˆJ − gJ  + gJ − g  .







 2 By Assumption 4, gJ − g  ≤ νJ−2 . Therefore, 2

Choosing J to optimize this rate gives Chen’s and Reiss’s (2011) minimax optimal rate for functions in HJ ν . The conclusion of the lemma follows from Markov’s inequality. Lemma 6. Given any ε > 0,

S˜n1b (J ) = n−1





 −1

n

|S˜n1 (J ) − E S˜n1 (J )| > 2ε 

J ∈Jn

n 

Ui2

i =1

J  1 ∗ 2 {[(A− Jn ) ψj ](Wi )} , j =1

n 

Now



 ˜ ˜ P max |Sn1b (J ) − E Sn1b (J )| > ε → 0

(A.13)

J ∈Jn

Ui [gJn (Xi ) − g (Xi )]

i=1

J  1 ∗ 2 {[(A− Jn ) ψj ](Wi )} , j=1

as n → ∞ because P (B¯ n ) → 0. Now consider S˜n1a . By Hoeffding’s inequality P [|S˜n1a (J ) − E (S˜n1a (J )|Bn )| > ε|Bn ] ≤ 2 exp[−2ε 2 n/(c12 Jn4+4τ a2n )]

and S˜n3 (J ) = n

ρJ2

  ≤ P max |S˜n1a (J ) − E S˜n1a (J )| > ε J ∈Jn   + P max |S˜n1b (J ) − E S˜n1b (J )| > ε .

Proof. Define

S˜n2 (J ) = −2n−2

Ui2 K¯ nJ (Wi )I (B¯ n ).

i =1

J ∈Jn

for each J ∈ Jn with probability approaching 1 as n → ∞.

n 

For any ε > 0, P max

   S˜ (J ) − S¯ (J )  n  n   ≤ε   S¯n (J )

S˜n1 (J ) = n

Ui2 K¯ nJ (Wi )I (Bn )

i =1



−2

n 

and

EA gˆJ − g  = O(ρJ2 J /n + νJ−2 ).



S˜n1a (J ) = n−1

−2

J n   1 ∗ 2 [gJn (Xi ) − g (Xi )]2 {[(A− Jn ) ψj ](Wi )} . i =1

≤ 2 exp[−2ε 2 n1−2d /(c12 Jn4+4τ )] for every J ∈ Jn . Therefore,

j =1



Then S˜n (J ) = S˜n1 (J ) + S˜n2 (J ) + S˜n3 (J ).

P max |S˜n1a (J ) − E (S˜n1a (J ))| > ε|Bn J ∈Jn

≤ 2Jn exp[−2ε 2 n1−2d /(c12 Jn4+4τ )].



J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173



 1/(2r +2)

In the mildly ill-posed case, Jn = o n with r ≥ 2. In the severely ill-posed case, Jn = o(log n). Therefore,



P max |S˜n1a (J ) − E (S˜n1a (J ))| > ε|Bn

 →0

(A.14)

J ∈Jn

as n → ∞. Because P (Bn ) → 1, it follows from (A.13) and (A.14) that





P max J ∈Jn

ρJ2



 −1

Proof. Define 1 ∗ 2 Kj (w) = {[(A− Jn ) ψj ](w)} ,

Kˆ j (w) = {[(Aˆ −1 )∗ ψj ](w)}2 ,

∆g (x) = g˜ (x) − gJn (x), ∆Kj (w) = Kˆ j (w) − Kj (w), ∆Sn1 (J ) = −2n−2

J n   [Yi − gJn (Xi )]∆g (Xi ) Kj (Wi ), i =1

|S˜n1 (J ) − E (S˜n1 )| > ε → 0.

n

169

j =1

J n   ∆Sn2 (J ) = n−2 [∆g (Xi )]2 Kj (Wi ),

But E S˜n1 (J ) = S¯n (J )[1 + o(1)] and S¯n (J ) ≥ C ρJ2 /n for some finite constant C uniformly over J ∈ Jn . Therefore,

i =1

j =1

J   [Yi − gJn (Xi )]2 ∆Kj (Wi ), n

∆Sn3 (J ) = n−2

i =1

S˜n1 (J ) − S¯n (J ) S¯n (J )

≤ε

(A.15)

with probability approaching 1 as n → ∞ for any ε > 0 and every J ∈ Jn . Now consider S˜n2 (J )/S¯n (J ). Assumption 5 implies that |gJn (x) − g (x)| = o(Jn−2−2τ ) as n → ∞ uniformly over x ∈ [0, 1] for some constant c2 < ∞. Therefore,

|S˜n2 (J )| ≤ o(Jn−2−2τ )n−2

n 

J  1 ∗ 2 |Ui | {[(A− Jn ) ψj ](Wi )} .

i=1

|S˜n2 (J )| ≤ o(1)ρJ2 n−2

[Yi − gJn (Xi )]∆g (Xi )

J 

i=1

∆Kj (Wi ),

j =1

and

∆Sn5 (J ) = n−2

J n   [∆g (Xi )]2 ∆Kj (Wi ). i =1

j =1

Then Sˆn (J ) − S˜n (J ) =

j =1

5 

∆Snk (J ).

k=1

1 ∗ 2 Now {[(A− ≤ c3 Jn1+2τ ρJ2 for every j ≤ J and some Jn ) ψj ](w)} constant c3 < ∞. Therefore, n 

∆Sn4 (J ) = −2n−2

j =1 n 

Because S¯n (J ) ≥ C ρJ2 /n for some constant C < ∞, it suffices to prove that max(ρJ2 /n)−1 |Sˆn (J ) − S˜n (J )| ≤ ε J ∈J

|Ui |

with probability approaching 1 as n → ∞. Consider ∆Sn1 . We have ∥˜g − gJn ∥ = Op [ρJn (Jn /n)1/2 + νJ−n 1 ] =

i=1

uniformly over J ∈ J . It follows from the strong law of large numbers that

|S˜n2 (J )| ≤ o(1)ρJ2 n−1 [E (|U |) + op (1)] for every J ∈ Jn . Therefore,

|S˜n2 (J )| = c4 o(1)[E (|U |) + op (1)] = op (1) S¯n (J )

1 ∗ Op [ρJn (Jn /n)1/2 ] under Assumption 6. In addition, by {[(A− Jn ) ψj ]

(w)}2 ≤ c2 Jn1+2τ ρJ2 for some constant c2 < ∞ and each J ∈ Jn and w ∈ [0, 1]. Moreover, E [Kj (W )] ≤ c3 ρJ2 for some constant c3 < ∞. Therefore, |∆Sn1 (J )| = Op [ρJn (Jn /n)1/2 ]n−2

|Yi − gJn (Xi )|

i=1

(A.16)

× |S˜n3 (J )| = o(1)ρJ2 n−1

n 

|Ui − [gJn (Xi ) − g (Xi )]|

J 

Kj (Wi )

j =1

 = Op [ρJn (Jn /n)1/2 ] n−2

n  i=1

(A.17)

for every J ∈ Jn . The lemma follows by combining (A.15)–(A.17). 

+ Op (νJn )n −1

−2

J n  

|Ui |

J 

Kj (Wi )

j =1

 Kj (Wi ) .

i =1 j =1

Using E [Kj (W )] ≤ c3 ρJ2 , it follows from Markov’s inequality that

Lemma 7. Given any ε > 0,

|∆Sn1 (J )| = Op [ρJn (Jn /n)1/2 ](ρJ2 J /n)[1 + op (1)]

   Sˆ (J ) − S˜ (J )  n  n   ≤ε   S¯n (J )

(ρJ2 /n)−1 |∆Sn1 (J )| = Op [ρJn (Jn /n)1/2 ]J [1 + op (1)]

for each J ∈ Jn with probability approaching 1 as n → ∞.

Kj (Wi )

j =1

i =1

for some constant C < ∞ for every J ∈ Jn , so

J 

= Op [ρJn (Jn /n)1/2 ]n−2

for every J ∈ Jn . A similar argument gives

|S˜n3 (J )| = o(1) S¯n (J )

n 

for every J ∈ Jn , where op (1) does not depend on J. But ρJn (Jn /n)1/2 = o(1) by Assumption 6. Therefore,

= Op [ρJn (Jn3 /n)1/2 ]

170

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

for every J ∈ Jn . It follows that for any ε > 0

Define

(ρJ2 /n)−1 |∆Sn1 (J )| < ε

(A.18)

with probability approaching 1 as n → ∞. A similar argument shows that

(ρJ2 /n)−1 |∆Sn2 (J )| < ε

(A.19)

with probability approaching 1 for every J ∈ Jn . Now consider ∆Sn3 (J ). We have n 

∆Sn3 (J ) = n−2

{Ui − [gJn (Xi ) − g (Xi )]}2

i=1 n 

= n− 2

n 

Ui2

J 

+n

n 

Rnj = n−1

ξnj = [(4/3)σj2 n−1 log n]1/2 .

∆Kj (Wi )

Then snJ =

J 

R2nj .

j =1

j =1

Ui [gJn (Xi ) − g (Xi )]

J 

By Bernstein’s inequality (van de Geer, 2000, lemma 5.7)

∆Kj (Wi )

n 

P (|Rnj | > ξnj ) ≤ 2 exp −

[gJn (Xi ) − g (Xi )]

2

J 

∆Kj (Wi )



Some algebra shows that

P (|Rnj | > ξnj ) ≤ 2 exp −

1 ∗ 1 ∗ ˆ −1 )∗ − (A− ∆Kj = 2[(A− Jn ) ψj ]{[(A Jn ) ]ψj }

Moreover,

P

1 ∗ −1 ∗ 1 ∗ ∗ −1 −1 ∗ (Aˆ −1 )∗ − (A− (AJn ) (∆A∗ )(A− Jn ) = −[I + (AJn ) ∆A ] Jn ) ,

where I is the identity operator in KJn and ∆A = Aˆ ∗ − A∗Jn . Now    ∗  ∆A  =  Aˆ − AJn 

 Jn  

|Rnj | > ξnj

= Op [(Jn /n)1/2 ] for every J ∈ Jn by Lemma 2. Therefore, it follows from Lemma 1 and Markov’s inequality that

   ˆ −1 ∗  1 ∗ ) ]ψ [(A ) − (A−  = ρJ2 Op [(Jn /n)1/2 ], j Jn   ∆Kj  = ρ 3 Op [(Jn /n)1/2 ], J

Lemma 8. The following inequality holds for every J ∈ Jn as n → ∞.

  gˆJ − gJ 2 ≤ (4/3)(log n)S¯n (J )[1 + op (1)], where op (1) does not depend on J. Proof. Let snJ denote the leading term of the asymptotic expansion

2

of gˆJ − gJ  . By Proposition 1 and Lemma 3,

snJ =

j =1

n

−1

n 

≤ 2Jn n−4/(12+ε) → 0

i=1

as n → ∞ if ε is sufficiently small, where the last relation follows from Assumption 6 and the observation that ρJn increases at least β

as fast as Jn for some β > 0. Combining this result with (A.20) gives snJ ≤

J 

ξnj2

j =1 J 

σj2

Ui [(AJn ) ψj ](Wi )

.

(A.20)



Lemma 9. The following inequality holds.

 2  2  2  2  2         gˆJˆ  − gJˆ  ≤ 3 gˆJˆ − gJˆ  + 0.5 gJˆ − g  + 0.5 gJopt − g   2   + 2 gˆJopt − gJopt  + 2 gJopt , gˆJopt − gJopt . Proof. The proof of this lemma is similar to the proof of lemma 3.4 (ii) of Loubes and Marteau (2009). We have

 2  2     gˆJˆ  = (gˆJˆ − gJˆ ) + gJˆ   2    2     = gˆJˆ − gJˆ  + 2 gJˆ , gˆJˆ − gJˆ + gJˆ  . Therefore,

2 −1 ∗

P (|Rnj | > ξnj )

j =1

for every J ∈ Jn , where op (1) does not depend on J.

)

last equality follows from Assumption 6. Similar arguments apply to ∆Sn3b , ∆Sn3c , ∆Sn4 , and ∆Sn5 . The lemma follows by combining these results with (A.18) and (A.19). 



Jn 

j=1

∈ Jn . It follows that (ρJ2 /n)−1 ∆Sn3a (J ) = Op [ρJ ] = op (1), where op (1) does not depend on J and the

J 

 (4/3) log n = 2n−4/(12+ε) 4 + ε/3

= (4/3)(log n)S¯n (J )[1 + op (1)]

for every J





= (4/3)n−1 (log n)

and

( / n)

 

j =1



1/2

4σj2 + 2cJ (1+2τ )/2 ξnj

for any ε > 0, all sufficiently large n, and all j ≤ J. Therefore,

1 ∗ 2 + {[(Aˆ −1 )∗ − (A− Jn ) ]ψj } .

3/2



for some finite constant c > 0 that does not depend on j. But σj2 is bounded away from 0 for all j. Therefore,

j =1

≡ ∆Sn3a (J ) + ∆Sn3b (J ) + ∆Sn3c (J ).

∆Sn3a (J ) = ρJ3 JOp (Jn1/2 n−

nξnj2



j =1

i=1

Jn3

Vij ,

i=1

and

∆Kj (Wi ) − 2n−2

i =1

−2

1 ∗ Vij = Ui [(A− Jn ) ψj ](Wi ),

j =1

i=1

×

J 

1 ∗ 2 σj2 = E {Ui [(A− Jn ) ψj ](Wi )} ,

 2  2  2         gˆJˆ  − gJˆ  = gˆJˆ − gJˆ  + 2 gJˆ , gˆJˆ − gJˆ .

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

Define ΣJopt :Jˆ =

Jˆ Σj=Jopt +1 ,

Jopt

if Jˆ > Jopt , −Σ ˆ if Jˆ < Jopt , and 0 if j=J +1

Proof of Theorem 3.1. Define an = (2/3) log(n) and Qˆ n (J ) = Tˆn (J ) + ∥g ∥2

Jˆ = Jopt . Then,





171

2 gJˆ , gˆJˆ − gJˆ = 2

 2 = an Sˆn (J ) + ∥g ∥2 − gˆJ  .

Jˆ 

bj (b˜ j − bj )

Then Jˆ minimizes Qˆ n (J ) over J ∈ Jn . By Lemmas 6 and 7,

j =1

=2

 2

Jopt 

bj (b˜ j − bj ) + 2

j =1



Qˆ n (J ) = an S¯n (J )[1 + op (1)] + ∥g ∥2 − gˆJ 

bj (b˜ j − bj )

for all J ∈ Jn , where op (1) does not depend on J, so

Jopt :Jˆ

 2  

Qˆ n (Jˆ) = an S¯n (Jˆ)[1 + op (1)] + ∥g ∥2 − gˆJˆ  .

   = 2 gJopt , gˆJopt − gJopt + 2 bj (b˜ j − bj ) Jopt :Jˆ

It follows that

and

an S¯n (Jˆ)[1 + op (1)] + ∥g ∥2 − ∥gJˆ ∥2 = Qˆ n (Jˆ) + ∥ˆgJˆ ∥2 − ∥gJˆ ∥2 .

 2  2  2        bj (b˜ j − bj ) gˆJˆ  − gJˆ  = gˆJˆ − gJˆ  + 2

An application of Lemma 9 and the observation that ∥gJˆ − g ∥2 =

∥g ∥2 − ∥gJˆ ∥2 give

Jopt :Jˆ

+ 2 gJopt , gˆJopt − gJopt . 



(A.21)

Define Rn = 2



 2  2   + 0.5(∥g ∥2 − gJˆ  ) + 5 gJopt − g   2   + 2 gˆJopt − gJopt  + 2 gJopt , gˆJopt − gJopt .

bj (b˜ j − bj ).

Jopt :Jˆ

Then

|Rn | ≤ 2

∞ 

It follows from Proposition 1 that an S¯n (Jˆ) dominates ∥ˆgJˆ − gJˆ ∥2 as n → ∞. Moreover, an → ∞ as n → ∞. Therefore,

|I (j ≤ Jˆ) − I (j ≤ Jopt )||bj (b˜ j − bj )|,

an S¯n (Jˆ)[1 + op (1)] + 0.5(∥g ∥2 − ∥gJˆ ∥2 )

j =1

 2  2 ≤ Qˆ n (Jˆ) + .5 gJopt − g  + 2 gˆJopt − gJopt    + 2 gJopt , gˆJopt − gJopt .

where I (·) is the indicator function. But

|I (j ≤ Jˆ) − I (j ≤ Jopt )| = [I (j ≤ Jˆ) + I (j ≤ Jopt )] × |I (j ≤ Jˆ) − I (j ≤ Jopt )|

By Lemma 8 and the definition of an ,

≤ I (j ≤ Jˆ)I (j > Jopt ) + I (j ≤ Jopt )I (j > Jˆ).

 

Therefore,

I (j ≤ Jopt )I (j > Jˆ)|bj (b˜ j − bj )|

 

∞ 

I (j ≤ Jˆ)I (j > Jopt )|bj (b˜ j − bj )|.

2  2    + .5 gJopt − g  + 2 gˆJopt − gJopt  + 2 gJopt , gˆJopt − gJopt .

j=1

Now

By the Cauchy–Schwarz inequality,

|Rn | ≤ 2 

∞ 

+2

 

 2     = 0.5 gˆJˆ  + 0.5 ∥g ∥2 − gˆJˆ , gJˆ  2     = 0.5 gˆJˆ − g  + gˆJˆ , g − gJˆ  2   = 0.5 gˆJˆ − g  ,

j =1

∞ 

1/2  b2j 

1/2 Jˆ   (b˜ j − bj )2  .

j=Jopt

j =1

In addition, 2ab ≤ a2 /2 + 2b2 for any real numbers a and b. Therefore,

|Rn | ≤ 0.5

∞  j=Jopt

b2j + 0.5

∞  j=Jˆ

b2j + 2

Jopt Jˆ   (b˜ j − bj )2 + 2 (b˜ j − bj )2 j =1

j =1

 2  2   = 0.5 gJopt − g  + 0.5 gJˆ − g   2 2    + 2 gˆJopt − gJopt  + 2 gˆJˆ − gJˆ  . The lemma follows by substituting (A.22) into (A.21).

2 

0.5 gˆJˆ − gJˆ  + 0.5(∥g ∥2 − ∥gJˆ ∥2 )

1/2  1/2 Jopt  b2j  (b˜ j − bj )2

j=Jˆ



2 

0.5 gˆJˆ − gJˆ  [1 + op (1)] + 0.5(∥g ∥2 − ∥gJˆ ∥2 ) ≤ Qˆ n (Jˆ)

j=1



2 

 

≤ an S¯n (Jˆ)[1 + op (1)].

∞ 

+2

2 

an [(4/3) log n]−1 gˆJˆ − gJˆ  = 0.5 gˆJˆ − gJˆ 

Therefore,

|Rn | ≤ 2

2 

 

an S¯n (Jˆ)[1 + op (1)] + ∥g ∥2 − ∥gJˆ ∥2 ≤ Qˆ n (Jˆ) + 3 gˆJˆ − gJˆ 

where the last line follows from the observation that gˆJˆ and g − gJˆ are linear functions of different and, therefore, orthogonal ψj ’s. It follows that

 

2 

2

0.5 gˆJˆ − g  [1 + op (1)] ≤ Qˆ n (Jˆ) + .5 gJopt − g 



 2   + 2 gˆJopt − gJopt  + 2 gJopt , gˆJopt − gJopt . But Qˆ n (Jˆ) ≤ Qˆ n (Jopt ), so (A.22)





2

2   0.5 gˆJˆ − g  [1 + op (1)] ≤ Qˆ n (Jopt ) + .5 gJopt − g 



 2   + 2 gˆJopt − gJopt  + 2 gJopt , gˆJopt − gJopt .



172

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

In addition,

For any ν ∈ KJ ,

 2 Qˆ n (Jopt ) = an Sˆn (Jopt ) + ∥g ∥2 − gˆJopt  .

  ˆ  Aν  ∥ν∥

Therefore, by Lemmas 6 and 7,

2

Qˆ n (Jopt ) = an S¯n (Jopt )[1 + op (1)] + ∥g ∥2 − gˆJopt  .



    Aν + (Aˆ − A)ν 

=



∥Aν∥ ∥ν∥



∥Aν∥ ∥ν∥

But

        gˆJ 2 = gˆJ − gJ 2 + gJ 2 + 2 gJ , gˆJ − gJ , opt opt opt opt opt opt opt so

Similarly

 2 Qˆ n (Jopt ) = an S¯n (Jopt )[1 + op (1)] + ∥g ∥2 − gˆJopt − gJopt   2   − gJopt  − 2 gJopt , gˆJopt − gJopt

  ˆ  Aν  ∥ν∥

and

∥Aν∥ − ∥ν∥



∥Aν∥ − sup ∥ν∥ ν∈KJ

 2  2 But ∥g ∥2 − gJopt  = gJopt − g  . Therefore,

sup

ν∈KJ

2 

2  2  + gˆJopt − gJopt  + 1.5 gJopt − g  . inf

ν∈KJ

2 − gJopt  = S¯n (Jopt ). Therefore, for any n > 2,

.

    ˆ (A − A)ν 

= JOp (n−1/2 ).

∥ν∥

  ˆ  Aν 

≤ inf

∥Aν∥ + rn . ∥ν∥

≥ inf

∥Aν∥ − rn . ∥ν∥

ν∈KJ

∥ν∥

Similarly

2   2   0.5EA gˆJˆ − g  ≤ (an + 1)S¯n (Jopt ) + 1.5 gJopt − g  2  ≤ (an + 1)EA gˆJopt − g  ,

inf

ν∈HJs

  ˆ  Aν  ∥ν∥

ν∈KJ

It follows that

and

 

∥ν∥

Therefore, for rn = JOp (n−1/2 ),

0.5 gˆJˆ − g  [1 + op (1)] ≤ an S¯n (Jopt )[1 + op (1)]

opt

∥ν∥    ˆ  (A − A)ν 

A slight modification of the proof of Lemma 2 shows that

 2  2  2 + ∥g ∥2 − gJopt  + gˆJopt − gJopt  + 0.5 gJopt − g  .

 Now, EA gˆJ

   ˆ  (A − A)ν 



0.5∥ˆgJˆ − g ∥2 [1 + op (1)] ≤ an S¯n (Jopt )[1 + op (1)]

 

∥ν∥    ˆ  (A − A)ν  + ∥ν∥    ˆ  (A − A)ν  + sup . ∥ν∥ ν∈KJ

2 

ρˆ J−1 ≤ ρJ−1 + rn

2

EA gˆJˆ − g  ≤ 2(an + 1)EA gˆJopt − g  . 



and

ρˆ J−1 ≥ ρJ−1 − rn . Proof of Theorem 3.2. We first prove part (ii) of the theorem. Jn0 satisfies the assumptions of Theorem 3.1, so the conclusion of Theorem 3.1 holds with Jn0 in place of Jn . Therefore, part (ii) of the theorem follows from the fact that, by part (i), Jˆn0 = Jn0 with probability approaching 1 as n → ∞. We now prove part (i) of the theorem. For J = 1, 2, . . . . Define Ln (J ) = ρJ2 J 3.5 /n and Lˆ n (J ) = ρˆ J2 J 3.5 /n. Let C > 2 be a finite constant. Define Θn = {J = 1, 2, . . . : |J − Jn0 | ≤ C }. We first show that

   Lˆ (J ) − L (J )  n  n  max   = ρJn0 Jn0 Op (n−1/2 ) = op (1),  J ∈Θn  Ln ( J )

Therefore, |ρˆ J−1 /ρJ−1 − 1| ≤ ρJ rn and

   Lˆ (J ) − L (J )  n  n    = ρJ JOp (n−1/2 ).   Ln (J )

(A.24)

(A.23) follows from (A.24) and the observation that ρJ ≍ ρJn0 for J ∈ Θn . Now define J˜n0 = arg

(A.23)

min

J =1,2,...;J ∈Θn

{ρˆ J2 J 3.5 /n : ρJ2 J 3.5 /n − 1 ≥ 0}.

We show that limn→∞ P (J˜n0 = Jn0 ) = 1. It follows from (A.23) that

as n → ∞. To do this, observe that

Lˆ n (Jn0 ) < Ln (Jn0 ) 1 + ρJn0 Jn0 Op (n−1/2 )

Lˆ n (J )/Ln (J ) = ρˆ J−2 /ρJ−2 ,

and

ρJ

−1

∥Aν∥ = inf , ν∈KJ ∥ν∥

and





Ln (J˜n0 ) < Lˆ n (J˜n0 ) 1 + ρJn0 Jn0 Op (n−1/2 ) .





In addition Lˆ n (J˜n0 ) ≤ Lˆ n (Jn0 ). Therefore, Ln (J˜n0 ) < Ln (Jn0 ) 1 + ρJn0 Jn0 Op (n−1/2 ) .



ρˆ J−1 = inf

ν∈KJ

  ˆ  Aν  ∥ν∥

.

Now let N = Θn − Jn0 . Then Jn0 + 1 = min{Ln (J ) : Ln (J ) − 1 > 0}. J ∈N



(A.25)

J.L. Horowitz / Journal of Econometrics 180 (2014) 158–173

But Ln (Jn0 + 1) > Ln (Jn0 ). Therefore, it follows from (A.25) that J˜n0 ̸∈ N − Jn0 , so J˜n0 = Jn0 . If Jˆn0 < Jn0 − C , then Lˆ n (Jˆn0 ) − 1 < Lˆ n (Jn0 − C ) − 1. Therefore, by (A.23) 0 < Lˆ n (Jn0 − C ) − 1 < Ln (Jn0 − C )[1 + ρJn0 Jn0 Op (n−1/2 )] − 1, which is impossible because Jn0 minimizes Ln (J ) subject to Ln (J ) − 1 ≥ 0. Therefore, Jˆn0 < Jn0 − C cannot happen when n is large. In addition, it follows from (A.23) that Lˆ n (Jn0 + 1) − 1 > 0 with probability approaching 1 as n → ∞. Therefore, with probability approaching 1 as n → ∞, Jˆn0 ≤ Jn0 + 1 and Jˆn0 > Jn0 + C cannot happen.  References Ai, C., Chen, X., 2003. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71, 1795–1843. Bauer, F., Hohage, T., 2005. A Lepskij-type stopping rule for regularized Newton methods. Inverse Problems 21, 1975–1991. Blundell, R., Chen, X., Kristensen, D., 2007. Semi-nonparametric IV estimation of shape-invariant Engel curves. Econometrica 75, 1613–1669. Blundell, R., Horowitz, J.L., 2007. A non-parametric test of exogeneity. Rev. Econom. Stud. 74, 1035–1058. Blundell, R., Pashardes, P., Weber, G., 1993. What do we learn about consumer demand patterns from micro data?. Amer. Econom. Rev. 83, 570–597. Breunig, C., Johannes, J., 2013. Adaptive estimation of functionals in nonparametric instrumental regression. Working Paper, Center for Doctoral Studies in Economics, University of Mannheim, Germany. Carrasco, M., Florens, J.-P., Renault, E., 2007. Linear inverse problems in structural econometrics: estimation based on spectral decomposition and regularization. In: Leamer, E.E., Heckman, J.J. (Eds.), Handbook of econometrics, vol. 6. NorthHolland, Amsterdam, pp. 5634–5751. Cavalier, L., Hengartner, N.W., 2005. Adaptive estimation for inverse problems with noisy operators. Inverse Problems 21, 1345–1361. Chen, X., Pouzo, D., 2009. Efficient estimation of semiparametric conditional moment models with possibly nonsmooth residuals. J. Econometrics 152, 46–60. Chen, X., Pouzo, D., 2012. Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica 80, 277–321. Chen, X., Reiss, M., 2011. On rate optimality for ill-posed inverse problems in econometrics. Econometric Theory 27, 472–521. Chernozhukov, V., Imbens, G.W., Newey, W.K., 2007. Instrumental variable identification and estimation of nonseparable models via quantile conditions. J. Econometrics 139, 4–14. Cohen, A., Hoffmann, M., Reiss, M., 2004. Adaptive wavelet Galerkin methods for linear inverse problems. SIAM J. Numer. Anal. 42, 1479–1501. Darolles, S., Fan, Y., Florens, J.-P., Renault, E., 2011. Nonparametric instrumental regression. Econometrica 79, 1541–1565.

173

Efromovich, S., Koltchinskii, V., 2001. On inverse problems with unknown operators. IEEE Trans. Inform. Theory 47, 2876–2894. Engl, H.W., Hanke, M., Neubauer, A., 1996. Regularization of inverse problems. Kluwer Academic Publishers, Dordrecht. Florens, J.-P., Simoni, A., 2010. Nonparametric estimation of an instrumental regression: a quasi-Bayesian approach based on regularized posterior. Working Paper, Department of Decision Sciences, Bocconi University, Milan, Italy. Gagliardini, P., Scaillet, O., 2012. Nonparametric instrumental variable estimation of structural quantile effects. Econometrica 80, 1533–1562. Hall, P., Horowitz, J.L., 2005. Nonparametric methods for inference in the presence of instrumental variables. Ann. Statist. 33, 2904–2929. Hoffmann, M., Reiss, M., 2008. Nonlinear estimation for linear inverse problems with error in the operator. Ann. Statist. 36, 310–336. Horowitz, J.L., 2012. Specification testing in nonparametric instrumental variables estimation. J. Econometrics 167, 383–396. Horowitz, J.L., 2007. Asymptotic normality of a nonparametric instrumental variables estimator. Internat. Econom. Rev. 48, 1329–1349. Horowitz, J.L., 2006. Testing a parametric model against a nonparametric alternative with identification through instrumental variables. Econometrica 74, 521–538. Horowitz, J.L., Lee, S., 2012. Uniform confidence bands for functions estimated nonparametrically with instrumental variables. J. Econometrics 168, 175–188. Horowitz, J.L., Lee, S., 2007. Nonparametric instrumental variables estimation of a quantile regression model. Econometrica 75, 1191–1208. Johannes, J., Schwarz, M., 2010. Adaptive nonparametric instrumental regression by model selection. Working Paper, Université Catholique de Louvain.. Kress, R., 1999. Linear Integral Equations, second ed.. Springer-Verlag, New York. Loubes, J.-M., Ludeña, C., 2008. Adaptive complexity regularization for inverse problems. Electron. J. Stat. 2, 661–677. Loubes, J.-M., Marteau, C., 2009. Oracle inequality for instrumental variable regression. Working Paper, Institute of Mathematics, University of Toulouse 3, France. Lukas, M.A., 1993. Asymptotic optimality of generalized cross-validation for choosing the regularization parameter. Numer. Math. 66, 41–66. Lukas, M.A., 1998. Comparisons of parameter choice methods for regularization with discrete, noisy data. Inverse Problems 14, 161–184. Marteau, C., 2006. Regularization of inverse problems with unknown operator. Math. Methods Statist. 15, 415–443. Marteau, C., 2009. On the stability of the risk hull method. J. Statist. Plann. Inference 139, 1821–1835. Mathé, P., Pereverzev, S.V., 2003. Geometry of ill posed problems in variable Hilbert scales. Inverse Problems 19, 789–803. Newey, W.K., Powell, J.L., 2003. Instrumental variables estimation of nonparametric models. Econometrica 71, 1565–1578. Newey, W.K., Powell, J.L., Vella, F., 1999. Nonparametric estimation of triangular simultaneous equations models. Econometrica 67, 565–603. Spokoiny, V., Vial, C., 2009. Parameter tuning in pointwise adaptation using a propagation approach. Annals of Statistics 37, 2783–2807. van de Geer, S., 2000. Empirical Processes in M-estimation. Cambridge University Press, Cambridge. Wahba, G., 1977. Practical approximate solutions to linear operator equations when the data are noisy. SIAM J. Numer. Anal. 14, 651–666.