Journal of Econometrics (
)
–
Contents lists available at ScienceDirect
Journal of Econometrics journal homepage: www.elsevier.com/locate/jeconom
Non parametric analysis of panel data models with endogenous variables✩ Frédérique Fève, Jean-Pierre Florens ∗ Toulouse School of Economics (University of Toulouse I Capitole), France
article
info
Article history: Received 5 October 2012 Received in revised form 8 January 2014 Accepted 18 March 2014 Available online xxxx
abstract This paper considers the estimation of panel data models by first differences in the presence of endogenous variables and under an instrumental variables condition. This framework leads to the resolution of linear inverse problems solved using a Tikhonov regularization with L2 or Sobolev penalty. Rates of convergence and data driven selection of the regularization parameters are proposed. The practical implementation of our estimators is presented and some Monte Carlo simulations show the potential of the method. © 2014 Elsevier B.V. All rights reserved.
JEL classification: C36 C14 Keywords: Panel data Endogeneity Instrumental variables Inverse problems
1. Introduction The objective of this paper is to extend previous research on non parametric estimation by the instrumental variables (IV) method to panel data models. Several papers have analyzed the basic cross section model Y = ϕ(Z ) + U where Z is endogenous and ϕ is defined via a mean independence condition E (U |W ) = 0 where W is a set of relevant instruments. As pointed out by Florens (2003) this problem may be treated as an ill-posed inverse problem theoretically analyzed by Carrasco et al. (2007), Darolles et al. (2011) and Hall and Horowitz (2005). The practical implementation of this method is developed in Fève and Florens (2010), with the example of the transformation model or in Horowitz (2011). Deep discussions about minimax convergence rate may be found in Chen and Reiss (2011) and Johannes et al. (2011). We will extend this approach to panel data models with endogenous variables and individual effect analyzed by the first differences method. More precisely we consider a model Yt = ϕ(Zt ) + ξ + Ut
t = 1, . . . , T
(1.1)
✩ We acknowledge helpful comments from the editor Cheng Hsiao and from two anonymous referees. We also thank François Laisney who suggests the extension of non parametric IV models to panel data specification. Jérôme Bolte, Samuele Centorrino, Patrick Fève, Anna Simoni and Ingrid Van Keilegom are gratefully acknowledged for helpful discussions and comments. ∗ Corresponding author. Tel.: +33 561128596. E-mail address:
[email protected] (J.-P. Florens).
http://dx.doi.org/10.1016/j.jeconom.2014.03.009 0304-4076/© 2014 Elsevier B.V. All rights reserved.
for which individual i.i.d. observations (yti , zti ) are available for i = 1, . . . , n and t = 1, . . . , T . The variable ξ is an individual unobservable heterogeneity effect and Z1 , . . . , ZT , ξ may be endogenous. We assume that W is a set of instrumental variables for which individual data wi are also observed. The model is treated by first differences in order to eliminate ξ and we consider: Yt − Yt −1 = ϕ(Zt ) − ϕ(Zt −1 ) + Ut − Ut −1
t = 2, . . . , T
(1.2)
under the assumption E (Ut − Ut −1 |W ) = 0. This assumption is weaker than E (Ut |W ) = E (Ut −1 |W ) = 0. This model is very similar to the initial IV model treated in the literature except that the unknown function has the specific form ϕ(Zt ) − ϕ(Zt −1 ). Note that the vector of instruments may contain time dependent variables but their realizations for the different periods are stacked in W , which does not depend on t in that case. For simplicity we focus on the case T = 2 which implies that (1.2) reduces to a single equation. The extension to general T is briefly analyzed in Section 2 and in the Appendix. We also simplify our model by considering that all the Z variables are endogenous. A more general specification would be written ϕ(Zt , W1t ) where W1t is a part of the instruments Wt . This extension is straightforward: intuitively the analysis remains identical if we consider W1t ‘‘fixed’’, see Hall and Horowitz (2005), Fève and Florens (2010). To the best of our knowledge, this model has not yet been treated in a pure non parametric way. This paper fills this gap. In particular, we develop two original aspects in the econometric literature on inverse problems. First the form of the operator differs
2
F. Fève, J.-P. Florens / Journal of Econometrics (
from previous research and has some specific features. Second, we use a Sobolev penalty in the Tikhonov regularization, which has been previously done in different other contexts by e.g. Florens et al. (2011) or Gagliardini and Scaillet (2012). Our paper is not only theoretical, since we develop the implementation of our estimators using standard computer programs.1 Monte Carlo simulations illustrate our method. Recent works consider different panel data models analyzed non parametrically. Evdokimov (2010) considers a model where the heterogeneity is not additive and then may not be eliminated by taking differences but the additive residuals are mean independent of the explanatory variables. Wilhelm (2012) considers a model similar to our model. He assumes that the Z variables are exogenous but only observed with error, which creates an endogeneity bias if this error is not explicitly treated. The paper is organized as follows. Section 2 presents the model and analyzes its identification. Sections 3 and 4 consider its estimation by Tikhonov regularization with an L2 or a Sobolev penalization, respectively. Section 5 shows how our estimators may be implemented in practice, including data driven rules for the selection of the regularization parameters and Section 6 gives some simulation examples. The proofs are given in an Appendix. 2. Notations and identification Consider the following random variables: Yt ∈ R, Zt ∈ Rp , ξ ∈ R and W ∈ Rq . We denote by L2Zt , L2W . . . the spaces of square integrable functions of Zt , W , . . . . w.r.t. the true distribution generating the data. In most previous analyses of IV models the function ϕ is assumed to be a square integrable function w.r.t. the density of Z . In our case the distributions of Z1 and Z2 may be different and for identification reasons ϕ should be normalized. This motivates the following assumption: Assumption 2.1. We consider a density π of a probability measure and we assume ϕ ∈ E = {ϕ : Rp → R/ ϕ 2 (z )π (z )dz < ∞ and ϕ(z )π (z )dz = 0}. The density π is such that E ⊂ L2Zt for t = 1, 2.
)
–
marginal distribution, π may be taken equal to this distribution (and estimated in that case). A possible choice which satisfies the assumption is to take a convex combination of the two marginal densities on Z1 and Z2 . The estimation procedure presented in the next section requires the adjoint operator of K . Remember that this adjoint operator is defined by:
[(K ϕ)(w)]ψ(w)fW (w)dw = ϕ(z )[(K ∗ ψ)(z )]π (z )dz ,
for any ϕ ∈ E and ψ ∈ L2W .
Proposition 2.1. The adjoint operator verifies: K ∗ : L2W → E where
(K ∗ ψ)(z ) =
ψ(w)
fZ2 ,W (z , w) − fZ1 ,W (z , w)
π (z )
and K ∗ ψ satisfies the normalizing constraint = 0.
dw,
(2.4)
[(K ∗ ψ)(z )]π (z )dz
Some comments about the computation of K ∗ K are in order. Let Kt (t = 1, 2) from L2Z (π ) = {ϕ : 2us define the operators 2 ϕ (z )π (z )dz < ∞} ⊂ LZt into L2W by:
(Kt ϕ)(w) = E (ϕ(Zt )|W = w) =
ϕ(z )
fZt ,W (z , w) fW (w)
dz ,
and we denote by Kt∗ their adjoint operators from L2W into L2Z (π ) verifying:
(Kt∗ ψ)(z ) =
ψ(w)
fZt ,W (z , w)
π (z )
dw.
An elementary computation shows that: K K = K2∗ K2 + K1∗ K1 − K1∗ K2 − K2∗ K1 . ∗
(2.5)
The assumption E ⊂ L2Zt is in particular verified if πt(z ) is bounded and if the support of Zt is included in the support of π . However, this condition is not necessary. If Z1 and Z2 have the same
Notice further that K K ϕ is an element of E , as it satisfies the constraint given in Proposition 2.1. Let us now consider the identification problem. The model is identified if K is one to one on E or equivalently if K ϕ = 0 implies ϕ = 0. Note that if (Z1 , Z2 , W ) is distributed as (Z2 , Z1 , W ) the distribution of (Z1 , W ) is the same as the distribution of (Z2 , W ) and the model is not identified (K ϕ = 0 ∀ ϕ). Non parametric identification in panel data with endogenous explanatory variables has been studied by Altonji and Matzkin (2005), see also Evdokimov (2010) in the case of non separable models using conditional independence conditions. We want to analyze identification under the weaker conditions of mean independence and in the framework of inverse problems, using the tools developed in previous work on this topic, see Newey and Powell (2003) or Darolles et al. (2011). Two concepts have been pointed out as essential in this analysis: the concept of strong identification (or completeness) and the concept of measurable separability.2 The concept of strong identification is related to the injectivity of the conditional expectation operator (Z is strongly identified by W in the L2 sense if E (ϕ(Z )|W ) = 0 a.s. implies ϕ(Z ) = 0 a.s). A different characterization of this definition has been recently considered in particular by D’Haultfoeuille (2011) or Andrews (2011), see also Wilhelm (2012) and Hu and Shiu (2011). The second concept is the concept of measurable separability: two random elements X1 and X2 are separated if, for any measurable functions ϕ and ψ , ϕ(X1 ) = ψ(X2 ) a.s. implies ϕ = ψ =
1 All the code in Matlab is available upon request. See also the R-package ‘‘np’’ developed by J. Racine.
2 See Florens et al. (1990), chapter 5 for a systematic study of this two concepts and for a statistical history.
Obviously the difference model (1.2) remains identical if a constant is added to ϕ and the normalization rule eliminates this identification problem. Under Assumption 2.1 the conditional expectation operator is well defined and ϕ is characterized by the equation: Kϕ = r
(2.1)
where (K ϕ)(w) = E (ϕ(Z2 ) − ϕ(Z1 )|W = w) ∈ and r (w) = E (Y2 − Y1 |W = w) ∈ L2W . We assume that Y2 − Y1 is square integrable. Let f denote the density of the data generating process with respect to the Lebesgue measure. Using obvious notations we have: L2W
(K ϕ)(w) =
ϕ(z )
[fZ2 ,W (z , w) − fZ1 ,W (z , w)] dz fW (w)
(2.2)
and the operator K is usually compact. This compactness property is in particular verified if
fZ2 ,W (z , w) − fZ1 ,W (z , w) fW (w)π (z )
2
π (z )fW (w)dzdw < ∞.
(2.3)
In that case K is an Hilbert Schmidt operator, which implies the compactness, see Carrasco et al. (2007). fZ (z )
∗
F. Fève, J.-P. Florens / Journal of Econometrics (
constant a.s. For example, if X1 and X2 are real vectors, consider the support of the joint distribution of X1 and X2 . If the joint support is the product of the support of X1 and of the support of X2 , this condition is satisfied. The measurable separability is actually weaker as shown by Florens et al. (2008). Proposition 2.2. Let ϕ ∈ E . If (Z1 , Z2 ) is strongly identified by W and if Z1 and Z2 are measurably separated the function ϕ is identified. The assumption of Proposition 2.2 is too strong because it assumes that E (g (Z1 , Z2 )|W ) = 0 implies g (Z1 , Z2 ) = 0 a.s. for any g and we only use this property for g (Z1 , Z2 ) of the form ϕ(Z2 ) − ϕ(Z1 ). In the normal case (Z1 , Z2 ) is strongly identified by W if and only if the covariance between (Z1 , Z2 ) and W has a rank equal to the dimension of (Z1 , Z2 ). It requires that the dimension of W is greater or equal to the dimension of (Z1 , Z2 ) (see Newey and Powell (2003)). This condition implies that the dimension of W should not be strictly smaller than 2p. Let us consider the following example and show that a weaker condition may be assumed. Example 2.1. Take (Z1 , Z2 , W )′ ∈ R3 normal with mean zero vector and covariance matrix:
ε
1
ε ρ1
1
ρ2
ρ1 ρ2 .
(2.6)
1
In that case the distribution π is assumed to be identical to the marginal distribution of Z1 and Z2 , both N (0, 1). We denote by pj the Hermite polynomials3 in R. We have for example the following Fourier decomposition:
ϕ(z ) =
∞
< ϕ, pj > pj (z ) z ∈ R
(2.7)
j =0
and E (ϕ(Z2 )|W = w) − E (ϕ(Z1 )|W = w) =
∞
ρ2j < ϕ, pj > pj (w)
j =0
−
∞
ρ1j < ϕ, ρj > pj (w)
(2.8)
j =0
because the singular values of the operator ϕ → E (ϕ(Zt )|W = .) j are the ρt and the singular vectors are the Hermite polynomials j (E (pj (Zt )|W = w) = ρt pj (w)). Then E (ϕ(Z2 ) − ϕ(Z1 )|W ) = 0 E
∞
a.s. ⇔
2 j 2
j 1
(ρ − ρ )⟨ϕ, pj ⟩pj (W )
=0
–
K ∗ K ϕj = (λ22j + λ21j − 2λ1j λ2j )ϕj = (λ2j − λ1j )2 ϕj .
If moreover ρ2 ̸= ρ1 this equality implies ⟨ϕ, pj ⟩2 = 0 for any pj (except for j = 0) which is equivalent to ϕ = constant and then to ϕ = 0 by the normalization rule. This example may be generalized. First remark that K is one to one if and only if K ∗ K is one to one, see Darolles et al. (2011), and consider equality (2.5).
3 Hermite polynomials are orthonormal polynomials for the N (0, 1) distribution. The property of these polynomials used in this example are shown in particular in Letac (1995) chapter IV.
(2.10)
The model is then identified if λ2j ̸= λ1j ∀j ≥ 1 (because λ10 = λ20 = 1). In particular we may consider geometric decay of the two λtj sequences: Ct > 0,
λtj = at >
Ct jat 1
,
t = 1, 2,
j ≥ 1,
(2.11)
.
2 The parameters a1 and a2 may be interpreted as a degree of dependence between W and Z1 or Z2 . If a1 ̸= a2 , the model is identified. Even if a1 = a2 the model remains identified if C1 ̸= C2 . In this case, the result may be interpreted in the following way: ϕ is identified using the instruments W if the dependence structure between W and Z1 is different from the dependence structure between W and Z2 . Let us give another identification theorem. Proposition 2.3. Assume that W is partitioned into (W1 , W2 ) such that Z1 yW2 |W1 (Z1 and W2 are independent given W1 ) and Z2 yW1 |W2 . If W1 and W2 are measurably separated and if Z1 is strongly identified by W1 (or Z2 by W2 ), then ϕ is identified. Let us remark that measurable separability excludes in particular that W1 and W2 may have common elements. The separability property is in particular verified if the joint distribution of (W1 , W2 ) is equivalent (i.e. has the same null sets) to a probability measure such that W1 and W2 are independent. The strong identification condition is more difficult to verify (and is not testable as shown by Canay et al. (2011)). It means that the singular value decomposition of the conditional expectation operator has no zero element, see Carrasco et al. (2007). A (strong) sufficient condition for the strong identification in Proposition 2.3 is: Z1 = m1 (W1 ) + ε1
(2.12)
where ε1 is independent of W1 and where the characteristic function of this noise does not cancel. We give an intuitive argument of this result. ˜1 = It is sufficient to prove that Z1 is strongly identified by W m1 (W1 ). Under (2.12) we have
ϕ(z1 )fε1 (z1 − w ˜ 1 )dz1 .
(2.13)
Take the Fourier transform of this equality:
(2.9)
j =0
3
Consider the singular value decomposition of each Kt , i.e. the families (λtj , ϕtj (z ), ψtj (w))j≥0 , t = 1, 2, such that Kt ϕtj = λtj ψtj and Kt∗ ψtj = λtj ϕtj (ϕtj ∈ L2Z (π ) and ψtj ∈ L2W ). Further assuming that ϕ1j = ϕ2j = ϕj and ψ1j = ψ2j = ψj ,∀j, it is obvious that ϕj and ψj are the singular vectors of K ∗ K and KK ∗ and:
˜ 1 = w˜1 ) = E (ϕ(Z1 )|W
j =0
∞ ⇔ (ρ2j − ρ1j )2 ⟨ϕ, pj ⟩2 = 0.
)
e
−it w ˜1
ϕ(z1 )fε1 (z1 − w ˜ 1 )dz1 dw ˜1 =
e−itz1 ϕ(z1 )dz1
×
eitu fε1 (u)du.
(2.14)
Then if (2.13) equals zero, the Fourier transform of ϕ is 0 and then ϕ is 0 using the inversion theorem. (This proof is based on D’Haultfoeuille (2011) and Wilhelm (2012).) A dynamic model4 gives an example of identification using assumptions not based on the concept of completeness. Let us
4 We would like to thank an anonymous referee for suggesting this example.
4
F. Fève, J.-P. Florens / Journal of Econometrics (
consider the model
)
–
We observe a sample of (Y0 , Y1 , Y2 ) where these three scalar variables belong to the same Hilbert space. This model generates an equation:
endogenous variables, the non parametric estimation of the density fZt ,Wt (z , wt ) may be poor if the dimension of Wt becomes too large. However the number of instruments can be reduced by keeping for example the last lagged variables only. Third, it affects the rate of convergence of the estimator with respect to the number T of equations (see illustration of T > 2 in Appendix A.3).
E (ϕ(Y1 )|Y0 ) − ϕ(Y0 ) = E (Y2 − Y1 |Y0 ),
3. Estimation by the Tikhonov method (L 2 penalization)
Yt = ϕ(Yt −1 ) + ξ + Ut
E (Ut |Yt , Yt −1 , . . .) = 0.
and the function ϕ is identifiable if the operator (T − I )ϕ = E (ϕ(Y1 )|Y0 ) − ϕ(Y0 ) is one to one on E . This operator is a Fredholm operator of type II which is not the case of operator (2.1). Let us consider the equation (T − I )ϕ = 0, i.e. T ϕ = ϕ . This means that ϕ is an eigenvector of T related to the eigenvalue 1. A constant non zero function satisfies this property but this case is eliminated by the constraint ϕ(z )π (z )dz = 0. The model is then identified up to this normalization rule if the eigenvalue 1 of the conditional expectation operator T has an order of multiplicity equal to 1. This condition is not a completeness condition and will not be considered in this paper. However note that in general ϕ is identified up to any function belonging to the space generated by the eigenvectors relative to the eigenvalue 1. Non parametric estimation of this type of model is given in Carrasco et al. (2007). To conclude this section, let us sketch the extension of our presentation to panel data models with more than two periods. Let us consider a set of equations: Yt = ϕ(Zt ) + ξ + Ut
t = 1, . . . , T ,
(2.15)
transformed into: Yt − Yt −1 = ϕ(Zt ) − ϕ(Zt −1 ) + Ut − Ut −1
t = 2, . . . , T .
(2.16)
We consider a sequence of instruments Wt (t = 2, . . . , T ) such that E (Ut − Ut −1 |Wt ) = 0.
(2.17)
The Wt may be time dependent or not. Time dependent instruments may contain lagged values of Z and Y . In the following, we illustrate this case. The space E is defined as before and we consider a space F = L2W2 × · · · × L2WT provided with the canonical
˜ = Hilbert space product structure (if (ψt ) and (ψ˜ t ) ∈ F , ⟨ψ, ψ⟩ T 2 ˜ ˜ ˜ ˜ t =2 ⟨ψt , ψt ⟩, ψ = (ψt )t , ψ = (ψt )t and ⟨ψt , ψt ⟩ is the LWt scalar product). Our problem may still be written: K ϕ = r where,
K :E →F
(K ϕ)(w2 , . . . , wT ) = (E (ϕ(Zt ) − ϕ(Zt −1 )|Wt = wt ))t =2,...,T , (2.18) K ϕ = (Ht ϕ)t =2,...,T ,
(2.20)
where (Ht ϕ)(wt ) = E (ϕ(Yt ) − ϕ(Yt −1 )|Wt = wt ). The adjoint operator verifies obviously: T
ψt (wt )
t =2
×
∥K ϕ − r ∥2 + α∥ϕ∥2 ,
fZt ,W (z , wt ) − fZt −1 ,W (z , wt )
π (z )
dwt .
(2.21)
The procedure described in the next sections may be easily extended to this case and an example for T > 2 is detailed in the Appendix. Three main questions arise from this extension. First the identification condition is weakened in this case. Indeed the function ϕ is identified if K is one to one and this property is true if one of the Ht is one to one. Second it is necessary to cope with the curse of dimensionality problem. If Wt includes for instance the lagged
(3.1)
where the first norm is the Hilbert norm in the Hilbert norm in L2Z (π ):
∥ϕ∥2 =
L2W
and the second is
ϕ 2 (z )π (z )dz .
(3.2)
The minimization of (3.1) leads to the solution:
ϕ α = (α I + K ∗ K )−1 K ∗ r ,
(3.3)
where α > 0 is a real parameter suitably chosen. In our particular case (3.3) becomes:
αϕ α (z ) +
ϕ α (t )a(t , z )dt =
yb(y, z )dy,
(3.4)
where a(t , z ) =
1
1
fZ2 ,W (z , w) − fZ1 ,W (z , w)
fZ2 ,W (t , w)
π (z ) fW (w) − fZ1 ,W (t , w) dw,
and, b(y, z ) =
fY ,W (y, w)
1
π (z )
fW (w)
(fZ2 ,W (z , w) − fZ1 ,W (z , w))dw.
The estimator is then constructed by the plug-in method: we just replace the densities by their kernel estimates. For example:
(2.19)
r (w2 , . . . , wT ) = (E (Yt − Yt −1 |Wt = wt ))t =2,...,T ,
(K ∗ ψ)(z ) =
We have seen that the function ϕ is characterized as the solution of Eq. (2.1) K ϕ = r where both K and r are unknown and should be estimated using available samples. The unknown function ϕ belongs to E (Assumption 2.1), r ∈ L2W and K : E → L2W . Since K is a compact operator, as remarked above, this problem is an ill-posed inverse problem belonging to the class of Fredholm equations of type I. The ill-posedness requires a regularization method to solve the equation and we focus our presentation on Tikhonov regularization. The usual Tikhonov approach is based on the minimization of
fˆZ1 ,W (z , w) =
1
n z − z1i w − wi
nhZ1 hW i=1
C
hZ1
C
hW
(3.5)
where C is a zero mean kernel density and hZ1 and hW the bandα is then the solution of Eq. (3.4) where widths. Our estimator ϕ the densities are replaced by their estimators as in (3.5). We will see below how to implement practically this method and a more explicit expression of the equation will be given in Section 5. Note that kernel smoothing is just a possible non parametric method. The densities may be estimated by any technique like splines, wavelets or any sieve method. Local constant kernel estimation may be replaced by local polynomial estimation. Analogously the regularization method by Tikhonov penalization is an example. Other techniques may be used such as Landweber–Fridman, see Florens and Racine (2012) or Galerkin, see also e.g. Horowitz (2011) regularization. The plug-in method gives an estimation of K , K ∗ and r. Therefore we do not discuss these arguments further. We simply assume that the following conditions hold.
F. Fève, J.-P. Florens / Journal of Econometrics (
ues of K ∗ K we have
∞ ⟨ϕ, ϕj ⟩2 2β
j =1
i=1
= OP n
.
(ii) ∃ κ > 0 such that: ∥Kˆ − K ∥2 ∼ OP n
K ∗ ∥2 = OP n
2κ − 2κ+ p+q
2κ − 2κ+ p+q
and ∥Kˆ ∗ −
.
In this assumption the value ρ should be interpreted as the regularity of the conditional expectation E (Y2 − Y1 |W ) and κ as the regularity of the joint density of Z1 , Z2 and W . The regularity is defined by the number of continuous derivatives. The first result is standard in non parametric estimation of a conditional expectation and is based on the optimal choice of the bandwidth −
1
hW in kernel estimation (proportional to n 2ρ+q ). This choice may be practically realized using for example a cross validation method. The hypothesis (i) just means that the estimation of E (Y2 − Y1 |W ) is performed at the minimax rate. The hypothesis (ii) is based on a result given by Darolles et al. (2011) on the estimation of the conditional expectation operator. Indeed K and K ∗ are differences of conditional expectations operators and this result applies. Here also the bandwidths are chosen optimally in the estimations of fZ1 ,W and fZ2 ,W (proportional 1
−
to n 2κ+p+q ). This result is different from the usual result on the estimation of the conditional expectation where p or q only appears. The rate of convergence is due to the fact that we consider:
∥Kˆ − K ∥2 = sup{∥Kˆ ϕ − K ϕ∥2 , ∥ϕ∥ 5 1} ϕ
≤
ˆ fZ2 ,W (z , w) − fˆZ1 ,W (z , w) fˆW (w)π (z )
−
fZ2 ,W (z , w) − fZ1 ,W (z , w)
2
fW (w)π (z )
× π (z )fW (w)dzdw. This sup norm is bounded by the Hilbert Schmidt norm which involves the estimation of the joint density estimated at a rate depending on both p and q. In most previous papers it has been assumed that ρ = κ but this hypothesis is not necessary. These assumptions are based on technical conditions we do not recall here. One of them is that the variables have a compact support and that the densities (including π ) are bounded from below. This assumption is also common in the non parametric literature. Let us also remark that to treat the bias of the kernel estimation at the boundary we need to use boundary or generalized kernels as recalled in Darolles et al. (2011).
λ ⟨ϕ, ϕj ⟩ϕj . More generally the spectral mapping theorem defines g (K K )ϕ ∗
by g (λ2j )⟨ϕ, ϕj ⟩ϕj . The function g should be analytic on a set containing all the eigenvalues. See Dunford and Schwartz (1988, - I VII -3).
(3.6)
j
This norm converges to 0 if α → 0 for any ϕ but extra conditions are required to get a rate of convergence. Under Assumption 3.2 we have
α2
2β ⟨ϕ, ϕj ⟩2 λj ⟨ϕ, ϕj ⟩2 2 = α 2 2 2 (α + λj ) (α + λj ) λ2j β
= O(α β )
⟨ϕ, ϕj ⟩2 2β
λj
= O(α β ).
(3.7)
In these previous computations β is assumed to be smaller than or equal to 2. If not, β should be replaced by the minimum value between β and 2. This is related to the so-called qualification of the Tikhonov methodology. When β > 2, it is possible to increase the qualification, either by employing an iterated Tikhonov approach (see Section 5), or by applying an iterated procedure (so-called Landweber–Fridman regularization, see Engl et al. (2000), ch.3). The use of a different penalty may also increase the qualification (see Section 4). The asymptotic behavior of our estimator is given by the following proposition.
α be our estimator. Under Assumptions 3.1 Proposition 3.1. Let ϕ and 3.2 we have: 2ρ
α − ϕ∥2 = OP (α −1 n 2ρ+q + α (β−1)∧0 n (i) ∥ϕ where β ∧ 2 = min(β, 2). −
2ρ
2κ − 2κ+ p+q
+ α β∧2 ) 2κ − 2κ+ p+q
Then, if α → 0 such that α −1 n 2ρ+q and α (β−1)∧0 n →0 ∥ ϕ α − ϕ∥2 → 0 in probability. (ii) Moreover let us assume that β ≥ 1 and that −
2κ 2κ + p + q
≥
β ∧2 . β ∧2+1
Then the optimal rate of convergence is realized if α is 2ρ − 2ρ+ × β∧12+1 q
proportional to n
(K ∗ K ) 2 . Equivalently, one of the following conditions is satisfied5 :
2γ j
< ∞.
ϕ − (α I + K ∗ K )−1 K ∗ K ϕ = α(α I + K ∗ K )−1 ϕ ⟨ϕ, ϕj ⟩ ϕj =α α + λ2j ⟨ϕ,ϕj ⟩2 with a square norm equal to α 2 . (α+λ2 )2
β
5 If K ∗ K has a spectral decomposition (λ2 , ϕ ) we define (K ∗ K )γ by (K ∗ K )γ ϕ = j j
λj
Intuitively this condition means that the coefficients of the expression of ϕ in the basis of ϕj decay sufficiently fast in comparison to the singular values of K . This type of regularity condition has been extensively discussed in other papers, see Carrasco et al. (2007), Darolles et al. (2011). We will discuss in the next section an interpretation in terms of derivatives. The role of this assumption is to control the regularization bias. Estimating ϕ by formula (3.3) leads to a bias ϕ − ϕ α (see the proof of Proposition 3.1). This bias is equal to
Assumption 3.2. There exists β > 0 such that ϕ is in the range of
5
• ∃δ ∈ E such that ϕ = (K ∗ K ) 2 δ • If (ϕj )j=1,... and (λ2j )j=1,... denote the eigenvectors and eigenval-
fˆY1 ,Y2 ,W (y1 , y2 , w) (y2 − y1 ) fˆW (w) 2 fY1 ,Y2 ,W (y1 , y2 , w) − dy1 dy2 fW (w) 2 n i (y2i − y1i ) C ( w−w ) hW i =1 − E (( y − y )| W = w) = 2 1 n w−wi ) C ( hW 2ρ − 2ρ+ q
– β
Assumption 3.1. (i) ∃ρ > 0 such that
)
α − ϕ∥2 = OP ∥ϕ
and leads to
− × β∧2 n 2ρ+q β∧2+1 2ρ
.
This type of result is standard in the literature on inverse problems. The rate is decomposed in two parts: one is due to the non parametric estimation of the right hand side of the Eq. (2.1)
(n
2ρ − 2ρ+ q
) and one is a discount factor
β β+1
due to the inversion
6
F. Fève, J.-P. Florens / Journal of Econometrics (
(if β ≤ 2). This result is very general and does not link the regularity of the non parametric estimation of r (the value ρ ) and the regularity of ϕ (the value β ). Under some specific assumptions (see remark following Proposition 4.1), these regularity conditions are related and this result may be simplified and shown to be minimax, see Chen and Reiss (2011). This point is discussed in the next section within a more general framework. The link between the number of instruments, the number of variables and the rate of convergence is a complex issue as discussed in previous papers on this topic. We may simplify the argument in the following way. Since ϕ is determined as the solution of the equation K ϕ = r, an estimation error on r implies an estimation error on ϕ . As r is defined as E (Y2 − Y1 |W ), its estimation error relates to the dimension of W (usual ‘‘curse of dimensionality’’). Moreover the rate of convergence also depends on the β parameter. It has been shown that the value of β is affected by the relation between the endogenous elements Z1 and Z2 and the instruments W . Roughly speaking, β increases if the dimension of W increases. This improves the rate of convergence (see Darolles et al. (2011), theorem 2.1). Furthermore, the dimension of Z has an impact on the convergence rate. The rate of decline of the λj increases if more Z are considered. This property has an impact on the regularity β as defined in Assumption 3.2. The rate of the estimation of K and K ∗ also depends on the dimension of Z . If p and q are large, inequality in assumption (ii) may be non satisfied. In this case, the rate of convergence may be driven in this case by the non parametric estimation error on K . Our result improves upon Darolles et al. (2011) upon some auxiliary conditions on q, ρ, κ and β . This analysis is detailed in the next section. The conditions given in part (ii) of the Proposition 3.1 are needed to eliminate the impact of the estimation of the operators K and K ∗ on the rate of convergence of ϕ . The obtained result is then identical to a situation where the joint densities fZt ,W (z , w) (t = 1, 2) are known. An interesting question not treated here could be to look at the change of the rate of convergence when T increases. When the instruments are invariant with T the rate of convergence of E (Yt − 2ρ − 2ρ+ q
. Moreover Yt −1 |W ) does not change and remains equal to n the parameter β would change because the operator K is modified and so are the λ2j in Assumption 3.2. An interesting point, which is not explicitly treated here, would be to look at the asymptotic properties as T increases. The Fourier coefficient ⟨ϕ, ϕj ⟩ will also change because ϕj is defined relatively to K . A natural conjecture would be that β will increase with T which improves the rate. Our simulations show that the latter conjecture is indeed plausible. 4. Estimation of the derivatives of ϕ and regularization by the norm of the derivative The first order derivatives of ϕ have a special interest for two reasons. First, they are identified independently of the location constraint ϕ(z )π (z )dz = 0. Second, they are very often the main object of interest for the economist. For example in a logarithm model they represent elasticities. We therefore consider in this section a Tikhonov regularization procedure where the penalty is measured by the norm of the derivative (‘‘Sobolev penalty’’ or ‘‘Hilbert Scale penalty’’). The interest of this approach is to give an immediate estimation of the derivative. Furthermore, it allows to relax the qualification constraint implied by the simple Tikhonov method, see Florens et al. (2011), where the regularity β is limited by 2. An alternative approach to the estimation of derivatives is presented in Florens and Racine (2012) in the standard IV context using a Landweber–Fridman regularization. We simplify the presentation by considering only the case when Zt ∈ R and when π is the uniform measure on [0, 1]. The choice
)
–
of the support does not really matter but the case when π is not constant complicates the formula. Let us consider the following assumption: Assumption 4.1. ϕ is an element of E0 the subset of E of L2 differentiable functions. The function ϕ is L2 differentiable if there exists a function ϕ ′ in L2 such that:
ξ ′ (u)ϕ(u)du = −
ξ (u)ϕ ′ (u)du
for any function ξ differentiable in the usual sense and with ξ (0) = ξ (1) = 0. It is sufficient to assume that ϕ is differentiable in the usual sense such that its derivative is an element of L2Z (π ). We first define our new estimator and its derivative. This construction will be motivated later. Let us introduce two integral operators: M :
L2Z
(π ) → E0
(M ψ)(t ) =
t
ψ(u)du −
1
ψ(u)du,
dt 0
0
t
0
(4.1) and M ∗ : E0 → L2Z (π )
(M ∗ λ)(t ) =
1
λ(u)du.
(4.2)
t
Note that M ∗ is the adjoint operator of M. Then we define a new estimator of ϕ :
α = M (α I + M ∗ Kˆ ∗ Kˆ M )−1 M ∗ Kˆ ∗ rˆ . ϕ
(4.3)
This construction is motivated by the following argument. Let denote by L the differential operator E0 → L2Z (π ) Lϕ = ϕ ′ . We have obviously MLϕ = ϕ for any ϕ ∈ E0 . The original inverse problem K ϕ = r may then be rewritten as T ϕ ′ = r where T = KM and the object of interest becomes ϕ ′ ∈ L2Z (π ). We now treat this integral equation by the usual Tikhonov regularization. The regularized solution takes the form:
ϕ ′α = (α I + T ∗ T )−1 T ∗ r .
(4.4) ∗
∗ ∗
The adjoint operator of T is M K
∀ϕ ∈ E0 , ∀ψ ∈ L2W ⟨KM ϕ ′ , ψ⟩L2 = ⟨ϕ ′ , M ∗ K ∗ ψ⟩L2 (π ) W
Z
and
ϕ ′α = (α I + M ∗ K ∗ KM )−1 M ∗ K ∗ r .
(4.5)
By integration and using the constraint ϕ ∈ E0 we get the new estimator of ϕ given in Eq. (4.3) and Eq. (4.5) gives a direct estimator of ϕ ′ where K , K ∗ and r are replaced by their sample counterparts. Note that the estimation of ϕ ′ is not the derivative of the estimation of ϕ defined in Section 3. The estimator (4.3) may also be derived by using a different penalty in the Tikhonov minimization. Consider the problem K ϕ = r. Assuming ϕ in E0 , the L2 penalization is imposed on the first derivative of the function rather than on the function itself. That is: min ∥K ϕ − r ∥2 + α∥Lϕ∥2 , ϕ∈E
(4.6)
where the first norm is the usual Hilbert norm in L2W and the second norm is the Hilbert norm in L2Z (π ). The minimization problem in (4.6) is related to the use of an Hilbert scale regularization approach as presented in Engl et al. (2000) except that the derivative is not self adjoint. As L is not bounded its adjoint needs to be defined precisely and may be characterized by: 1
ϕ ′ (z )ψ(z )dz = − 0
1
ϕ(z )ψ ′ (z )dz , 0
for any ψ L2Z (π ) differentiable such that ψ(0) = ψ(1) = 0. This equality shows that −L may be selected as the adjoint of L (for
F. Fève, J.-P. Florens / Journal of Econometrics (
)
–
7
a sub family of functions verifying the boundary conditions). The program (4.6) then leads to the solution:
and is then equal to a + b. We then get the rates of convergence derived by Chen and Reiss (2011) in the IV context.
ϕ α = (−α L2 + K ∗ K )−1 K ∗ r .
2b α − ϕ∥2 = OP n− 2(a+b)+q , ∥ϕ b−1) − 2(2a(+ ′α − ϕ ′ ∥2 = O b)+q ∥ϕ . P n
(4.7)
The proof of this result is given in the Appendix. The boundary constraint implied by the resolution of the integro-differential 2 ∗ ∗ equation (−α L + K K )ϕ = K r is replaced in∗ our case by the constraint ϕ(z )dz = 0. Using LM = I and −LM = I we get
ϕ α = M ∗ (α I + M ∗ K ∗ KM )−1 MK ∗ r , (4.8) which leads to (4.3) when K , K ∗ and r are replaced by their estimators. ′α to ϕ ′ . This Let us now consider the rate of convergence of ϕ result follows from the same arguments as in Proposition 3.1. except that the operator K is replaced by KM (and K ∗ by (KM )∗ = M ∗ K ∗ ). The three elements of Assumption 3.1 remain true if KM replaces K because ∥Kˆ M − KM ∥ ≤ ∥Kˆ − K ∥∥M ∥ and ∥M ∥ is finite. The single modification comes from the regularity condition. The Assumption 3.2 is replaced by the following one: Assumption 4.2. There exists γ
>
γ
0 such that ϕ ′
∈
R(M K KM ) (where R is the range of the operator,i.e., ∃λ ∈ L2Z (π ) such that ∗ ∗
2
γ
ϕ ′ = (M ∗ K ∗ KM ) 2 λ). Proposition 4.1. (1) Under Assumptions 3.1, 4.1, 4.2, we have:
2ρ 2κ −1 − 2ρ+q ′α − ϕ ′ ∥2 = O ∥ϕ n + α (γ −1)∧0 n− 2κ+p+q + α γ ∧2 . P α 2ρ
2κ
Then if α → 0 such that α −1 n 2ρ+q and α (γ −1)∧0 n 2κ+p+q → ′α − ϕ ′ ∥2 → 0 in probability. 0, we get ∥ϕ γ ∧2 (2) If γ ≥ 1 and 2κ+2κp+q ≥ γ ∧2+1 , the optimal α is proportional to −
n
2ρ × γ ∧12+1 − 2ρ+ q
−
and the optimal rate of convergence is given by
2 2ρ − 2ρ+ × (γ γ∧∧ ′α − ϕ ′ ∥2 = O q 2)+1 ∥ϕ n . P
The objective of this section is to show how the theoretical approach presented above can be easily implemented. The general argument for this approach is the approximation of integrals of the form:
ϕ(z )fˆZ2 ,W (z , w)dz ,
1
n
Remark. It is interesting to link β and γ even if more assumptions are needed. First assume that K is equivalent to M a (i.e. ∃ c > 0 and c¯ > 0 two constants such that c ∥M a ϕ∥ ≤ ∥K ϕ∥ ≤ c¯ ∥M a ϕ∥) and that ϕ ∈ D (Lb ) (the domain of Lb ). Intuitively K has the smoothing property of the power of the integral (it generates differentiable functions) and ϕ is b times differentiable (a > 0, b > 0). If ϕ is b times differentiable, its transformation K ϕ is a + b times differentiable. The operators M a and Lb represent the powers a and b of M and L (see footnote 4 ). In that case it may be verified (see Engl et al. (2000), chap 8) that β = ba (b is the regularity index and a the degree of ill-posedness). With these notations the optimal rate for the estimation of ϕ becomes
′α − ϕ ∥ = O ∥ϕ P n ′ 2
2ρ 1 − 2ρ+ × ba− +b q
.
(4.10)
Moreover in this case ρ is the number of derivatives of E (Y |W ) = E (ϕ(Z2 ) − ϕ(Z1 )|W )
(4.11)
w − wi
, (5.2) ϕ(z2i )C q hW nhW i=1 where C defines a zero mean kernel density function and hW a bandwidth. This type of approximation has been discussed in our previous papers, see Darolles et al. (2011), Fève and Florens (2010). The main interest of the computation we present is that it does not require any approximation of ϕ (by a sieve method) which could introduce a hidden regularization. Moreover in our framework all the computations are reduced to matrix computations. We briefly present two methods for the computation of the estimator of ϕ : one based on the computation of ϕ at any point of the union of the two samples of Z1 and Z2 and one on a grid of selected points. First we assume that Z1 and Z2 are scalar and we consider an interval which contains all the observed values of Z1 and Z2 . We assume π (z ) constant on the interval and arbitrarily fixed to 1. Let us define the following matrices:
CW =
We recover the standard result that the penalization by derivative does not improve the rate but just changes the qualification constraint ( ba should not be larger than 2). This rate is reached either by estimator (4.3) or by the estimator studied in Section 3. Now consider the estimation of ϕ ′ . This function is an element of D (Lb−1 ) and the degree of ill-posedness becomes a + 1. Then 1 γ = ba− and the optimal rate is +b
(5.1)
by
(4.9)
(4.13)
5. Practical implementation of the estimation methods
The proof of this proposition is identical to the proof of Proposition 3.1.
2ρ b α − ϕ∥2 = OP n− 2ρ+q × a+b . ∥ϕ
(4.12)
wi −wj
C h W n wi −wj C
j=1
hW
∀ t , τ = 1, 2 CZt τ =
(5.3) i,j=1,...,n
1 nhZτ
C
zti − zτ j hZτ
i,j=1,...,n
.
(5.4)
In the case when π (z ) differs from 1, we should divide the elements of CZt τ by π (zτ i ). If Z1 and Z2 have the same marginal, π may be chosen equal to this marginal and estimated by the usual kernel method. For simplicity, we only consider the case where π is uniform. ˜α the We denote by y˜ ∈ Rn the vector (y2i − y1i )i=1,...,n and by ϕ vector of the estimator evaluated at all the observed data points
˜α ∈ R2n . Finally we denote by I the identity z1i and z2i . Then ϕ n matrix of size n.In the simulation the constraint ϕ(z )π (z )dz = 0 is replaced by ϕ(z )π (z )dz equal to some known constant. As we do approximation of the theoretical estimation presented in Section 3, the constraint on the estimator is notexactly satisfied and we translate the function in order to get ϕ(z )π (z )dz = α (z )π (z )dz. ϕ From approximation (5.2) and formula (3.4) it follows that our estimator is given as the solution of the following equation: w −w j i n n C( h ) 1 W α α α α ϕ (z ) + (ϕ (zi2 ) − ϕ (zi1 )) n π (z ) j=1 w −w i=1 C ( hj i ) W i=1
8
F. Fève, J.-P. Florens / Journal of Econometrics (
1
×
C
nhZ2
z − zi2
1
−
hZ 1
nhZ1
C
z − zi1
hZ1
and if we consider for z all the values of both samples, we obtain a linear system of equations where the unknown values are the α (zt +i ) and its solution is denoted in matrix notation: ϕ
−1 ˜α = α I + CZ12 − CZ11 C [−I , I ] ϕ W n n 2n CZ22 − CZ21 CZ12 − CZ11 CW y˜ + constant. CZ22 − CZ21
(5.5)
The extension for T larger than 2 is presented in the Appendix. The second method consists in an estimation of ϕ on a given grid of points. This approach is useful in the estimation of the derivative. Let (z˜l )l=1,...,N denote a vector of N ordered points into a range including all the observations of Z1 and Z2 . We assume moreover that the elements of the grid are equally spaced which means that π is uniform. Let us define: t = 1, 2 CZt =
1 nhZt
C
z˜l − zti
l=1,...,N
hZt
i=1,...,n
,
C ztih−˜zl Z˜ N zti −˜zl
.
C˜ Zt =
C
hZ˜
l =1
i=1,...,n l=1,...,N
The product by CZt estimates the conditional expectation given Zt (with a marginal distribution known to be uniform). The matrices C˜ Zt transform functions defined on the grid into functions defined on observations of Zt . The estimator is then computed as follows:
−1 α = α IN + (CZ1 − CZ2 )CW (C˜ Z1 − C˜ Z2 ) (CZ1 − CZ2 )CW y˜ ϕ In order to implement the estimation with a penalization by the derivative, we introduce the following two matrices which are discretizations of the integral operators M and M ∗ .
∗ N ×N M
0 z˜2 − z˜1
... .. . .. .
.. . z˜2 − z˜1 z˜1 z˜2 − z˜1 0 z˜2 − zˆ1 = .. .. . . 0
...
α α ˆ ∗ ˆ −1 ˆ ∗ define ϕ (2) = (α I + K K ) (K rˆ + α ϕ(1) ). The square norm of the
residuals ∥ˆr − Kˆ ϕˆ (α2) ∥2 goes to zero when α → 0 and we need to introduce a multiplicative penalty. Two methods may be used: 1 α 2 αopt = argmin ∥ˆr − Kˆ ϕ (2) ∥ α α 2 αopt = argmin∥ϕˆ (α2) ∥2 ∥ˆr − Kˆ ϕ ∥ . (2)
0
0 z˜N − z˜N −1
... ... .. . 0
,
(Criterion 1) (Criterion 2)
It is proved that both criteria are bounded by functions of
α such that their minima deliver αopt at the optimal rate of 2ρ
1
convergence (n 2ρ+q β+1 ) even if β and γ are unknown. These results are in Engl et al. (2000) for the case where K is given and the extension to estimated K is in Fève and Florens (2010) for the first one and in the Appendix for the second one. Recent papers analyze this problem in the case of instrumental variables using a different regularization method (see Breunig and Johannes (2013) and Horowitz (2010)). A method for the selection of α based on a cross validation approach with leave one out method is given in Centorrino (2013) and Centorrino et al. (2013). We apply these two methods to the estimation of ϕ and for the estimation of ϕ ′ where the equation becomes r = T ϕ ′ = KM ϕ ′ . In that case we reconstruct ϕ by integration of ϕ ′ .
As an exercise we simulate a sample generated as follows:
(U1 , Z1 , W1 ) and (U2 , Z2 , W2 ) are independently normally distributed in R2 with zero means and variances:
(5.6)
1 A 0
A 1 B
0 B 1
where A2 + B2 < 1,
and ξ has the form:
ξ = a(U1 + U2 + Z1 + Z2 + W1 + W2 + 1) + η,
z˜N − z˜N −1 z˜N − z˜N −1
. .. . z˜N − z˜N −1
(5.7)
˜′α + constant, ˜α = M ϕ ϕ ′α is the estimator of the derivative: where ϕ −1 ˜′α = α I + M (CZ − CZ )CW (C˜ Z − C˜ Z ) M ∗ ϕ N 1 2 1 2
where η ∼ N (0, 1). In this toy model the marginal distributions of Z1 and Z2 are identical and π is equal to this marginal and estimated non parametrically. The data are generated by: Y1 = ϕ(Z1 ) + ξ + U1 ,
Then we have:
× (CZ2 − CZ1 )CW y˜ .
1
adopt the following rule: n− 5 × 1,069× standard deviation of the grid × 41 . Remember that the introduction of C˜ Z1 and C˜ Z2 is not based on a statistical argument but only on a numerical computation. We present in this paper purely data driven methods for the selection of the regularization parameter α . Let us consider a general case for which the inverse problem takes the form K ϕ = r. The selection of α is based on the analysis of the estimated residual of this equation. α Let rˆ and Kˆ be estimators of r and K and ϕ is estimated by ϕ (2) defined in the following way. In order to capture all the regularity of ϕ useful for a Tikhonov estimation we need to estimate ϕ by an α ˆ ∗ ˆ −1 ˆ ∗ α = ϕ iterated Tikhonov method.6 If ϕ (1) = (α I + K K ) K rˆ , we
6. A simulation example
+ constant.
z˜ 1 .. . = M . . . z˜1
1
bandwidths hW or hZt , we adopt a naive rule (n− 5 × standard deviation of the variable ×1,069). For the C˜ Z1 and C˜ Z2 matrices we
−
–
The last point for the practical implementation is how to choose the bandwidths and the regularization parameter. For all the
hZ1
w −w j i n n C( h ) 1 1 z − zi2 W = C (yi2 − yi1 ) n π(z ) j=1 hZ2 w −w nhZ2 i =1 C ( hj i ) W i=1 1 z − zi1 − C , nhZ1
)
Y2 = ϕ(Z2 ) + ξ + U2 .
(6.1)
(5.8)
(5.9)
6 Iterated Tikhonov method is studied in Engl et al. (2000) p 123. In the usual Tikhonov method, the regularization bias has a square norm of order α β∧2 but an iterated scheme gives a bias of order α β∧4 and this condition is needed to get an optimal choice of α for the usual Tikhonov regularization (see the proof in the Appendix and in Fève and Florens (2011)).
F. Fève, J.-P. Florens / Journal of Econometrics (
Fig. 1.
)
–
9
Fig. 3.
Fig. 2. Fig. 4.
The values of the parameters are given in Table 1. In the simulated experiments, we also estimate ϕ under an exogeneity assumption. In that case E (Y |Z2 , Z1 ) = ϕ(Z2 ) − ϕ(Z1 ) and we use a backfitting approach based on the result 1 [E (Y + ϕ(Z1 )|Z2 = z ) − E (Y − ϕ(Z2 )|Z1 = z )] . (6.2) 2 The estimator in that case is obtained as the fixed point of this equation. In these models we have two instruments W1 and W2 (W = (W1 , W2 )) and A measures the endogeneity of Z1 or Z2 and B the dependence between the Z vector and the W vector. Fig. 1 represents the true function, the estimation7 under exogeneity and an estimation for an ad hoc choice of α under scenario I. In the same case Fig. 2 shows the impact of different choices of α . Figs. 3 and 4 show the same curves but in the case of scenario II. Fig. 5 shows the shape of the curve to minimize for (Criterion 2) (in case of scenario II).
ϕ(z ) =
7 All the kernels used in the simulation are centered and standardized normal densities.
Fig. 5.
10
F. Fève, J.-P. Florens / Journal of Econometrics (
)
–
Fig. 6.
The sets of graphs given in Figs. 6 and 7 correspond to scenarios II and III respectively and give Monte Carlo replications of different estimators with purely data driven choices of α and of the bandwidths. Only 20 replications are represented to keep the graph
readable. In the two top graphs, the estimation is performed by Tikhonov regularization with an L2 penalty and differs by the selection rule of α . The two graphs in the middle show the estimation of ϕ using a Sobolev penalty (formulae (4.8)) and we represent the
F. Fève, J.-P. Florens / Journal of Econometrics (
)
–
11
Fig. 7.
estimation of the function and of its derivative in the bottom. The left pictures are related to the (Criterion 1) for the choice of α and right pictures to (Criterion 2).
The first graphs show the role of the parameter α . If α is very low the estimation fluctuates around the true curve. The bias is reduced but as usual the variance is larger. As we know that most of the
12
F. Fève, J.-P. Florens / Journal of Econometrics (
Table 1 Parameter’s values. Scenario Function ϕ n A B a Std U1 Std U2 Std η
)
–
which implies under separability that I
II 1 2 z 2 200 .7 .5 1 .5 .5 1
z 200 .7 .5 1 .1 .1 1
III ez /2
ez /2 +1
1000 .7 .5 .25 .1 .1 1
economic functions are ‘‘simple’’ (increasing, convex or concave) the control of the shape of the curve may be a good way to select the value of α . Figs. 6 and 7 show the role of the data driven selection rule and the variance of the estimators. In the case of the two functions and for a L2 penalty the second column gives low bias and strongly oscillating curves. From these simulations, it seems that L2 penalty is not the best method of estimation in case of panel data models.
E (ϕ(Z1 )|W1 ) = E (ϕ(Z2 )|W2 ) = C
Consider now one of the strong identification properties, Z1 strongly identified by W1 for example. E (ϕ(Z1 ) − C |W1 ) = 0, implies ϕ(Z1 ) − C = 0 and thus ϕ = 0 a.s. Proof of Proposition 3.1 Let us first remark, as E (Y2 − Y1 |W ) = E (ϕ(Z2 ) − ϕ(Z1 )|W ) = K ϕ , we have (using Assumption 3.1.i):
∥Kˆϕ − K ϕ∥ = OP (n
2ρ − 2ρ+ q
).
Then
∥ˆr − Kˆ ϕ∥ 6 ∥ˆr − K ϕ∥ + ∥Kˆ ϕ − K ϕ∥, ∥ˆr − Kˆ ϕ∥ = OP (n
7. Conclusion
a.s.
2ρ − 2ρ+ q
).
Let us consider This paper proposes an estimation method for panel data models analyzed non parametrically in the presence of endogenous variables. The model is a fixed effects model treated by differences. The practical aspects of the estimation are developed and the usual asymptotic properties of the estimator are derived. A special attention is given to the problem of the estimation of the derivative through a suitable penalization. Several aspects should be developed in the future, including the analysis of the asymptotic distribution of the estimated function (see Carrasco et al. (2014) and its approximation by bootstrap methods). Usual test procedures (for testing exogeneity, parametric form, specific properties of the function) need to be adapted to our framework (see in the usual IV framework Horowitz (2006, 2007)). In case of large dimensions (for Z in particular) semi parametric specifications may be used (see Florens et al. (2011)).
Observe first that:
∥A∥2 6 ∥(α I + Kˆ ∗ Kˆ )−1 Kˆ ∗ ∥2 ∥ˆr − Kˆ ϕ∥2 . As,
∥(α I + Kˆ ∗ Kˆ )−1 Kˆ ∗ ∥ = OP
∥A∥2 = OP (α −1 n
A.1. Proofs L22
and ψ ∈
L2W
⟨K ϕ, ψ⟩ fZ ,W (z , w) − fZ1 ,W (z , w) dz fW (w)dw = ψ(w) ϕ(z ) 2 fW (w) fZ ,W (z , w) − fZ1 ,W (z , w) = ϕ(z ) ψ(w) 2 dw π (z )dz π (z ) = ⟨ϕ, K ∗ ψ⟩. The second equality follows from Fubini’s theorem. Proposition A.2. Let ϕ be such that K ϕ = 0. We have E (ϕ(Z2 ) − ϕ(Z1 )|W ) = 0
+ (α I + Kˆ ∗ Kˆ )−1 Kˆ ∗ Kˆ ϕ − (α I + K ∗ K )−1 K ∗ K ϕ + (α I + K ∗ K )−1 K ∗ K ϕ − ϕ = A + B + C.
1
√ α
,
we get
Appendix
Proposition A.1. Let ϕ ∈
α − ϕ = (α I + Kˆ ∗ Kˆ )−1 Kˆ ∗ rˆ − ϕ ϕ = (α I + Kˆ ∗ Kˆ )−1 Kˆ ∗ (ˆr − Kˆ ϕ)
then ϕ(Z2 ) − ϕ(Z1 ) = 0,
almost surely from the strong identification condition. As Z1 and Z2 are measurably separated ϕ is constant and hence equal to 0 under the normalization rule. Proposition A.3. We start by E (ϕ(Z1 )|W ) − E (ϕ(Z2 )|W ) = 0, Then: E (ϕ(Z1 )|W1 ) = E (ϕ(Z2 )|W2 ),
2ρ − 2ρ+ q
).
Using results given in Darolles et al. (2011) and recalled after Assumption 3.2, we know that C = α(α I + K ∗ K )−1 ϕ and that ∥C ∥2 = O(α β∧2 ). Finally, we have: B = [I − (α I + K ∗ K )−1 K ∗ K ] ϕ − [I − (α I + Kˆ ∗ Kˆ )−1 Kˆ ∗ Kˆ ] ϕ
= [α(α I + K ∗ K )−1 − α(α I + Kˆ ∗ Kˆ )−1 ] ϕ = α(α I + Kˆ ∗ Kˆ )−1 (K ∗ K − Kˆ ∗ Kˆ )(α I + K ∗ K )−1 ϕ = α(α I + Kˆ ∗ Kˆ )−1 (K ∗ − Kˆ ∗ )K (α I + K ∗ K )−1 ϕ + α(α I + Kˆ ∗ Kˆ )−1 Kˆ ∗ (K − Kˆ )(α I + K ∗ K )−1 ϕ = B1 + B2 . Using ∥(α I + K ∗ K )−1 ∥ = O( α12 ), we have:
∥B1 ∥2 ≤ ∥(α I + Kˆ ∗ Kˆ )−1 ∥2 ∥Kˆ ∗ − K ∗ ∥2 ∥α K (α I + K ∗ K )−1 ϕ∥2 . Similar arguments to those of formulae (3.6) and (3.7) imply that the last term is OP (α (β+1)∧2 ). Thus ∥B1 ∥2 = OP
1
α2
n
2κ − 2κ+ p+q
α (β+1)∧2 and finally,
∥B2 ∥2 = ∥(α I + Kˆ ∗ Kˆ )−1 Kˆ ∗ ∥2 ∥Kˆ − K ∥2 ∥α(α I + K ∗ K )−1 ϕ∥2 1 − 2κ = OP n 2κ+p+q α β∧2 . α
F. Fève, J.-P. Florens / Journal of Econometrics (
)
–
13
Regrouping all these terms give the elements of Proposition A.3. Let us now select the optimalα in order to balance α −1 n −
2ρ − 2ρ+ q
and
2ρ
B ∧2 × (β∧ 2ρ+q 2)+1
the rate of convergence is n Finally we need to verify that ∥B∥2 = C α β∧2 where C = OP (1) and this condition is satisfied under our Assumptions. Indeed the ∥B2 ∥2 converges faster than ∥B1 ∥2 if β ≥ 1. We just have to verify that n
β∧2
2κ − 2κ+ p+q
× n β∧2+1 = OP (1),
which is implied by our Assumption. Proof of formula (4.7) We want to minimize
⟨K ϕ − r , K ϕ − r ⟩ + α⟨Lϕ, Lϕ⟩. The Frechet derivative of this expression is:
⟨K ϕ˜ − r , K ϕ − r ⟩ + α⟨Lϕ, ˜ Lϕ⟩,
Fig. 8. MC simulation for scenario II (T periods).
which is equal to 0 for any ϕ˜ ∈ L2π . Equivalently,
⟨ϕ, ˜ K ∗ (K ϕ − r )⟩ + α⟨ϕ, ˜ L∗ Lϕ⟩ = 0. Thus, K ∗ (K ϕ − r ) + α L∗ Lϕ = 0. As L∗ = −L, we obtain formula (4.7). A.2. Data driven method for choosing α ′α ∥2 ∥K α′ − rˆ ∥2 . ˆ M ϕ Let us analyze the product ∥ϕ First : ′α = (α I + M ∗ K ˆ ∗ Kˆ M )−1 M ∗ Kˆ ∗ rˆ , ϕ
and ′α ∥2 ≤ ∥(α I + M ∗ K ˆ M )−1 M ∗ Kˆ ∗ ∥2 ∥ˆr ∥2 = OP ∥ϕ
1
α
.
The second term has been studied in Fève and Florens (2010).8 The −
2ρ
criterion is bounded above by a term of order OP (α −1 (n 2ρ+q + α β+1 )). The argument of the minimum of this upper bound has
Fig. 9. Optimal choice of α .
which is the optimal rate. This technique the rate α ∝ n is an adaptation to an unknown operator linear inverse problem of the L curve method of Hansen (presented in Engl et al. (2000)). A more detailed analysis is given in this book.
which is a nT × n matrix for any τ . Let us also define CWt as in (5.3) for each Wt , t = 2, . . . , T . Then formula (5.5) is extended by considering
A.3. Extension of the method to T periods. A simulated example
A=
2ρ 1 × β+ − 2ρ+ q 1
We extend our presentation to any value of T of the estimation method given in (5.5) based on the Tikhonov regularization method using L2 penalty. In practice our method may be implemented in the following way. Let Z˜ = (˜z1′ , . . . , z˜T′ )′ and y˜ = (˜y′1 , . . . , y˜ ′T )′ the nT vectors of the n observations of the T periods. We define as in Section 5 the matrix CZt τ . For each t = 2, . . . , T as in formula (5.4), we define
CZ ,τ
CZ t τ
= ... CZT τ
T (C2τ − C2τ −1 )CW τ (a′τ ⊗ In )
τ =2
where In is the identity matrix of size n × n and aτ is an T × 1 vector were all the elements are 0 except the (τ − 1)th and the τ th which are respectively equal to −1 and 1. Then
α = [α InT + A]−1 Ay˜ + constant. ϕ We apply this formula to an extension of the simulation presented in Section 6. We use the same design for 3 periods in case of scenario II. We represent in Fig. 8 the true function and 20 Monte Carlo simulations for an empirical data driven rule. An example of this minimization is given in Fig. 9 where the choice of α is obtained by 2 α min ∥ϕ (2) (Z )∥
8 Using a second order Tikhonov estimation and if β ≤ 2, it is proved that this
term is OP n
2ρ − 2ρ+ q
+α
an α β+1 if 1 < β ≤ 2.
β+1
. This second order regularization is necessary to get
T
τ =2
2 α ∥CW τ (a′τ ⊗ In )(˜y − ϕ (2) (Z ))∥
α where ϕ (2) (Z ) is obtained by an iterated Tikhonov estimation of order 2.
14
F. Fève, J.-P. Florens / Journal of Econometrics (
References Altonji, J.G., Matzkin, R.L., 2005. Cross section and panel data estimators for non separable models with endogenous regressors. Econometrica 73 (4), 1053–1102. Andrews, D.W., 2011. Examples of L2 -complete and Boundely-complete distributions, Cowles foundation, Discussion paper 1801, Yale University. Breunig, C., Johannes, J., 2013. Adaptative estimation of functionals in nonparameetric instrumental regression, Department of Statistics, Working paper, Université Catholique de Louvain. Canay, I.A., Santos, A., Shaik, A.M., 2011. The testability of identification in some non parametric model with endogeneity, Discussion paper, Northwestern University. Carrasco, M., Florens, J.P., Renault, E., 2007. Linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization. In: Heckman, J.J., Limer, E.E. (Eds.), Handbook of Econometrics, vol. B. pp. 2007–2108. (chapter 77). Carrasco, M., Florens, J.P., Renault, E., 2012. Asymptotic normal inference in linear inverse problems. In: Ullah, J., Racine, J., Su, L. (Eds.), Handbook of Applied Nonparametric and Semiparametric Econometrics and Statistics. Oxford University Press, February 2014. Centorrino, S., 2013. On the choice of the Regularization Parameter in Nonparametric Instrumental Regressions, Mimeo, Toulouse School of Economics. Centorrino, S., Fève, F., Florens, J.P., 2013. Implementation, simulation and bootstrap in nonparametric instrumental variable regression, Discussion paper, Toulouse School of Economics. Chen, X., Reiss, M., 2011. On rate optimality for ill posed inverse problems in econometrics. Econometric Theory 27 (3), 497–521. Darolles, S., Fan, Y., Florens, J.P., Renault, E., 2011. Non parametric instrumental regression. Econometrica 79 (5), 1541–1565. D’Haultfoeuille, X., 2011. On the completeness condition in non parametric instrumental problems. Econometric Theory 27 (03), 466–471. Dunford, N., Schwartz, J.T., 1988. Linear Operators. Wiley, New-York. Engl, H.W., Hanke, M., Neubauer, A., 2000. Regularization of Inverse Problems. Kluwer, Dordrecht. Evdokimov, K., 2010. Essays on Non Parametric and Semiparametric Econometric Models (Ph.D. thesis). Yale University. Fève, F., Florens, J.P., 2010. The practice of non parametric estimation by solving inverse problems: the example of transformation models. Econom. J. 3 (13), 1–27.
)
–
Florens, J.P., 2003. Inverse problems in structural econometrics, the example of instrumental variables. In: Dewatripont, M., Hansen, L.P., Turnovsky, S.J. (Eds.), Advances in Economics and Econometrics, vol. 2. Cambridge University Press, Cambridge. Florens, J.P., Heckman, J.J., Meghir, C., Vytlacil, E., 2008. Identification of treatment effect using control functions in models with continuous treatment and heterogenous effects. Econometrica 76, 1191–1206. Florens, J.P., Johannes, J., Van Bellegem, S., 2011. Identification and estimation by penalization in nonparametric instrumental regression. Econometric Theory 27 (3), 472–496. Florens, J.P., Mouchart, M., Rolin, J.M., 1990. Elements of Bayesian Statistics. M. Dekker, New-York. Florens, J.P., Racine, J., 2012. Non parametric instrumental derivatives, Discussion paper, Toulouse School of Economics. Gagliardini, P., Scaillet, O., 2012. Nonparametric instrumental variable estimators of structural quantile effects. Econometrica 8, 1533–1562. Hall, P., Horowitz, J., 2005. Non parametric methods for inference in the presence of instrumental variables. Ann. Statist. 33 (6), 2904–2929. Horowitz, J.L., 2006. Testing a parametric model against a nonparametric alternative with identifcation through instrumental variables. Econometrica 74, 521–538. Horowitz, J.L., 2007. Asymptotic normality of a nonparametric instrumental variables estimator. Internat. Econom. Rev. 48, 1329–1349. Horowitz, J.L., 2010. Adaptive nonparametric instrumental variables estimation: empirical choice of the regularization parameter, Economic Department Working paper, Northwestern University. Horowitz, J.L., 2011. Applied non parametric instrumental variables regression. Econometrica 79, 347–394. Hu, Y., Shiu, J.LP, 2011. Non parametric identification using instrumental variables: sufficient condition for completeness, Cemmap — Working paper. Johannes, J., Van Bellegem, S., Vanhems, A., 2011. Convergence rates for Ill-posed inverse problem with an unknown operator. Econometric Theory 27, 522–545. Letac, G., 1995. Integratin and Probability by Paul Malliavin, Exercices and Solutions. Springer, New-York. Newey, W.K., Powell, J.L., 2003. Instrumental variable estimation in non parametric models. Econometrica 71 (5), 1565–1578. Wilhelm, G., 2012. Identification and estimation of nonparametric panel data regressions with measurement error, Discussion Paper, University of Chicago.