Journal Pre-proof Smooth backfitting for errors-in-variables varying coefficient regression models Kyunghee Han, Young K. Lee, Byeong U. Park
PII: DOI: Reference:
S0167-9473(19)30264-6 https://doi.org/10.1016/j.csda.2019.106909 COMSTA 106909
To appear in:
Computational Statistics and Data Analysis
Received date : 30 April 2019 Revised date : 26 December 2019 Accepted date : 26 December 2019 Please cite this article as: K. Han, Y.K. Lee and B.U. Park, Smooth backfitting for errors-in-variables varying coefficient regression models. Computational Statistics and Data Analysis (2020), doi: https://doi.org/10.1016/j.csda.2019.106909. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2020 Elsevier B.V. All rights reserved.
Journal Pre-proof
Smooth Backfitting for Errors-in-Variables Varying Coefficient Regression Models
of
Kyunghee Han, Young K. Lee1 and Byeong U. Park University of Pennsylvania, Kangwon National University, Seoul National University
p ro
Revised for CSDA: December 26, 2019
Abstract
Varying coefficient models inherit the simplicity and easy interpretation of classical linear
Pr e-
models while enjoying the flexibility of nonparametric models. They are very useful in analyzing the relation between a response and a set of predictors. There has been no study, however, on the estimation of varying coefficients when the predictors, on which the varying coefficients depend, are contaminated by measurement errors. A new kernel smoothing technique that is tailored to the structure of an underlying varying coefficient model as well as corrects for the bias due to the measurement errors is developed here. The estimators of the varying coefficients are given implicitly by solving a system of integral equations, whose implementation requires an iterative backfitting procedure. The existence of a unique solution and the convergence of the associated
al
backfitting algorithm are established theoretically. Some numerical evidences that support the
urn
theory and demonstrate the success of the proposed methodology are presented. Key Words: Kernel smoothing, smooth backfitting, varying coefficient models, local polynomial regression, errors-in-variables.
1
Jo
Short Title: SBF for EV Varying Coefficient Models
Corresponding author. Postal address: Department of Statistics, Kangwon National University, 313 Global
Business Hall, Chuncheon, Gangwon-do, 24341 South Korea.
[email protected]. Phone: +82 33 250 8438. Fax: +82 33 259 5663.
Email:
[email protected], youngky-
Journal Pre-proof
1
Introduction
Let Y ∈ R be the response variable, Xo = (X1o , . . . , Xdo )> and Zo = (Z1o , . . . , Zdo )> be true
of
predictors. We study the estimation of the varying coefficient model Y = f1 (X1o )Z1o + · · · + fd (Xdo )Zdo + e
(1.1)
p ro
with E(e|Xo , Zo ) = 0. We consider the case where we do not observe the true predictors, but those contaminated by measurement errors. Suppose that we observe Xi = Xoi + Ui and Zi = Zoi + Vi , where (Xoi , Zoi ) are unobserved i.i.d. copies of (Xo , Zo ) and (Ui , Vi ) are i.i.d. unobserved measurement error vectors. We assume that (Ui , Vi ) are independent of (Xoi , Zoi , ei ) and that Ui and Vi are independent of each other. In this paper, we propose a powerful technique of estimat-
Pr e-
ing the coefficient functions fj based on the observations of the response and the contaminated predictors.
The varying coefficient model (1.1) introduced by Hastie and Tibshirani (1993) is an example of structural nonparametric regression with which one may avoid the curse of dimensionality. The model has a simple structure and an easy interpretation as the classical linear model, yet renders flexibility by allowing the regression coefficients to depend on some other predictors in a nonparametric way. The model has been well-studied in the case of no measurement error in the
al
predictors, see Yang et al. (2006) and Lee et al. (2012a), among others. There is another type of varying coefficient models that have been more extensively studied than (1.1) in the case of no measurement error in the predictors. It is to let all regression
urn
coefficients depend on a common predictor, say X o . The model takes the form Y = f1 (X o )Z1o + · · · + fd (X o )Zdo + e.
(1.2)
The estimation of the latter model is much easier than (1.1) since
Jo
−1 (f1 (x), . . . , fd (x))> = E Zo Zo> |X o = x E(Zo Y |X o = x),
so that the estimation of fj may be accomplished by applying a standard kernel smoothing technique to each entry of Zo Zo> and Zo Y as a response and X o as a single predictor. We refer to the review paper, Park et al. (2015), for a detailed account of the issues in the estimation of the models (1.1) and (1.2) in the case of no measurement error. 2
Journal Pre-proof
There have been a few works treating measurement errors in varying coefficient models. These include You et al. (2006), Li and Greene (2008) and Liu et al. (2019). All of these considered the simpler model (1.2) and the case where only Zo is contaminated by V. In the latter case, the
of
effect of the measurement errors in Zi may be removed by correcting for the so called ‘attenuation effect’. The measurement errors in Xi are more difficult to handle. There has been no attempt to estimate even the simpler model (1.2) when both Xoi and Zoi are contaminated. We stress
p ro
that the estimation of the model (1.1) is much more difficult than (1.2) when the observed Xi also contain measurement errors. The very core of the matter with the measurement errors in Xi is that the responses Yi do not contain relevant information about fj around of the observed values of the predictors Xi due to the measurement errors Ui . Thus, local smoothing with a conventional kernel scheme breaks down. In the case of the simpler model (1.2), this difficulty
Pr e-
may be resolved by employing the so called deconvolution kernel (Fan and Truong, 1993). In the case of our model (1.1), however, the deconvolution kernel scheme does not work. In this paper we develop a powerful technique of fitting the model (1.1) based on (Xi , Zi , Yi ), 1 ≤ i ≤ n. To remove the effect of the measurement errors in Xi , we make use of the ‘normalizedand-deconvoluted’ kernel scheme that is designed for additive regression by Han and Park (2018), see Section 2.2. We find that the non-standard kernel weights remove successfully the effect of the measurement errors Ui in the estimation of the varying coefficient model (1.1). The resulting
al
method combined with correcting for the attenuation effect of Vi has a projection interpretation with relevant function spaces. We detail in this paper the relevant theory for the associated projection operators that has been lacking in the existing literature even in the case of no measurement
urn
error, see Section 3. The set of our estimators of the coefficient functions fj in (1.1) is given as a solution of a system of integral equations and thus does not have a closed form. We suggest a backfitting algorithm to realize the solution. Our theory justifies the proposed method. We prove that the system of integral equations has a unique solution and that the iterative backfitting algorithm converges to the solution at a geometric speed. We also present numerical evidences
Jo
that support the theory.
3
Journal Pre-proof
2
Methodology
We assume that the densities of the original predictors Xjo are supported on bounded intervals with known boundaries. Without loss of generality we assume the supports equal [0, 1]. We
of
also assume that the eigenvalues of E(Zo Zo> |Xo = x) are bounded below by a positive constant λ0 > 0. These assumptions are required to identify each coefficient function fj in the model (1.1).
p ro
To see this, we note that, for square integrable functions mj h i E (m1 (X1o )Z1o + · · · + md (Xdo )Zdo )2 = E m(Xo )> Zo Zo> m(Xo ) ≥ λ0
d X
E [mj (Xjo )2 ],
j=1
where m = (m1 , . . . , md )> . Since the density of Xo is positive on [0, 1]d , we get from the above
Pr e-
inequality that i h E (m1 (X1o )Z1o + · · · + md (Xdo )Zdo )2 = 0 implies mj ≡ 0 a.e. on [0, 1] for all 1 ≤ j ≤ d.
Let K be a baseline kernel function supported on a compact set, say [−1, 1]. Throughout the
paper Kh (v) = K(v/h)/h. Let Kh (·, ·) be the normalized kernel function defined by Z 1 −1 Kh (x, u) = Kh (v − u) dv Kh (x − u) · I[0,1]2 (x, u). 0
(2.1)
al
The normalized kernel Kh (x, u) equals the conventional kernel Kh (x − u) for all u ∈ [0, 1] if x is in the ‘interior’ region [2h, 1 − 2h]. For the standard kernel Kh (·), the interior region is [h, 1 − h] R1 since 0 Kh (x − v) dv = 1 for x ∈ [h, 1 − h]. The above normalized kernel has been used in the
urn
smooth backfitting literature for the case of no measurement error in the predictors, see Mammen et al. (1999) or Yu et al. (2008), for example. For m : Rd → Rd with m(x) = (m1 (x1 ), . . . , md (xd ))> , it makes sense to write m =
(m1 , . . . , md )> if we interpret mj (x) as mj (xj ), where xj is the jth entry of x. Throughout this paper, we use this convention. In this way, we may write f = (f1 , . . . , fd )> for the vector of
2.1
Jo
the true component functions fj in the model (1.1).
Adjustment for noisy Zi
To introduce the smooth backfitting (SBF) technique for estimating (1.1), suppose for the time o , . . . , X o )> and Zo ≡ (Z o , . . . , Z o )> without measurement being that we observe Xoi ≡ (Xi1 i i1 id id
4
Journal Pre-proof
errors. A direct application of the SBF idea is to minimize Z n h i2 X 1 −1 L(m) := n Yi − m(x)> Zoi Kh (x, Xoi ) dx 2 [0,1]d
(2.2)
i=1
of
Q over function tuples m with m(x) = (m1 (x1 ), . . . , md (xd ))> , where Kh (x, u) = dj=1 Khj (xj , uj ). P P o Y K (x , X o ) and q o )2 K (x , X o ). Define f˜jo (xj ) = qˆjo (xj )−1 n−1 ni=1 Zij ˆjo (xj ) = n−1 ni=1 (Zij i hj j hj j ij ij P n > o o o −1 o o o o ˆ be d-vectors defined by (M ˆ (x), . . . , M ˆ (x)) = n ˆ (x). Let M =: M 1 j i=1 Kh (x, Xi )Zi Zi d
p ro
ˆ o (x) esHere and throughout this paper, x−j = (x1 , . . . , xj−1 , xj+1 , . . . , xd )> . We note that M timates M(x) := E(Zo Zo> |Xo = x)p(x) and qˆjo (xj ) is an estimator of qj (xj ) := E((Zjo )2 |Xjo =
xj )pj (xj ), where p and pj are the densities of Xo and Xjo , respectively. By considering the Gˆateaux derivatives of L(m), a solution ˆf o = (fˆ1o , . . . , fˆdo )> , if exists, satisfies Z o o ˜ ˆ o (x)>ˆf o (x) dx−j = 0, 1 ≤ j ≤ d. M qj (xj ) − fj (xj )ˆ j
Pr e-
[0,1]d−1
In the derivation of (2.3), we used the normalization property that
R1 0
(2.3)
Kh (x, u) = 1 for all u ∈ [0, 1].
Now, suppose that we observe Xoi and noisy Zi = Zoi +Vi for random vectors Vi ≡ (Vi1 , . . . , Vid )>
with E(Vi ) = 0. To derive a proper estimator of M based on (Xoi , Zi ), 1 ≤ i ≤ n, we also note that
E(ZZ> |Xo = x) = E(Zo Zo> |Xo = x) + ΣV ,
al
where ΣV = E(VV> ). This follows from the assumption that V is independent of Zo and Xo . This suggests
ˆ M(x) = n−1
n X
urn
i=1
Kh (x, Xoi ) Zi Z> i − ΣV
o as an estimator of M(x). Define qˆj and f˜j as qˆjo and f˜jo , respectively, now with Zij replacing Zij 2 − Cov(V , V ) replacing (Z o )2 in q in f˜jo and Zij ˆjo . For k 6= j, define j k ij
qˆjk (xj , xk ) = n−1
n X i=1
o Khj (xj , Xijo )Khk (xk , Xik ) (Zij Zik − Cov(Vj , Vk )) .
Jo
They estimate qjk (xj , xk ) := E(Zjo Zko |Xjo = xj , Xko = xk )pjk (xj , xk ), respectively, where pjk is the
density of (Xjo , Xko ). The contribution of Vij in Zij to f˜j is negligible since E(Vij Yi |Xoi , Zoi ) = 0, ˆ o , f˜o so that there needs no adjustment in this part for the measurement errors Vij . Replacing M j ˆ f˜j and qˆj , respectively, in (2.3) gives and qˆjo by M, Z ˜ ˆ j (x)>ˆf (x) dx−j = 0, M fj (xj )ˆ qj (xj ) − [0,1]d−1
5
1 ≤ j ≤ d,
(2.4)
Journal Pre-proof
to solve to find ˆf = (fˆ1 , . . . , fˆd )> . We see that the system of equations (2.4) is equivalent to Z 1 d X qˆjk (xj , xk ) fˆj (xj ) = f˜j (xj ) − dxk , 1 ≤ j ≤ d. (2.5) fˆk (xk ) qˆj (xj ) 0 k=1,6=j
Adjustment for noisy Xi
p ro
2.2
of
R ˆ jj (x)fˆj (xj ) dx−j = qˆj (xj )fˆj (xj ) In the derivation of the above system of equations, we used M R1 R ˆ ˆ jk (x)fˆk (xk ) dx−j = ˆ jk denotes the (j, k)th entry of M. ˆjk (xj , xk )fˆk (xk ) dxk , where M and M 0 q
We now come to the case where we observe contaminated predictors (Xi , Zi ) instead of (Xoi , Zoi ). Recall Xi = Xoi + Ui . We write Ui = (Ui1 , . . . , Uid )> . The main difficulty with the normalized kernel Kh (x, u) as defined at (2.1) is that Khj (xj , Xij ) are quite different from Khj (xj , Xijo ) that
Pr e-
we would use if Xijo are available. Depending on the value of Uij , the weight Khj (xj , Xij ) may be large even if Xijo is far away from xj and may be small even if Xijo is very close to xj . Thus, the kernel scheme Khj (xj , Xij ) fails to collect the information about the target function at a point xj from (Xij , Yi ) with Xijo close to xj . R∞ Let Lh (x, u) = −∞ Kh (x, v)Kh (u − v) dv. For the present setting we use the normalized-and˜ ? defined by deconvoluted (ND) kernel K hj Kh?j (xj , u)
1 = 2π
Z
∞
e−itu
−∞
φLh
j
(xj ,·) (t)
φUj (−t)
dt,
(2.6)
al
instead of Khj (xj , u), for smoothing across Xij . Here and throughout the paper, φg for a function g is the Fourier transform of g, and φU for a random variable U is the characteristic function
urn
of U . In the definition (2.6), for simplicity we suppressed the dependence of Kh?j on the index j through the characteristic function of Uj . The above ND kernel originates from Han and Park (2018). It has been proved successful in the application of the SBF technique to additive models and partially additive models with measurement errors in the predictors, see Han and Park (2018) and Lee et al. (2018).
Jo
One of the salient features of the ND kernel Kh? is the so called “unbiased scoring” as stated in the following equation. E Kh?j (xj , Xj ) Xjo = Lhj (xj , Xjo ) for all xj , Xjo ∈ [0, 1].
(2.7)
This implies that the conditional expectation of the ND kernel weight for the contaminated variable Xj , given the original variable Xjo , equals the conventional kernel weight for Xjo . Thus, it 6
Journal Pre-proof
effectively removes the bias caused by the measurement errors Uij . The unbiased scoring property P entails that the bias of n−1 ni=1 Kh?j (xj , Xij )Yi , for example, as an estimator of E(Y |Xjo = P xj )pj (xj ), is the same as the bias of n−1 ni=1 Lhj (xj , Xijo )Yi with the conventional convolution Z
1
0
of
kernel Lh (·, ·). Another important property of Kh? is that
Kh?j (xj , vj ) dxj = 1 for all vj ∈ R.
(2.8)
p ro
The above normalization property, shared by Khj (·, ·) as defined at (2.1) in the case of no measurement errors, is crucial for the success of the SBF technique.
To deconvolute the effect of the measurement errors Ui in Xi in the local smoothing, we basically replace the kernel Khj (·, Xijo ) in Section 2.1 by Kh?j (·, Xij ). Define n X
Zij Yi Kh?j (xj , Xij ),
Pr e-
f˜j? (xj ) = qˆj? (xj )−1 n−1
i=1
qˆj? (xj ) = n−1 ? (xj , xk ) = n−1 qˆjk
n X
i=1 n X i=1
2 (Zij − Var(Vj ))Kh?j (xj , Xij ),
(2.9)
(Zij Zik − Cov(Vj , Vk )) Kh?j (xj , Xij )Kh?k (xk , Xik ).
The SBF estimator of f = (f1 , . . . , fd ) is defined as the solution ˆf ? = (fˆ1? , . . . , fˆd? ) that solves the
al
system of equations Z d X
k=1,6=j
urn
fˆj? (xj ) = f˜j? (xj ) −
0
1
fˆk? (xk )
? (x , x ) qˆjk j k
qˆj? (xj )
dxk ,
1 ≤ j ≤ d.
(2.10)
We cannot solve the SBF equation (2.10) explicitly, but only by the following backfitting iteration: for r ≥ 1
Jo
?[r] fˆj (xj ) = f˜j? (xj ) −
−
j−1 Z X k=1
Z d X
0
1
?[r] fˆk (xk )
? (x , x ) qˆjk j k
qˆj? (xj )
? (x , x ) qˆjk j k ?[r−1] fˆk (xk ) ? (x ) q ˆ j j k=j+1 0
dxk (2.11)
1
dxk ,
1 ≤ j ≤ d.
In the next section we prove that there exists a unique solution ˆf ? of the SBF equation (2.10), and that the SBF algorithm (2.11) converges to the solution at a geometric speed. For the R 1 ?[0] ?[0] initial estimators fˆj we only require 0 fˆj (xj )2 qˆj? (xj ) dxj < ∞ for the SBF iteration (2.11) 7
Journal Pre-proof
?[0] ?[1] to converge. We may take fˆj ≡ 0, in which case fˆj = f˜j? . This means we may also take
?[0] fˆj = f˜j? from the start.
We close this section by giving a remark in the case where M is a diagonal matrix. From the X k6=j
If M is diagonal, then E
fk (Xko )Zjo Zko |Xjo
= xj pj (xj ) =
E(fk (Xko )Zjo Zko |Xjo ),
1 ≤ j ≤ d.
p ro
E(Zjo Y |Xjo ) = fj (Xjo )E((Zjo )2 |Xjo ) +
of
model (1.1) we get
Z
0
1
fk (xk )Mjk (x) dx−j = 0
for all 1 ≤ j 6= k ≤ d, where Mjk are the (j, k)th entry of M. This implies that, if M is diagonal, then
E(Zjo Y |Xjo = xj ) . E((Zjo )2 |Xjo = xj )
Pr e-
fj (xj ) =
In the definitions at (2.9), f˜j? (xj ) estimates the right hand side of the above identity and thus may be used as estimators of fj when M is diagonal. In this case, there needs no backfitting.
3 3.1
Theoretical justification
The space of coefficient functions
al
To discuss the existence of the solution of the SBF equation (2.10) and the convergence of its
urn
algorithm (2.11), we need to introduce some relevant function spaces. For x, u ∈ [0, 1]d , let Q Kh? (x, u) = dj=1 Kh?j (xj , uj ) and define ˆ ? (x) = n−1 M
n X i=1
Kh? (x, Xi ) Zi Z> − Σ V . i
Recall that M(x) = E(Zo Zo> |Xo = x)p(x). We consider the space of component function tuples R m = (m1 , . . . , md )> : [0, 1]d → Rd with mj : [0, 1] → R such that m(x)> M(x)m(x) dx < ∞.
Jo
We call it H. Let Hj denote the subspaces of H defined by m ∈ Hj
if and only if mk ≡ 0 for all k 6= j.
It holds that H = H1 + · · · + Hd . We consider the inner product h·, ·i defined by Z hg, mi := g(x)> M(x)m(x) dx. [0,1]d
8
Journal Pre-proof
Let k · k denote the induced norm so that Z kmk2 :=
m(x)> M(x)m(x) dx.
[0,1]d
of
In this formulation, k · k is certainly a norm for H under the assumptions (A1) and (A2) in the Appendix. We recall that
qjk (uj , uk ) =
Z
[0,1]d−1
Z
[0,1]d−2
Mjj (u) du−j = E (Zjo )2 |Xjo = uj pj (uj ),
p ro
qj (uj ) =
Mjk (u) du−(j,k) = E Zjo Zko |Xjo = uj , Xko = uk pjk (uj , uk ),
where Mjk are the (j, k)th entries of M. From the results (A.15) and (A.17) in the Appendix,
Pr e-
there exist constants 0 < λ1L < λ1U < ∞ and 0 < λ2U < ∞ such that λ1L < inf qj (u) ≤ sup qj (u) < λ1U , u∈[0,1]
sup (u,v)∈[0,1]2
u∈[0,1]
|qjk (u, v)| < λ2U ,
1 ≤ j ≤ d,
(3.1)
1 ≤ j 6= k ≤ d.
Remark 1. We note that the empirical version of k·k, defined by kmk2n =
R
ˆ ? (x)m(x) dx, m(x)> M
ˆ ? (x) conis not appropriate as a norm for the space H since it may take negative values. If M
verges to M(x) = E(Zo Zo> |Xo = x)p(x) uniformly for x ∈ [0, 1]d and the assumptions (A1) and
al
(A2) in the Appendix hold, then k · kn is equivalent to k · k on H with probability tending to one.
ˆ ? (x) to M(x) requires Thus, one may use it as a norm in such case. But, the convergence of M
urn
ˆ ? (x) involves a strong condition on the bandwidths hj that they go to zero very slowly since M Q d-dimensional smoothing. Indeed, we need n dj=1 h1+2β → ∞ as n → ∞ for some β ≥ 0, where j
β depends on the tail behaviour of the characteristic function of Uij , see the assumption (A5) in the Appendix.
Existence and uniqueness of SBF solution
Jo
3.2
To discuss the theoretical justification of our methodology in Section 2.2, we first prove that H is closed. Define the d-vectors Mj by M = (M1 , . . . , Md ). We introduce relevant projection operators that map H to Hj for 1 ≤ j ≤ d. Recall that Hj consists of function tuples of the form
1j · mj for univariate functions mj , where 1j = (0, . . . , 0, 1, 0, . . . , 0)> is the unit d-vector with ’1’ 9
Journal Pre-proof
appearing at the jth position. Define the integral operators Π(·|Hj ) : H → Hj by Z Mj (x)> m(x) dx−j . Π(m|Hj )(x) = 1j · qj (xj )−1
(3.2)
[0,1]d−1
hg, m − Π(m|Hj )i = 0
of
Then, it holds that for all g ∈ Hj and for all m ∈ H,
p ro
so that Π(·|Hj ) is a projection operator onto Hj in H.
Remark 2. In the above definition (3.2) of Π(·|Hj ), it is worthwhile to note that the left hand side of the definition actually depends only on xj , the jth coordinate of x. The ‘dummy’ x−j for integration on the right side is not meant to be that part in x on the left hand side. Nevertheless,
Pr e-
here and below we use the same notation x on both sides for succinct expression. Let Rdj = {a · 1j : a ∈ R} for 1 ≤ j ≤ d, and let L(Rdk , Rdj ) denote the space of all bounded
linear operators that map Rdk to Rdj . If we restrict the domain of Π(·|Hj ) to Hk for k 6= j, then we get that for mk = 1k · mk ∈ Hk Z Π(mk |Hj )(u) = where we used the fact that
R
qjk (uj , vk ) > 1j · ·1 mk (v)Mkk (v) dv, qj (uj )qk (vk ) k [0,1]d
Mkk (v) dv−k = qk (vk ). Define κjk : [0, 1]d × [0, 1]d → L(Rdk , Rdj ) by
qjk (uj , vk ) > κjk (u, v)(w) = 1j · w, w ∈ Rk . ·1 qj (uj )qk (vk ) k R Then, we may write Π(mk |Hj )(u) = κjk (u, v)(mk (v))Mkk (v) dv, so that κjk is the kernel of
urn
al
the integral operator Π(·|Hj ) restricted to Hk . Let kgkL(Rd ,Rd ) for g ∈ L(Rdk , Rdj ) denote the k
j
operator norm of g. According to the theory of integral operators (see Theorem 3.1 of Jeon and Park (2019+) for a general result), Π(·|Hj ) : Hk → Hj are compact if Z kκjk (u, v)k2L(Rd ,Rd ) Mjj (u)Mkk (v) du dv < ∞, 1 ≤ j 6= k ≤ d. k
Jo
[0,1]d ×[0,1]d
j
(3.3)
Clearly, kκjk (u, v)kL(Rd ,Rd ) = |qjk (uj , vk )/(qj (uj )qk (vk ))|. Thus, (3.3) holds for all 1 ≤ j 6= k ≤ d j
k
if
Z
[0,1]2
qjk (uj , vk )2 duj dvk < ∞, qj (uj )qk (vk )
10
1 ≤ j 6= k ≤ d.
(3.4)
Journal Pre-proof
The property (3.4) holds due to (3.1). According to Proposition A.4.2 of Bickel et al. (1993), this implies that H is closed. Then, by Corollary 4.3 of Xu and Zikatanov (2002), the linear operator T := (I − πd ) ◦ · · · ◦ (I − π1 ) is contraction, i.e.,
of
kT kop := sup{kT gk : g ∈ H with kgk ≤ 1} < 1.
(3.5)
To prove that the solution of the SBF equation (2.10) exists and is unique, let us consider the
p ro
empirical versions of the operators Π(·|Hj ). For simplicity, we denote Π(·|Hj ) by πj . Define Z ˆ ? (x)> m(x) dx−j , 1 ≤ j ≤ d. π ˆj (m)(x) = 1j · qˆj? (xj )−1 M (3.6) j [0,1]d−1
Pr e-
R We claim that π ˆj : H → Hj are well-defined. To see this, define kmk∗ by kmk2∗ = m(x)> m(x) dx = Pd R 2 j=1 mj (xj ) dxj . Then, there exist absolute constants 0 < c1 < c2 < ∞ such that c1 kmk ≤ kmk∗ ≤ c2 kmk
(3.7)
for all m ∈ H. The inequalities follow from the assumptions (A1) and (A2) in the Appendix. Now, using Lemma 3, (3.1) and the fact that Z 2 k1j · gj k =
1
gj (xj )2 qj (xj ) dxj ,
(3.8)
0
we may prove that there exists an absolute constant 0 < c3 < ∞ such that
al
P kˆ πj (m)k2 ≤ c3 kmk2∗ for all m ∈ H → 1.
urn
This with (3.7) establishes that π ˆj are well-defined on an event with probability tending to one. We stress that π ˆj are not projection operators in H with the inner product h·, ·i, however, while πj are.
As (2.5) is equivalent to (2.4), the system of equations (2.10) is equivalent to Z ? ? ˜ ˆ ? (x)>ˆf ? (x) dx−j = 0, 1 ≤ j ≤ d, fj (xj )ˆ qj (xj ) − M j
(3.9)
Jo
[0,1]d−1
ˆ ? are d-vectors defined by M ˆ ? = (M ˆ ?, . . . , M ˆ ? ). Define ˆf ? = 1j · fˆ? and ˜f ? = 1j · f˜? . where M 1 j j j j j d Pd ˆ? Pd ˜? ? ? ˆ ˜ Then, f = j=1 fj and f = j=1 fj . Also, we may express the SBF equation at (3.9) as ˆf ? = ˜f ? − π ˆj j j
X j−1 k=1
ˆf ? + k
d X
k=j+1
11
ˆf ? , k
1 ≤ j ≤ d.
Journal Pre-proof
This is equivalent to the system of equations ˆf ? = ˜f ? + (I − π ˆj )ˆf ? , j
1 ≤ j ≤ d,
(3.10)
of
where we used (I − π ˆj )ˆfj? = 0. Now, define π ˆj⊥ = I − π ˆj , the orthogonal complement of π ˆj . We apply (3.10) successively from j = d to j = 1. Then, we get
p ro
ˆf ? = ˜f ? + Tˆˆf ? , ⊕
(3.11)
? ⊥ )˜ ? where Tˆ = π ˆd⊥ ◦ · · · ◦ π ˆ1⊥ and ˜f⊕? = ˜fd? + π ˆd⊥˜fd−1 + (ˆ πd⊥ ◦ π ˆd−1 fd−2 + · · · + (ˆ πd⊥ ◦ · · · ◦ π ˆ2⊥ )˜f1? . In the
Appendix we prove that there exists an absolute constant 0 < γ < 1 such that (3.12)
Pr e-
n o kTˆkop := sup kTˆgk : g ∈ H with kgk ≤ 1 < γ with probability tending to one. This gives the following theorem.
Theorem 1. Under the assumptions (A1)–(A7), the solution of the SBF equation (2.10) exists and is unique with probability tending to one.
3.3
Convergence of SBF algorithm
P ?[r] ?[r] ?[r] ?[r] ?[r] Define ˆf ?[r] = (fˆ1 , . . . , fˆd )> and ˆfj = 1j · fˆj . Then, ˆf ?[r] = dj=1 ˆfj . Along the lines
al
of the above arguments that lead to (3.11), we may also prove that the SBF algorithm (2.11) is equivalent to ˆf ?[r] = ˜f⊕? + Tˆˆf ?[r−1] . To see this, we first note that the SBF algorithm at (2.11) X j−1
urn
may be expressed as
ˆf ?[r] = ˜f ? − π ˆj j j
ˆf ?[r] + k
k=1
d X
k=j+1
ˆf ?[r−1] , k
1 ≤ j ≤ d.
This is equivalent to the system j X
d X
k=j+1
ˆf ?[r−1] = ˜f ? + (I − π ˆj ) j k
Jo
k=1
ˆf ?[r] + k
X j−1 k=1
ˆf ?[r] + k
d X k=j
ˆf ?[r−1] , k
1 ≤ j ≤ d.
(3.13)
Pd ˆ?[r] Starting with ˆf ?[r] = + 0, the left hand side of (3.13) for j = d, we apply (3.13) k=1 fk
successively for j = d to j = 1, changing the rth update to the (r − 1)th update one by one.
12
Journal Pre-proof
Then, we get X d−1
ˆf ?[r] + ˆf ?[r−1] k d
k=1
? ⊥ = ˜fd? + π ˆd⊥˜fd−1 +π ˆd⊥ π ˆd−1
X d−2
ˆf ?[r] + k
k=1
= ˜f⊕? + Tˆ
X d
ˆf ?[r−1] k
k=1
= ˜f⊕? + Tˆˆf ?[r−1] .
k=d−1
∞ X s=r
(3.14)
Pr−1 ˆs˜? ˆr ˆ?[0] for all r ≥ 1. We also s=0 T f⊕ + T f
Pr e-
Now, applying (3.14) repeatedly, we obtain ˆf ?[r] = P ˆs˜? have ˆf ? = ∞ s=0 T f⊕ from (3.11). These imply kˆf ?[r] − ˆf ? k ≤
ˆf ?[r−1] k
p ro
···
d X
of
ˆf ?[r] = ˜f ? + π ˆd⊥ d
kTˆksop · k˜f⊕? k + kTˆkrop · kˆf ?[0] k
≤ kTˆkrop ·
! k˜f⊕? k + kˆf ?[0] k , 1 − kTˆkop
r ≥ 1.
Using (3.12) we now get the following theorem.
Theorem 2. Under the assumptions (A1)–(A7) in the Appendix, there exists an absolute constant ! k˜f⊕? k + kˆf ?[0] k , 1−γ
al
0 < γ < 1 such that kˆf ?[r] − ˆf ? k ≤ γ r ·
r≥1
4
urn
with probability tending to one.
Numerical evidence
We demonstrate the numerical performance of the proposed SBF estimators fˆj? . Below, we first describe an efficient way of data-driven bandwidth selection for h = (h1 , . . . , hd ) that we used in
4.1
Jo
the simulation study.
Bandwidth selection
To choose the bandwidths hj in Kh?j , we employed a shrinkage bandwidth selection strategy introduced by Han et al. (2018) and Han and Park (2018) based on 5-fold cross validation (CV). 13
Journal Pre-proof
To get into some details, we denote by fˆj? (·; h, X ) the component function estimates based on a sample X and a bandwidth vector h. Let Xn = {(Yi , Xi , Zi ) : 1 ≤ i ≤ n} and (Ik : 1 ≤ k ≤ 5) S be a partition of the index set {1, . . . , n} such that 5`=1 I` = {1, . . . , n} and Ik ∩ Ik0 = ∅ for
of
k 6= k 0 . Also, let Xn,(−k) = {(Yi , Xi , Zi : i 6∈ Ik }. We chose a baseline bandwidth vector
hk = (hk,1 , . . . , hk,d ) for each of the subsamples Xn,(−k) , which we describe below. We computed the prediction error
d Xn X o2 Yi − fˆj? Xij ; αhk , Xn,(−k) Zij , j=1
i∈Ik
p ro
CVk (αhk , Xn ) =
α>0
(4.1)
for each 1 ≤ k ≤ 5. Here, α is a positive shrinkage factor commonly applied to all components. Then, we found the optimal shrinkage factor α ˆ such that 5 X
CVk (αhk , Xn ).
Pr e-
α ˆ = arg min α>0
k=1
(4.2)
ˆ =α Our choice of the bandwidth vector was then h ˆ hfull , where hfull = (hfull,1 , . . . , hfull,d ) is the baseline bandwidth vector defined similarly as hk but based on the entire sample Xn . We chose the baseline bandwidth vector hk in (4.1) and (4.2) by calibrating optimal band widths for marginal varying coefficient regression. To be specific, let gˆj ·; b, X denote the marginal
estimator with (Xj , Zj ) being the only predictor set, based on a sample X and a bandwidth b. For example, n X
al
gˆj xj ; bj , Xn =
!−1
2 Zij Kbj (xj , Xij )
i=1
n X
Zij Yi Kbj (xj , Xij ),
i=1
5/12
urn
where Kbj (x, u) is the normalized kernel function with bandwidth bj as defined at (2.1). We used hk,j = bk,j where
bk,j = arg min bj >0
Xn o2 Yi − gˆj Xij ; bj , Xn,(−k) Zij .
i∈Ik
The target of the marginal varying coefficient estimator gˆj is gj = E(Zj Y |Xj = ·)/E(Zj2 |Xj = ·).
The optimal size of bj in this estimation problem equals n−1/5 . It turns out that the optimal sizes
Jo
of hj for the additive component estimators fˆj? equal n−1/(5+2β) when β < 1/2 and n−1/(4+4β) when β > 1/2, where β is the constant in the condition (A5). We may show this along the lines of the technical arguments developed in Han and Park (2018) for the case of additive regression, see Section 5 for further discussion. In our simulation setting β equals 2, which justifies the exponent 5/12
5/(4 + 4β) = 5/12 in hk,j = bk,j . 14
Journal Pre-proof
4.2
Simulation study
We employed a double gamma difference distribution (Augustyniak and Doray, 2012; Klar, 2015) for the distributions of the measurement errors Uj . A double gamma difference (DGD) random d
of
variable U with shape and rate parameters λ and θ, respectively, has the representation U = U1 − U2 for independent U1 , U2 ∼ Γ(λ, θ). The characteristic function of the DGD random
variable U is given by φU (t) = (1 + t2 /θ2 )−λ , so that it satisfies the condition (A5) with β = 2λ.
p ro
We took λ = 1, which corresponds to a Laplace distribution with a scale parameter θ (Kotz et al., 2012). For the measurement errors Vj , we took normal distributions N (0, τ 2 ). We considered two scenarios, one for d = 3 and the other for d = 6. In the first case, we took the error e in the model (1.1) such that its conditional distribution given Xo = x and Zo = z was N (0, σ 2 (x, z)). Here,
Pr e-
1 z22 + z32 x1 + x2 σ(x, z) = + exp −2 + 2 1 + z22 + z32 2
as in Yang et al. (2006) and Lee et al. (2012a). We generated W = (W1 , W2 , W3 ) from the multivariate normal distribution with mean zero and Cov(Wj , Wk ) = 0.7|j−k| . We took Xjo = Φ(Wj ), 1 ≤ j ≤ 3, where Φ is the cumulative distribution function of N (0, 1). In this way
we allowed dependence between Xjo ’s. We took Z1o ≡ 1 and generated the other two regression predictors (Z2o , Z3o ) according to a bivariate normal distribution with mean zero and Cov(Zjo , Zko ) =
al
0.5|j−k| for 2 ≤ j, k ≤ 3. We set the coefficient functions as f1 (x1 ) = 1+e2x1 −1 , f2 (x2 ) = cos(2πx2 ) and f3 (x3 ) = x23 .
urn
For the high-dimensional case with d = 6, we chose (X1o , X2o , X3o ) and (Z1o , Z2o , Z3o ) to be the same as in the low-dimensional case. We generated the additional predictors Xjo for 4 ≤ j ≤ 6 from
independent uniform distributions on [0, 1] and Zjo for 4 ≤ j ≤ 6 from independent N (0, 1). For the additional coefficient functions, we took f4 (x4 ) = pdfΓ(4,5) (x4 ), f5 (x5 ) = arctan(2π(x5 − 0.5)) and f6 (x6 ) = x6 − 0.5. Here, pdfΓ(4,5) is the probability density function of a gamma random
Jo
variable with the shape and rate parameters 4 and 5, respectively. For simplicity, we set the variance function σ(x, z) not to depend on (xj , zj ) for 4 ≤ j ≤ 6. We considered two levels of noise-to-signal ratios (NSR) for the measurement errors. Specifically, we chose the values of θ for the distribution of Uj and τ 2 for the distribution of Vj so that Var(Uj )/Var(Xjo ) = Var(Vj )/Var(Zjo ) = 0.1 and 0.2. To evaluate the estimators fˆj? , we used the Epanechnikov kernel as the baseline kernel function 15
Journal Pre-proof
ˆ = (h ˆ 1, . . . , h ˆ d ) that we described in Section 4.1. We K and the data-driven bandwidth selectors h applied the iterative algorithm (2.11). The convergence criterion we chose was
j=1
2 ?[r−1] ?[r] fˆj (xj ) − fˆj (xj ) dxj ≤ 10−12 .
of
d Z X
We found that the algorithm converged within r = 15 iterations in all simulation scenarios. This suggests that the smooth backfitting algorithm (2.11) works very well in the estimation problem.
p ro
We compared our deconvolution SBF estimators fˆj? as defined at (2.10) with the naive SBF estimators. The naive SBF estimators are the same as fˆjo as defined at (2.3) but with Xijo and o being replaced by the contaminated X and Z , respectively. Thus, they basically ignore Zij ij ij
the measurement errors and simply follow the SBF procedure as described in Lee et al. (2012a)
Pr e-
with the contaminated predictor values. For these naive SBF estimators, we also employed the shrinkage bandwidth selection strategy as described above, but with hk and hfull being replaced by bk and bfull , respectively. We used the Epanechnikov kernel and the 5-fold CV criterion as in the case of our deconvolution SBF. We found that the shrinkage bandwidth selection strategy worked quite well both for the deconvolution and the naive SBF. We examined the numerical performance of each component function estimator in terms of the mean integrated squared error (MISE). We computed B Z X
1
al
MISE(fˆj ) ≈ B −1
b=1
0
2 (b) fˆj (xj ) − fj (xj ) dxj .
(b) Here, fˆj , 1 ≤ j ≤ d, are either the deconvolution SBF (D-SBF) estimators or the naive SBF (N(b)
(b)
(b)
(b)
urn
SBF) estimators of fj (xj ) based on the b-th Monte Carlo pseudo sample Xn = {(Yi , Xi , Zi ) : 1 ≤ i ≤ n}. Tables 1–4 show the approximated values of MISE, where we also report the approximated values of the integrated squared bias (ISB) and the integrated variance (IV), Z 1 2 ¯(b) fˆj (xj ) − fj (xj ) dxj , ISB(fˆj ) ≈
Jo
0
IV(m ˆ j) ≈ B
−1
B Z X b=1
0
1
2 ¯ (b) fˆj (xj ) − fˆj (xj ) dxj .
P ¯(b) ˆ(b) Here, fˆj (xj ) = B −1 B b=1 fj (xj ) for j = 1, . . . , d.
We find that the overall MISE performance of D-SBF is better than N-SBF. For some compo-
nents in Tables 1, 2 and 4, the MISE value of the D-SBF component estimator is larger than that 16
Journal Pre-proof
of the corresponding N-SBF component. However, summing up the MISE values for all components always give a smaller value to the D-SBF estimators than to the N-SBF estimators. Also, in all simulation settings the MISE values of the D-SBF estimators decrease faster than the N-SBF
of
estimators as the sample size n increases. In particular, the D-SBF estimators have smaller ISB values than the N-SBF estimators, which owes to the unbiased scoring property at (2.7). This indicates that our proposed deconvolution procedure corrects successfully for the estimation bias
p ro
due to measurement errors. It is worth to mention that N-SBF generally achieves lower estimation variance than D-SBF but they are getting closer as the sample size increases. Figures 1–3 demonstrate the attenuation-to-null phenomenon for the N-SBF estimators, while our proposal recovers the true coefficient functions quite well both in the low- and high-dimensional cases. In the Supplement, we also report the simulation results when the measurement errors in Xij were
Pr e-
generated from DGD with β = 0.4. They gave the same lessons as those from Tables 1–4 and
Jo
urn
al
Figures 1–3.
17
of
Journal Pre-proof
p ro
Table 1: The mean integrated squared error (MISE), the integrated squared biases (ISB) and the integrated variances (IV) of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 3 and the noise-to-signal (NSR) ratio equals 10%, based on 500 Monte Carlo samples.
n = 400
Deconvolution SBF
Pr e-
n = 200
Coefficient function
MISE
ISB
f1
0.0303
f2
IV
MISE
ISB
0.0126
0.0177
0.0579
0.0419
0.0160
0.0758
0.0268
0.0490
0.1192
0.0780
0.0412
f3
0.0537
0.0016
0.0521
0.0442
0.0078
0.0364
Sum
0.1598
0.0410
0.1188
0.2213
0.1277
0.0936
f1
0.0205
0.0115
0.0090
0.0490
0.0405
0.0085
f2
0.0495
0.0268
0.0227
0.0956
0.0759
0.0197
f3
0.0272
0.0020
0.0252
0.0249
0.0079
0.0170
0.0972
0.0403
0.0569
0.1695
0.1243
0.0452
IV
0.0153
0.0110
0.0043
0.0438
0.0399
0.0039
f1 f2
0.0376
0.0273
0.0103
0.0851
0.0760
0.0091
f3
0.0134
0.0018
0.0116
0.0159
0.0073
0.0086
Sum
0.0663
0.0401
0.0262
0.1448
0.1232
0.0216
urn
Sum
Jo
n = 800
Naive SBF
al
Sample size
18
of
Journal Pre-proof
p ro
Table 2: The mean integrated squared error (MISE), the integrated squared biases (ISB) and the integrated variances (IV) of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 3 and the noise-to-signal (NSR) ratio equals 20%, based on 500 Monte Carlo samples.
n = 400
Deconvolution SBF
Pr e-
n = 200
Coefficient function
MISE
ISB
f1
0.0434
f2
IV
MISE
ISB
0.0108
0.0326
0.0906
0.0739
0.0167
0.1282
0.0178
0.1104
0.1741
0.1375
0.0366
f3
0.1262
0.0012
0.1250
0.0492
0.0172
0.0320
Sum
0.2978
0.0298
0.2680
0.3139
0.2286
0.0853
f1
0.0278
0.0111
0.0167
0.0851
0.0756
0.0095
f2
0.0915
0.0234
0.0681
0.1569
0.1375
0.0194
f3
0.0691
0.0023
0.0668
0.0371
0.0194
0.0177
0.1884
0.0368
0.1516
0.2791
0.2325
0.0466
IV
0.0185
0.0106
0.0079
0.0787
0.0742
0.0045
f1 f2
0.0512
0.0255
0.0257
0.1472
0.1378
0.0094
f3
0.0299
0.0018
0.0281
0.0263
0.0177
0.0086
Sum
0.0996
0.0379
0.0617
0.2522
0.2297
0.0225
urn
Sum
Jo
n = 800
Naive SBF
al
Sample size
19
of
Journal Pre-proof
Table 3: The mean integrated squared error (MISE), the integrated squared biases (ISB) and the
p ro
integrated variances (IV) of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 6 and the noise-to-signal (NSR) ratio equals 10%, based on 500 Monte Carlo samples.
function
MISE
ISB
f1
0.0443
0.0216
0.0227
0.0745
0.0476
0.0269
f2
0.1063
0.0637
0.0426
0.1543
0.1101
0.0442
f3
0.0472
0.0034
0.0438
0.0522
0.0093
0.0429
f4
0.0502
0.0099
0.0403
0.0708
0.0306
0.0402
f5
0.0580
0.0197
0.0383
0.0977
0.0587
0.0390
f6
0.0463
0.0029
0.0434
0.0491
0.0071
0.0420
Sum
0.3523
0.1212
0.2311
0.4986
0.2634
0.2352
f1
0.0311
0.0205
0.0106
0.0594
0.0477
0.0117
0.0794
0.0592
0.0202
0.1269
0.1070
0.0199
0.0226
0.0028
0.0198
0.0273
0.0079
0.0194
urn
f3
ISB
IV
Naive SBF
MISE
f2 n = 400
Deconvolution SBF
Pr e-
n = 200
Coefficient
al
Sample size
IV
0.0255
0.0094
0.0161
0.0473
0.0302
0.0171
f5
0.0350
0.0173
0.0177
0.0733
0.0561
0.0172
f6
0.0205
0.0022
0.0183
0.0244
0.0058
0.0186
Sum
0.2141
0.1114
0.1027
0.3586
0.2547
0.1039
Jo
f4
20
of
Journal Pre-proof
Table 4: The mean integrated squared error (MISE), the integrated squared biases (ISB) and the
p ro
integrated variances (IV) of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 6 and the noise-to-signal (NSR) ratio equals 20%, based on 500 Monte Carlo samples.
function
MISE
ISB
f1
0.0597
0.0207
0.0390
0.1124
0.0827
0.0297
f2
0.1503
0.0593
0.0910
0.2180
0.1754
0.0426
f3
0.0979
0.0032
0.0947
0.0603
0.0199
0.0404
f4
0.0966
0.0070
0.0896
0.1001
0.0604
0.0397
f5
0.1034
0.0156
0.0878
0.1599
0.1215
0.0384
f6
0.1012
0.0016
0.0996
0.0532
0.0125
0.0407
Sum
0.6091
0.1074
0.5017
0.7039
0.4724
0.2315
f1
0.0378
0.0198
0.0180
0.0968
0.0834
0.0134
0.0962
0.0537
0.0425
0.1875
0.1669
0.0206
0.0487
0.0036
0.0451
0.0392
0.0188
0.0204
urn
f3
ISB
IV
Naive SBF
MISE
f2 n = 400
Deconvolution SBF
Pr e-
n = 200
Coefficient
al
Sample size
IV
0.0443
0.0088
0.0355
0.0786
0.0604
0.0182
f5
0.0521
0.0139
0.0382
0.1331
0.1154
0.0177
f6
0.0400
0.0020
0.0380
0.0298
0.0113
0.0185
Sum
0.3191
0.1018
0.2173
0.5650
0.4562
0.1088
Jo
f4
21
1.0
0.5 0.6
0.8
−1.0
1.0
0.4
0.6
0.8
0.2
0.4
0.8
0.0
0.2
0.4
0.6
x2
urn
1.0
0.8
1.0
1.0 0.6
0.8
D−SBF N−SBF
0.0
al
0.6
x1
0.6
x3
Var3
0.8 0.6 0.4 0.4
0.0
D−SBF N−SBF
0.2
0.2
0.0
1.0
0.0
0.0
0.2
0.4
Var2
0.6
0.8
D−SBF N−SBF
0.0
0.4
f3
0.2
x2
1.0
1.0
0.2
0.0
x1
True D−SBF N−SBF
0.4
0.4
True D−SBF N−SBF
0.2
0.2
Pr e-
f2
−0.5
0.0
0.6
3.0
f1
2.5 2.0 1.5
True D−SBF N−SBF 0.0
Var1
0.8
3.5
1.0
p ro
of
Journal Pre-proof
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x3
Figure 1: The average curves of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators (top), and their variance curves (bottom) when d = 3, n = 800
Jo
and the noise-to-signal (NSR) ratio equals 20%, based on 500 Monte Carlo samples.
22
1.0
0.5 0.6
0.8
−1.0
1.0
f3
0.2
0.4
0.6
0.8
0.2
0.4
0.8
1.0
0.4 −0.2
−1.0
0.0
0.2
True D−SBF N−SBF
0.4
0.6
x5
urn
1.0
al
0.8
0.6
x3
0.2
0.5
f5
0.0 −0.5
0.6
x4
−1.5
0.4
0.0
f6
1.5 1.0
1.2 1.0 0.8 0.6 0.4 0.2 0.0
0.2
0.0
1.0
x2
True D−SBF N−SBF 0.0
0.4 0.2
0.0
x1
True D−SBF N−SBF
0.0
0.4
True D−SBF N−SBF
0.8
1.0
−0.4
0.2
Pr e-
f2
−0.5
0.0
0.6
3.0
f1
2.5 2.0 1.5
True D−SBF N−SBF 0.0
f4
0.8
3.5
1.0
p ro
of
Journal Pre-proof
True D−SBF N−SBF 0.0
0.2
0.4
0.6
0.8
1.0
x6
Figure 2: The average curves of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 6, n = 400 and the noise-to-signal (NSR) ratio equals
Jo
20%, based on 500 Monte Carlo samples.
23
0.6
0.8
1.0
1.0 0.8
Var3
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
x5
urn
x4
0.6
0.8
1.0
x3 1.0 0.6
Var6
0.8
D−SBF N−SBF
0.0
al
0.4
0.2
0.4
0.8 0.6 0.4 0.2
0.2
0.0
D−SBF N−SBF
0.0
0.0
0.2
0.4
Var5
0.6
0.8
D−SBF N−SBF
0.0
0.0
1.0
x2
1.0
1.0
0.4
0.0
x1
0.2
0.4
Pr e-
0.2
D−SBF N−SBF
0.6
0.8 0.6 0.2 0.0 0.0
Var4
D−SBF N−SBF
0.4
Var2
0.6 0.0
0.2
0.4
Var1
0.8
D−SBF N−SBF
p ro
1.0
1.0
of
Journal Pre-proof
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x6
Figure 3: The variance curves of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 6, n = 400 and the noise-to-signal (NSR) ratio equals
Jo
20%, based on 500 Monte Carlo samples.
24
Journal Pre-proof
5
Concluding remarks
The methodology we introduce in this paper requires the knowledge of the covariance matrix of V and the densities of Uj , which is often assumed in the literature. In case they are unknown,
of
we may estimate them with repeated measurements Zi` and Xi` for 1 ≤ ` ≤ ri (ri ≥ 2) such that
Xi` = Xoi + Ui` and Zi` = Zoi + Vi` . An immediate idea is to use the differences of repeated ¯ i = Vi` − V ¯ i , we may take Zi` − Z ˆV = Σ
p ro
measurements to cancel the true unobserved predictors. For the estimation of ΣV , noting that Pn Pri i=1
¯
(Z − Zi )(Zi` `=1 Pn i` i=1 (ri − 1)
¯ i )> −Z
,
¯ i = r−1 Pri Zi` . For the characteristic function φU of the measurement error Uj , which where Z j i `=1
Pr e-
we actually need in the construction of the ND kernel Kh?j at (2.6), we may estimate it by Pn P 1/2 2 cos t(X − X ) i` ,j i` ,j 1 2 i=1 1≤`1 <`2 ≤ri Pn φˆUj (t) = , i=1 ri (ri − 1)
where Xi`,j denotes the jth entry of Xi` . For more details we refer to Delaigle et al. (2008). We also assumed that the boundaries of the supports of pj are known, as is usually done in the literature for the case of no measurement error. The reason is that, if we observe Xijo , 1 ≤ i ≤ n, for each 1 ≤ j ≤ d, then we may simply choose, as the support of Xjo , the smallest interval [aj , bj ]
al
that covers all Xijo . However, this is not the case when we observe noisy Xij with unobservable
respectively.
urn
measurement errors Uij . In the latter case, one option is that we estimate the densities pj of P Xjo by pˆ?j (xj ) = n−1 ni=1 Kh?j (xj , Xij ) and then approximate the supports of pj by those of pˆ?j , We have not discussed the rates of convergence of the estimators of the coefficient functions. Along the lines of the asymptotic arguments in the proof of Theorem 3.4 in Han and Park (2018), we may obtain a version of the theorem which gives the stochastic expansions of fˆj? . From
Jo
the stochastic expansions, we may be able to get the following results: (i) when β < 1/2 and √ if hj n−1/(5+2β) is used, then supxj ∈[2hj ,1−2hj ] |fˆj? (xj ) − fj (xj )| = Op n−2/(5+2β) log n and supxj ∈[0,1] |fˆj? (xj ) − fj (xj )| = Op n−1/(5+2β) ; (ii) when β > 1/2 and if hj n−1/(4+4β) is used, √ then supxj ∈[2hj ,1−2hj ] |fˆj? (xj ) − fj (xj )| = Op n−1/(2+2β) log n and supxj ∈[0,1] |fˆj? (xj ) − fj (xj )| = Op n−1/(4+4β) ; (iii) when β = 1/2 and if hj n−1/6 is used, then supxj ∈[2hj ,1−2hj ] |fˆj? (xj ) − fj (xj )| = Op n−1/3 log n and supxj ∈[0,1] |fˆj? (xj ) − fj (xj )| = Op n−1/6 , where β ≥ 0 is the 25
Journal Pre-proof
constant in the assumption (A5). For the above results, we need some further assumptions in addition to (A1)–(A7) in the Appendix. Specifically, those are that E(|e|α ) < ∞ for some α > 5/2,
E(e2 |Xj = ·) are continuous on [0, 1] and that the coefficient functions fj are twice differentiable
of
on [0, 1]. The results demonstrate that the rates do not depend on the dimension d, so that our method does not suffer from ‘the curse of dimensionality’. We recall that the rate n−2/(5+2β) is the optimal univariate rate (in L2 sense), see Fan and Truong (1993) for example. On the
p ro
contrary, full-dimensional smoothing with d-variate deconvolution gives the rate n−2/(4+d(2β+1)) in the interior and n−1/(4+d(2β+1)) on the boundary, see Fan and Masry (1992). We may also extend the main idea of the proposed method to more general varying coefficient models. For example, we may allow some of Xjo to be multivariate, or let some of Zjo enter into the arguments of other coefficient functions fk for k 6= j. We may handle the first case in a
Pr e-
straightforward manner, but the second one with more elaboration. In fact, the latter case was fully addressed in Lee et al. (2012b) in the case of no measurement error. We believe it is possible to extend the approach of Lee et al. (2012b) to the case of measurement errors. We think the latter extension would be a challenging topic for future study.
Acknowledgement
al
Research of Young K. Lee was supported by 2017 Research Grant from Kangwon National University (520170506) and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2018R1A2B6001068). Research of Kyunghee Han was
urn
supported by Fostering Next Generation Researchers Program in Basic Science and Engineering through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2019R1A6A3A03031351). Research of Byeong U. Park was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No.
Jo
2019R1A2C3007355).
26
Journal Pre-proof
Appendix A.1
Assumptions
of
Recall the definition of the matrix-valued function M on [0, 1]d in Section 2.1. We make the following assumptions.
p ro
(A1) Each entry of the matrix function E(Zo Zo> |Xo = ·) is continuous on [0, 1]d . The smallest
and largest eigenvalues of E(Zo Zo> |Xo = ·), respectively, are bounded away from zero and
infinity on [0, 1]d .
(A2) The joint density p is continuous and bounded away from zero and infinity on [0, 1]d .
Pr e-
(A3) The baseline kernel K is symmetric, nonnegative, bounded, Lipschitz continuous, supported R1 on a compact set, say [−1, 1], and satisfies −1 K(u) du = 1. (A4) For some ` > 2, the true predictors Zjo and the measurement errors Vj have finite (2`)th moments. The conditional moments E[(Zjo )4 |Xjo = xj ] and E[(Zjo )2 |Xjo = xj ] are uniformly bounded for xj ∈ [0, 1].
(A5) The characteristic functions of Uj satisfy c1 (1 + |t|)−β ≤ |φUj (t)| ≤ c2 (1 + |t|)−β for some
constant β ≥ 0 and 0 < c1 < c2 < ∞. They also satisfy |tβ+1 φ0Uj (t)| = O(1) as |t| → ∞,
al
where φ0Uj denotes the first derivative of φUj .
(A6) The baseline kernel K is bγ + 1c-times continuously differentiable for some γ ≥ β, and
urn
K (j) (−1) = K (j) (1) = 0 for 0 ≤ j ≤ bγc, where β is the constant in (A5), bγc denotes the largest integer that is less than or equal to γ, and K (j) the jth derivative of K. For the R constant γ it also satisfies |tγ φK (t)| dt < ∞.
(A7) There exists a constant α > 0 such that the bandwidths satisfy hj → 0 for all 1 ≤ j ≤ d
Jo
and n1−(`/2)+α (hj hk )−`(1+2β)/2 → 0 as n → ∞ for all 1 ≤ j 6= k ≤ d, where ` is the constant
in the assumption (A4) and β is the constant in (A5).
27
Journal Pre-proof
A.2
Proof of (3.12)
Recall that T = πd⊥ ◦ · · · ◦ π1⊥ as a map from H to itself is contraction, see (3.5). Thus, the claim follows if we prove that
of
kˆ πj − πj kop = op (1).
(A.1)
((ˆ πj − πj )m) (x) = 1j ·
Z d X
? (x , x ) qˆjk j k
1
qˆj? (xj )
0
k=1,6=j
p ro
To prove (A.1), we first note that (ˆ πj − πj )mj = 0 for mj = 1j · mj . This entails that for P m = (m1 , . . . , md )> = dj=1 mj qjk (xj , xk ) − qj (xj )
!
mk (xk ) dxk ,
1 ≤ j ≤ d.
From this expression and using (3.7)–(3.8) with H¨older inequality, we get that there exists a
Pr e-
constant 0 < C < ∞ such that
! Z 1 d ? (·, x )
X qˆjk qjk (·, xk ) k
k(ˆ πj − πj )mk = − m (x ) dx
1j ·
k k k ?
qˆj (·) qj (·) 0 k=1,6=j " # !2 " Z 1 Z 1 d ? (x , x ) X qˆjk qjk (xj , xk ) j k − mk (xk )qk (xk ) dxk ≤ qˆj? (xj )qk (xk ) qj (xj )qk (xk ) 0 0 k=1,6=j #1/2 × qj (xj ) dxj
"
!1/2 #2 qjk (xj , xk ) ≤ k1k · mk k − qj (xj )qk (xk ) dxj dxk ˆj? (xj )qk (xk ) qj (xj )qk (xk ) [0,1]2 q k=1,6=j " #2 !1/2 Z ? (x , x ) qˆjk qjk (xj , xk ) j k ≤ Ckmk∗ − qj (xj )qk (xk ) dxj dxk ˆj? (xj )qk (xk ) qj (xj )qk (xk ) [0,1]2 q ? (x , x ) qˆjk j k
al
Z
urn
d X
for all m ∈ H. Application of (3.7) then gives " #2 !1/2 Z ? (x , x ) qˆjk qjk (xj , xk ) j k − qj (xj )qk (xk ) dxj dxk . kˆ πj − πj kop ≤ c2 C ˆj? (xj )qk (xk ) qj (xj )qk (xk ) [0,1]2 q The right hand side of the above inequality converges to zero in probability due to Lemmas 1–3
A.3
Jo
in the next subsection.
Lemmas and their proofs
For simplicity of presentation, assume hj are of the same magnitude as h for all 1 ≤ j ≤ d. Let Ij = [2hj , 1 − 2hj ] and Ijk = Ij × Ik . 28
Journal Pre-proof
1 ≤ j ≤ d, 1 ≤ j 6= k ≤ d.
of
Lemma 1. Assume that (A2), (A4)–(A7) hold. Then, ! r ? log n ? sup qˆj (u) − E qˆj (u) = Op , nh1+2β u∈[0,1] ! r ? log n ? sup qˆjk (u, v) − E qˆjk , (u, v) = Op nh2+4β (u,v)∈[0,1]2
Proof. We give only the proof of the second part. The proof of the first part is similar. In the
with observing that Kh?j (x, u)
1 = 2π
Z
∞
e−itu
−∞
p ro
proof, C > 0 denotes the generic absolute constant, which is different in different places. We start
φK (−hj t; x)φK (−hj t) dt, φUj (−t)
R 1 it(x−u)/h jK hj (x, u) du and ‘i’ denotes the unit imaginary number. The 0 e Lipschitz continuity of K and the fact that |eitu | ≤ 1 give
Pr e-
where φK (t; x) =
h1+β |Kh?j (x, u)| ≤ Chj (hj + |x − u|)−1 , j −itx 0 0 e φK (−hj t; x) − e−itx φK (−hj t; x0 ) ≤ Ch−2 j |x − x |.
Fix the indices j 6= k and let
εi = Zij Zik − Cov(Vj , Vk ),
al
κi (u, v) = Kh?j (u, Xij )Kh?k (v, Xik ),
(A.2)
ignoring their dependence on j and k. The inequalities at (A.2) imply
Define ξi by
for all δ > 0.
(A.3)
urn
sup{|κi (u, v) − κi (u0 , v 0 )| : |u − u0 | ≤ δ, |v − v 0 | ≤ δ} ≤ C · δ · h−4−2β
ξi (u, v) = κi (u, v)εi I(|εi | ≤ n1/2−α h1+2β ) − E κi (u, v)εi I(|εi | ≤ n1/2−α h1+2β ). Under the moment conditions at (A4), we get that
Jo
P |εi | ≤ n1/2−α h1+2β for all 1 ≤ i ≤ n → 1
(A.4)
if n1−`/2+`α h−`(1+2β) → 0 as n → ∞. We also find that
n o sup E κi (u, v)εi I(|εi | > n1/2−α h1+2β ) : (u, v) ∈ [0, 1]2 = o(n−1/2 h−1−2β ). 29
(A.5)
Journal Pre-proof
The results (A.4) and (A.5) imply ? qˆjk (u, v)
? E qˆjk (u, v)
−
=n
n X
−1
ξi (u, v) + op (n−1/2 h−1−2β )
(A.6)
i=1
of
uniformly for (u, v) ∈ [0, 1]2 . For a positive number γ, let D(γ) ≡ Dn (γ) be a discrete subset of
[0, 1]2 such that for any (u, v) ∈ [0, 1]2 there exists (˜ u, v˜) ∈ D(γ) with |u − u ˜| ≤ n−γ and |v − v˜| ≤
obtain sup (u,v)∈[0,1]2
n −1 X ξi (u, v) ≤ n i=1
sup (u,v)∈D(γ)
p ro
n−γ . Then, due to (A.4) and taking γ sufficiently large so that n−γ h−4−2β = o(n−1/2 h−1−2β ), we n −1 X ξi (u, v) + op (n−1/2 h−1−2β ). n
(A.7)
i=1
(A.8)
Pr e-
The results (A.6) and (A.7) give the second part of the lemma if we prove ! r n X log n . sup n−1 ξi (u, v) = Op nh2+4β (u,v)∈D(γ) i=1
To prove (A.8) we note that E ξi (u, v) = 0, |ξi (u, v)| ≤ Cn−1/2−α h−1 and E |ξi (u, v)|2 ≤
Ch−2−4β uniformly for (u, v) ∈ [0, 1]2 . The latter two follow since h2+2β |κi (u, v)| and E(|εi |2 |Xij = u, Xik = v) are bounded, which are consequences of the assumption (A5) and the first inequality at (A.2). These give
! n h2+4β log n X sup P ξi (u, v) > A log n n (u,v)∈[0,1]2 i=1 n p h2+4β log n 2 −α 2β −A ≤n sup 1+ · E |ξ1 (u, v)| · exp Cn h log n 2n (u,v)∈[0,1]2 log n n −A ≤n 1+C ≤ n−A+C . n
urn
al
r
Taking A large enough so that −A + C + 2γ < 0 proves (A.8) and completes the proof of the second part of the lemma.
Jo
Lemma 2. Assume that (A1)–(A3), (A5) and (A6) hold, and that hj → 0 as n → ∞. Then,
sup
(u,v)∈Ijk
sup E qˆj? (u) − qj (u) = o(1),
u∈Ij
? E qˆ (u, v) − qjk (u, v) = o(1), jk
30
1 ≤ j ≤ d, 1 ≤ j 6= k ≤ d,
Journal Pre-proof
Proof. We prove the second assertion first. Note that E(Zj Zk |Xj , Xk ) − Cov(Vj , Vk ) = E(Zjo Zko |Xj , Xk ) = E(Zjo Zko |Xjo , Xko )
(A.9)
of
Because of the unbiased scoring property (2.7) and from (A.9) it holds that ? E qˆjk (xj , xk ) = E E(Zjo Zko |Xjo , Xko )Lhj (xj , Xjo )Lhk (xk , Xko ) uniformly for (xj , xk ) ∈ [0, 1]2 , where µj (xj ) =
R1 0
p ro
= qjk (xj , xk )µj (xj )µk (xk ) + o(1)
(A.10)
Lhj (xj , u) du. The second equation of (A.10)
follows from the assumptions (A1) and (A2). If xj ∈ Ij , then µj (xj ) = 1. This gives the second result. The first result also follows from the fact that
uniformly for xj ∈ [0, 1].
Pr e-
E qˆj? (xj ) = qj (xj )µj (xj ) + o(1)
(A.11)
Lemma 3. Assume that the conditions of Lemmas 1 and 2 hold. Then, there exist some constants 0 < c1 < C2 < ∞ and 0 < C2 < ∞ such that
c1 < inf qˆj? (u) ≤ sup qˆj? (u) < C1 , u∈[0,1]
? |ˆ qjk (u, v)| < C2 ,
al
sup
(u,v)∈[0,1]2
1 ≤ j ≤ d,
u∈[0,1]
1 ≤ j 6= k ≤ d
urn
with probability tending to one.
Proof. For the proof of the first part, we observe that from Lemma 1 and (A.11) sup |ˆ qj? (xj ) − qj (xj )µj (xj )| = op (1).
(A.12)
xj ∈[0,1]
Jo
From the symmetry of K and
R1
−1 K(u) du
= 1, it follows that
Khj (xj − u) ≤ Khj (xj , u) ≤ 2Khj (xj − u),
1 ≤ 2
Z
0
1
Khj (xj − u) du ≤ 1
(A.13)
for all xj , u ∈ [0, 1]. This proves 1 ≤ µj (xj ) ≤ 2 4
for all xj ∈ [0, 1]. 31
(A.14)
Journal Pre-proof
Let λmin and λmax , respectively, be the lower and upper bounds of the eigenvalues of M(x). Then, Z o 2 o 1> (A.15) λmin ≤ qj (xj ) = E((Zj ) |Xj = xj )pj (xj ) = j M(x)1j dx−j ≤ λmax [0,1]d−1
of
for all xj ∈ [0, 1]. This together with (A.12) and (A.14) gives the first part of the lemma. For the proof of the second part, we note that from Lemma 1 and (A.10)
xj ,xk ∈[0,1]
? |ˆ qjk (xj , xk ) − qjk (xj , xk )µj (xj )µk (xk )| = op (1).
We also get
p ro
sup
qjk (xj , xk ) = E(Zjo Zko |Xjo = xj , Xko = xk )pjk (xj , xk ) =
Z
[0,1]d−1
(A.16)
1> j M(x)1k dx−(j,k) .
> 1/2 · (1> M(x)1 )1/2 , it holds that Since 1> k j M(x)1k ≤ (1j M(x)1j ) k
for all xj , xk ∈ [0, 1].
Pr e-
|qjk (xj , xk )| ≤ qj (xj )1/2 qk (xk )1/2
(A.17)
The result at (A.16) and the upper bounds at (A.14) and (A.15) imply the second part of the lemma.
References
al
Augustyniak, M. and Doray, L. G. (2012). Inference for a leptokurtic symmetric family of distributions represented by the difference of two gamma variates. Journal of Statistical Computation and Simulation 82, 1621–1634.
urn
Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press. Delaigle, A., Hall, P. and Meister, A. (2008). On deconvolution with repeated measurements. Annals of Statistics 36, 665–685.
Fan, J. and Masry, E. (1992). Multivariate regression estimation with errors-in-variables: Asymptotic
Jo
normality for mixing processes. Journal of Multivariate Analysis 43 237–271. Fan, J. and Truong, Y. K. (1993). Nonparametric regression with errors in variables. Annals of Statistics 21, 1900–1925.
Han, K., M¨ uller, H.-G. and Park, B. U. (2018). Smooth backfitting for additive modeling with small errorsin-variables with an application to additive functional regression for multiple predictor functions. Bernoulli 24, 1233–1265.
32
Journal Pre-proof
Han, K. and Park, B. U. (2018). Smooth backfitting for errors-in-variables additive models. Annals of Statistics 46, 2216–2250. Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journal of Royal Statistical Society,
of
Series B 55, 757–796. Jeon, J. M. and Park, B. U. (2019+). Additive regression with Hilbertian responses. Annals of Statistics, in print.
p ro
Lee, E. R., Han, K. and Park, B. U. (2018). Estimation of errors-in-variables partially linear additive models. Statistica Sinica 28, 2353–2373.
Lee, Y. K., Mammen, E. and Park, B. U. (2012a). Projection-type estimation for varying coefficient regression models. Bernoulli 18, 177–205.
Lee, Y. K., Mammen, E. and Park, B. U. (2012b). Flexible generalized varying coefficient regression
Pr e-
models. Annals of Statistics 40, 1906–1933.
Li, L. and Greene, T. (2008). Varying coefficients model with measurement error. Biometrics 64, 519–526. Liu, Z., Liu, C. and Sun, Z. (2019). Consistent model check of error-in-variables varying coefficient model with auxiliary variables. Journal of Statistical Planning and Inference 198, 13–28. Klar, B. (2015). A note on gamma difference distributions. Journal of Statistical Computation and Simulation 85, 3708–3715.
Kotz, S., Kozubowski, T. and Podgorski, K. (2012). The Laplace distribution and generalizations: a & Business Media.
al
revisit with applications to communications, economics, engineering, and finance. Springer Science
Mammen, E., Linton, O. and Nielsen, J. P. (1999). The existence and asymptotic properties of a backfit-
urn
ting projection algorithm under weak conditions. Annals of Statistics 27, 1443–1490. Park, B. U., Mammen, E., Lee, Y. K. and Lee, E. R. (2015). Varying coefficient regression models: a review and new developments (with discussion). International Statistical Review 83, 36–64. Xu, J. and Zikatanov, L. (2002). The method of alternating projections and the method of subspace corrections in Hilbert space. Journal of American Mathematical Society 15, 573–597.
Jo
Yang, L., Park, B. U., Xue, L. and H¨ardle, W. (2006). Estimation and testing for varying coefficients in additive models with marginal integration. Journal of American Statistical Association 101, 1212–1227.
You, J., Zhou, Y. and Chen, G. (2006). Corrected local polynomial estimation in varying-coefficient models with measurement errors. Canadian Journal of Statistics 34, 391–410.
33
Journal Pre-proof
Yu, K, Park, B. U. and Mammen, E. (2008). Smooth backfitting in generalized additive models. Annals
Jo
urn
al
Pr e-
p ro
of
of Statistics 36, 228–260.
34