Smooth backfitting for errors-in-variables varying coefficient regression models

Smooth backfitting for errors-in-variables varying coefficient regression models

Journal Pre-proof Smooth backfitting for errors-in-variables varying coefficient regression models Kyunghee Han, Young K. Lee, Byeong U. Park PII: DO...

633KB Sizes 0 Downloads 93 Views

Journal Pre-proof Smooth backfitting for errors-in-variables varying coefficient regression models Kyunghee Han, Young K. Lee, Byeong U. Park

PII: DOI: Reference:

S0167-9473(19)30264-6 https://doi.org/10.1016/j.csda.2019.106909 COMSTA 106909

To appear in:

Computational Statistics and Data Analysis

Received date : 30 April 2019 Revised date : 26 December 2019 Accepted date : 26 December 2019 Please cite this article as: K. Han, Y.K. Lee and B.U. Park, Smooth backfitting for errors-in-variables varying coefficient regression models. Computational Statistics and Data Analysis (2020), doi: https://doi.org/10.1016/j.csda.2019.106909. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Elsevier B.V. All rights reserved.

Journal Pre-proof

Smooth Backfitting for Errors-in-Variables Varying Coefficient Regression Models

of

Kyunghee Han, Young K. Lee1 and Byeong U. Park University of Pennsylvania, Kangwon National University, Seoul National University

p ro

Revised for CSDA: December 26, 2019

Abstract

Varying coefficient models inherit the simplicity and easy interpretation of classical linear

Pr e-

models while enjoying the flexibility of nonparametric models. They are very useful in analyzing the relation between a response and a set of predictors. There has been no study, however, on the estimation of varying coefficients when the predictors, on which the varying coefficients depend, are contaminated by measurement errors. A new kernel smoothing technique that is tailored to the structure of an underlying varying coefficient model as well as corrects for the bias due to the measurement errors is developed here. The estimators of the varying coefficients are given implicitly by solving a system of integral equations, whose implementation requires an iterative backfitting procedure. The existence of a unique solution and the convergence of the associated

al

backfitting algorithm are established theoretically. Some numerical evidences that support the

urn

theory and demonstrate the success of the proposed methodology are presented. Key Words: Kernel smoothing, smooth backfitting, varying coefficient models, local polynomial regression, errors-in-variables.

1

Jo

Short Title: SBF for EV Varying Coefficient Models

Corresponding author. Postal address: Department of Statistics, Kangwon National University, 313 Global

Business Hall, Chuncheon, Gangwon-do, 24341 South Korea. [email protected]. Phone: +82 33 250 8438. Fax: +82 33 259 5663.

Email:

[email protected], youngky-

Journal Pre-proof

1

Introduction

Let Y ∈ R be the response variable, Xo = (X1o , . . . , Xdo )> and Zo = (Z1o , . . . , Zdo )> be true

of

predictors. We study the estimation of the varying coefficient model Y = f1 (X1o )Z1o + · · · + fd (Xdo )Zdo + e

(1.1)

p ro

with E(e|Xo , Zo ) = 0. We consider the case where we do not observe the true predictors, but those contaminated by measurement errors. Suppose that we observe Xi = Xoi + Ui and Zi = Zoi + Vi , where (Xoi , Zoi ) are unobserved i.i.d. copies of (Xo , Zo ) and (Ui , Vi ) are i.i.d. unobserved measurement error vectors. We assume that (Ui , Vi ) are independent of (Xoi , Zoi , ei ) and that Ui and Vi are independent of each other. In this paper, we propose a powerful technique of estimat-

Pr e-

ing the coefficient functions fj based on the observations of the response and the contaminated predictors.

The varying coefficient model (1.1) introduced by Hastie and Tibshirani (1993) is an example of structural nonparametric regression with which one may avoid the curse of dimensionality. The model has a simple structure and an easy interpretation as the classical linear model, yet renders flexibility by allowing the regression coefficients to depend on some other predictors in a nonparametric way. The model has been well-studied in the case of no measurement error in the

al

predictors, see Yang et al. (2006) and Lee et al. (2012a), among others. There is another type of varying coefficient models that have been more extensively studied than (1.1) in the case of no measurement error in the predictors. It is to let all regression

urn

coefficients depend on a common predictor, say X o . The model takes the form Y = f1 (X o )Z1o + · · · + fd (X o )Zdo + e.

(1.2)

The estimation of the latter model is much easier than (1.1) since

Jo

 −1 (f1 (x), . . . , fd (x))> = E Zo Zo> |X o = x E(Zo Y |X o = x),

so that the estimation of fj may be accomplished by applying a standard kernel smoothing technique to each entry of Zo Zo> and Zo Y as a response and X o as a single predictor. We refer to the review paper, Park et al. (2015), for a detailed account of the issues in the estimation of the models (1.1) and (1.2) in the case of no measurement error. 2

Journal Pre-proof

There have been a few works treating measurement errors in varying coefficient models. These include You et al. (2006), Li and Greene (2008) and Liu et al. (2019). All of these considered the simpler model (1.2) and the case where only Zo is contaminated by V. In the latter case, the

of

effect of the measurement errors in Zi may be removed by correcting for the so called ‘attenuation effect’. The measurement errors in Xi are more difficult to handle. There has been no attempt to estimate even the simpler model (1.2) when both Xoi and Zoi are contaminated. We stress

p ro

that the estimation of the model (1.1) is much more difficult than (1.2) when the observed Xi also contain measurement errors. The very core of the matter with the measurement errors in Xi is that the responses Yi do not contain relevant information about fj around of the observed values of the predictors Xi due to the measurement errors Ui . Thus, local smoothing with a conventional kernel scheme breaks down. In the case of the simpler model (1.2), this difficulty

Pr e-

may be resolved by employing the so called deconvolution kernel (Fan and Truong, 1993). In the case of our model (1.1), however, the deconvolution kernel scheme does not work. In this paper we develop a powerful technique of fitting the model (1.1) based on (Xi , Zi , Yi ), 1 ≤ i ≤ n. To remove the effect of the measurement errors in Xi , we make use of the ‘normalizedand-deconvoluted’ kernel scheme that is designed for additive regression by Han and Park (2018), see Section 2.2. We find that the non-standard kernel weights remove successfully the effect of the measurement errors Ui in the estimation of the varying coefficient model (1.1). The resulting

al

method combined with correcting for the attenuation effect of Vi has a projection interpretation with relevant function spaces. We detail in this paper the relevant theory for the associated projection operators that has been lacking in the existing literature even in the case of no measurement

urn

error, see Section 3. The set of our estimators of the coefficient functions fj in (1.1) is given as a solution of a system of integral equations and thus does not have a closed form. We suggest a backfitting algorithm to realize the solution. Our theory justifies the proposed method. We prove that the system of integral equations has a unique solution and that the iterative backfitting algorithm converges to the solution at a geometric speed. We also present numerical evidences

Jo

that support the theory.

3

Journal Pre-proof

2

Methodology

We assume that the densities of the original predictors Xjo are supported on bounded intervals with known boundaries. Without loss of generality we assume the supports equal [0, 1]. We

of

also assume that the eigenvalues of E(Zo Zo> |Xo = x) are bounded below by a positive constant λ0 > 0. These assumptions are required to identify each coefficient function fj in the model (1.1).

p ro

To see this, we note that, for square integrable functions mj h  i  E (m1 (X1o )Z1o + · · · + md (Xdo )Zdo )2 = E m(Xo )> Zo Zo> m(Xo ) ≥ λ0

d X

E [mj (Xjo )2 ],

j=1

where m = (m1 , . . . , md )> . Since the density of Xo is positive on [0, 1]d , we get from the above

Pr e-

inequality that i h E (m1 (X1o )Z1o + · · · + md (Xdo )Zdo )2 = 0 implies mj ≡ 0 a.e. on [0, 1] for all 1 ≤ j ≤ d.

Let K be a baseline kernel function supported on a compact set, say [−1, 1]. Throughout the

paper Kh (v) = K(v/h)/h. Let Kh (·, ·) be the normalized kernel function defined by Z 1 −1 Kh (x, u) = Kh (v − u) dv Kh (x − u) · I[0,1]2 (x, u). 0

(2.1)

al

The normalized kernel Kh (x, u) equals the conventional kernel Kh (x − u) for all u ∈ [0, 1] if x is in the ‘interior’ region [2h, 1 − 2h]. For the standard kernel Kh (·), the interior region is [h, 1 − h] R1 since 0 Kh (x − v) dv = 1 for x ∈ [h, 1 − h]. The above normalized kernel has been used in the

urn

smooth backfitting literature for the case of no measurement error in the predictors, see Mammen et al. (1999) or Yu et al. (2008), for example. For m : Rd → Rd with m(x) = (m1 (x1 ), . . . , md (xd ))> , it makes sense to write m =

(m1 , . . . , md )> if we interpret mj (x) as mj (xj ), where xj is the jth entry of x. Throughout this paper, we use this convention. In this way, we may write f = (f1 , . . . , fd )> for the vector of

2.1

Jo

the true component functions fj in the model (1.1).

Adjustment for noisy Zi

To introduce the smooth backfitting (SBF) technique for estimating (1.1), suppose for the time o , . . . , X o )> and Zo ≡ (Z o , . . . , Z o )> without measurement being that we observe Xoi ≡ (Xi1 i i1 id id

4

Journal Pre-proof

errors. A direct application of the SBF idea is to minimize Z n h i2 X 1 −1 L(m) := n Yi − m(x)> Zoi Kh (x, Xoi ) dx 2 [0,1]d

(2.2)

i=1

of

Q over function tuples m with m(x) = (m1 (x1 ), . . . , md (xd ))> , where Kh (x, u) = dj=1 Khj (xj , uj ). P P o Y K (x , X o ) and q o )2 K (x , X o ). Define f˜jo (xj ) = qˆjo (xj )−1 n−1 ni=1 Zij ˆjo (xj ) = n−1 ni=1 (Zij i hj j hj j ij ij P n > o o o −1 o o o o ˆ be d-vectors defined by (M ˆ (x), . . . , M ˆ (x)) = n ˆ (x). Let M =: M 1 j i=1 Kh (x, Xi )Zi Zi d

p ro

ˆ o (x) esHere and throughout this paper, x−j = (x1 , . . . , xj−1 , xj+1 , . . . , xd )> . We note that M timates M(x) := E(Zo Zo> |Xo = x)p(x) and qˆjo (xj ) is an estimator of qj (xj ) := E((Zjo )2 |Xjo =

xj )pj (xj ), where p and pj are the densities of Xo and Xjo , respectively. By considering the Gˆateaux derivatives of L(m), a solution ˆf o = (fˆ1o , . . . , fˆdo )> , if exists, satisfies Z o o ˜ ˆ o (x)>ˆf o (x) dx−j = 0, 1 ≤ j ≤ d. M qj (xj ) − fj (xj )ˆ j

Pr e-

[0,1]d−1

In the derivation of (2.3), we used the normalization property that

R1 0

(2.3)

Kh (x, u) = 1 for all u ∈ [0, 1].

Now, suppose that we observe Xoi and noisy Zi = Zoi +Vi for random vectors Vi ≡ (Vi1 , . . . , Vid )>

with E(Vi ) = 0. To derive a proper estimator of M based on (Xoi , Zi ), 1 ≤ i ≤ n, we also note that

E(ZZ> |Xo = x) = E(Zo Zo> |Xo = x) + ΣV ,

al

where ΣV = E(VV> ). This follows from the assumption that V is independent of Zo and Xo . This suggests

ˆ M(x) = n−1

n X

urn

i=1

  Kh (x, Xoi ) Zi Z> i − ΣV

o as an estimator of M(x). Define qˆj and f˜j as qˆjo and f˜jo , respectively, now with Zij replacing Zij 2 − Cov(V , V ) replacing (Z o )2 in q in f˜jo and Zij ˆjo . For k 6= j, define j k ij

qˆjk (xj , xk ) = n−1

n X i=1

o Khj (xj , Xijo )Khk (xk , Xik ) (Zij Zik − Cov(Vj , Vk )) .

Jo

They estimate qjk (xj , xk ) := E(Zjo Zko |Xjo = xj , Xko = xk )pjk (xj , xk ), respectively, where pjk is the

density of (Xjo , Xko ). The contribution of Vij in Zij to f˜j is negligible since E(Vij Yi |Xoi , Zoi ) = 0, ˆ o , f˜o so that there needs no adjustment in this part for the measurement errors Vij . Replacing M j ˆ f˜j and qˆj , respectively, in (2.3) gives and qˆjo by M, Z ˜ ˆ j (x)>ˆf (x) dx−j = 0, M fj (xj )ˆ qj (xj ) − [0,1]d−1

5

1 ≤ j ≤ d,

(2.4)

Journal Pre-proof

to solve to find ˆf = (fˆ1 , . . . , fˆd )> . We see that the system of equations (2.4) is equivalent to Z 1 d X qˆjk (xj , xk ) fˆj (xj ) = f˜j (xj ) − dxk , 1 ≤ j ≤ d. (2.5) fˆk (xk ) qˆj (xj ) 0 k=1,6=j

Adjustment for noisy Xi

p ro

2.2

of

R ˆ jj (x)fˆj (xj ) dx−j = qˆj (xj )fˆj (xj ) In the derivation of the above system of equations, we used M R1 R ˆ ˆ jk (x)fˆk (xk ) dx−j = ˆ jk denotes the (j, k)th entry of M. ˆjk (xj , xk )fˆk (xk ) dxk , where M and M 0 q

We now come to the case where we observe contaminated predictors (Xi , Zi ) instead of (Xoi , Zoi ). Recall Xi = Xoi + Ui . We write Ui = (Ui1 , . . . , Uid )> . The main difficulty with the normalized kernel Kh (x, u) as defined at (2.1) is that Khj (xj , Xij ) are quite different from Khj (xj , Xijo ) that

Pr e-

we would use if Xijo are available. Depending on the value of Uij , the weight Khj (xj , Xij ) may be large even if Xijo is far away from xj and may be small even if Xijo is very close to xj . Thus, the kernel scheme Khj (xj , Xij ) fails to collect the information about the target function at a point xj from (Xij , Yi ) with Xijo close to xj . R∞ Let Lh (x, u) = −∞ Kh (x, v)Kh (u − v) dv. For the present setting we use the normalized-and˜ ? defined by deconvoluted (ND) kernel K hj Kh?j (xj , u)

1 = 2π

Z



e−itu

−∞

φLh

j

(xj ,·) (t)

φUj (−t)

dt,

(2.6)

al

instead of Khj (xj , u), for smoothing across Xij . Here and throughout the paper, φg for a function g is the Fourier transform of g, and φU for a random variable U is the characteristic function

urn

of U . In the definition (2.6), for simplicity we suppressed the dependence of Kh?j on the index j through the characteristic function of Uj . The above ND kernel originates from Han and Park (2018). It has been proved successful in the application of the SBF technique to additive models and partially additive models with measurement errors in the predictors, see Han and Park (2018) and Lee et al. (2018).

Jo

One of the salient features of the ND kernel Kh? is the so called “unbiased scoring” as stated in the following equation.   E Kh?j (xj , Xj ) Xjo = Lhj (xj , Xjo ) for all xj , Xjo ∈ [0, 1].

(2.7)

This implies that the conditional expectation of the ND kernel weight for the contaminated variable Xj , given the original variable Xjo , equals the conventional kernel weight for Xjo . Thus, it 6

Journal Pre-proof

effectively removes the bias caused by the measurement errors Uij . The unbiased scoring property P entails that the bias of n−1 ni=1 Kh?j (xj , Xij )Yi , for example, as an estimator of E(Y |Xjo = P xj )pj (xj ), is the same as the bias of n−1 ni=1 Lhj (xj , Xijo )Yi with the conventional convolution Z

1

0

of

kernel Lh (·, ·). Another important property of Kh? is that

Kh?j (xj , vj ) dxj = 1 for all vj ∈ R.

(2.8)

p ro

The above normalization property, shared by Khj (·, ·) as defined at (2.1) in the case of no measurement errors, is crucial for the success of the SBF technique.

To deconvolute the effect of the measurement errors Ui in Xi in the local smoothing, we basically replace the kernel Khj (·, Xijo ) in Section 2.1 by Kh?j (·, Xij ). Define n X

Zij Yi Kh?j (xj , Xij ),

Pr e-

f˜j? (xj ) = qˆj? (xj )−1 n−1

i=1

qˆj? (xj ) = n−1 ? (xj , xk ) = n−1 qˆjk

n X

i=1 n X i=1

2 (Zij − Var(Vj ))Kh?j (xj , Xij ),

(2.9)

(Zij Zik − Cov(Vj , Vk )) Kh?j (xj , Xij )Kh?k (xk , Xik ).

The SBF estimator of f = (f1 , . . . , fd ) is defined as the solution ˆf ? = (fˆ1? , . . . , fˆd? ) that solves the

al

system of equations Z d X

k=1,6=j

urn

fˆj? (xj ) = f˜j? (xj ) −

0

1

fˆk? (xk )

? (x , x ) qˆjk j k

qˆj? (xj )

dxk ,

1 ≤ j ≤ d.

(2.10)

We cannot solve the SBF equation (2.10) explicitly, but only by the following backfitting iteration: for r ≥ 1

Jo

?[r] fˆj (xj ) = f˜j? (xj ) −



j−1 Z X k=1

Z d X

0

1

?[r] fˆk (xk )

? (x , x ) qˆjk j k

qˆj? (xj )

? (x , x ) qˆjk j k ?[r−1] fˆk (xk ) ? (x ) q ˆ j j k=j+1 0

dxk (2.11)

1

dxk ,

1 ≤ j ≤ d.

In the next section we prove that there exists a unique solution ˆf ? of the SBF equation (2.10), and that the SBF algorithm (2.11) converges to the solution at a geometric speed. For the R 1 ?[0] ?[0] initial estimators fˆj we only require 0 fˆj (xj )2 qˆj? (xj ) dxj < ∞ for the SBF iteration (2.11) 7

Journal Pre-proof

?[0] ?[1] to converge. We may take fˆj ≡ 0, in which case fˆj = f˜j? . This means we may also take

?[0] fˆj = f˜j? from the start.

We close this section by giving a remark in the case where M is a diagonal matrix. From the X k6=j

If M is diagonal, then E

fk (Xko )Zjo Zko |Xjo



= xj pj (xj ) =

E(fk (Xko )Zjo Zko |Xjo ),

1 ≤ j ≤ d.

p ro

E(Zjo Y |Xjo ) = fj (Xjo )E((Zjo )2 |Xjo ) +

of

model (1.1) we get

Z

0

1

fk (xk )Mjk (x) dx−j = 0

for all 1 ≤ j 6= k ≤ d, where Mjk are the (j, k)th entry of M. This implies that, if M is diagonal, then

E(Zjo Y |Xjo = xj ) . E((Zjo )2 |Xjo = xj )

Pr e-

fj (xj ) =

In the definitions at (2.9), f˜j? (xj ) estimates the right hand side of the above identity and thus may be used as estimators of fj when M is diagonal. In this case, there needs no backfitting.

3 3.1

Theoretical justification

The space of coefficient functions

al

To discuss the existence of the solution of the SBF equation (2.10) and the convergence of its

urn

algorithm (2.11), we need to introduce some relevant function spaces. For x, u ∈ [0, 1]d , let Q Kh? (x, u) = dj=1 Kh?j (xj , uj ) and define ˆ ? (x) = n−1 M

n X i=1

  Kh? (x, Xi ) Zi Z> − Σ V . i

Recall that M(x) = E(Zo Zo> |Xo = x)p(x). We consider the space of component function tuples R m = (m1 , . . . , md )> : [0, 1]d → Rd with mj : [0, 1] → R such that m(x)> M(x)m(x) dx < ∞.

Jo

We call it H. Let Hj denote the subspaces of H defined by m ∈ Hj

if and only if mk ≡ 0 for all k 6= j.

It holds that H = H1 + · · · + Hd . We consider the inner product h·, ·i defined by Z hg, mi := g(x)> M(x)m(x) dx. [0,1]d

8

Journal Pre-proof

Let k · k denote the induced norm so that Z kmk2 :=

m(x)> M(x)m(x) dx.

[0,1]d

of

In this formulation, k · k is certainly a norm for H under the assumptions (A1) and (A2) in the Appendix. We recall that

qjk (uj , uk ) =

Z

[0,1]d−1

Z

[0,1]d−2

 Mjj (u) du−j = E (Zjo )2 |Xjo = uj pj (uj ),

p ro

qj (uj ) =

 Mjk (u) du−(j,k) = E Zjo Zko |Xjo = uj , Xko = uk pjk (uj , uk ),

where Mjk are the (j, k)th entries of M. From the results (A.15) and (A.17) in the Appendix,

Pr e-

there exist constants 0 < λ1L < λ1U < ∞ and 0 < λ2U < ∞ such that λ1L < inf qj (u) ≤ sup qj (u) < λ1U , u∈[0,1]

sup (u,v)∈[0,1]2

u∈[0,1]

|qjk (u, v)| < λ2U ,

1 ≤ j ≤ d,

(3.1)

1 ≤ j 6= k ≤ d.

Remark 1. We note that the empirical version of k·k, defined by kmk2n =

R

ˆ ? (x)m(x) dx, m(x)> M

ˆ ? (x) conis not appropriate as a norm for the space H since it may take negative values. If M

verges to M(x) = E(Zo Zo> |Xo = x)p(x) uniformly for x ∈ [0, 1]d and the assumptions (A1) and

al

(A2) in the Appendix hold, then k · kn is equivalent to k · k on H with probability tending to one.

ˆ ? (x) to M(x) requires Thus, one may use it as a norm in such case. But, the convergence of M

urn

ˆ ? (x) involves a strong condition on the bandwidths hj that they go to zero very slowly since M Q d-dimensional smoothing. Indeed, we need n dj=1 h1+2β → ∞ as n → ∞ for some β ≥ 0, where j

β depends on the tail behaviour of the characteristic function of Uij , see the assumption (A5) in the Appendix.

Existence and uniqueness of SBF solution

Jo

3.2

To discuss the theoretical justification of our methodology in Section 2.2, we first prove that H is closed. Define the d-vectors Mj by M = (M1 , . . . , Md ). We introduce relevant projection operators that map H to Hj for 1 ≤ j ≤ d. Recall that Hj consists of function tuples of the form

1j · mj for univariate functions mj , where 1j = (0, . . . , 0, 1, 0, . . . , 0)> is the unit d-vector with ’1’ 9

Journal Pre-proof

appearing at the jth position. Define the integral operators Π(·|Hj ) : H → Hj by Z Mj (x)> m(x) dx−j . Π(m|Hj )(x) = 1j · qj (xj )−1

(3.2)

[0,1]d−1

hg, m − Π(m|Hj )i = 0

of

Then, it holds that for all g ∈ Hj and for all m ∈ H,

p ro

so that Π(·|Hj ) is a projection operator onto Hj in H.

Remark 2. In the above definition (3.2) of Π(·|Hj ), it is worthwhile to note that the left hand side of the definition actually depends only on xj , the jth coordinate of x. The ‘dummy’ x−j for integration on the right side is not meant to be that part in x on the left hand side. Nevertheless,

Pr e-

here and below we use the same notation x on both sides for succinct expression. Let Rdj = {a · 1j : a ∈ R} for 1 ≤ j ≤ d, and let L(Rdk , Rdj ) denote the space of all bounded

linear operators that map Rdk to Rdj . If we restrict the domain of Π(·|Hj ) to Hk for k 6= j, then we get that for mk = 1k · mk ∈ Hk Z Π(mk |Hj )(u) = where we used the fact that

R

  qjk (uj , vk ) > 1j · ·1 mk (v)Mkk (v) dv, qj (uj )qk (vk ) k [0,1]d

Mkk (v) dv−k = qk (vk ). Define κjk : [0, 1]d × [0, 1]d → L(Rdk , Rdj ) by

 qjk (uj , vk ) > κjk (u, v)(w) = 1j · w, w ∈ Rk . ·1 qj (uj )qk (vk ) k R Then, we may write Π(mk |Hj )(u) = κjk (u, v)(mk (v))Mkk (v) dv, so that κjk is the kernel of

urn

al



the integral operator Π(·|Hj ) restricted to Hk . Let kgkL(Rd ,Rd ) for g ∈ L(Rdk , Rdj ) denote the k

j

operator norm of g. According to the theory of integral operators (see Theorem 3.1 of Jeon and Park (2019+) for a general result), Π(·|Hj ) : Hk → Hj are compact if Z kκjk (u, v)k2L(Rd ,Rd ) Mjj (u)Mkk (v) du dv < ∞, 1 ≤ j 6= k ≤ d. k

Jo

[0,1]d ×[0,1]d

j

(3.3)

Clearly, kκjk (u, v)kL(Rd ,Rd ) = |qjk (uj , vk )/(qj (uj )qk (vk ))|. Thus, (3.3) holds for all 1 ≤ j 6= k ≤ d j

k

if

Z

[0,1]2

qjk (uj , vk )2 duj dvk < ∞, qj (uj )qk (vk )

10

1 ≤ j 6= k ≤ d.

(3.4)

Journal Pre-proof

The property (3.4) holds due to (3.1). According to Proposition A.4.2 of Bickel et al. (1993), this implies that H is closed. Then, by Corollary 4.3 of Xu and Zikatanov (2002), the linear operator T := (I − πd ) ◦ · · · ◦ (I − π1 ) is contraction, i.e.,

of

kT kop := sup{kT gk : g ∈ H with kgk ≤ 1} < 1.

(3.5)

To prove that the solution of the SBF equation (2.10) exists and is unique, let us consider the

p ro

empirical versions of the operators Π(·|Hj ). For simplicity, we denote Π(·|Hj ) by πj . Define Z ˆ ? (x)> m(x) dx−j , 1 ≤ j ≤ d. π ˆj (m)(x) = 1j · qˆj? (xj )−1 M (3.6) j [0,1]d−1

Pr e-

R We claim that π ˆj : H → Hj are well-defined. To see this, define kmk∗ by kmk2∗ = m(x)> m(x) dx = Pd R 2 j=1 mj (xj ) dxj . Then, there exist absolute constants 0 < c1 < c2 < ∞ such that c1 kmk ≤ kmk∗ ≤ c2 kmk

(3.7)

for all m ∈ H. The inequalities follow from the assumptions (A1) and (A2) in the Appendix. Now, using Lemma 3, (3.1) and the fact that Z 2 k1j · gj k =

1

gj (xj )2 qj (xj ) dxj ,

(3.8)

0

we may prove that there exists an absolute constant 0 < c3 < ∞ such that

al

 P kˆ πj (m)k2 ≤ c3 kmk2∗ for all m ∈ H → 1.

urn

This with (3.7) establishes that π ˆj are well-defined on an event with probability tending to one. We stress that π ˆj are not projection operators in H with the inner product h·, ·i, however, while πj are.

As (2.5) is equivalent to (2.4), the system of equations (2.10) is equivalent to Z ? ? ˜ ˆ ? (x)>ˆf ? (x) dx−j = 0, 1 ≤ j ≤ d, fj (xj )ˆ qj (xj ) − M j

(3.9)

Jo

[0,1]d−1

ˆ ? are d-vectors defined by M ˆ ? = (M ˆ ?, . . . , M ˆ ? ). Define ˆf ? = 1j · fˆ? and ˜f ? = 1j · f˜? . where M 1 j j j j j d Pd ˆ? Pd ˜? ? ? ˆ ˜ Then, f = j=1 fj and f = j=1 fj . Also, we may express the SBF equation at (3.9) as ˆf ? = ˜f ? − π ˆj j j

X j−1 k=1

ˆf ? + k

d X

k=j+1

11

 ˆf ? , k

1 ≤ j ≤ d.

Journal Pre-proof

This is equivalent to the system of equations ˆf ? = ˜f ? + (I − π ˆj )ˆf ? , j

1 ≤ j ≤ d,

(3.10)

of

where we used (I − π ˆj )ˆfj? = 0. Now, define π ˆj⊥ = I − π ˆj , the orthogonal complement of π ˆj . We apply (3.10) successively from j = d to j = 1. Then, we get

p ro

ˆf ? = ˜f ? + Tˆˆf ? , ⊕

(3.11)

? ⊥ )˜ ? where Tˆ = π ˆd⊥ ◦ · · · ◦ π ˆ1⊥ and ˜f⊕? = ˜fd? + π ˆd⊥˜fd−1 + (ˆ πd⊥ ◦ π ˆd−1 fd−2 + · · · + (ˆ πd⊥ ◦ · · · ◦ π ˆ2⊥ )˜f1? . In the

Appendix we prove that there exists an absolute constant 0 < γ < 1 such that (3.12)

Pr e-

n o kTˆkop := sup kTˆgk : g ∈ H with kgk ≤ 1 < γ with probability tending to one. This gives the following theorem.

Theorem 1. Under the assumptions (A1)–(A7), the solution of the SBF equation (2.10) exists and is unique with probability tending to one.

3.3

Convergence of SBF algorithm

P ?[r] ?[r] ?[r] ?[r] ?[r] Define ˆf ?[r] = (fˆ1 , . . . , fˆd )> and ˆfj = 1j · fˆj . Then, ˆf ?[r] = dj=1 ˆfj . Along the lines

al

of the above arguments that lead to (3.11), we may also prove that the SBF algorithm (2.11) is equivalent to ˆf ?[r] = ˜f⊕? + Tˆˆf ?[r−1] . To see this, we first note that the SBF algorithm at (2.11) X j−1

urn

may be expressed as

ˆf ?[r] = ˜f ? − π ˆj j j

ˆf ?[r] + k

k=1

d X

k=j+1

 ˆf ?[r−1] , k

1 ≤ j ≤ d.

This is equivalent to the system j X

d X

k=j+1

ˆf ?[r−1] = ˜f ? + (I − π ˆj ) j k

Jo

k=1

ˆf ?[r] + k

X j−1 k=1

ˆf ?[r] + k

d X k=j

 ˆf ?[r−1] , k

1 ≤ j ≤ d.

(3.13)

Pd ˆ?[r] Starting with ˆf ?[r] = + 0, the left hand side of (3.13) for j = d, we apply (3.13) k=1 fk

successively for j = d to j = 1, changing the rth update to the (r − 1)th update one by one.

12

Journal Pre-proof

Then, we get X d−1

ˆf ?[r] + ˆf ?[r−1] k d

k=1

? ⊥ = ˜fd? + π ˆd⊥˜fd−1 +π ˆd⊥ π ˆd−1



X d−2

ˆf ?[r] + k

k=1

= ˜f⊕? + Tˆ

X d

ˆf ?[r−1] k

k=1



= ˜f⊕? + Tˆˆf ?[r−1] .

k=d−1

∞ X s=r



(3.14)

Pr−1 ˆs˜? ˆr ˆ?[0] for all r ≥ 1. We also s=0 T f⊕ + T f

Pr e-

Now, applying (3.14) repeatedly, we obtain ˆf ?[r] = P ˆs˜? have ˆf ? = ∞ s=0 T f⊕ from (3.11). These imply kˆf ?[r] − ˆf ? k ≤

ˆf ?[r−1] k

p ro

···

d X

of

ˆf ?[r] = ˜f ? + π ˆd⊥ d

kTˆksop · k˜f⊕? k + kTˆkrop · kˆf ?[0] k

≤ kTˆkrop ·

! k˜f⊕? k + kˆf ?[0] k , 1 − kTˆkop

r ≥ 1.

Using (3.12) we now get the following theorem.

Theorem 2. Under the assumptions (A1)–(A7) in the Appendix, there exists an absolute constant ! k˜f⊕? k + kˆf ?[0] k , 1−γ

al

0 < γ < 1 such that kˆf ?[r] − ˆf ? k ≤ γ r ·

r≥1

4

urn

with probability tending to one.

Numerical evidence

We demonstrate the numerical performance of the proposed SBF estimators fˆj? . Below, we first describe an efficient way of data-driven bandwidth selection for h = (h1 , . . . , hd ) that we used in

4.1

Jo

the simulation study.

Bandwidth selection

To choose the bandwidths hj in Kh?j , we employed a shrinkage bandwidth selection strategy introduced by Han et al. (2018) and Han and Park (2018) based on 5-fold cross validation (CV). 13

Journal Pre-proof

To get into some details, we denote by fˆj? (·; h, X ) the component function estimates based on a sample X and a bandwidth vector h. Let Xn = {(Yi , Xi , Zi ) : 1 ≤ i ≤ n} and (Ik : 1 ≤ k ≤ 5) S be a partition of the index set {1, . . . , n} such that 5`=1 I` = {1, . . . , n} and Ik ∩ Ik0 = ∅ for

of

k 6= k 0 . Also, let Xn,(−k) = {(Yi , Xi , Zi : i 6∈ Ik }. We chose a baseline bandwidth vector

hk = (hk,1 , . . . , hk,d ) for each of the subsamples Xn,(−k) , which we describe below. We computed the prediction error

d Xn X  o2 Yi − fˆj? Xij ; αhk , Xn,(−k) Zij , j=1

i∈Ik

p ro

CVk (αhk , Xn ) =

α>0

(4.1)

for each 1 ≤ k ≤ 5. Here, α is a positive shrinkage factor commonly applied to all components. Then, we found the optimal shrinkage factor α ˆ such that 5 X

CVk (αhk , Xn ).

Pr e-

α ˆ = arg min α>0

k=1

(4.2)

ˆ =α Our choice of the bandwidth vector was then h ˆ hfull , where hfull = (hfull,1 , . . . , hfull,d ) is the baseline bandwidth vector defined similarly as hk but based on the entire sample Xn . We chose the baseline bandwidth vector hk in (4.1) and (4.2) by calibrating optimal band widths for marginal varying coefficient regression. To be specific, let gˆj ·; b, X denote the marginal

estimator with (Xj , Zj ) being the only predictor set, based on a sample X and a bandwidth b. For example, n X

al



gˆj xj ; bj , Xn =

!−1

2 Zij Kbj (xj , Xij )

i=1

n X

Zij Yi Kbj (xj , Xij ),

i=1

5/12

urn

where Kbj (x, u) is the normalized kernel function with bandwidth bj as defined at (2.1). We used hk,j = bk,j where

bk,j = arg min bj >0

Xn  o2 Yi − gˆj Xij ; bj , Xn,(−k) Zij .

i∈Ik

The target of the marginal varying coefficient estimator gˆj is gj = E(Zj Y |Xj = ·)/E(Zj2 |Xj = ·).

The optimal size of bj in this estimation problem equals n−1/5 . It turns out that the optimal sizes

Jo

of hj for the additive component estimators fˆj? equal n−1/(5+2β) when β < 1/2 and n−1/(4+4β) when β > 1/2, where β is the constant in the condition (A5). We may show this along the lines of the technical arguments developed in Han and Park (2018) for the case of additive regression, see Section 5 for further discussion. In our simulation setting β equals 2, which justifies the exponent 5/12

5/(4 + 4β) = 5/12 in hk,j = bk,j . 14

Journal Pre-proof

4.2

Simulation study

We employed a double gamma difference distribution (Augustyniak and Doray, 2012; Klar, 2015) for the distributions of the measurement errors Uj . A double gamma difference (DGD) random d

of

variable U with shape and rate parameters λ and θ, respectively, has the representation U = U1 − U2 for independent U1 , U2 ∼ Γ(λ, θ). The characteristic function of the DGD random

variable U is given by φU (t) = (1 + t2 /θ2 )−λ , so that it satisfies the condition (A5) with β = 2λ.

p ro

We took λ = 1, which corresponds to a Laplace distribution with a scale parameter θ (Kotz et al., 2012). For the measurement errors Vj , we took normal distributions N (0, τ 2 ). We considered two scenarios, one for d = 3 and the other for d = 6. In the first case, we took the error e in the model (1.1) such that its conditional distribution given Xo = x and Zo = z was N (0, σ 2 (x, z)). Here,

Pr e-

  1 z22 + z32 x1 + x2 σ(x, z) = + exp −2 + 2 1 + z22 + z32 2

as in Yang et al. (2006) and Lee et al. (2012a). We generated W = (W1 , W2 , W3 ) from the multivariate normal distribution with mean zero and Cov(Wj , Wk ) = 0.7|j−k| . We took Xjo = Φ(Wj ), 1 ≤ j ≤ 3, where Φ is the cumulative distribution function of N (0, 1). In this way

we allowed dependence between Xjo ’s. We took Z1o ≡ 1 and generated the other two regression predictors (Z2o , Z3o ) according to a bivariate normal distribution with mean zero and Cov(Zjo , Zko ) =

al

0.5|j−k| for 2 ≤ j, k ≤ 3. We set the coefficient functions as f1 (x1 ) = 1+e2x1 −1 , f2 (x2 ) = cos(2πx2 ) and f3 (x3 ) = x23 .

urn

For the high-dimensional case with d = 6, we chose (X1o , X2o , X3o ) and (Z1o , Z2o , Z3o ) to be the same as in the low-dimensional case. We generated the additional predictors Xjo for 4 ≤ j ≤ 6 from

independent uniform distributions on [0, 1] and Zjo for 4 ≤ j ≤ 6 from independent N (0, 1). For the additional coefficient functions, we took f4 (x4 ) = pdfΓ(4,5) (x4 ), f5 (x5 ) = arctan(2π(x5 − 0.5)) and f6 (x6 ) = x6 − 0.5. Here, pdfΓ(4,5) is the probability density function of a gamma random

Jo

variable with the shape and rate parameters 4 and 5, respectively. For simplicity, we set the variance function σ(x, z) not to depend on (xj , zj ) for 4 ≤ j ≤ 6. We considered two levels of noise-to-signal ratios (NSR) for the measurement errors. Specifically, we chose the values of θ for the distribution of Uj and τ 2 for the distribution of Vj so that Var(Uj )/Var(Xjo ) = Var(Vj )/Var(Zjo ) = 0.1 and 0.2. To evaluate the estimators fˆj? , we used the Epanechnikov kernel as the baseline kernel function 15

Journal Pre-proof

ˆ = (h ˆ 1, . . . , h ˆ d ) that we described in Section 4.1. We K and the data-driven bandwidth selectors h applied the iterative algorithm (2.11). The convergence criterion we chose was

j=1

2 ?[r−1] ?[r] fˆj (xj ) − fˆj (xj ) dxj ≤ 10−12 .

of

d Z  X

We found that the algorithm converged within r = 15 iterations in all simulation scenarios. This suggests that the smooth backfitting algorithm (2.11) works very well in the estimation problem.

p ro

We compared our deconvolution SBF estimators fˆj? as defined at (2.10) with the naive SBF estimators. The naive SBF estimators are the same as fˆjo as defined at (2.3) but with Xijo and o being replaced by the contaminated X and Z , respectively. Thus, they basically ignore Zij ij ij

the measurement errors and simply follow the SBF procedure as described in Lee et al. (2012a)

Pr e-

with the contaminated predictor values. For these naive SBF estimators, we also employed the shrinkage bandwidth selection strategy as described above, but with hk and hfull being replaced by bk and bfull , respectively. We used the Epanechnikov kernel and the 5-fold CV criterion as in the case of our deconvolution SBF. We found that the shrinkage bandwidth selection strategy worked quite well both for the deconvolution and the naive SBF. We examined the numerical performance of each component function estimator in terms of the mean integrated squared error (MISE). We computed B Z X

1

al

MISE(fˆj ) ≈ B −1

b=1

0

2 (b) fˆj (xj ) − fj (xj ) dxj .

(b) Here, fˆj , 1 ≤ j ≤ d, are either the deconvolution SBF (D-SBF) estimators or the naive SBF (N(b)

(b)

(b)

(b)

urn

SBF) estimators of fj (xj ) based on the b-th Monte Carlo pseudo sample Xn = {(Yi , Xi , Zi ) : 1 ≤ i ≤ n}. Tables 1–4 show the approximated values of MISE, where we also report the approximated values of the integrated squared bias (ISB) and the integrated variance (IV), Z 1 2 ¯(b) fˆj (xj ) − fj (xj ) dxj , ISB(fˆj ) ≈

Jo

0

IV(m ˆ j) ≈ B

−1

B Z X b=1

0

1

2 ¯ (b) fˆj (xj ) − fˆj (xj ) dxj .

P ¯(b) ˆ(b) Here, fˆj (xj ) = B −1 B b=1 fj (xj ) for j = 1, . . . , d.

We find that the overall MISE performance of D-SBF is better than N-SBF. For some compo-

nents in Tables 1, 2 and 4, the MISE value of the D-SBF component estimator is larger than that 16

Journal Pre-proof

of the corresponding N-SBF component. However, summing up the MISE values for all components always give a smaller value to the D-SBF estimators than to the N-SBF estimators. Also, in all simulation settings the MISE values of the D-SBF estimators decrease faster than the N-SBF

of

estimators as the sample size n increases. In particular, the D-SBF estimators have smaller ISB values than the N-SBF estimators, which owes to the unbiased scoring property at (2.7). This indicates that our proposed deconvolution procedure corrects successfully for the estimation bias

p ro

due to measurement errors. It is worth to mention that N-SBF generally achieves lower estimation variance than D-SBF but they are getting closer as the sample size increases. Figures 1–3 demonstrate the attenuation-to-null phenomenon for the N-SBF estimators, while our proposal recovers the true coefficient functions quite well both in the low- and high-dimensional cases. In the Supplement, we also report the simulation results when the measurement errors in Xij were

Pr e-

generated from DGD with β = 0.4. They gave the same lessons as those from Tables 1–4 and

Jo

urn

al

Figures 1–3.

17

of

Journal Pre-proof

p ro

Table 1: The mean integrated squared error (MISE), the integrated squared biases (ISB) and the integrated variances (IV) of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 3 and the noise-to-signal (NSR) ratio equals 10%, based on 500 Monte Carlo samples.

n = 400

Deconvolution SBF

Pr e-

n = 200

Coefficient function

MISE

ISB

f1

0.0303

f2

IV

MISE

ISB

0.0126

0.0177

0.0579

0.0419

0.0160

0.0758

0.0268

0.0490

0.1192

0.0780

0.0412

f3

0.0537

0.0016

0.0521

0.0442

0.0078

0.0364

Sum

0.1598

0.0410

0.1188

0.2213

0.1277

0.0936

f1

0.0205

0.0115

0.0090

0.0490

0.0405

0.0085

f2

0.0495

0.0268

0.0227

0.0956

0.0759

0.0197

f3

0.0272

0.0020

0.0252

0.0249

0.0079

0.0170

0.0972

0.0403

0.0569

0.1695

0.1243

0.0452

IV

0.0153

0.0110

0.0043

0.0438

0.0399

0.0039

f1 f2

0.0376

0.0273

0.0103

0.0851

0.0760

0.0091

f3

0.0134

0.0018

0.0116

0.0159

0.0073

0.0086

Sum

0.0663

0.0401

0.0262

0.1448

0.1232

0.0216

urn

Sum

Jo

n = 800

Naive SBF

al

Sample size

18

of

Journal Pre-proof

p ro

Table 2: The mean integrated squared error (MISE), the integrated squared biases (ISB) and the integrated variances (IV) of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 3 and the noise-to-signal (NSR) ratio equals 20%, based on 500 Monte Carlo samples.

n = 400

Deconvolution SBF

Pr e-

n = 200

Coefficient function

MISE

ISB

f1

0.0434

f2

IV

MISE

ISB

0.0108

0.0326

0.0906

0.0739

0.0167

0.1282

0.0178

0.1104

0.1741

0.1375

0.0366

f3

0.1262

0.0012

0.1250

0.0492

0.0172

0.0320

Sum

0.2978

0.0298

0.2680

0.3139

0.2286

0.0853

f1

0.0278

0.0111

0.0167

0.0851

0.0756

0.0095

f2

0.0915

0.0234

0.0681

0.1569

0.1375

0.0194

f3

0.0691

0.0023

0.0668

0.0371

0.0194

0.0177

0.1884

0.0368

0.1516

0.2791

0.2325

0.0466

IV

0.0185

0.0106

0.0079

0.0787

0.0742

0.0045

f1 f2

0.0512

0.0255

0.0257

0.1472

0.1378

0.0094

f3

0.0299

0.0018

0.0281

0.0263

0.0177

0.0086

Sum

0.0996

0.0379

0.0617

0.2522

0.2297

0.0225

urn

Sum

Jo

n = 800

Naive SBF

al

Sample size

19

of

Journal Pre-proof

Table 3: The mean integrated squared error (MISE), the integrated squared biases (ISB) and the

p ro

integrated variances (IV) of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 6 and the noise-to-signal (NSR) ratio equals 10%, based on 500 Monte Carlo samples.

function

MISE

ISB

f1

0.0443

0.0216

0.0227

0.0745

0.0476

0.0269

f2

0.1063

0.0637

0.0426

0.1543

0.1101

0.0442

f3

0.0472

0.0034

0.0438

0.0522

0.0093

0.0429

f4

0.0502

0.0099

0.0403

0.0708

0.0306

0.0402

f5

0.0580

0.0197

0.0383

0.0977

0.0587

0.0390

f6

0.0463

0.0029

0.0434

0.0491

0.0071

0.0420

Sum

0.3523

0.1212

0.2311

0.4986

0.2634

0.2352

f1

0.0311

0.0205

0.0106

0.0594

0.0477

0.0117

0.0794

0.0592

0.0202

0.1269

0.1070

0.0199

0.0226

0.0028

0.0198

0.0273

0.0079

0.0194

urn

f3

ISB

IV

Naive SBF

MISE

f2 n = 400

Deconvolution SBF

Pr e-

n = 200

Coefficient

al

Sample size

IV

0.0255

0.0094

0.0161

0.0473

0.0302

0.0171

f5

0.0350

0.0173

0.0177

0.0733

0.0561

0.0172

f6

0.0205

0.0022

0.0183

0.0244

0.0058

0.0186

Sum

0.2141

0.1114

0.1027

0.3586

0.2547

0.1039

Jo

f4

20

of

Journal Pre-proof

Table 4: The mean integrated squared error (MISE), the integrated squared biases (ISB) and the

p ro

integrated variances (IV) of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 6 and the noise-to-signal (NSR) ratio equals 20%, based on 500 Monte Carlo samples.

function

MISE

ISB

f1

0.0597

0.0207

0.0390

0.1124

0.0827

0.0297

f2

0.1503

0.0593

0.0910

0.2180

0.1754

0.0426

f3

0.0979

0.0032

0.0947

0.0603

0.0199

0.0404

f4

0.0966

0.0070

0.0896

0.1001

0.0604

0.0397

f5

0.1034

0.0156

0.0878

0.1599

0.1215

0.0384

f6

0.1012

0.0016

0.0996

0.0532

0.0125

0.0407

Sum

0.6091

0.1074

0.5017

0.7039

0.4724

0.2315

f1

0.0378

0.0198

0.0180

0.0968

0.0834

0.0134

0.0962

0.0537

0.0425

0.1875

0.1669

0.0206

0.0487

0.0036

0.0451

0.0392

0.0188

0.0204

urn

f3

ISB

IV

Naive SBF

MISE

f2 n = 400

Deconvolution SBF

Pr e-

n = 200

Coefficient

al

Sample size

IV

0.0443

0.0088

0.0355

0.0786

0.0604

0.0182

f5

0.0521

0.0139

0.0382

0.1331

0.1154

0.0177

f6

0.0400

0.0020

0.0380

0.0298

0.0113

0.0185

Sum

0.3191

0.1018

0.2173

0.5650

0.4562

0.1088

Jo

f4

21

1.0

0.5 0.6

0.8

−1.0

1.0

0.4

0.6

0.8

0.2

0.4

0.8

0.0

0.2

0.4

0.6

x2

urn

1.0

0.8

1.0

1.0 0.6

0.8

D−SBF N−SBF

0.0

al

0.6

x1

0.6

x3

Var3

0.8 0.6 0.4 0.4

0.0

D−SBF N−SBF

0.2

0.2

0.0

1.0

0.0

0.0

0.2

0.4

Var2

0.6

0.8

D−SBF N−SBF

0.0

0.4

f3

0.2

x2

1.0

1.0

0.2

0.0

x1

True D−SBF N−SBF

0.4

0.4

True D−SBF N−SBF

0.2

0.2

Pr e-

f2

−0.5

0.0

0.6

3.0

f1

2.5 2.0 1.5

True D−SBF N−SBF 0.0

Var1

0.8

3.5

1.0

p ro

of

Journal Pre-proof

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

x3

Figure 1: The average curves of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators (top), and their variance curves (bottom) when d = 3, n = 800

Jo

and the noise-to-signal (NSR) ratio equals 20%, based on 500 Monte Carlo samples.

22

1.0

0.5 0.6

0.8

−1.0

1.0

f3

0.2

0.4

0.6

0.8

0.2

0.4

0.8

1.0

0.4 −0.2

−1.0

0.0

0.2

True D−SBF N−SBF

0.4

0.6

x5

urn

1.0

al

0.8

0.6

x3

0.2

0.5

f5

0.0 −0.5

0.6

x4

−1.5

0.4

0.0

f6

1.5 1.0

1.2 1.0 0.8 0.6 0.4 0.2 0.0

0.2

0.0

1.0

x2

True D−SBF N−SBF 0.0

0.4 0.2

0.0

x1

True D−SBF N−SBF

0.0

0.4

True D−SBF N−SBF

0.8

1.0

−0.4

0.2

Pr e-

f2

−0.5

0.0

0.6

3.0

f1

2.5 2.0 1.5

True D−SBF N−SBF 0.0

f4

0.8

3.5

1.0

p ro

of

Journal Pre-proof

True D−SBF N−SBF 0.0

0.2

0.4

0.6

0.8

1.0

x6

Figure 2: The average curves of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 6, n = 400 and the noise-to-signal (NSR) ratio equals

Jo

20%, based on 500 Monte Carlo samples.

23

0.6

0.8

1.0

1.0 0.8

Var3

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

x5

urn

x4

0.6

0.8

1.0

x3 1.0 0.6

Var6

0.8

D−SBF N−SBF

0.0

al

0.4

0.2

0.4

0.8 0.6 0.4 0.2

0.2

0.0

D−SBF N−SBF

0.0

0.0

0.2

0.4

Var5

0.6

0.8

D−SBF N−SBF

0.0

0.0

1.0

x2

1.0

1.0

0.4

0.0

x1

0.2

0.4

Pr e-

0.2

D−SBF N−SBF

0.6

0.8 0.6 0.2 0.0 0.0

Var4

D−SBF N−SBF

0.4

Var2

0.6 0.0

0.2

0.4

Var1

0.8

D−SBF N−SBF

p ro

1.0

1.0

of

Journal Pre-proof

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

x6

Figure 3: The variance curves of the proposed deconvolution SBF (D-SBF) estimators and the naive SBF (N-SBF) estimators when d = 6, n = 400 and the noise-to-signal (NSR) ratio equals

Jo

20%, based on 500 Monte Carlo samples.

24

Journal Pre-proof

5

Concluding remarks

The methodology we introduce in this paper requires the knowledge of the covariance matrix of V and the densities of Uj , which is often assumed in the literature. In case they are unknown,

of

we may estimate them with repeated measurements Zi` and Xi` for 1 ≤ ` ≤ ri (ri ≥ 2) such that

Xi` = Xoi + Ui` and Zi` = Zoi + Vi` . An immediate idea is to use the differences of repeated ¯ i = Vi` − V ¯ i , we may take Zi` − Z ˆV = Σ

p ro

measurements to cancel the true unobserved predictors. For the estimation of ΣV , noting that Pn Pri i=1

¯

(Z − Zi )(Zi` `=1 Pn i` i=1 (ri − 1)

¯ i )> −Z

,

¯ i = r−1 Pri Zi` . For the characteristic function φU of the measurement error Uj , which where Z j i `=1

Pr e-

we actually need in the construction of the ND kernel Kh?j at (2.6), we may estimate it by Pn P  1/2 2 cos t(X − X ) i` ,j i` ,j 1 2 i=1 1≤`1 <`2 ≤ri Pn φˆUj (t) = , i=1 ri (ri − 1)

where Xi`,j denotes the jth entry of Xi` . For more details we refer to Delaigle et al. (2008). We also assumed that the boundaries of the supports of pj are known, as is usually done in the literature for the case of no measurement error. The reason is that, if we observe Xijo , 1 ≤ i ≤ n, for each 1 ≤ j ≤ d, then we may simply choose, as the support of Xjo , the smallest interval [aj , bj ]

al

that covers all Xijo . However, this is not the case when we observe noisy Xij with unobservable

respectively.

urn

measurement errors Uij . In the latter case, one option is that we estimate the densities pj of P Xjo by pˆ?j (xj ) = n−1 ni=1 Kh?j (xj , Xij ) and then approximate the supports of pj by those of pˆ?j , We have not discussed the rates of convergence of the estimators of the coefficient functions. Along the lines of the asymptotic arguments in the proof of Theorem 3.4 in Han and Park (2018), we may obtain a version of the theorem which gives the stochastic expansions of fˆj? . From

Jo

the stochastic expansions, we may be able to get the following results: (i) when β < 1/2 and  √ if hj  n−1/(5+2β) is used, then supxj ∈[2hj ,1−2hj ] |fˆj? (xj ) − fj (xj )| = Op n−2/(5+2β) log n and  supxj ∈[0,1] |fˆj? (xj ) − fj (xj )| = Op n−1/(5+2β) ; (ii) when β > 1/2 and if hj  n−1/(4+4β) is used,  √ then supxj ∈[2hj ,1−2hj ] |fˆj? (xj ) − fj (xj )| = Op n−1/(2+2β) log n and supxj ∈[0,1] |fˆj? (xj ) − fj (xj )| =  Op n−1/(4+4β) ; (iii) when β = 1/2 and if hj  n−1/6 is used, then supxj ∈[2hj ,1−2hj ] |fˆj? (xj ) −   fj (xj )| = Op n−1/3 log n and supxj ∈[0,1] |fˆj? (xj ) − fj (xj )| = Op n−1/6 , where β ≥ 0 is the 25

Journal Pre-proof

constant in the assumption (A5). For the above results, we need some further assumptions in addition to (A1)–(A7) in the Appendix. Specifically, those are that E(|e|α ) < ∞ for some α > 5/2,

E(e2 |Xj = ·) are continuous on [0, 1] and that the coefficient functions fj are twice differentiable

of

on [0, 1]. The results demonstrate that the rates do not depend on the dimension d, so that our method does not suffer from ‘the curse of dimensionality’. We recall that the rate n−2/(5+2β) is the optimal univariate rate (in L2 sense), see Fan and Truong (1993) for example. On the

p ro

contrary, full-dimensional smoothing with d-variate deconvolution gives the rate n−2/(4+d(2β+1)) in the interior and n−1/(4+d(2β+1)) on the boundary, see Fan and Masry (1992). We may also extend the main idea of the proposed method to more general varying coefficient models. For example, we may allow some of Xjo to be multivariate, or let some of Zjo enter into the arguments of other coefficient functions fk for k 6= j. We may handle the first case in a

Pr e-

straightforward manner, but the second one with more elaboration. In fact, the latter case was fully addressed in Lee et al. (2012b) in the case of no measurement error. We believe it is possible to extend the approach of Lee et al. (2012b) to the case of measurement errors. We think the latter extension would be a challenging topic for future study.

Acknowledgement

al

Research of Young K. Lee was supported by 2017 Research Grant from Kangwon National University (520170506) and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2018R1A2B6001068). Research of Kyunghee Han was

urn

supported by Fostering Next Generation Researchers Program in Basic Science and Engineering through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2019R1A6A3A03031351). Research of Byeong U. Park was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No.

Jo

2019R1A2C3007355).

26

Journal Pre-proof

Appendix A.1

Assumptions

of

Recall the definition of the matrix-valued function M on [0, 1]d in Section 2.1. We make the following assumptions.

p ro

(A1) Each entry of the matrix function E(Zo Zo> |Xo = ·) is continuous on [0, 1]d . The smallest

and largest eigenvalues of E(Zo Zo> |Xo = ·), respectively, are bounded away from zero and

infinity on [0, 1]d .

(A2) The joint density p is continuous and bounded away from zero and infinity on [0, 1]d .

Pr e-

(A3) The baseline kernel K is symmetric, nonnegative, bounded, Lipschitz continuous, supported R1 on a compact set, say [−1, 1], and satisfies −1 K(u) du = 1. (A4) For some ` > 2, the true predictors Zjo and the measurement errors Vj have finite (2`)th moments. The conditional moments E[(Zjo )4 |Xjo = xj ] and E[(Zjo )2 |Xjo = xj ] are uniformly bounded for xj ∈ [0, 1].

(A5) The characteristic functions of Uj satisfy c1 (1 + |t|)−β ≤ |φUj (t)| ≤ c2 (1 + |t|)−β for some

constant β ≥ 0 and 0 < c1 < c2 < ∞. They also satisfy |tβ+1 φ0Uj (t)| = O(1) as |t| → ∞,

al

where φ0Uj denotes the first derivative of φUj .

(A6) The baseline kernel K is bγ + 1c-times continuously differentiable for some γ ≥ β, and

urn

K (j) (−1) = K (j) (1) = 0 for 0 ≤ j ≤ bγc, where β is the constant in (A5), bγc denotes the largest integer that is less than or equal to γ, and K (j) the jth derivative of K. For the R constant γ it also satisfies |tγ φK (t)| dt < ∞.

(A7) There exists a constant α > 0 such that the bandwidths satisfy hj → 0 for all 1 ≤ j ≤ d

Jo

and n1−(`/2)+α (hj hk )−`(1+2β)/2 → 0 as n → ∞ for all 1 ≤ j 6= k ≤ d, where ` is the constant

in the assumption (A4) and β is the constant in (A5).

27

Journal Pre-proof

A.2

Proof of (3.12)

Recall that T = πd⊥ ◦ · · · ◦ π1⊥ as a map from H to itself is contraction, see (3.5). Thus, the claim follows if we prove that

of

kˆ πj − πj kop = op (1).

(A.1)

((ˆ πj − πj )m) (x) = 1j ·

Z d X

? (x , x ) qˆjk j k

1

qˆj? (xj )

0

k=1,6=j

p ro

To prove (A.1), we first note that (ˆ πj − πj )mj = 0 for mj = 1j · mj . This entails that for P m = (m1 , . . . , md )> = dj=1 mj qjk (xj , xk ) − qj (xj )

!

mk (xk ) dxk ,

1 ≤ j ≤ d.

From this expression and using (3.7)–(3.8) with H¨older inequality, we get that there exists a

Pr e-

constant 0 < C < ∞ such that

! Z 1 d ? (·, x )

X qˆjk qjk (·, xk ) k

k(ˆ πj − πj )mk = − m (x ) dx

1j ·

k k k ?

qˆj (·) qj (·) 0 k=1,6=j " # !2 " Z 1 Z 1 d ? (x , x ) X qˆjk qjk (xj , xk ) j k − mk (xk )qk (xk ) dxk ≤ qˆj? (xj )qk (xk ) qj (xj )qk (xk ) 0 0 k=1,6=j #1/2 × qj (xj ) dxj

"

!1/2 #2 qjk (xj , xk ) ≤ k1k · mk k − qj (xj )qk (xk ) dxj dxk ˆj? (xj )qk (xk ) qj (xj )qk (xk ) [0,1]2 q k=1,6=j " #2 !1/2 Z ? (x , x ) qˆjk qjk (xj , xk ) j k ≤ Ckmk∗ − qj (xj )qk (xk ) dxj dxk ˆj? (xj )qk (xk ) qj (xj )qk (xk ) [0,1]2 q ? (x , x ) qˆjk j k

al

Z

urn

d X

for all m ∈ H. Application of (3.7) then gives " #2 !1/2 Z ? (x , x ) qˆjk qjk (xj , xk ) j k − qj (xj )qk (xk ) dxj dxk . kˆ πj − πj kop ≤ c2 C ˆj? (xj )qk (xk ) qj (xj )qk (xk ) [0,1]2 q The right hand side of the above inequality converges to zero in probability due to Lemmas 1–3

A.3

Jo

in the next subsection.

Lemmas and their proofs

For simplicity of presentation, assume hj are of the same magnitude as h for all 1 ≤ j ≤ d. Let Ij = [2hj , 1 − 2hj ] and Ijk = Ij × Ik . 28

Journal Pre-proof

1 ≤ j ≤ d, 1 ≤ j 6= k ≤ d.

of

Lemma 1. Assume that (A2), (A4)–(A7) hold. Then, ! r ? log n ? sup qˆj (u) − E qˆj (u) = Op , nh1+2β u∈[0,1] ! r ? log n ? sup qˆjk (u, v) − E qˆjk , (u, v) = Op nh2+4β (u,v)∈[0,1]2

Proof. We give only the proof of the second part. The proof of the first part is similar. In the

with observing that Kh?j (x, u)

1 = 2π

Z



e−itu

−∞

p ro

proof, C > 0 denotes the generic absolute constant, which is different in different places. We start

φK (−hj t; x)φK (−hj t) dt, φUj (−t)

R 1 it(x−u)/h jK hj (x, u) du and ‘i’ denotes the unit imaginary number. The 0 e Lipschitz continuity of K and the fact that |eitu | ≤ 1 give

Pr e-

where φK (t; x) =

h1+β |Kh?j (x, u)| ≤ Chj (hj + |x − u|)−1 , j −itx 0 0 e φK (−hj t; x) − e−itx φK (−hj t; x0 ) ≤ Ch−2 j |x − x |.

Fix the indices j 6= k and let

εi = Zij Zik − Cov(Vj , Vk ),

al

κi (u, v) = Kh?j (u, Xij )Kh?k (v, Xik ),

(A.2)

ignoring their dependence on j and k. The inequalities at (A.2) imply

Define ξi by

for all δ > 0.

(A.3)

urn

sup{|κi (u, v) − κi (u0 , v 0 )| : |u − u0 | ≤ δ, |v − v 0 | ≤ δ} ≤ C · δ · h−4−2β

ξi (u, v) = κi (u, v)εi I(|εi | ≤ n1/2−α h1+2β ) − E κi (u, v)εi I(|εi | ≤ n1/2−α h1+2β ). Under the moment conditions at (A4), we get that

Jo

  P |εi | ≤ n1/2−α h1+2β for all 1 ≤ i ≤ n → 1

(A.4)

if n1−`/2+`α h−`(1+2β) → 0 as n → ∞. We also find that

n o sup E κi (u, v)εi I(|εi | > n1/2−α h1+2β ) : (u, v) ∈ [0, 1]2 = o(n−1/2 h−1−2β ). 29

(A.5)

Journal Pre-proof

The results (A.4) and (A.5) imply ? qˆjk (u, v)

? E qˆjk (u, v)



=n

n X

−1

ξi (u, v) + op (n−1/2 h−1−2β )

(A.6)

i=1

of

uniformly for (u, v) ∈ [0, 1]2 . For a positive number γ, let D(γ) ≡ Dn (γ) be a discrete subset of

[0, 1]2 such that for any (u, v) ∈ [0, 1]2 there exists (˜ u, v˜) ∈ D(γ) with |u − u ˜| ≤ n−γ and |v − v˜| ≤

obtain sup (u,v)∈[0,1]2

n −1 X ξi (u, v) ≤ n i=1

sup (u,v)∈D(γ)

p ro

n−γ . Then, due to (A.4) and taking γ sufficiently large so that n−γ h−4−2β = o(n−1/2 h−1−2β ), we n −1 X ξi (u, v) + op (n−1/2 h−1−2β ). n

(A.7)

i=1

(A.8)

Pr e-

The results (A.6) and (A.7) give the second part of the lemma if we prove ! r n X log n . sup n−1 ξi (u, v) = Op nh2+4β (u,v)∈D(γ) i=1

To prove (A.8) we note that E ξi (u, v) = 0, |ξi (u, v)| ≤ Cn−1/2−α h−1 and E |ξi (u, v)|2 ≤

Ch−2−4β uniformly for (u, v) ∈ [0, 1]2 . The latter two follow since h2+2β |κi (u, v)| and E(|εi |2 |Xij = u, Xik = v) are bounded, which are consequences of the assumption (A5) and the first inequality at (A.2). These give

! n h2+4β log n X sup P ξi (u, v) > A log n n (u,v)∈[0,1]2 i=1    n p h2+4β log n 2 −α 2β −A ≤n sup 1+ · E |ξ1 (u, v)| · exp Cn h log n 2n (u,v)∈[0,1]2   log n n −A ≤n 1+C ≤ n−A+C . n

urn

al

r

Taking A large enough so that −A + C + 2γ < 0 proves (A.8) and completes the proof of the second part of the lemma. 

Jo

Lemma 2. Assume that (A1)–(A3), (A5) and (A6) hold, and that hj → 0 as n → ∞. Then,

sup

(u,v)∈Ijk

sup E qˆj? (u) − qj (u) = o(1),

u∈Ij

? E qˆ (u, v) − qjk (u, v) = o(1), jk

30

1 ≤ j ≤ d, 1 ≤ j 6= k ≤ d,

Journal Pre-proof

Proof. We prove the second assertion first. Note that E(Zj Zk |Xj , Xk ) − Cov(Vj , Vk ) = E(Zjo Zko |Xj , Xk ) = E(Zjo Zko |Xjo , Xko )

(A.9)

of

Because of the unbiased scoring property (2.7) and from (A.9) it holds that   ? E qˆjk (xj , xk ) = E E(Zjo Zko |Xjo , Xko )Lhj (xj , Xjo )Lhk (xk , Xko ) uniformly for (xj , xk ) ∈ [0, 1]2 , where µj (xj ) =

R1 0

p ro

= qjk (xj , xk )µj (xj )µk (xk ) + o(1)

(A.10)

Lhj (xj , u) du. The second equation of (A.10)

follows from the assumptions (A1) and (A2). If xj ∈ Ij , then µj (xj ) = 1. This gives the second result. The first result also follows from the fact that

uniformly for xj ∈ [0, 1]. 

Pr e-

E qˆj? (xj ) = qj (xj )µj (xj ) + o(1)

(A.11)

Lemma 3. Assume that the conditions of Lemmas 1 and 2 hold. Then, there exist some constants 0 < c1 < C2 < ∞ and 0 < C2 < ∞ such that

c1 < inf qˆj? (u) ≤ sup qˆj? (u) < C1 , u∈[0,1]

? |ˆ qjk (u, v)| < C2 ,

al

sup

(u,v)∈[0,1]2

1 ≤ j ≤ d,

u∈[0,1]

1 ≤ j 6= k ≤ d

urn

with probability tending to one.

Proof. For the proof of the first part, we observe that from Lemma 1 and (A.11) sup |ˆ qj? (xj ) − qj (xj )µj (xj )| = op (1).

(A.12)

xj ∈[0,1]

Jo

From the symmetry of K and

R1

−1 K(u) du

= 1, it follows that

Khj (xj − u) ≤ Khj (xj , u) ≤ 2Khj (xj − u),

1 ≤ 2

Z

0

1

Khj (xj − u) du ≤ 1

(A.13)

for all xj , u ∈ [0, 1]. This proves 1 ≤ µj (xj ) ≤ 2 4

for all xj ∈ [0, 1]. 31

(A.14)

Journal Pre-proof

Let λmin and λmax , respectively, be the lower and upper bounds of the eigenvalues of M(x). Then, Z o 2 o 1> (A.15) λmin ≤ qj (xj ) = E((Zj ) |Xj = xj )pj (xj ) = j M(x)1j dx−j ≤ λmax [0,1]d−1

of

for all xj ∈ [0, 1]. This together with (A.12) and (A.14) gives the first part of the lemma. For the proof of the second part, we note that from Lemma 1 and (A.10)

xj ,xk ∈[0,1]

? |ˆ qjk (xj , xk ) − qjk (xj , xk )µj (xj )µk (xk )| = op (1).

We also get

p ro

sup

qjk (xj , xk ) = E(Zjo Zko |Xjo = xj , Xko = xk )pjk (xj , xk ) =

Z

[0,1]d−1

(A.16)

1> j M(x)1k dx−(j,k) .

> 1/2 · (1> M(x)1 )1/2 , it holds that Since 1> k j M(x)1k ≤ (1j M(x)1j ) k

for all xj , xk ∈ [0, 1].

Pr e-

|qjk (xj , xk )| ≤ qj (xj )1/2 qk (xk )1/2

(A.17)

The result at (A.16) and the upper bounds at (A.14) and (A.15) imply the second part of the lemma. 

References

al

Augustyniak, M. and Doray, L. G. (2012). Inference for a leptokurtic symmetric family of distributions represented by the difference of two gamma variates. Journal of Statistical Computation and Simulation 82, 1621–1634.

urn

Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press. Delaigle, A., Hall, P. and Meister, A. (2008). On deconvolution with repeated measurements. Annals of Statistics 36, 665–685.

Fan, J. and Masry, E. (1992). Multivariate regression estimation with errors-in-variables: Asymptotic

Jo

normality for mixing processes. Journal of Multivariate Analysis 43 237–271. Fan, J. and Truong, Y. K. (1993). Nonparametric regression with errors in variables. Annals of Statistics 21, 1900–1925.

Han, K., M¨ uller, H.-G. and Park, B. U. (2018). Smooth backfitting for additive modeling with small errorsin-variables with an application to additive functional regression for multiple predictor functions. Bernoulli 24, 1233–1265.

32

Journal Pre-proof

Han, K. and Park, B. U. (2018). Smooth backfitting for errors-in-variables additive models. Annals of Statistics 46, 2216–2250. Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journal of Royal Statistical Society,

of

Series B 55, 757–796. Jeon, J. M. and Park, B. U. (2019+). Additive regression with Hilbertian responses. Annals of Statistics, in print.

p ro

Lee, E. R., Han, K. and Park, B. U. (2018). Estimation of errors-in-variables partially linear additive models. Statistica Sinica 28, 2353–2373.

Lee, Y. K., Mammen, E. and Park, B. U. (2012a). Projection-type estimation for varying coefficient regression models. Bernoulli 18, 177–205.

Lee, Y. K., Mammen, E. and Park, B. U. (2012b). Flexible generalized varying coefficient regression

Pr e-

models. Annals of Statistics 40, 1906–1933.

Li, L. and Greene, T. (2008). Varying coefficients model with measurement error. Biometrics 64, 519–526. Liu, Z., Liu, C. and Sun, Z. (2019). Consistent model check of error-in-variables varying coefficient model with auxiliary variables. Journal of Statistical Planning and Inference 198, 13–28. Klar, B. (2015). A note on gamma difference distributions. Journal of Statistical Computation and Simulation 85, 3708–3715.

Kotz, S., Kozubowski, T. and Podgorski, K. (2012). The Laplace distribution and generalizations: a & Business Media.

al

revisit with applications to communications, economics, engineering, and finance. Springer Science

Mammen, E., Linton, O. and Nielsen, J. P. (1999). The existence and asymptotic properties of a backfit-

urn

ting projection algorithm under weak conditions. Annals of Statistics 27, 1443–1490. Park, B. U., Mammen, E., Lee, Y. K. and Lee, E. R. (2015). Varying coefficient regression models: a review and new developments (with discussion). International Statistical Review 83, 36–64. Xu, J. and Zikatanov, L. (2002). The method of alternating projections and the method of subspace corrections in Hilbert space. Journal of American Mathematical Society 15, 573–597.

Jo

Yang, L., Park, B. U., Xue, L. and H¨ardle, W. (2006). Estimation and testing for varying coefficients in additive models with marginal integration. Journal of American Statistical Association 101, 1212–1227.

You, J., Zhou, Y. and Chen, G. (2006). Corrected local polynomial estimation in varying-coefficient models with measurement errors. Canadian Journal of Statistics 34, 391–410.

33

Journal Pre-proof

Yu, K, Park, B. U. and Mammen, E. (2008). Smooth backfitting in generalized additive models. Annals

Jo

urn

al

Pr e-

p ro

of

of Statistics 36, 228–260.

34