Nonparametric regression with additional measurement errors in the dependent variable

Nonparametric regression with additional measurement errors in the dependent variable

Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361 www.elsevier.com/locate/jspi Nonparametric regression with additional measuremen...

230KB Sizes 3 Downloads 137 Views

Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361 www.elsevier.com/locate/jspi

Nonparametric regression with additional measurement errors in the dependent variable Michael Kohler∗ Department of Mathematics, University of Stuttgart, Pfaffenwaldring 57, D-70569 Stuttgart, Germany Received 7 September 2003; accepted 12 January 2005 Available online 22 March 2005

Abstract Estimation of a regression function from data which consists of an independent and identically distributed sample of the underlying distribution with additional measurement errors in the dependent variable is considered. It is allowed that the measurement errors are not independent and have nonzero mean. It is shown that the rate of convergence of least-squares estimates applied to this data is similar to the rate of convergence of least-squares estimates applied to an independent and identically distributed sample of the underlying distribution as long as the measurement errors are small. As an application, estimation of conditional variance functions from residuals is considered. © 2005 Elsevier B.V. All rights reserved. MSC: primary 62G07; secondary 62G20 Keywords: Estimation of conditional variance functions; Least-squares estimates; Measurement error; Rate of convergence; Regression estimates; L2 error

1. Introduction Let (X, Y ), (X1 , Y1 ), (X2 , Y2 ), . . . be independent identically distributed (i.i.d.) Rd × Rvalued random vectors with EY 2 < ∞. In regression analysis we want to estimate Y after having observed X, i.e., we want to determine a function f with f (X) “close” to Y. If “closeness” is measured by the mean squared error, then one wants to find a function f ∗ ∗ Tel.: +49 711 6855382; fax: +49 711 6855389.

E-mail address: [email protected]. 0378-3758/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2005.01.009

3340

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

such that E{|f ∗ (X) − Y |2 } = min E{|f (X) − Y |2 }.

(1)

f

Let m(x) := E{Y |X = x} be the regression function and denote the distribution of X by . The well-known relation which holds for each measurable function f  2 2 |f (x) − m(x)|2 (dx) (2) E{|f (X) − Y | } = E{|m(X) − Y | } + 2 implies that m is the solution of the minimization  problem (1), 2E{|m(X) − Y | } is the minimum of (2) and for an arbitrary f, the L2 error |f (x) − m(x)| (dx) is the difference between E{|f (X) − Y |2 } and E{|m(X) − Y |2 }. In the regression estimation problem the distribution of (X, Y ) (and consequently m) is unknown. Given a sequence Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} of independent observations of (X, Y), the goal is to construct an estimate mn (x) = mn (x, Dn ) of m(x) such that the L2 error |mn (x) − m(x)|2 (dx) is small. For a general introduction to regression estimation see, e.g., Györfi et al. (2002). A famous principle to construct regression estimates is the principle of least-squares. The basic idea is to estimate the L2 risk E|f (X) − Y |2 of a function f by the so-called empirical L2 risk

1 |f (Xi ) − Yi |2 n n

i=1

and to choose as an estimate of the regression function a function which minimizes the empirical L2 risk over some given set Fn of functions f : Rd → R. More precisely, one defines the estimate by 1 |f (Xi ) − Yi |2 . f ∈Fn n n

mn (·) = arg min

(3)

i=1

For notational simplicity we assume here and in the sequel that the minimum exists, but we do not require that it is unique. In case that the minimum does not exist the results of Theorem 1 and Corollaries 1–3 do also hold for the estimate mn defined by mn (·) ∈ Fn

and

1 1 1 |mn (Xi ) − Yi |2  inf |f (Xi ) − Yi |2 + . n n n f ∈Fn n

n

i=1

i=1

Consistency of least-squares regression estimates was studied, e.g., in Geman and Hwang (1982), Lugosi and Zeger (1995), van de Geer and Wegkamp (1996) and Kohler (1999). Concerning an analysis of the rate of convergence see, e.g., Cox (1984), van de Geer (1987, 1990, 2000), Birgé and Massart (1993, 1998), Shen and Wong (1994), Kohler (2000) and the literature cited therein. The behaviour of the error of the least-squares estimate (3) is well-known (see, e.g., Chapter 11 in Györfi et al., 2002). Its L2 error can be decomposed into two parts: The first

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

3341

part is the approximation error  |f (x) − m(x)|2 (dx), inf f ∈Fn

which comes from the fact that the estimate is contained in Fn and that therefore the estimate can approximate the regression function not better than the best function in Fn . The second part is the estimation error, which arises from the fact that one minimizes the empirical L2 risk instead of the L2 risk. The size of the estimation error depends on the “complexity” of the set Fn and is often measured by the size of suitable covering numbers. If Fn is a subset of a linear vector space of dimension Kn , then the estimation error is bounded by const

Kn , n

i.e., up to a constant (which depends, e.g., on the conditional variance of Y given X) it is bounded by the number of parameters to be estimated divided by n. Sometimes it is possible to observe data from the underlying distribution only with measurement errors. In this context one often considers the problem that the predictor X can be observed only with errors, i.e., instead of Xi one observes Wi = Xi + Ui for some random variable Ui which satisfies E{Ui |Xi } = 0, and the problem is to estimate the regression function from {(W1 , Y1 ), . . . , (Wn , Yn )} (see, e.g., Caroll et al. (1999) and the references there). In this paper we consider the problem of estimation of a regression function in case that we can observe the dependent variable Y (which is also called response variable) only with additional, maybe correlated, measurement errors. Since we do not assume that the means of these measurement errors are zero, these kind of errors are not already included in standard regression models. More precisely, we assume that we have given data ¯ n = {(X1 , Y¯1,n ), . . . , (Xn , Y¯n,n )}, D where the only assumption on the random variables Y¯1,n , . . . , Y¯n,n is that the average squared measurement error 1 |Yi − Y¯i,n |2 n n

(4)

i=1

¯ n does not need to be independent or identically distributed, and is small. In particular, D ¯ E{Yi,n |X = x} does not need to be equal to m(x) = E{Y |X = x}. For notational simplicity we will suppress in the sequel a possible dependency of Y¯i = Y¯i,n on the sample size n in our notation. Since we assume that (4) is small, it is still reasonable to estimate the L2 risk by the empirical L2 risk 1 |f (Xi ) − Y¯i |2 n n

i=1

3342

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

computed with the aid of the data with measurement error, and to define least-squares estimates as if no measurement error is present by 1 |f (Xi ) − Y¯i |2 m ¯ n (·) = arg min f ∈Fn n n

(5)

i=1

for some set Fn of functions f : Rd → R. It is not clear, how the L2 error of an arbitrary regression estimate is influenced by additional measurement errors. Due to the fact that we assume nothing on the nature of these errors, there is no chance to get rid of these errors, so these errors will necessarily increase the L2 error of the estimate. Intuitively one can expect that measurement errors influence the error of the estimate not much as long as these measurement errors are small. In this paper we prove that this is indeed true for least-squares estimates. More precisely, our main result is that the L2 error of least-squares estimates applied to data with additional measurement errors in the dependent variable can be bounded by a constant times the sum of three terms, where the first two terms are upper bounds on the approximation and on the estimation error as described above and the third term is the average-squared measurement error. Application of this result to the estimation of conditional variance functions yields that the rate of convergence of a least-squares estimate of a conditional variance function is of the same order as the rate of convergence of the original least-squares regression estimate as long as the conditional variance function is at least as smooth as the underlying regression function. The main result is formulated in Section 2. Applications of it to estimation of conditional variance functions and to regression estimation from censored data are described in Section 3. Section 4 contains the proofs. 2. Main results In order to be able to formulate our main result we need the following notations: let x1 , . . . , xn ∈ Rd and set x1n = (x1 , . . . , xn ). Define the distance d2 (f, g) between f, g : Rd → R by 1/2  n 1 d2 (f, g) = |f (xi ) − g(xi )|2 . n i=1

An -cover of F (w.r.t. the distance d2 ) is a set of functions f1 , . . . , fk : Rd → R with the property min d2 (f, fj ) < 

1j k

for all f ∈ F.

Let N2 (, F, x1n ) denote the size k of the smallest -cover of F w.r.t. the distance d2 , and set N2 (, F, x1n ) = ∞ if there does not exist any -cover of F of finite size. N2 (, F, x1n ) is called L2 --covering number of F on x1n .

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

3343

In order to avoid measurability problems in the case of uncountable collections of functions, we assume throughout this paper that the class F of functions is permissible in the sense of Pollard (1984), Appendix C. This mild measurability condition is satisfied for most classes of functions used in applications, including the class of piecewise polynomials considered in this paper. Theorem 1. Assume that Y − m(X) is sub-Gaussian in the sense that K 2 E{e(Y −m(X))

2 /K 2

− 1|X} 20

almost surely

(6)

for some K, 0 > 0. Let L 1 and assume that the regression function is bounded in absolute value by L. Let Fn be a set of functions f : Rd → [−L, L] and define the estimate m ¯n by (5). Then there exists constants c1 , c2 , c3 > 0 depending only on L, 0 and K such that for any n which satisfies n → 0

(n → ∞) and n · n → ∞

(n → ∞)

and √ nc1







c2 



 log N2

 u , f − g : f ∈ Fn , 4L 1/2 

1 |f (xi ) − g(xi )|2  , x1n n n

du

i=1

for all  n , all x1 , . . . , xn ∈ Rd and all g ∈ Fn ∪ {m} we have  |m ¯ n (x) − m(x)|2 (dx)

P

 > c3

1 |Yi − Y¯i |2 + n + inf n f ∈Fn n



 |f (x) − m(x)| (dx) 2

→0

i=1

for n → ∞. In case of linear vector spaces, the condition on the covering number above can be simplified as can be seen from the following corollary. Corollary 1. Assume that Y − m(X) is sub-Gaussian, i.e., assume that (6) holds for some K, 0 > 0. Let L1 and assume that the regression function is bounded in absolute value by L . Let Fn be a set of functions f : Rd → [−L, L] and assume that Fn is a subset of a linear vector space of dimension Dn ∈ N, where Dn → ∞ (n → ∞). Define the estimate

3344

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

m ¯ n by (5). Then there exists a constant c4 > 0 depending only on L, 0 and K such that  |m ¯ n (x) − m(x)|2 (dx)

P

 > c4

Dn 1 + inf |Yi − Y¯i |2 + n n f ∈Fn n



 |f (x) − m(x)| (dx) 2

→0

i=1

for n → ∞. Proof. By Corollary 2.6 in van de Geer (2000) we have that if F is a linear vector space of dimension D, then      n 4R + v D 1 2 2 n N2 v, f ∈ F : |f (zi )| R , z1  . (7) n v i=1

Since Fn is a subset of a linear vector space of dimension Dn , {f − g : f ∈ Fn } ⊆ {f +  · g : f ∈ Fn ,  ∈ R} is a subset of a linear vector space of dimension Dn + 1 and we get     n u 1 2 n N2 |f (xi ) − g(xi )|  , x1 , f − g : f ∈ Fn , 4L n i=1  √ Dn +1 4  + u/(4L)  . u/(4L) An easy computation (cf. Example 9.3.1 in van de Geer (2000) or proof of Lemma 19.1 in Györfi et al., 2002) shows that in this case the bound on the covering number in Theorem 1 is implied by n c5

Dn + 1 . n

Setting n = c5 (Dn + 1)/n the assertion follows from Theorem 1.



In order to illustrate Corollary 1 we apply it to piecewise polynomial partitioning estimates and regression functions, which are (p, C)-smooth according to the following definition. Definition 1. Let C > 0 and p = k +  for some k ∈ N0 and  ∈ (0, 1]. A function m : [0, 1] → R is called (p, C)-smooth, if the kth derivative m(k) of m exists and satisfies |m(k) (x) − m(k) (z)| C · |x − z|

(x, z ∈ [0, 1]).

Corollary 2. Let L1, C > 0 and p = k +  for some k ∈ N0 and  ∈ (0, 1]. Assume that X ∈ [0, 1] almost surely, that Y − m(X) is sub-Gaussian in the sense that (6) holds

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

for some K, 0 > 0, that   n 1 |Yi − Y¯i |2 > c6 C 2/(2p+1) n−2p/(2p+1) → 0 P n

3345

(n → ∞)

(8)

i=1

for some constant c6 , that m is bounded in absolute value by L, and that m is (p, C)-smooth. Let Fn be the set of all piecewise polynomials of degree M k with respect to an equidistant partition of [0, 1] into Kn = C 2/(2p+1) n1/(2p+1) equidistant intervals, which are bounded in absolute value by L + 1. Define the estimate m ¯ n by (5). Then there exists a constant c7 > 0 depending only on c6 , L, 0 and K such that

 P |m ¯ n (x) − m(x)|2 (dx) > c7 C 2/(2p+1) n−2p/(2p+1) → 0 (n → ∞). Proof. Fn is a subset of as linear vector space of dimension Dn = (M + 1) · Kn , and Dn c8 C 2/(2p+1) n−2p/(2p+1) . n Furthermore, one can conclude from the (p, C)-smooth property of m and the definition of Fn  |f (x) − m(x)|2 (dx)c9 C 2/(2p+1) n−2p/(2p+1) inf f ∈Fn

for n sufficiently large (cf. proof of Corollary 19.1 in Györfi et al., 2002). This together with Corollary 1 (where we replace L by L + 1) implies the assertion.  Remark 1. Condition (6) on (X, Y ) is satisfied for example if Y − m(X) is bounded in absolute value by some constant  > 0, or if Y − m(X) is normally distributed with mean zero and variance 2 and independent of X (take K =  and 20 = (e − 1) · 2 or K = 2 and 20 = 32 , resp.). Remark 2. It is well-known that the rate of convergence in Corollary 2 is optimal if no measurement errors occur (i.e., if Y¯i = Yi (i = 1, . . . , n)), see Stone (1982) or Chapter 3 in Györfi et al. (2002). Usually in the literature this rate of convergence is proven under additional assumptions on the distribution of X, in particular under the assumption that X has a density with respect to the Lebesgue–Borel measure. We would like to stress that there is no such condition in Theorem 1, Corollary 1 or in Corollary 2. Remark 3. For nonparametric regression without measurements errors and bounded X and Y it was shown in Kohler (2000) that there exists estimates which achieve the optimal rate of convergence in case of (p, C)-smooth regression functions even if X has no density with respect to the Lebesgue–Borel measure. Corollary 2 shows that this is also true if Y is unbounded but (X, Y ) satisfies the sub-Gaussian condition (6).

3346

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

Remark 4. In the definition of the estimate in Corollary 2 the parameters M and Kn are chosen accordingly with the smoothness of the regression function. Clearly, this is not possible in an application. Instead one could use some automatical choice of the parameters, e.g., by splitting of the sample or by cross-validation. Analysis of such methods under the assumptions made in this paper should be possible but is beyond the scope of this paper. Remark 5. Instead of using a least-squares estimator, one could also consider so-called local averaging estimators of the form mn (x) =

n 

Wni (x) · Y¯i ,

i=1

where Wni (x)=Wni (x, X1 , . . . , Xn ) 0 are weights satisfying ni=1 Wni (x) 1 (x ∈ Rd ). Examples of estimates of this form are kernel, nearest neighbour and partitioning estimates (cf., e.g., Chapters 4–6 in Györfi et al., 2002). The partitioning estimate is a special case of a least-squares estimate, so the above results imply corresponding results for partitioning estimates. But as was pointed out by a referee, in fact such results for local averaging estimates can be shown easily directly. To do this, we use 2 2     n n E Wni (x)Y¯i − m(x) (dx)2E Wni (x)Yi − m(x) (dx) i=1

i=1

+ 2E

  n

2 Wni (x)(Y¯i − Yi )

(dx).

i=1

The first term on the right-hand side is (up to the factor 2) the expected L2 error of the local averaging estimate applied to data without additional measurement errors and can be bounded as in Chapters 4–6 in Györfi et al. (2002). To bound the second term, one can apply the Cauchy–Schwarz inequality, which yields 2   n E Wni (x)(Y¯i − Yi ) (dx) i=1

E

  n 

i=1 n

Wni (x)(Y¯i − Yi )2 (dx)

1 ¯ E (Yi − Yi )2 max n · i=1,...,n n



 Wni (x) (dx) .

i=1

Finally one can use that for partitioning and  for nearest neighbour regression estimates under some mild conditions maxi=1,...,n n · Wni (x) (dx) is bounded almost surely (cf., Györfi et al., 2002, pp. 469 and 491). It should be noted that this rather obvious result for local averaging estimates is much weaker than Theorem 1 above since for very smooth regression functions (e.g., twice

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

3347

continuously differentiable functions) the partitioning estimate does not achieve the optimal rate of convergence (cf. Chapter 4 in Györfi et al., 2002).

3. Applications 3.1. Estimation of conditional variance functions In our first application, we consider estimation of the conditional variance of Y given X = x in nonparametric regression, i.e., we consider estimation of the conditional variance function: 2 (x) = E{|Y − m(X)|2 |X = x} = E{Y 2 |X = x} − m(x)2 . Estimates of 2 (x) can be used, e.g., to construct confidence intervals for Y given the value of X. Further applications of variance estimators can be found in Section 4 of Müller and Stadtmüller (1987). In order to construct an estimate 2n (x) of 2 (x), we consider Z = Y 2 − m(X)2 and Zi = Yi2 − m(Xi )2 (i = 1, . . . , n). 2 (x) is the regression function to (X, Z), and if we set Z¯ i = Yi2 − mn (Xi )2

(i = 1, . . . , n)

then ¯ n = {(X1 , Z¯ 1 ), . . . , (Xn , Z¯ n )} D can be considered as a sample of (X, Z) with additional measurement errors in the dependent variable. In the sequel we use Corollary 2 to derive the rate of convergence of a least-squares estimate of a conditional variance function. For other nonparametric estimates of conditional variance functions see, e.g., Müller and Stadtmüller (1987), Neumann (1994), Stadtmüller and Tsybakov (1995) and the literature cited therein. Let L, C > 0 and p = k +  for some k ∈ N0 and  ∈ (0, 1]. For simplicity we assume in the sequel that Y is bounded in absolute value by L (which implies that (X, Y ) and (X, Y 2 − m(X)2 ) satisfy the sub-Gaussian condition (6)), that m(x) = E{Y |X = x} and 2 (x) = E{Y 2 |X = x} − m(x)2 are both (p, C)-smooth and that X ∈ [0, 1] almost surely. Let mn be the piecewise polynomial partitioning estimate of Corollary 2 applied to an i.i.d. sample (X1 , Y1 ), . . . , (Xn , Yn ) of (X, Y ). In order to estimate 2 we apply a least-squares estimate to the data (X1 , Z¯ 1 ), . . . , (Xn , Z¯ n ), where Z¯ i = Z¯ i,n = Yi2 − mn (Xi )2

(i = 1, . . . , n).

More precisely, we set 1 |g(Xi ) − Z¯ i |2 , g∈Gn n n

2n (·) = arg min

i=1

3348

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

where Gn consists of all piecewise polynomials of degree M k with respect to an equidistant partition of [0, 1] into Kn = C 2/(2p+1) n1/(2p+1) equidistant intervals, which are bounded in absolute value by L2 + 1 (observe that Y 2 − m(X)2 is bounded in absolute value by L2 almost surely since |Y | L almost surely). Corollary 3. Under the above assumptions the estimate 2n of 2 satisfies

 2 2 2 2/(2p+1) −2p/(2p+1) P |n (x) −  (x)| (dx) > c10 C → 0 (n → ∞). n for some constant c10 which depends only on L. Proof. Application of Lemma 4 below to mn implies  n  1 2 2/(2p+1) −2p/(2p+1) P →0 n |mn (Xi ) − m(Xi )| > c11 C n

(n → ∞).

i=1

From this together with 1 1 1 |Zi − Z¯ i |2 = |mn (Xi )2 − m(Xi )2 |2 4L2 |mn (Xi ) − m(Xi )|2 , n n n n

n

i=1

i=1

n

i=1

we conclude that Zi = Yi2 − m(Xi ) and Z¯ i = Yi2 − mn (Xi ) satisfy  n  1 2 2/(2p+1) −2p/(2p+1) P → 0 (n → ∞). |Zi − Z¯ i | > c12 C n n 2

i=1

Application of Corollary 2 yields the desired result.



Remark 6. The condition on boundedness of Y can be relaxed. It suffices to assume that m and 2 are both bounded and that (X, Y ) and (X, Y 2 − m(X)2 ) satisfy the sub-Gaussian condition (6). Remark 7. In Corollary 3 it is required that m is at least as smooth as 2 . This is in contrast to results from the literature, where it is shown that under regularity assumptions on the distributions of X only little smoothness of m is sufficient to obtain the optimal rate of convergence for smooth 2 (cf., e.g., Theorem 3.1 in Müller and Stadtmüller, 1987). It is an open problem whether this also holds without regularity assumptions on X. Remark 8. In application one has to use some data-dependent method to choose the parameters M and Kn , cf. Remark 4. Remark 9. We would like to stress that in contrast to the results usually proven in the literature the bound of Corollary 3 is proven without the assumption that X has a density with respect to the Lebesgue–Borel measure.

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

3349

3.2. Regression estimation from censored data In our second application we consider estimation of a regression function from censored data. In many medical studies it is not possible to observe a sample of (X, Y ). There Y is the survival time of a patient, and often some of the patients are still alive at termination of the study, or withdraw alive during the study or die from other causes than those under study. Hence, instead of an observation of the survival time of a patient one observes only the minimum of the survival time and a censoring time. To be more precise, let (X, Y, C), (X1 , Y1 , C1 ), (X2 , Y2 , C2 ), . . . be independent identically distributed Rd × R+ × R+ -valued random variables. Set Z = min{Y, C},  = I{Y t|X = x}. Then

E









Y · I{Y
Y · E{I{Y
Z · 

X = E X = E

X G(Z|X) G(Y |X) G(Y |X)



Y · G(Y |X)

=E X = E{Y |X}, G(Y |X)

hence m(x) is the regression function to (X, Z · /G(Z|X)). With an appropriate estimate Gn (t|x) of G(t|x) (for an example see Beran (1981) or Section 26.3 in Györfi et al., 2002) we can estimate m by applying a nonparametric regression estimate to the data 

 Z 1 · 1 Z n · n , . . . , Xn , X1 , Gn (Z1 |X1 ) Gn (Zn |Xn ) which can be considered as a sample of (X, Z · /G(Z|X)) with additional measurement errors in the dependent variable. If Y is bounded in absolute value by L, then the average squared measurement error is bounded by 1 n n

i=1

n 

Zi ·  i Zi · i

2 |G(Zi |Xi ) − Gn (Zi |Xi )|2 21

.

G(Z |X ) − G (Z |X ) L n G(Zi |Xi )2 · Gn (Zi |Xi )2 i i n i i i=1

(9)

Consistency of kernel estimates applied to such data was shown in Pintér (2001). Under the additional assumption that (X, Y ) and C are independent, consistency of various kernel, partitioning and nearest neighbour estimates applied to such data was studied in Carbonez et al. (1995) and Kohler et al. (2002). Theorem 1 or Corollary 1 can be used to analyze the rate of convergence of such estimates. This requires the determination of the rate of convergence of the right-hand side of (9) to zero. For details we refer to Kohler et al. (2003), where the results of this paper are used to investigate the rate of convergence of regression estimates applied to censored data under the additional assumption that (X, Y ) and C are independent.

3350

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

4. Proofs 4.1. A deterministic lemma In this subsection we formulate and prove Lemma 1 which will enable us to bound the empirical L2 error of the estimate by an analysis of increments of a stochastic process. Lemma 1. Let t > 0, x1 , . . . , xn ∈ Rd and y1 , y¯1 , . . . , yn , y¯n ∈ R. Let m be a function m : Rd → R and let F be a set of functions f : Rd → R. Set 1 |f (xi ) − y¯i |2 f ∈F n n

m ¯ n = arg min

i=1

and 1 |f (xi ) − m(xi )|2 f ∈F n

m∗n = arg min

n

i=1

and assume that both minima exist. Then 1 1 |m ¯ n (xi ) − m(xi )|2 > t + 512 |yi − y¯i |2 n n n

n

i=1

i=1

1 |f (xi ) − m(xi )|2 f ∈F n n

+ 18 min

(10)

i=1

implies 16  1 t |m ¯ n (xi ) − m∗n (xi )|2  (m ¯ n (xi ) − m∗n (xi )) · (yi − m(xi )). < 2 n n n

n

i=1

i=1

(11)

Proof. Assume that (10) holds. Then 1 1 1 ∗ |m ¯ n (xi )−m(xi )|2 2 |m ¯ n (xi )−m∗n (xi )|2 +2 |mn (xi ) − m(xi )|2 n n n n

n

n

i=1

i=1

i=1

implies t 1 ∗ 1 1 |m ¯ n (xi )−m∗n (xi )|2 > +256 |yi −y¯i |2 +8 |mn (xi )−m(xi )|2 . (12) n 2 n n n

n

n

i=1

i=1

i=1

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

3351

In addition 1 |m ¯ n (xi ) − m∗n (xi )|2 n i=1  n n 1 1 2 |m ¯ n (xi ) − y¯i | − |m(xi ) − y¯i |2 2 n n i=1 i=1  n n  1 1 ∗ −2 (m ¯ n (xi ) − m(xi )) · (m(xi ) − y¯i ) + 2 |mn (xi ) − m(xi )|2 n n n

i=1

i=1

n n 4 ∗ 4  |mn (xi ) − m(xi )|2 + (m ¯ n (xi ) − m∗n (xi )) · (y¯i − m(xi )). n n i=1

(13)

i=1

Furthermore, it is easy to see that (12) and (13) imply 1 ∗ 1 |mn (xi ) − m(xi )|2  (m ¯ n (xi ) − m∗n (xi )) · (y¯i − m(xi )). n n n

n

i=1

i=1

(14)

In the sequel we show that (12) and (14) imply (11). To do this, we conclude from (13) and (14) 1 8 |m ¯ n (xi ) − m∗n (xi )|2  (m ¯ n (xi ) − m∗n (xi )) · (y¯i − m(xi )) n n n

n

i=1

i=1

which together with (12) yields t 1 1 |yi − y¯i |2 < |m ¯ n (xi ) − m∗n (xi )|2 + 256 n n 2 n

i=1

n

i=1

n 8  (m ¯ n (xi ) − m∗n (xi )) · (y¯i − m(xi )). n

(15)

i=1

If 1 1 (m ¯ n (xi ) − m∗n (xi )) · (y¯i − yi )  (m ¯ n (xi ) − m∗n (xi )) · (yi − m(xi )) n n n

n

i=1

i=1

(16) holds, then (15) implies (11). We finish the proof by showing that we get a contradiction, if (16) does not hold. So assume, that (16) does not hold. Then (15) together with the

3352

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

Cauchy–Schwarz inequality implies 1 |m ¯ n (xi ) − m∗n (xi )|2 n n

i=1

8 8  (m ¯ n (xi )−m∗n (xi ))·(y¯i −yi )+ (m ¯ n (xi )−m∗n (xi ))·(yi −m(xi )) n n i=1 i=1    n  n 1  1  (m ¯ n (xi ) − m∗n (xi ))2 ·  (y¯i − yi )2 16 n n n

n

i=1

i=1

which in turn implies    n  n 1  1  2 ∗  (m ¯ n (xi ) − mn (xi )) 16 (y¯i − yi )2 . n n i=1

i=1

From this together with (12) we get a contradiction.



Remark 10. It follows from the proof of Lemma 1 that it still holds if 1 |f (xi ) − y¯i |2 f ∈F n n

min

i=1

1 |f (xi ) − m(xi )|2 f ∈F n n

or

min

i=1

do not exist, provided one chooses m ¯ n , m∗n ∈ F such that 1 1 1 |m ¯ n (xi ) − y¯i |2  inf |f (xi ) − y¯i |2 + n n f ∈F n n

n

i=1

i=1

and 1 1 1 ∗ |mn (xi ) − m(xi )|2  inf |f (xi ) − m(xi )|2 + n n f ∈F n n

n

i=1

i=1

and one replaces (10) by 1 1 24 + 2048 |m ¯ n (xi ) − m(xi )|2 > t + |yi − y¯i |2 n n n n

n

i=1

i=1

n 1 + 18 inf |f (xi ) − m(xi )|2 . f ∈F n i=1

In this case the lemmas which we will present in the sequel still hold but their proofs require some minor changes.

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

3353

4.2. Results for fixed design regression In this subsection we work conditionally on X1 , . . . , Xn and measure the error by the empirical L2 error. For the sake of generality we formulate and prove all results in a fixed design regression model. Let Yi = m(xi ) + Wi

(i = 1, . . . , n)

for some x1 , . . . , xn ∈ Rd , m : Rd → R and some random variables W1 , . . . , Wn which are independent and have expectation zero. We assume that the Wi ’s are sub-Gaussian in the sense that 2

2

max K 2 E{eWi /K − 1} 20

(17)

i=1,...,n

for some K, 0 > 0. Our goal is to estimate m from (x1 , Y¯1 ), . . . , (xn , Y¯n ), where Y¯1 , . . . , Y¯n are arbitrary random variables with the property that the average squared measurement error 1 |Yi − Y¯i |2 n n

(18)

i=1

is “small”. Let Fn be a set of functions f : Rd → R and consider the least-squares estimate 1 |f (xi ) − Y¯i |2 . f ∈Fn n n

m ¯ n (·) = arg min

i=1

¯n Our main result is the following lemma, which bounds the empirical L2 error of m 1 |m ¯ n (xi ) − m(xi )|2 n n

i=1

by the sum of the approximation error 1 |f (xi ) − m(xi )|2 f ∈Fn n n

min

i=1

a term which depends on the complexity of Fn measured by covering numbers and the average squared measurement error (18). Lemma 2. There exists constants c13 , c14 > 0 which depend only on 0 and K such that for any n > 0 with n → 0

(n → ∞) and n · n → ∞

(n → ∞)

3354

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

and √

 n · c13





/(29 0 )







log N2 u, f − g : f ∈ Fn , 1/2  n 1 2 n |f (xi ) − g(xi )|  , x1 du n

(19)

i=1

for all  n and all g ∈ Fn we have  n 1 |m ¯ n (xi ) − m(xi )|2 P n i=1   n n  1 1 2 2 |f (xi ) − m(xi )| > c14 |Yi − Y¯i | + n + min n f ∈Fn n i=1

i=1

→0

(n → ∞).

Lemma 2 follows more or less immediately from Lemma 1 and the techniques introduced in van de Geer (2000). For the sake of completeness we give nevertheless a detailed proof. Proof. Set m∗n (·) = arg

1 min |f (xi ) − m(xi )|2 . f ∈Fn n n

i=1

By Lemma 1,  n 1 |m ¯ n (xi ) − m(xi )|2 P n i=1

 n n  1 1 > 512 |Yi − Y¯i |2 + n + 18 min |f (xi ) − m(xi )|2 n f ∈Fn n i=1   i=1 n n  n 1  16 ∗ 2 ∗ < P |m ¯ n (xi ) − mn (xi )|  (m ¯ n (xi ) − mn (xi )) · Wi 2 n n i=1

i=1

P1 + P2 , where



1 2 Wi > 220 P1 = P n n



i=1

and

 P2 = P

1 2 n 1  Wi 220 , |m ¯ n (xi ) − m∗n (xi )|2 < n 2 n i=1 i=1  n  16 ∗  (m ¯ n (xi ) − mn (xi )) · Wi . n n

n

i=1

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

3355

Application of Chernoff’s exponential bounding method (cf. Chernoff, 1952) together with (17) yields   n  2 2 2 2 P1 = P Wi /K > 2n0 /K    i=1  n  2 2 2 2 P exp Wi /K > exp(2n0 /K ) i=1





n 





exp(−2n20 /K 2 ) · E



i=1 2 2 2 exp(−2n0 /K ) · (1 + 0 /K 2 )n exp(−2n20 /K 2 ) · exp(n · 20 /K 2 )



exp

Wi2 /K 2

→0

(n → ∞).

n

2 2 i=1 Wi 20 together with the Cauchy–Schwarz

To bound P2 , we observe first that 1/n inequality implies n 16 

n

i=1

  n  1  (m ¯ n (xi ) − m∗n (xi ))2 · 220 , (m ¯ n (xi ) − m∗n (xi )) · Wi 16 ·  n i=1

hence inside of P2 we have 1 (m ¯ n (xi ) − m∗n (xi ))2 51220 . n n

i=1

Set S = min{s ∈ N0 : 2s n 51220 }. Application of the peeling device (cf. Section 5.3 in van de Geer, 2000) yields P2 

S 

 P

s=0

1 2 1 Wi 220 , 2s−1 n < |m ¯ n (xi ) − m∗n (xi )|2 2s n , n n i=1 i=1  n n  16  1 ∗ 2 ∗ |m ¯ n (xi ) − mn (xi )|  (m ¯ n (xi ) − mn (xi )) · Wi , n n n

n

i=1



S  s=0



i=1

1 2 1 P Wi 220 , |m ¯ n (xi ) − m∗n (xi )|2 2s n , n n i=1 i=1  n 1 2 s n ∗ . (m ¯ n (xi ) − mn (xi )) · Wi > n 32 n

i=1

n

3356

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

The probabilities by Corollary 8.3 in van de Geer (2000) √ √ in the above sum can be bounded (use there R = 2s n ,  = 2s n /32 and  = 20 ). This yields    S S   n(2s n /32)2 n2s = P2  c15 exp − c15 exp − n 4c15 2s n 4 · 322 c15 s=0 s=0  nn c16 exp − →0 c16 for n → ∞.



To illustrate the usefulness of Lemma 2 we show what happens if we apply it to linear least-squares estimates. Lemma 3. Let Fn be a subset of a linear vector space of dimension Dn . Then  n 1 |m ¯ n (xi ) − m(xi )|2 P n i=1   n n  1 D 1 n 2 2 |f (xi ) − m(xi )| |Yi − Y¯i | + + min > c17 n n f ∈Fn n i=1

i=1

→0

(n → ∞).

Proof. The result follows immediately from Lemma 2 and the bound on the covering number of linear vector spaces given in Corollary 2.6 in van de Geer (2000) (cf. (7)), which implies that condition (19) is in the case of linear vector spaces satisfied for n c18 Dn /n (cf. Example 9.3.1 in van de Geer (2000) or proof of Lemma 19.1 in Györfi et al., 2002).  smoothness assumpFor particular choices of sets Fn we can derive under appropriate tions on m bounds on the approximation error minf ∈Fn 1/n ni=1 |f (xi ) − m(xi )|2 , which together with Lemma 3 yields bounds on the rate of convergence of the estimate. For example, for piecewise polynomials we get Lemma 4. Let L, C > 0 and p=k+ for some k ∈ N0 and  ∈ (0, 1]. Assume x1 , . . . , xn ∈ [0, 1], |m(x)|L for some L > 0 and m (p, C)-smooth. Define Fn as the set of all piecewise polynomials of degree M k with respect to an equidistant partition of [0, 1] into Kn = C 2/(2p+1) n1/(2p+1) equidistant intervals, which are bounded in absolute value by L + 1. Then  n   n 1 1 2 2 2/(2p+1) −2p/(2p+1) |m ¯ n (xi ) − m(xi )| > c19 |Yi − Y¯i | + C n P n n i=1

→0 for n → ∞.

i=1

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

3357

Proof. The proof follows from Lemma 3 and Lemma 11.1 in Györfi et al. (2002) (cf. proof of Corollary 19.1 in Györfi et al., 2002).  4.3. Connection between L2 error and empirical L2 error In this subsection we study the difference between the L2 error and some constant times the empirical L2 error. The results will enable us to extend the previous results from fixed design regression to random design regression. Lemma 5. Let L 1, let m : Rd → [−L, L] and let F be a class of functions f : Rd → [−L, L]. Let 0 <  < 1 and  > 0. Assume that √ √ n 1152L and that, for all x1 , . . . , xn ∈ Rd and all  2L2 ,   √  √  u n , f − m : f ∈ F, log N2  √  4L 768 2L2 128L2  1/2 n  1 |f (xi ) − m(xi )|2  2 , x1n du. L n i=1

Then



 |E{|f (X) − m(X)|2 } − 1/n ni=1 |f (Xi ) − m(Xi )|2 | n P sup > 2 2 f ∈F  + E{|f (X) − m(X)| } + 1/n i=1 |f (Xi ) − m(Xi )|  n2 15 exp − . 512 · 2304L2

In the proof we will use Theorem 19.2 of Györfi et al. (2002), which we reformulate here as Lemma 6. Let Z, Z1 , . . . , Zn be independent and identically distributed random variables with values in Rd . Let K 1 and let F be a class of functions f : Rd → [0, K]. Let 0 <  < 1 and  > 0. Assume that √ √ √ n 576 K (20) and that, for all z1 , . . . , zn ∈ Rd and all  K/2,   1/2  √  √  n 1 n 4 n log N2 u, f ∈ F : , z1 f (zi )  du. (21)  √  n K 192 2K 32K i=1 Then







E{f (Z)} − 1/n n f (Zi ) n2 i=1 n P sup . > 15 exp − 128 · 2304K f ∈F  + E{f (Z)} + 1/n i=1 f (Zi )

(22)

3358

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

Proof of Lemma 5. Set G = {(f − m)2 : f ∈ F}. Then

 |E{|f (X) − m(X)|2 } − 1/n ni=1 |f (Xi ) − m(Xi )|2 | n > P sup 2 2 f ∈F  + E{|f (X) − m(X)| } + 1/n i=1 |f (Xi ) − m(Xi )|   |E{g(X)} − 1/n ni=1 g(Xi )| n = P sup > . g∈G  + E{g(X)} + 1/n i=1 g(Xi ) 

For arbitrary f, g : Rd → [−L, L] we have 1 ||f (xi ) − m(xi )|2 − |g(xi ) − m(xi )|2 |2 n n

i=1

1 |f (xi ) + g(xi ) − 2m(xi )|2 |(f (xi ) − m(xi )) − (g(xi ) − m(xi ))|2 = n n

i=1

1 |(f (xi ) − m(xi )) − (g(xi ) − m(xi ))|2 n n

16L2

i=1

which implies  

  n 1  n N2 u, g ∈ G : g(xi )  2 , x1 n L i=1     n u 1  2 n N2 , f − m : f ∈ F, |f (xi ) − m(xi )|  2 , x1 . 4L n L i=1

The result follows by an application of Lemma 6.



By Lemma 5 we can bound the L2 error by some constant times the empirical L2 error. Indeed, |E{|f (X) − m(X)|2 } − 1/n ni=1 |f (Xi ) − m(Xi )|2 | 1 n > 2 2 2  + E{|f (X) − m(X)| } + 1/n i=1 |f (Xi ) − m(Xi )| is implied by E{|f (X) − m(X)|2 } − 1/n ni=1 |f (Xi ) − m(Xi )|2 1 n > 2 2 2  + E{|f (X) − m(X)| } + 1/n i=1 |f (Xi ) − m(Xi )| which in turn is equivalent to 1 |f (Xi ) − m(Xi )|2 . n n

E{|f (X) − m(X)|2 } >  + 3 ·

i=1

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

3359

Hence 



1 |f (Xi ) − m(Xi )|2 P ∃f ∈ F : E{|f (X) − m(X)| } >  + 3 · n n

2

i=1



 |E{|f (X) − m(X)|2 } − 1/n ni=1 |f (Xi ) − m(Xi )|2 | 1 n P sup . (23) > 2 2 2 f ∈F  + E{|f (X) − m(X)| } + 1/n i=1 |f (Xi ) − m(Xi )| Similarly one can show 



1 |f (Xi ) − m(Xi )|2 >  + 3 · E{|f (X) − m(X)|2 } P ∃f ∈ F : n n

i=1

 |E{|f (X) − m(X)|2 } − 1/n ni=1 |f (Xi ) − m(Xi )|2 | 1 n . (24) > P sup 2 2 2 f ∈F  + E{|f (X) − m(X)| } + 1/n i=1 |f (Xi ) − m(Xi )| 

4.4. Proof of Theorem 1 We have  P

|m ¯ n (x) − m(x)|2 (dx) 1 >3c20 |Yi −Y¯i |2 +(6c20 +1)n +9c20 inf n f ∈Fn n



 |f (x) − m(x)| (dx) 2

i=1

P1,n + P2,n + P3,n , where  P1,n = P

 n 1 2 |m ¯ n (x) − m(x)| (dx) > n + 3 |m ¯ n (Xi ) − m(Xi )| , n 2

i=1

 P2,n = P

1 |f (Xi ) − m(Xi )|2 f ∈Fn n n

min

i=1



 > n + 3 inf

f ∈Fn

|f (x) − m(x)| (dx) 2

3360

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

and  P3,n = P

1 |Yi − Y¯i |2 + (6c20 + 1)n n n

|m ¯ n (x) − m(x)|2 (dx) > 3c20 

|f (x) − m(x)|2 (dx),

+ 9c20 inf

f ∈Fn



i=1

1 |m ¯ n (Xi ) − m(Xi )|2 , n n

|m ¯ n (x) − m(x)|2 (dx)n + 3

i=1

  n 1 |f (x)−m(x)|2 (dx) . |f (Xi ) − m(Xi )|2  n +3 inf min f ∈Fn n f ∈Fn i=1

By (23), Lemma 5 and nn → ∞ for n → ∞ we get P1,n → 0

(n → ∞).

By (24), Lemma 5 and nn → ∞ for n → ∞ we get    n 1 |f (Xi )−m(Xi )|2 >n +3 |f (x) − m(x)|2 (dx) P2,n P ∃f ∈Fn : n i=1

→0 for n → ∞. Finally, 

1 1 |m ¯ n (Xi ) − m(Xi )|2 > 3c20 |Yi − Y¯i |2 P3,n P n + 3 n n i=1 i=1  n 1 2 +(3c20 + 1)n + 3c20 min |f (Xi ) − m(Xi )| f ∈Fn n i=1  n 1 =P |m ¯ n (Xi ) − m(Xi )|2 n i=1  n  n 1 1 2 2 ¯ > c20 |Yi − Yi | + n + min |f (Xi ) − m(Xi )| n f ∈Fn n n

n

i=1

→0

i=1

(n → ∞)

by Lemma 2. This completes the proof of Theorem 1.



Acknowledgements The author thanks two anonymous referees for several helpful comments.

M. Kohler / Journal of Statistical Planning and Inference 136 (2006) 3339 – 3361

3361

References Beran, R., 1981. Nonparametric regression with randomly censored survival data. Technical Report, University of California, Berkeley. Birgé, L., Massart, P., 1993. Rates of convergence of minimum contrast estimators. Probab. Theory Related Fields 71, 113–150. Birgé, L., Massart, P., 1998. Minimum contrast estimators on sieves: exponential bounds and rate of convergence. Bernoulli 4, 329–375. Carbonez, A., Györfi, L., van der Meulen, E.C., 1995. Partition-estimates of a regression function under random censoring. Statist. Decisions 13, 21–37. Caroll, R.J., Maca, J.D., Ruppert, D., 1999. Nonparametric regression in the presence of measurement error. Biometrika 86, 541–554. Chernoff, H., 1952. A measure of asymptotic efficiency of tests of a hypothesis based on the sum of observations. Ann. Math. Statist. 23, 493–507. Cox, D.D., 1984. Multivariate smoothing spline functions. SIAM J. Numer. Anal. 21, 789–813. Geman, S., Hwang, C.-R., 1982. Nonparametric maximum likelihood estimation by the method of sieves. Ann. Statist. 10, 401–414. Györfi, L., Kohler, M., Krzy˙zak, A., Walk, H., 2002. A Distribution-free Theory of Nonparametric Regression, Springer Series in Statistics. Springer, Berlin. Kohler, M., 1999. Universally consistent regression function estimation using hierarchical B-splines. J. Multivariate Anal. 67, 138–164. Kohler, M., 2000. Inequalities for uniform deviations of averages from expectations with applications to nonparametric regression. J. Statist. Plann. Inference 89, 1–23. Kohler, M., Máthé, K., Pintér, M., 2002. Prediction from randomly right censored data. J. Multivariate Anal. 80, 73–100. Kohler, M., Kul, S., Máthé, K., 2003. Least squares estimates for censored regression, submitted for publication. Lugosi, G., Zeger, K., 1995. Nonparametric estimation via empirical risk minimization. IEEE Trans. Inform. Theory 41, 677–687. Müller, H.-G., Stadtmüller, U., 1987. Estimation of heteroscedasticity in regression analysis. Ann. Statist. 15, 610–625. Neumann, M.H., 1994. Fully data-driven nonparametric variance estimators. Statistics 25, 189–212. Pintér, M., 2001. Consistency results in nonparametric regression estimation and classification. Ph.D. Thesis, Technical University of Budapest. Pollard, D., 1984. Convergence of Stochastic Processes. Springer, New York. Shen, X., Wong, W.H., 1994. Convergence rate of sieve estimates. Ann. Statist. 22, 580–615. Stadtmüller, U., Tsybakov, A.B., 1995. Nonparametric recursive variance estimation. Statistics 27, 55–63. Stone, C.J., 1982. Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10, 1040–1053. van de Geer, S., 1987. A new approach to least squares estimation, with applications. Ann. Statist. 15, 587–602. van de Geer, S., 1990. Estimating a regression function. Ann. Statist. 18, 907–924. van de Geer, S., 2000. Empirical Processes in M-estimation. Cambridge University Press, Cambridge. van de Geer, S., Wegkamp, M., 1996. Consistency for the least squares estimator in nonparametric regression. Ann. Statist. 24, 2513–2523.