Local polynomial M-smoothers in nonparametric regression

Local polynomial M-smoothers in nonparametric regression

Journal of Statistical Planning and Inference 126 (2004) 55 – 72 www.elsevier.com/locate/jspi Local polynomial M -smoothers in nonparametric regress...

541KB Sizes 0 Downloads 113 Views

Journal of Statistical Planning and Inference 126 (2004) 55 – 72

www.elsevier.com/locate/jspi

Local polynomial M -smoothers in nonparametric regression Ruey-Ching Hwang Department of Accounting and Statistics, Dahan Institute of Technology, Hualien, Taiwan 971, R.O.C. Received 2 September 2001; accepted 26 July 2003

Abstract In the case of the equally spaced /xed design nonparametric regression, the local polynomial M -smoother (LPM) is proposed for estimating the regression function. The LPM is developed directly from both the local constant M -smoother in Chu et al. (J. Amer. Statist. Assoc. 93 (1998) 526) and the local linear M -smoother in Rue et al. (J. Nonparametric Statist. 14 (2002) 155). It is shown that, in the smooth region, our LPM has the same asymptotic bias as the local polynomial estimator, but has larger asymptotic variance. On the other hand, in the jump region, it has the interesting property of edge-preserving, but is inconsistent. In practice, a fast algorithm for computing the LPM is presented, and the idea of cross-validation is applied to select the value of the smoothing parameter. More importantly, the results including the fast computation algorithm obtained for the LPM in the one-dimensional case can be extended directly to the multidimensional case. Simulation studies demonstrate that our LPM is competitive with alternatives, in the sense of yielding both smaller sample mean average squared error and better visual performance. c 2003 Published by Elsevier B.V.  MSC: primary 62G05; secondary 62G20 Keywords: Asymptotic behavior; Jump-preserving smoothing; Local polynomial estimation; Local polynomial M -smoothing; M -estimation; Nonparametric regression

1. Introduction In application of the regression method, the underlying regression function may or may not have discontinuity points. For example, consider the cases of studying the impact of advertising, the e>ect of medicine, and the in?uence of sudden changes E-mail address: [email protected] (R.-C. Hwang). c 2003 Published by Elsevier B.V. 0378-3758/$ - see front matter  doi:10.1016/j.jspi.2003.07.009

56

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

in government policies and international relationships. The regression function may or may not be a>ected by these actions. If it is a>ected instantly, then the resulting regression function has discontinuity points; otherwise the resulting regression function has no discontinuity point. See Shiau (1985) for many interesting examples where regression functions are not continuous. However, in practice, a suitable parametric method may not be available to analyze such situations. If the parametric model is chosen incorrectly, then we may make serious mistakes in drawing inferences about the process under study. Whenever there is no appropriate parametric method available, we may start from nonparametric regression. Nonparametric regression is a smoothing method for recovering the regression function and its characteristics from noisy data. Due to both simplicity of computation and nice asymptotic properties, the local polynomial estimator (LPE) is one of popular smoothing methods. For asymptotic properties of the LPE, see, for example, Fan (1992, 1993) and Ruppert and Wand (1994). For other smoothing methods, see, for example, the monographs by Eubank (1988), MGuller (1988), HGardle (1990, 1991), Scott (1992), Wand and Jones (1995), Fan and Gijbels (1996), and Simono> (1996). However, these classical smoothing methods are inappropriate when the regression function has discontinuity points, because smoothing tends to blur the jumps. For the e>ect of discontinuity points on the asymptotic behavior of the classical smoothing method, see, for example, Wu and Chu (1993). See also van Eeden (1985), van Es (1992), and Cline and Hart (1991) in the related /eld of kernel density estimation. To improve the adverse e>ect of discontinuity points on the classical smoothing method, several remedies are available. For example, this adverse e>ect can be lessened by employing locally varying bandwidths (Fan and Gijbels, 1995; Ruppert, 1997). On the other hand, McDonald and Owen (1986), Chiu (1987), Lee (1990), Hall and Titterington (1992), MGuller (1992), and Qiu et al. (1992) introduce smoothing methods that can produce discontinuous output. However, in practice, if the regression function under study has discontinuity points whose locations are close together, then, due to the essence of the local average, these remedies for the adverse e>ect of discontinuity points still su>er. The closer the locations of discontinuity points, the less e>ective these remedies. For more discussion of this fact, see Chu et al. (1998) and Rue et al. (2002). To overcome these problems, Chu et al. (1998) propose the local constant M -smoother constructed both by applying the idea of redescending M -estimation to local constant /ts, and by exploiting the controversial local minima in an essential way. See, for example, the monographs by Huber (1981) and Hampel et al. (1986) for introduction of robust M -estimation. Similarly, Rue et al. (2002) consider the local linear M -smoother for one-dimensional data. Such local M -smoothers have the interesting property of edge-preserving, but are weak in terms of eIciency of noise reduction. For other robust smoothing methods, see, for example, HGardle and Gasser (1984), Tsybakov (1986), Besl et al. (1989), Fan et al. (1994), and Fan and Jiang (1999). The di>erence between these robust smoothing methods and the above local M -smoothers is that the former methods are constructed by choosing the “global” minima, but the latter approaches are developed by /nding the “local” minima. The purpose of this paper is to study the local polynomial M -smoother (LPM). The motivation for our LPM will be given in Section 2. It is shown that, in the smooth

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

57

region, the LPM for the regression function has the same asymptotic bias as the LPE, but has larger asymptotic variance. On the other hand, in the jump region, it has the interesting property of edge-preserving, but is inconsistent. In practice, a fast algorithm for computing the LPM is presented, and the idea of cross-validation is applied to select the value of the smoothing parameter. More importantly, the results including the fast computation algorithm obtained for the LPM in the one-dimensional case can be extended directly to the multidimensional case. Simulation studies demonstrate that our LPM is competitive with alternatives, in the sense of yielding both smaller sample mean average squared error and better visual performance. This article is organized as follows. Section 2 describes the regression settings and the precise formulation of the LPM. The asymptotic behavior of the LPM is studied in Section 3. A fast algorithm for calculating the LPM is presented in Section 4. Finally, simulation results which give additional insight are contained in Section 5. 2. Regression settings and proposed estimators In this paper, the equally spaced /xed design nonparametric regression model is given by Yi = m(xi ) + i ;

(2.1)

for i=1; : : : ; n. Here m is an unknown regression function de/ned on the closed interval [0,1], xi are equally spaced /xed design points, that is, xi = i=n, i are independent and identically distributed regression errors with mean 0 and variance 2 , 0 ¡ 2 ¡ ∞, and Yi are noisy observations of m at xi . The regression function m in (2.1) is de/ned by m(x) = (x) + (x);

(2.2)

where and denote, respectively, the continuous and the discontinuous parts of m. The continuous function is de/ned and has q derivatives on [0,1], where q is a nonnegative is de/ned by w integer. On the other hand, the discontinuous function (x) = j=1 dj I[tj ;1] (x), for x ∈ [0; 1]. Here w is a nonnegative integer representing the number of jump points of m, tj are locations of jump points, and dj are nonzero real numbers standing for the sizes of jump values of m at tj . If w = 0, then m = . For simplicity of presentation, let 0 ¡ t1 ¡ · · · ¡ tw ¡ 1. The formulation of our proposed LPM for m(xi ) is now introduced. It is generated by applying the idea of redescending M -estimation to the LPE. Given K as a probability density function supported on [−1; 1], and the bandwidth h=hn tending to 0 as n → ∞, the LPE for m(xi ) is generated by minimizing the local weighted least squares n−1 h−1

n  j=1

K(zj ){Yj − 1; p (zj )}2 ;

(2.3)

p for each xi ∈ [0; 1]. Here 1; p (z)= k=0 ck z k , the subindices 1 and p stand respectively for one-dimensional data and p-degree polynomial, p is a given nonnegative integer,

58

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

and zj = (xi − xj )=h. Let cˆ#i , for each i = 0; : : : ; p, denote the solution of ci of this minimization problem. By Taylor’s theorem, the LPE for m(xi ) is taken as mˆ #p (xi ) = cˆ#0 : Note that the criterion (2.3) is based on the least squares principle, and is not robust. To improve this drawback to mˆ #p (xi ), we suggest /nding the local maximizer of the function S1; p (c; xi ) = n−1 h−1

n 

K(zj )Lg {Yj − 1; p (zj )}:

j=1

Here c = (c0 ; : : : ; cp ) is a 1 × (p + 1) vector of parameters, g = gn tends to 0 as n → ∞, and Lg (·) = g−1 L(·=g). In this paper, we take L as Gaussian function. Note that the /rst derivative of such choice for L satis/es the condition |L(1) (z)| → 0 as z → ±∞, and hence is redescending. See Chu et al. (1998) for more reasons of choosing Gaussian function as L in the case of p = 0. Set cˆ = (cˆ0 ; : : : ; cˆp ) as the local maximizer of S1; p (c; xi ) over c such that its associated cˆ0 is closest to Yi . Our proposed LPM for m(xi ) is taken as mˆ p (xi ) = cˆ0 : The asymptotic behavior of mˆ p (xi ) will be studied in Section 3. If the value of p is given as 0 and 1, then the resulting mˆ p (xi ) is the local constant M -smoother in Chu et al. (1998) and the local linear M -smoother in Rue et al. (2002), respectively. For constructing mˆ p (xi ) in practice, see Remarks 3.1 and 3.2 for the choice of the kernel function K and the values of p, h, and g, and see Section 4 for a fast computation algorithm. The rest of this section is devoted to giving an illustration and the motivation of mˆ p (xi ). To illustrate, a simulation study was performed. For clarity and simplicity of presentation, the simulated regression function with one discontinuity point was considered. Under our simulation settings, Fig. 1 shows that both the best and the practical performance of the regression function estimate derived by our LPM mˆ p (xi ), for each p = 0; : : : ; 4, are better than those obtained by the LPE mˆ #p (xi ), in each sense of having smaller sample average squared error, showing more accurately the location of the discontinuity point, and having smoother appearance. Fig. 1d shows that the diIculty of approximating a discontinuity by mˆ p (xi ) using cross-validated bandwidths increases with p, but Fig. 1c presents that such drawback to the cross-validated mˆ p (xi ) does not happen to mˆ p (xi ) taking bandwidths as the minimizer of the sample average squared error. More comparisons between these discussed estimators will be given in Section 5. We now explain why mˆ p (xi ) has to be de/ned by that way. For simplicity of explanation, the case of p=1 is considered. Using the simulated data shown in Figs. 1b, 2a and b give 3-D surfaces of the function S1; 1 (c0 ; c1 ; xi ) for xi in the smooth region and in the jump region, respectively. The locations of the solid curves at the top of Fig. 2 represent those on which the components Lg (Yj − c0 − c1 zj ) in S1; 1 (c0 ; c1 ; xi ) with K(zj ) ¿ 0 arrive at their own maximum values over (c0 ; c1 ). Since L is Gaussian kernel, these solid

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

(a)

(b)

(c)

(d)

(e)

(f)

59

Fig. 1. An arti/cial example. (a) True curve. (b) Noisy curve. (c) Best performance of our LPM mˆ p (xi ) using global bandwidths h and g as the minimizer of the sample average squared error. (d) Practical performance of our LPM mˆ p (xi ) employing cross-validated bandwidths hˆp and gˆp suggested in Remark 3.2. (e) Best performance of the LPE mˆ #p (xi ) using global bandwidth h as the minimizer of the sample average squared error. (f) Practical performance of the LPE mˆ #p (xi ) employing the locally varying bandwidth introduced by Fan and Gijbels (1995). The regression function estimates shown in (c) – (f) have been vertically shifted for better visual performance. For these discussed estimators, the kernel function K was taken as Epanechnikov kernel, the value of h was chosen on the equally spaced grid of 101 values of h in [0.03,0.5], and the value of g was selected on the equally spaced grid of 15 values of g in [0.02,0.3].

(a)

(b)

Fig. 2. Plot of S1; 1 (c0 ; c1 ; xi ) associated to mˆ 1 (xi ) using h = 0:1, g = 0:1, and the simulated data shown in Fig. 1(b) for xi = 0:75 in the smooth region (a), and that for xi = 0:51 in the jump region of m(x) (b).

60

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

curves are de/ned by c0 +c1 zj =Yj on the (c0 ; c1 )-plane. Through a straightforward calculation, the solid curves corresponding to the e>ective design points xj and xu intersect at the point Pj; u =mj; u +ej; u . Here mj; u and ej; u denote, respectively, the nonrandom and the random parts of Pj; u , mj; u =[{zu m(xj )−zj m(xu )}=(zu −zj ); {m(xu )−m(xj )}=(zu −zj )], and ej; u is mj; u with m(xj ) and m(xu ) replaced, respectively, by j and u . Note that Fig. 2 shows that these intersection points Pj; u form clusters roughly at local maximizers of S1; 1 (c0 ; c1 ; xi ). This fact will be used to explain what information can be provided by locations of local maximizers of S1; 1 (c0 ; c1 ; xi ). If xi is in the smooth region, then, by some regularity conditions and a straightforward calculation, mj; u can be asymptotically expressed by pi∗ = {m(xi ), (−h)m(1) (xi )}. Hence the corresponding intersection points Pj; u cluster round the location of pi∗ . By this and the fact that each Lg (Yj − c0 − c1 zj ) with K(zj ) ¿ 0 has maximum value at its associated Pj; u , S1; 1 (c0 ; c1 ; xi ) for xi in the smooth region has unique maximizer located around pi∗ ; see Fig. 2a. On the other hand, if xi is in the jump region (t1 − h; t1 + h), then, using similar arguments, S1; 1 (c0 ; c1 ; xi ) has two local maximizers asymptotically; see Fig. 2b. One is located around pi∗ . The other is located around pi# = {m(xi ) + d1 , (−h)m(1) (xi )} or pi& = {m(xi ) − d1 , (−h)m(1) (xi )}, for xi ∈ (t1 − h; t1 ] or (t1 ; t1 + h), respectively. By the value of the /rst element in each pi∗ , pi# , and pi& and the explanations given in Chu et al. (1998) for constructing the local constant M -smoother, our proposed mˆ 1 (xi ) can deal with discontinuities. Finally, by the fact that the value (−h)m(1) (xi ) of the second element in each pi∗ , # pi , and pi& converges to 0 as that of h decreases, the fast algorithm given in Section 4 for computing mˆ 1 (xi ) takes the starting value of (c0 ; c1 ) as (Yi ; 0). The same remark can be applied to the case of p ¿ 1 by taking the starting value of (c0 ; c1 ; : : : ; cp ) as (Yi ; 0; : : : ; 0). 3. Results In this section, we shall study the asymptotic behavior of mˆ p (xi ). For this, in addition to the assumptions given in Section 2, we add the following ones: (A1) The function in (2.2) has q Lipschitz continuous derivatives on [0,1]. (A2) The probability density function f of i is supported on R, symmetric about 0, and unimodal, and has four Lipschitz continuous derivatives. Also, the function f(x) + f(x − dj ) has two local maximizers at x = 0 and x = dj , for each j = 1; : : : ; w. (A3) The function K is a Lipschitz continuous and symmetric probability density function supported on [ − 1; 1], and L Gaussian kernel. (A4) The values of h and g are selected on the interval Hn = [& · n−1+& ; &−1 · n−& ], where the positive constant & is arbitrarily small. Also, they satisfy the conditions hp ghp+1 , hp+2 g3 , and n h2p+1 g3 1n h2p+3 g3 . Here and throughout this paper, the notation an bn denotes bn =an → 0, as n → ∞. (A5) The total number of observations in this regression setting is n, with n → ∞. The following Theorem 3.1 gives the asymptotic bias and variance of mˆ p (xi ). Its proof is essentially the same as that of Theorem 3.1 in Rue et al. (2002), hence is omitted. To state this theorem, we introduce the following notation. Suppose the interval

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

61

[0,1] is decomposed into R1 = {x : x = )h; ) ∈ [0; 1]}, R2 = {x : x = 1 + )h; ) ∈ [ − 1; 0]}, w R3 = {x : x ∈ j=0 (tj + h; tj+1 − h)}, R4; j = {x : x = tj + )h; ) ∈ [ − 1; 0)}, R5; j = {x : x = tj + )h; ) ∈ [0; 1]}, for each j = 1; : : : w. Here t0 = 0 and tw+1 = 1. De/ne  −|dj |=2 *j = f(z) d z; −∞

u

ap; u = (−1) (u!) (2)

−1

bp = f(0)f (0)



1

−1 −2



 u

z Kp (z) d z

1

−1

(1)

2

L (z) d z

m(u) (xi );  

1

−1

 2

Kp (z) d z

;

for j = 1; : : : ; w and p; u ¿ 0. Here the notation +(u) denotes the u-th derivative of the given function +, except that m(u) stands for the regression function m itself for u = 0 and (u) for u ¿ 0. Also, Lejeune–Sarda kernel (Lejeune and Sarda, 1992)  Kp (z) is  de/ned by Kp (z) = det{ p (z)}=det( p ), where det(A) denotes the determinant of the  matrix A, p is the (p + 1) × (p + 1) matrix whose (i; j)th element is the quantity 1   ,i+j−2 = −1 z i+j−2 K(z) d z, and p (z) is the matrix p with its /rst column replaced by {K(z); zK(z); : : : ; z p K(z)}T . Note that Kp (z) is an order (0; u) kernel (Ruppert and Wand, 1994; Lai and Chu, 2001), where u = p + 1 when p is odd and u = p + 2 when p is even. 1 u Given a value of ) ∈[ − 1; 1], set ap; u; ‘ and a as a with z Kp (z) d z rep; u; r p; u −1 1 u ) u placed, respectively, by −1 z Kp; ‘ (z) d z and ) z Kp; r (z) d z. Here Kp; ‘ (z) and Kp; r (z) ) 1 are Kp (z) with the quantity ,u replaced, respectively, by −1 z u K(z) d z and ) z u K(z) d z. Both Kp; ‘ (z) and Kp; r (z) are Lejeune–Sarda kernels of order (0; p + 1) for p ¿ 0. Finally, bp; ‘ and bp; r are similarly de/ned. Theorem 3.1. Under the regression model (2.1) and the assumptions (A1) – (A5), for xi in each of R1 , R2 , R3 , R4; j , and R5; j , the dominant term for the asymptotic bias and variance of mˆ p (xi ) has the following respective expression. If p ¿ q, then AB{mˆ p (xi )} = o(hq ); o(hq ); o(hq ); *j dj ; −*j dj :

(3.1)

If p is odd and q ¿ p + 1, then AB{mˆ p (xi )} = hp+1 ap; p+1; ‘ ; hp+1 ap; p+1; r ; hp+1 ap; p+1 ; *j dj ; −*j dj : (3.2) If p is even and q = p + 1, then AB{mˆ p (xi )} = hp+1 ap; p+1; ‘ ; hp+1 ap; p+1; r ; o(hp+1 ); *j dj ; −*j dj :

(3.3)

If p is even and q ¿ p + 2, then AB{mˆ p (xi )} = hp+1 ap; p+1; ‘ ; hp+1 ap; p+1; r ; hp+2 ap; p+2 ; *j dj ; −*j dj : (3.4)

62

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

In any case, AV {mˆ p (xi )} = n−1 h−1 g−3 bp; ‘ ; n−1 h−1 g−3 bp; r ; n−1 h−1 g−3 bp ; *j (1 − *j )d2j ; *j (1 − *j )d2j :

(3.5)

We now close this section with the following remarks. Remark 3.1 (The asymptotic properties of mˆ p (xi )). By Theorem 3.1 in this paper and Theorem 4.2 in Ruppert and Wand (1994), we have the following conclusions. In the smooth region R1 ∪ R2 ∪ R3 , our LPM has the same asymptotic bias as the LPE, but has larger asymptotic variance. In this case, if p is odd, then the optimal kernel function K for constructing the LPM is Epanechnikov kernel K(z) = (3=4)(1 − z 2 )I[−1; 1] (z) (Fan et al., 1993), in the sense of having smaller asymptotic mean square error. On the other hand, in the jump region R4; j ∪R5; j , for each j =1; : : : ; w, our LPM is inconsistent for the regression function. The same conclusion for the LPM in the jump region can be applied to the LPE (Wu and Chu, 1993). Finally, by (3.1)–(3.4), we should know the order of smoothness of the continuous part of the regression function in advance to be rewarded with the bene/ts of a fast decaying bias as the value of h decreases. Remark 3.2 (Practical choice of the values of h and g). For constructing mˆ p (xi ), the values of h and g can be selected simultaneously by taking the minimizer hˆp and gˆp of the cross-validation score n  CVp (h; g) = {mˆ p; i (xi ) − Yi }2 ; i=1

see Section 5.1 of HGardle (1990). Here mˆ p; i (xi ) is mˆ p (xi ) with the observation (xi ; Yi ) deleted, and the local maximizer of the resulting function S1; p (c; xi ) over c is taken as cˆ = (cˆ0 ; : : : ; cˆp ) such that cˆ0 is closest to Yi∗ . Here Yi∗ = Yi−1 for i ¿ 1 and Y2 for i = 1. Remark 3.3 (A direct extension of the LPM to the case of the “multidimensional” equally spaced /xed design): For simplicity of presentation, we shall only introduce the two-dimensional LPM, and other multidimensional LPM can be constructed similarly. The bivariate regression model is given by Yi; j = 3(xi ; yj ) + i; j ; for i; j = 1; : : : ; n. Here xi = i=n, yj = j=n, i; j are regression errors, 3 is the bivariate regression function de/ned on [0; 1]2 , and Yi; j are noisy observations of 3 at (xi ; yj ). Given the bandwidths hx , hy , and g, the two-dimensional LPM for 3(xi ; yj ) is produced by /nding the local maximizer of the function n  n  −1 S2; p (c; xi ; yj ) = n−2 h−1 h K(zu )K(+v )Lg {Yu; v − 2; p (zu ; +v )}; x y u=1 v=1 2

for each (xi ; yj ) ∈ [0; 1] . Here c = (c00 ; c10 ; c01 ; : : : ; cp0 ; : : : ; c0p ) is a 1 × {(p + 1) (p + 2)=2} vector of parameters, p is a given nonnegative integer, zu = (xi − xu )=hx ;

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

63

p k +v = (yj − yv )=hy ; 2; p (z; +) = k=0 6=0 ck−6; 6 z k−6 +6 , and Lg (·) = g−1 L(·=g). Set c=( ˆ cˆ00 ; cˆ10 ; cˆ01 ; : : : ; cˆp0 ; : : : ; cˆ0p ) as the local maximizer of S2; p (c; xi ; yj ) over c such that its associated cˆ00 is closest to Yi; j . Our two-dimensional LPM 3ˆ p (xi ; yj ) for 3(xi ; yj ) is taken as 3ˆ p (xi ; yj ) = cˆ00 : In practice, the fast algorithm given in Section 4 for computing mˆ p (xi ) in the onedimensional case can be used directly to calculate 3ˆ p (xi ; yj ) in the two-dimensional case. Also, the idea of cross-validation introduced in Remark 3.2 can be employed to select the value of the smoothing parameter.

4. An algorithm In this section, an iterative algorithm for calculating the LPM mˆ p (xi ) is given. Our algorithm is developed directly from that in Chu et al. (1998) for the local constant

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 3. Comb-pieces example. (a) True curve. (b) Noisy curve. (c) Local constant M -estimate mˆ 0 (xi ) with h=0:0775 and g =0:41. (d) Local linear M -estimate mˆ 1 (xi ) with h=0:1225 and g =0:41. (e) Local quadratic M -estimate mˆ 2 (xi ) with h = 0:1325 and g = 0:33. (f) Local cubic M -estimate mˆ 3 (xi ) with h = 0:1275 and g = 0:41.

64

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 4. Broken-sine example. (a) True curve. (b) Noisy curve. (c) Local constant M -estimate mˆ 0 (xi ) with h = 0:05 and g = 0:36. (d) Local linear M -estimate mˆ 1 (xi ) with h = 0:05 and g = 0:36. (e) Local quadratic M -estimate mˆ 2 (xi ) with h = 0:0925 and g = 0:31. (f) Local cubic M -estimate mˆ 3 (xi ) with h = 0:0825 and g = 0:28.

M -smoother. It is established by taking iteratively a step of small size in the direction of steepest ascent of S1; p (c; xi ). On the other hand, the computation algorithm in Rue et al. (2002) is based on solving systems of di>erential equations. However, Rue et al. (2002) note that their algorithm has some practical drawbacks. For example, it su>ers from the problem of producing noisy peaks and troughs, and it is diIcult to develop their algorithm designed for the local linear /t in the one-dimensional case to the multidimensional case with higher order polynomial /t. Fortunately, by simulation results in Section 5, these drawbacks do not happen to our computation algorithm. The procedure written by using the software GAUSS for computing our LPM mˆ p (xi ) is available on request, and can be found at the web page http://134.208.26.104/faculty/ chu/Lpmd1.g. To state our algorithm, we introduce the following notation. Set ∇c and 7c as the gradient vector and the Hessian matrix of S1; p (c; xi ), respectively. Speci/cally, ∇c = (@=@c)S1; p (c; xi ) and 7c = (@2 =@cT @c)S1; p (c; xi ). Let e0 6 · · · 6 ep be the eigenvalues of 7c , and vj the unit eigenvector corresponding to ej , for each j = 0; : : : ; p. Let v be a 1 × (p + 1) unit vector and U a value of the pseudo uniform random variable U [’1 ; ’2 ] produced independently in each of the following iterative procedures. Here ’1 and ’2 are given constants satisfying 0 ¡ ’1 ¡ ’2 ¡ 12 .

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

(a)

(b)

(c)

(d)

(e)

(f)

65

Fig. 5. Shifted-blocks example. (a) True curve. (b) Noisy curve. (c) Local constant M -estimate mˆ 0 (xi ) with h = 0:055 and g = 0:27. (d) Local linear M -estimate mˆ 1 (xi ) with h = 0:045 and g = 0:33. (e) Local quadratic M -estimate mˆ 2 (xi ) with h = 0:065 and g = 0:31. (f) Local cubic M -estimate mˆ 3 (xi ) with h = 0:0825 and g = 0:33.

Recall that the idea of the LPM mˆ p (xi ) is to /nd the local maximizer cˆ = (cˆ0 ; : : : ; cˆp ) of S1; p (c; xi ) whose associated cˆ0 is closest to Yi . Our iterative algorithm for computing mˆ p (xi ) is based on the following ideas. (1) By the explanation given for Fig. 2 in Section 2, the starting value of c in our iterative algorithm is taken as (Yi ; 0; : : : ; 0), a 1 × (p + 1) row vector with Yi in the /rst coordinate and 0 elsewhere. (2) If ∇c = 0 and ep ¡ 0, then the corresponding c is the location of the local maximizer we are looking for. (3) In regions where ∇c = 0 and ep ¿ 0, using the second order Taylor expansion, we have S1; p (c + gUv; xi ) = S1; p (c; xi ) + (1=2)g2 U 2 v7c vT + O(g3 ). By this result, vp is the steepest ascent direction in such regions, and take a step of size gU in the direction of vp . (4) In regions where ∇c = 0 and ep ¡ 0, that is, S1; p (c; xi ) is concave, which means that Newton’s method would head towards a local maximizer, take a step of size not bigger than gU in the direction of (−1)7c−1 ∇Tc . (5) In regions where ∇c = 0 and ep ¿ 0, using the /rst order Taylor expansion, we have S1; p (c + gUv; xi ) = S1; p (c; xi ) + gU ∇c vT + O(g2 ). By this result, ∇c is the

66

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 6. HeaviSine example. (a) True curve. (b) Noisy curve. (c) Local constant M -estimate mˆ 0 (xi ) with h = 0:03 and g = 0:56. (d) Local linear M -estimate mˆ 1 (xi ) with h = 0:035 and g = 0:42. (e) Local quadratic M -estimate mˆ 2 (xi ) with h = 0:0675 and g = 0:37. (f) Local cubic M -estimate mˆ 3 (xi ) with h = 0:0775 and g = 0:55.

steepest ascent direction in such regions, and take a step of size gU in the direction of ∇c . (6) In each of the procedures (3) – (5), the value of S1; p (·; xi ) at c has to be increased after one step is taken. If not, then the size of that step is reduced by half, and repeat the procedure. If such size of the step cannot be obtained, then c is the location of the local maximizer we are looking for. (7) In each of the procedures (3) – (5), the pseudo independent uniform random variable U [’1 ; ’2 ] is used to avoid possible oscillation. On the other hand, choosing small values of ’1 and ’2 in U [’1 ; ’2 ] can avoid moving beyond the local maximizer we are looking for, to a region where S1; p is again concave. However, the smaller these values of ’1 and ’2 used, the more the computation time needed. 5. Simulations In this section, a simulation study was performed to compare the performance of our LPM and the LPE. The kernel function K used by each discussed estimator was Epanechnikov kernel. The values of n = 200 and p = 0; : : : ; 4 were considered. The

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

(a)

(b)

(c)

(d)

(e)

(f)

67

Fig. 7. Bumps example. (a) True curve. (b) Noisy curve. (c) Local constant M -estimate mˆ 0 (xi ) with h = 0:0625 and g = 0:27. (d) Local linear M -estimate mˆ 1 (xi ) with h = 0:05 and g = 0:47. (e) Local quadratic M -estimate mˆ 2 (xi ) with h = 0:0775 and g = 0:31. (f) Local cubic M -estimate mˆ 3 (xi ) with h = 0:08 and g = 0:25.

regression errors i were pseudo independent normal random variables N (0; 2 ), where =0:25. Seven regression functions considered include examples of Comb-pieces (m1 ), Broken-sine (m2 ), and Shifted-blocks (m3 ) in Rue et al. (2002), HeaviSine (m4 ), Bumps (m5 ), and Doppler (m6 ) in Donoho and Johnstone (1994), and Linear-peak m7 (x) = 2 − 2x + 3 exp{−100(x − 0:5)2 } in MGuller (1988). For each regression function, 100 independent sets of observations were generated from the regression model (2.1). Each m4 − m6 was scaled such that the root signal-to-noise ratio associated to its resulting observations was 7 (Donoho and Johnstone, 1994). The root signal-to-noise ratio for observations corresponding to each m1 , m2 , m3 , and m7 was 6.24, 4.57, 4.26, and 4.35, respectively. For computing the LPM, the maximum iteration number of the algorithm given in Section 4 was 500, and the values of ’1 and ’2 required by the algorithm were ’1 = 0:1 and ’2 = 0:3. For each data set, the values of mˆ p (xi ) and CVp (h; g) were calculated on an equally spaced grid of 189×14 values of h and g in [0:03; 0:5]×[0:1; 0:75]. Given the values of h and g, the value of the mean average squared error MASEp (h; g) of m ˆ p (xi ) was em n pirically approximated by the sample average of ASEp (h; g)=(n+1)−1 i=0 {mˆ p (xi )− 2 m(xi )} over the 100 data sets. After evaluation on the grid, the optimal global

68

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 8. Doppler example. (a) True curve. (b) Noisy curve. (c) Local constant M -estimate mˆ 0 (xi ) with h=0:03 and g=0:29. (d) Local linear M -estimate mˆ 1 (xi ) with h=0:0425 and g=0:33. (e) Local quadratic M -estimate mˆ 2 (xi ) with h = 0:035 and g = 0:28. (f) Local cubic M -estimate mˆ 3 (xi ) with h = 0:0375 and g = 0:3.

bandwidths hp and gp for mˆ p (xi ) and the cross-validated bandwidths hˆp and gˆp for mˆ p (xi ) were taken respectively as the global minimizers of MASEp (h; g) and CVp (h; g) on the grid. When the optimal global bandwidths hp and gp were obtained, the best performance of mˆ p (xi ) was measured by the sample mean of ASEp (hp ; gp ) over the 100 data sets. On the other hand, the practical performance of mˆ p (xi ) was measured by the sample mean of ASEp (hˆp ; gˆp ) over the 100 data sets. The same computation procedures were applied to the LPE mˆ #p (xi ) using the optimal global bandwidth h#p as well as using the locally varying bandwidth hˆ#p (xi ) produced by Fan and Gijbels (1995). Let ASE#p (h#p ) and ASE#p {hˆ#p (xi )} be similarly de/ned. Their sample averages over the 100 data sets were used to measure respectively the best and the practical performance of mˆ #p (xi ). The simulation results were summarized in the following /gures and tables. For each regression function and for each value of p = 0; : : : ; 3, the attainable performance of our LPM mˆ p (xi ) is presented in Figs. 3–9. Each local M -estimate in these /gures used what we considered to be a good visual choice of h and g. From Fig. 3, the regression function estimate produced by the local linear M -smoother mˆ 1 (xi ) for

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

(a)

(b)

(c)

(d)

(e)

(f)

69

Fig. 9. Linear-peak example. (a) True curve. (b) Noisy curve. (c) Local constant M -estimate mˆ 0 (xi ) with h = 0:06 and g = 0:75. (d) Local linear M -estimate mˆ 1 (xi ) with h = 0:0475 and g = 0:75. (e) Local quadratic M -estimate mˆ 2 (xi ) with h = 0:1425 and g = 0:52. (f) Local cubic M -estimate mˆ 3 (xi ) with h = 0:1275 and g = 0:69.

the complicated signal Comb-pieces m1 has the best visual performance among the 4 local M -estimates, in both senses of showing more accurately locations of discontinuity points and having smaller sample average squared error. The same remark can be applied to the complicated signal Bumps m5 in Fig. 7. From Fig. 4, the regression function estimate obtained by the local constant M -smoother mˆ 0 (xi ) for the discontinuous signal Broken-sine m2 has the best visual performance. Similar conclusion can be drawn for each of the discontinuous signals Shifted-blocks m3 in Fig. 5 and HeaviSine m4 in Fig. 6. From Fig. 8, the regression function estimates generated by both the local quadratic and the local cubic M -smoothers mˆ p (xi ) for p = 2; 3 for the suIciently smooth but very complicated signal Doppler m6 have much better visual performance than those derived by the local constant and the local linear M -smoothers mˆ p (xi ) for p=0; 1. This comment can also be made on the typically and suIciently smooth signal Linear-peak m7 in Fig. 9. Note that both the local quadratic and the local cubic M -estimates in Fig. 8 are remarkably good at the left boundary at which oscillations are very dense, but have rough appearance for xi ∈ [0:7; 1]. Note also that, in Fig. 9, they are extremely good for the peak in the middle, but are wiggly for xi ∈ [0; 0:3] ∪ [0:7; 1]. Such results in

70

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

Table 1 Values of the sample means and standard deviations (given in the parentheses) of ASEp (hp ; gp ) and ASEp (hˆp ; gˆp ) for our LPM mˆ p (xi ), and those of ASE#p (h#p ) and ASE#p {hˆ#p (xi )} for the LPE mˆ #p (xi ). Each of these values has been multiplied by 102 m(x)

m1 (x)

m2 (x)

m3 (x)

m4 (x)

m5 (x)

m6 (x)

m7 (x)

p

LPE mˆ #p (xi )

LPM mˆ p (xi ) ASEp (hp ; gp )

ASEp (hˆp ; gˆp )

ASE#p (h#p )

ASE#p {hˆ#p (xi )}

0 1 2 3 4

1.66(0.47) 1.31(0.46) 1.70(0.56) 1.98(0.64) 2.05(0.69)

4.60(3.03) 9.40(4.42) 17.00(5.62) 21.56(8.63) 16.63(4.93)

141.54(0.30) 141.67(0.29) 133.95(0.31) 134.24(0.34) 130.49(0.42)

152.24(0.83) 153.81(0.75) 153.50(0.52) 154.86(0.69) 151.13(0.51)

0 1 2 3 4

1.18(0.30) 1.61(0.44) 1.63(0.49) 1.98(0.59) 2.00(0.62)

1.55(0.98) 4.74(2.06) 4.54(2.32) 5.28(3.29) 4.35(1.75)

18.37(0.39) 18.38(0.39) 10.68(0.34) 10.72(0.35) 7.27(0.36)

19.50(1.63) 21.81(3.93) 19.12(6.59) 20.11(4.69) 21.55(3.26)

0 1 2 3 4

1.14(0.33) 1.93(0.44) 2.24(0.68) 2.57(0.53) 2.53(0.61)

5.54(2.52) 7.13(2.98) 7.82(2.24) 8.93(2.79) 8.84(2.17)

14.01(0.37) 14.04(0.37) 8.89(0.31) 8.93(0.31) 6.94(0.31)

15.23(0.42) 15.35(0.43) 11.94(0.80) 12.65(0.90) 11.82(1.18)

0 1 2 3 4

1.49(0.28) 1.64(0.32) 1.57(0.33) 1.66(0.24) 1.76(0.25)

1.49(0.28) 1.64(0.31) 1.62(0.32) 1.88(0.33) 1.81(0.31)

1.77(0.22) 1.65(0.22) 1.67(0.22) 1.71(0.23) 1.77(0.24)

1.92(0.25) 1.88(0.27) 2.15(0.30) 2.48(0.32) 2.90(0.30)

0 1 2 3 4

3.53(0.53) 3.46(0.52) 4.07(0.57) 3.98(0.62) 4.08(0.72)

4.46(0.83) 4.47(0.86) 6.85(1.11) 28.47(9.70) 22.00(9.55)

209.89(0.55) 209.92(0.56) 178.91(0.72) 178.95(0.72) 147.02(0.84)

266.46(1.98) 263.28(2.81) 262.10(1.42) 260.14(2.49) 257.57(5.78)

0 1 2 3 4

6.12(0.65) 4.62(0.70) 4.34(0.71) 4.14(0.83) 4.58(0.68)

8.91(3.29) 8.86(3.06) 12.35(5.32) 9.47(3.67) 12.07(4.17)

36.58(0.64) 36.59(0.64) 17.52(0.44) 17.60(0.46) 11.76(0.41)

39.09(0.78) 39.17(0.83) 23.09(0.69) 22.95(0.97) 20.10(1.28)

0 1 2 3 4

0.59(0.17) 0.60(0.18) 0.46(0.17) 0.49(0.17) 0.49(0.17)

0.63(0.17) 0.69(0.25) 0.54(0.21) 0.61(0.30) 0.57(0.24)

0.56(0.17) 0.59(0.18) 0.46(0.16) 0.48(0.17) 0.46(0.16)

0.46(0.16) 0.46(0.19) 0.42(0.17) 0.44(0.18) 0.56(0.20)

Figs. 8 and 9 indicate that we should develop both ideas of locally varying smoothing parameter and locally varying order polynomial /tting for the LPM. On the other hand, by comparing Figs. 1 and 3–9 in this paper and Figs. 1, 4, and 5 in Rue et al.

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

71

(2002), it is clear that our computation algorithm for the LPM is better than that in Rue et al. (2002) for the local linear M -smoother, in the sense of yielding better visual performance. Table 1 contains both the best and the practical performance of each discussed estimator. Given each p = 0; : : : ; 4, it shows that our LPM mˆ p (xi ) is of much better both the best and the practical performance than the LPE mˆ #p (xi ) for each interesting case m1 − m5 where the underlying signal is discontinuous or complicated. The same remark applies to the suIciently smooth but very complicated signal m6 . Also, it shows that the latter method has similar the best performance to and has better the practical performance than the former approach for the typically and suIciently smooth signal m7 . This result indicates that our LPM has potential in dealing with the typically smooth signal. From Table 1, we have the following results for our LPM mˆ p (xi ) with p = 0; : : : ; 4. First, considering the practical performance, mˆ 0 (xi ) is better for m1 , m2 , m3 , and m4 , mˆ 0 (xi ) and mˆ 1 (xi ) perform equally well and are better for m5 and m6 , and mˆ 2 (xi ) is better for m7 . By this, we have some practical recommendations for our LPM using cross-validated bandwidths. For both the discontinuous and the complicated signals, mˆ 0 (xi ) is advisable, but mˆ p (xi ) with local polynomials of order p ¿ 2 do not bring any advantages. On the other hand, for the typically and suIciently smooth signal, mˆ p (xi ) with local polynomials of order p ¿ 2 could be considered, although the LPE is advisable in such case. Second, considering the best performance, mˆ 0 (xi ) is better for the discontinuous signals m2 ; m3 , and m4 , mˆ 1 (xi ) is better for the complicated signals m1 and m5 , and mˆ p (xi ) with local polynomials of order p ¿ 2 are better for the smooth signals m6 and m7 . Finally, by the results in these two remarks and by the comments given in Section 2 for Figs. 1c and d, the performance of mˆ p (xi ) using cross-validated bandwidths does not coincide with that employing optimal bandwidths. Hence, our suggested cross-validation criterion is not a very good bandwidth selector for the LPM. In this case, our LPM requires another bandwidth selector so that its resulting practical performance coincides with its best performance. Acknowledgements The author is grateful to the referees, the associate editor, and the coordinating editor for particularly constructive comments on the /rst version of this article. References Besl, P.J., Birch, J.B., Watson, L.T., 1989. Robust window operators. Mach. Vision Appl. 2, 179–191. Chiu, S.T., 1987. A feature-preserving smoother which preserves abrupt changes in mean. Unpublished paper, Department of Statistics, Colorado State University. Chu, C.K., Glad, I., Godtliebsen, F., Marron, J.S., 1998. Edge-preserving smoothers for image processing (with discussion). J. Amer. Statist. Assoc. 93, 526–556. Cline, D.B.H., Hart, J.D., 1991. Kernel estimation of densities with discontinuities or discontinuous derivatives. Statistics 22, 69–84. Donoho, D.L., Johnstone, I.M., 1994. Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425–455.

72

R.-C. Hwang / Journal of Statistical Planning and Inference 126 (2004) 55 – 72

Eubank, R.L., 1988. Spline Smoothing and Nonparametric Regression. Marcel Dekker Inc., New York. Fan, J., 1992. Design-adaptive nonparametric regression. J. Amer. Statist. Assoc. 87, 998–1004. Fan, J., 1993. Local linear regression smoothers and their minimax eIciencies. Ann. Statist. 21, 196–216. Fan, J., Gijbels, I., 1995. Data-driven bandwidth selection in local polynomial /tting: variable bandwidth and spatial adaptation. J. Roy. Statist. Soc. Ser. B 57, 371–394. Fan, J., Gijbels, I., 1996. Local Polynomial Modeling and Its Application—Theory and Methodologies. Chapman & Hall, New York. Fan, J., Jiang, J., 1999. Variable bandwidth and one-step local M -estimator. Sci. China Ser. A 29, 1–15. Fan, J., Gasser, T., Gijbels, I., Brookmann, M., Engels, J., 1993. Local polynomial /tting: a standard for nonparametric regression. Discussion paper 9315. Institut de Statistique, UniversitYe Catholique de Louvain, Belgium. Fan, J., Hu, T., Truong, Y., 1994. Robust non-parametric function estimation. Scand. J. Statist. 21, 433–446. Hall, P., Titterington, D.M., 1992. Edge-preserving and peak-preserving smoothing. Technometrics 34, 429–440. Hampel, F., Ronchetti, E., Rousseeuw, P., Stahel, W., 1986. Robust Statistics: The Approach Based on In?uence Functions. Wiley, New York. HGardle, W., 1990. Applied Nonparametric Regression. Cambridge University Press, Cambridge. HGardle, W., 1991. Smoothing Techniques: With Implementation in S. Springer Series in Statistics, Springer, Berlin. HGardle, W., Gasser, T., 1984. Robust nonparametric function /tting. J. Roy. Statist. Soc. Ser. B 46, 42–51. Huber, P.J., 1981. Robust Statistics. Wiley, New York. Lai, C.J., Chu, C.K., 2001. Triple smoothing estimation of the regression function and its derivatives in nonparametric regression. J. Statist. Plann. Inference 98, 157–175. Lee, D., 1990. Coping with discontinuities in computer vision: their detection, classi/cation, and measurement. IEEE Trans. Pattern Anal. Machine Intell. 12, 321–344. Lejeune, M., Sarda, P., 1992. Smooth estimators of distribution and density functions. Comput. Statist. Data Anal. 14, 457–471. McDonald, J.A., Owen, A.B., 1986. Smoothing with split linear /ts. Technometrics 28, 195–208. MGuller, H.G., 1988. Nonparametric Regression Analysis of Longitudinal Data Lecture Notes in Statistics, Vol. 46. Springer, Berlin. MGuller, H.G., 1992. Change-points in nonparametric regression analysis. Ann. Statist. 20, 737–761. Qiu, P., Asaro, C., Li, X., 1992. Estimation of jump regression function. Bull. Inform. Cybernet. 24, 197–212. Rue, H., Chu, C.K., Godtliebsen, F., Marron, J.S., 2002. M -smoother with local linear /t. J. Nonparametric Statist. 14, 155–168. Ruppert, D., 1997. Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation. J. Amer. Statist. Assoc. 92, 1049–1062. Ruppert, D., Wand, M.P., 1994. Multivariate locally weighted least squares regression. Ann. Statist. 22, 1346–1370. Scott, D.W., 1992. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York. Shiau, J.H., 1985. Smoothing spline estimation of functions with discontinuities. Ph.D. Dissertation, Department of Statistics, University of Wisconsin. Madison, WI, unpublished. Simono>, J.S., 1996. Smoothing Methods in Statistics. Springer, New York. Tsybakov, A.B., 1986. Robust reconstruction of functions by the local approximation. Problems Inform. Transform. 22, 133–146. van Eeden, C., 1985. Mean integrated squared error of kernel estimators when the density and its derivatives are not necessarily continuous. Ann. Inst. Statist. Math. 37, Part A, 461– 472. van Es, B., 1992. Asymptotics for least squares cross-validation bandwidths in non-smooth cases. Ann. Statist. 20, 1647–1657. Wand, M.P., Jones, M.C., 1995. Kernel Smoothing.. Chapman & Hall, London. Wu, J.S., Chu, C.K., 1993. Nonparametric function estimation and bandwidth selection for discontinuous regression functions. Statistica Sinica 3, 557–576.