A comparison of six smoothers when there are multiple predictors

A comparison of six smoothers when there are multiple predictors

Statistical Methodology 2 (2005) 49–57 www.elsevier.com/locate/stamet A comparison of six smoothers when there are multiple predictors Rand R. Wilcox...

150KB Sizes 0 Downloads 22 Views

Statistical Methodology 2 (2005) 49–57 www.elsevier.com/locate/stamet

A comparison of six smoothers when there are multiple predictors Rand R. Wilcox Department of Psychology, University of Southern California, Los Angeles, CA 90089-1061, United States Received 10 May 2004

Abstract The paper compares six smoothers, in terms of mean squared error and bias, when there are multiple predictors and the sample size is relatively small. The focus is on methods that use robust measures of location (primarily a 20% trimmed mean) and where there are four predictors. To add perspective, some methods designed for means are included. The smoothers include the locally weighted (loess) method derived by Cleveland and Devlin [W.S. Cleveland, S.J. Devlin, Locallyweighted regression: an approach to regression analysis by local fitting, Journal of the American Statistical Association 83 (1988) 596–610], a variation of a so-called running interval smoother where distances from a point are measured via a particular group of projections of the data, a running interval smoother where distances are measured based in part using the minimum volume ellipsoid estimator, a generalized additive model based on the running interval smoother, a generalized additive model based on the robust version of the smooth derived by Cleveland [W.S. Cleveland, Robust locally weighted regression and smoothing scatterplots, Journal of the American Statistical Association 74 (1979) 829–836], and a kernel regression method stemming from [J. Fan, Local linear smoothers and their minimax efficiencies, The Annals of Statistics 21 (1993) 196–216]. When the regression surface is a plane, the method stemming from [J. Fan, Local linear smoothers and their minimax efficiencies, The Annals of Statistics 21 (1993) 196–216] was found to dominate, and indeed offers a substantial advantage in various situations, even when the error term has a heavytailed distribution. However, if there is curvature, this method can perform poorly compared to the other smooths considered. Now the projection-type smoother used in conjunction with a 20% trimmed mean is recommended with the minimum volume ellipsoid method a close second. © 2004 Elsevier B.V. All rights reserved.

E-mail address: [email protected]. 1572-3127/$ - see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.stamet.2004.11.004

50

R.R. Wilcox / Statistical Methodology 2 (2005) 49–57

Keywords: Robust estimation; Generalized additive models; Multivariate measures of depth

1. Introduction Consider a random sample of n vectors from some p + 1-variate distribution: (Yi , X i1 , . . . , X ip ), i = 1, . . . , n. Let m(X) be some conditional measure of location associated with Y , given X = (X 1 , . . . , X p ). A fundamental goal is estimating m(X) without specifying any form for the function m, and there are a variety of so-called smoothers aimed at addressing this problem. Certainly one of the better-known methods is a locally weighted regression method called loess introduced by Cleveland and Devlin [3]; it represents a natural extension of a smooth introduced by Cleveland [2] which was restricted to the case p = 1. A basic issue here is whether, when n is relatively small, there are situations where some alternative smoother offers a distinct advantage in terms of mean squared error or bias. When the error term is heavy tailed, an obvious concern is that methods based on estimating the conditional mean of Y might have large mean squared error relative to a method based on a robust measure of location. There are several methods for estimating a robust measure of location for Y , given X, but virtually nothing is known about how they compare in terms of mean squared error. Preliminary simulations suggested that for the case p = 1, there is little difference. But with p = 4, it was found that the choice of method can matter. Indeed, among all situations considered, two smoothers were found to be uniformly superior to loess in terms of mean squared error, even when the error term has a normal distribution, and for various situations the improvement was substantial. Another technique (stemming from [4]) was found to be substantially more effective than all other methods in some situations, but in other instances it performs poorly. A generalized additive model based on the running interval smoother was considered, but it is not recommended for general use (and a variation, based in part on the robust smoother proposed by Cleveland [2], is not recommended either). A general concern with the case p > 1 is the so-called curse of dimensionality: neighborhoods with a fixed number of points become less local as the dimensions increase [1]. So a practical issue is how much is gained or lost by using a smooth having the form Yˆ = m(X) versus a generalized additive model where Yˆ = β0 + m 1 (X 1 ) + · · · + m p (X p ),

(1)

and m j is some unknown function of X j , j = 1, . . . , p. Conditions are found where smooths having the form Yˆ = m(X) offer a distinct advantage over (1), even when data are generated from an additive model, while in other situations the reverse is true. 2. Description of the alternative estimators This section provides the details of the smooths to be compared, some of which are based on a 20% trimmed mean. Consider the values Y1 , . . . , Yn and put the observations in ascending order yielding Y(1) ≤ · · · ≤ Y(n) . Let  = [.2n], where [.] is the greatest integer

R.R. Wilcox / Statistical Methodology 2 (2005) 49–57

51

function. Then the 20% trimmed mean is Y¯t =

n−  1 Y(i) . n − 2 i=+1

In terms of efficiency, generally 20% trimming performs very well under normality but continues to perform well in situations where the sample mean performs poorly (e.g., [10]). Of course, other robust measures of location might be considered, such as an M-estimator with Huber’s Ψ . Comments about using this M-esitmator are given in Section 3. To describe the Cleveland and Devlin [3] method, it helps to first describe the method in [2], which is restricted to p = 1. Briefly, given X, the method looks for a prespecified number of points among the X i values that are close to X. It then scales these distances yielding values in the half open interval [0, 1), and then these scaled values are transformed via the tricube function yielding weights, which in turn yield a weighted mean of the Y values which estimates the mean of Y , given X. More precisely, let δi = |X i − X|, i = 1, . . . , n and let δ(1) ≤ · · · ≤ δ(n) be the δi values written in ascending order. Choose some constant κ, 0 ≤ κ < 1 and let K be the value of κn rounded to the nearest integer. (The default value for κ used by S-PLUS is 0.75.) Set Qi =

|X − X i | , δ(K )

and if 0 ≤ Q i < 1, set wi = (1 − Q 3i )3 , otherwise wi = 0. Finally, use weighted least squares regression to estimate m(X i ) using wi as weights. Cleveland suggests a robust extension that is applied as follows. Let ri = Yi − m(X i ) and let Mr be the median of these residuals. Let γi = B(ri /6Mr ), where B(x) = (1 − x 2)2 for |x| < 1; otherwise B(x) = 0. Then m(X i ) is recomputed with weights γi wi , this yields a new set of residuals, and the process is repeated until convergence. Now consider the more general case p ≥ 1. Cleveland and Devlin [3] proceed as follows. Let η(X, Xi ) be the Euclidean distance between X and Xi . Let d(X) be the distance of the K th-nearest Xi to X. Now Q i = η(X, Xi )/d(X) and wi are used as weights in weighted least squares to compute m(X). The computations are performed by the R and S-PLUS functions loess, and for convenience, this will be called method L henceforth. The first alternative smoother is a generalized additive model based on a so-called running interval smoother. It is designed to estimate the conditional 20% trimmed mean of Y . Momentarily consider p = 1 and for convenience write the random sample as (Y1 , X 1 ), . . . , (Yn , X n ). Let κ1 be some constant that is chosen in a manner to be described. The constant κ1 is called the span (and the subscript 1 is used merely to indicate it is the span associated with the first of the alternative methods to be considered). The median absolute deviation (MAD), based on X 1 , . . . , X n , is the median of the n values |X 1 −M|, . . . , |X n −M|, where M is the usual sample median. Let MADN = MAD/.6745. Under normality, MADN estimates σ , the standard deviation. Then the point X is said to

52

R.R. Wilcox / Statistical Methodology 2 (2005) 49–57

be close to X i if |X i − X| ≤ κ1 × MADN. Thus, for normal distributions, X is close to X i if X is within κ1 standard deviations of X i . Then Yˆi is just the 20% trimmed mean of the Y j values for which X j is close to X i . In exploratory work, a good choice for the span is often κ1 = .8 or 1 [13], and κ1 = 1 is used here. Note that unlike Cleveland’s method, the running interval smoother uses all points close to X i when computing Yˆi rather than the K nearest points, K fixed. Virtually any smoother can be extended to the generalized additive model given by (1) using the backfitting algorithm in [7]. Set k = 0 and let m 0j be some initial estimate of m j . Here, m 0j = S j (Y |X j ), where S j (Y |X j ) is the running interval smooth based on the j th predictor, ignoring the other p − 1 predictors that are available. Next, iterate as follows. 1. Increment k by 1. 2. For each j , j = 1, . . . , p, let    k−1 k m  |X j . m j = Sj Y − = j

3. Repeat steps 1 and 2 until convergence.

 k Finally, estimate β0 with the 20% trimmed mean of the values Yi − m j (Yi |X i j ), i = 1, . . . , n. This will be called method R. The computations can be performed with the software R or S-PLUS in conjunction with the function adrun in [13]. A simple alternative to method R is to replace the running interval smoother with the robust version of the smoother derived by Cleveland [2], which is computed by the built-in S-PLUS function lowess. This will be called method RL. Here the default choice for the span used by the software R, κ2 = .75, is used. The next method is a simple extension of the kernel regression method derived by Fan [4] to p > 1 and is aimed at estimating the conditional mean of Y . (The extension stems from basic results on kernel density estimators covered by Silverman [12].) It is included here mainly because there are situations where it performed well even when the error term has a heavy-tailed distribution. For convenience, let x i = X i / min(s , (q2 − q1 )/1.34), where s , q2 and q1 are, respectively, the standard deviation, upper and lower quartiles based on X 1 , . . . , X n . (Here, the quartiles are estimated via the so-called ideal fourths; see [6,13].) If x x < 1, the multivariate Epanechnikov kernel is K e (x) =

( p + 2)(1 − x x) ; 2c p

otherwise K e (x) = 0. Here, c p is the volume of the unit p-sphere: c1 = 2, c2 = π, and for p > 2c p = 2πc p−2 / p. Following [12, p. 86], the span is taken to be κ3 = A( p)n −1/( p+4), where A(1) = 1.77, A(2) = 2.78 and for p > 2, √  1/( p+4) 8 p( p + 2)( p + 4)(2 π) p A( p) = . (2 p + 1)c p

R.R. Wilcox / Statistical Methodology 2 (2005) 49–57

53

For fixed i , let w j = K ((x j − x i )/κ3 ). Then the estimated mean of Y , m(Xi ), is β0 + β1 X 1 + · · · + β p X p , where β0 , β1 , . . . , β p are estimated via weighted least squares with weights w1 , . . . , wn . This will be called method K. The next method is of the form m(X) and represents a generalization of the running interval method based on the minimum volume ellipsoid MVE estimate of location and scatter (e.g., [11]). Now the goal is to estimate the conditional 20% trimmed mean of Y . The basic strategy behind the MVE estimator is, among all ellipsoids containing half of the data, to search for the one having the smallest volume and then, based on this subset, compute the mean and the usual covariance matrix. Typically the covariance matrix is rescaled to obtain consistency at the multivariate normal model (e.g., [9, p. 254]). Generally it is difficult to find the smallest ellipsoid containing half of the data, but effective approximations are available via functions in the software R as well as the commercial software S-PLUS and SAS. Let M be the MVE estimate of scatter. Then a measure of the distance between Xi and X j is  Di j = (Xi − X j ) M−1 (Xi − X j ). The value of m(Xi ) is simply the 20% trimmed mean of all Y j values such that X j is close to Xi . That is, compute the 20% trimmed mean of all the Y j values for which the subscript j satisfies Di j ≤ κ4 . The choice κ4 = 1 typically gives good results and is used here. This will be called method M. A variation of method M is to judge distances from the point Xk using a particular collection of projections based on X1 , . . . , Xn . Roughly, when judging the distance of points from X k , k fixed, fix i and orthogonally project all n points onto line between X k and X i , and compute the distance among the projected points between X k and X j , j = 1, . . . , n. Repeat this for each i , i = 1, . . . , n. The maximum distance between X k and X j , among these n projections, is the projection distance between X k and X j . The notion of using projection distances is not new [13], it is related to projection-type methods for measuring the depth of a point in a multivariate cloud, but its use when dealing with smoothers appears to be new. More formally, for any i , i = 1, . . . , n, let Ui = Xi − Xk , Bi = Ui Ui and for any j ( j = 1, . . . , n) let Wi j =

p 

Uik U j k ,

k=1

Ti j =

Wi j (Ui1 , . . . , Uip ) Bi

and Di j = Ti j .

(2)

54

R.R. Wilcox / Statistical Methodology 2 (2005) 49–57

Table 1 Some properties of the g-and-h distribution g

h

γ1

γ2

0.0 0.0 0.2 0.2

0.0 0.2 0.0 0.2

0.00 0.00 0.61 2.81

3.00 3.68 8.9 155.98

So for the projection of the data onto the line connecting Xi and Xk , Di j is the distance between X j and Xi . Now let di j =

Di j , q2 − q1

(3)

where for fixed i , q2 and q1 are the upper and lower quartiles, respectively. Here, the quartiles are estimated based on the values Di1 , . . . , Din . For fixed j , the projection distance associated with X j from Xk , say pd (X j ), is the maximum value of di j , the maximum being taken over i = 1, . . . , n. Now m(Xi ) is the 20% trimmed mean of all the Y j values for which the subscript j satisfies pd (X j ) ≤ κ5 , and κ5 = 1 is used here. This will be called method P. For completeness, another projection type method, aimed at estimating the conditional mean of Y , was derived by Friedman and Stuetzle [5]. In some cases it was found to offer a bit of an advantage over method L (loess), but when dealing with error terms having heavy-tailed distributions, it did not perform as well as method K or the robust methods considered here, so further details are omitted. 3. Simulation results Details regarding the simulations are as follows. Observations were generated where the marginal distributions have a g-and-h distribution [8] which includes the normal distribution as a special case. More precisely, observations Z i j (i = 1, . . . , n; j = 1, 2) were initially generated from a multivariate normal distribution having a common correlation ρ, then the marginal distributions were transformed to   exp(g Z i j ) − 1 exp(h Z 2 /2), if g > 0 ij Xij = g  if g = 0 Z exp(h Z i2j /2), where g and h are parameters that determine the third and fourth moments. The four gand-h distributions that were used were the standard normal (g = h = 0), a symmetric heavy-tailed distribution (g = 0, h = .2), an asymmetric distribution with relatively light tails (g = .2, h = 0), and an asymmetric distribution with heavy tails (g = h = .2). Here, two choices for ρ were considered: 0 and .8. Table 1 shows the skewness (γ1 ) and kurtosis (γ2 ) for each distribution considered. Additional properties of the g-and-h distribution are summarized by Hoaglin [8].

R.R. Wilcox / Statistical Methodology 2 (2005) 49–57

55

Table 2 Estimated mean squared error, Y =

g

h

0.0

0.0

0.0

0.2

0.2

0.0

0.2

0.2

g 0.0 0.0 0.2 0.2 0.0 0.0 0.2 0.2 0.0 0.0 0.2 0.2 0.0 0.0 0.2 0.2

h 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.2 0.0 0.2

ρ=0 G1

G2

G3

G4

ρ = .8 G1

G2

G3

G4

1.25 2.19 1.21 1.21 2.19 2.18 2.19 2.19 1.25 1.25 1.25 1.25 2.26 2.25 2.25 2.25

1.27 2.22 1.24 1.25 2.19 2.22 2.25 2.23 1.28 1.28 1.28 1.28 2.30 2.29 2.29 2.29

1.77 3.32 4.04 5.17 2.81 3.32 4.92 5.44 3.92 4.83 3.97 4.99 5.00 5.52 5.08 5.53

3.12 3.09 3.01 3.04 5.64 5.61 5.40 5.49 3.21 3.23 3.09 3.12 5.76 5.79 5.55 5.62

1.98 2.00 1.98 2.00 4.83 4.71 4.38 4.38 2.09 2.12 2.10 2.13 4.72 4.76 4.75 4.86

1.70 1.69 1.69 1.69 3.82 4.11 3.82 3.80 1.87 1.78 1.78 1.79 4.11 4.11 4.14 4.18

0.73 0.82 0.01 0.01 1.60 0.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

4.42 4.83 4.10 4.12 11.33 11.49 9.31 9.32 4.51 4.51 4.31 4.37 10.43 10.06 10.06 10.29

Let Yˆ be the estimated value of m(Xi ). For each replication, estimates of Y were computed for each Xi yielding Yˆi . Let 1 ˆ V = (Yi − m(Xi ))2 . n (When the error term has a skewed distribution, m(Xi ) was determined via numerical quadrature.) Bias was estimated with 1 ˆ Yi − m(Xi ). n A total of 1000 replications was used to estimate E(V ) and the expected bias. Bias was found to be negligible, so no results are reported. For convenience, the estimates of E(V ) corresponding to methods L, P, M, R, and K are labeled S12 , S22 , S32 , S42 , and S52 . For brevity, results for method RL are not reported; in some situations it performed a bit better than method R, but in others it was a bit worse. The initial set of simulations generated data from the model Y =

with n = 40. Table 2 reports the values of G i = S1 /Si+1 , i = 1, . . . , 4. Note that among all distributions considered, both G 1 and G 2 always have values greater than one indicating that methods P and M improve upon method L in terms of mean squared error. Method R performs well when ρ = 0 but poorly when ρ = .8. Based solely on Table 2, method K performed the best. (As a partial check on the case where the regression surface is a plane, simulations were run with Y = X 1 − X 2 + as well as Y = X 1 + X 2 + X 3 + X 4 +

yielding results similar to those in Table 2.)

56

R.R. Wilcox / Statistical Methodology 2 (2005) 49–57

However, as we move toward situations where there is curvature, the advantages of method K in terms of mean squared error can diminish and in some cases method L gives better results. For example, under normality with Y = X 1 X 2 + , the values for G 1 , . . . , G 4 were 1.20, 1.23, 0.79 and 1.15 with ρ = 0 and 1.78, 1.67, 0.21 and 1.30 with ρ = .8. Now methods P and M are best although method K performs reasonably well. But, if Y = X 1 X 2 + X 32 + , the values for G are 1.17, 1.21, 0.76, and 0.71 for ρ = 0 and 1.53, 1.69, 0.00, and 0.67 for ρ = .8. So now methods P and M are best with K not performing well. As is evident, when using methods R, M, and P, it is a simple matter to replace 20% trimmed means with an M-estimator based on Huber’s Ψ . However, direct comparisons with the other methods considered here are impossible. The reason is that when using methods R, M, and P, typically one or more points are isolated. That is, when computing m(Xi ), it might be that there are no points X j , i = j , flagged as being close to Xi , other than Xi itself. The result is that these methods attempt to compute this particular M-estimator based on a single observation, but because it depends in part on a measure of variation (MAD), it cannot be computed. But otherwise, of course, this particular Mestimator is a viable option. 4. Concluding remarks A basic issue here was whether some smoother could be found that uniformly improves on method L (loess) and, among the situations considered here, two such smoothers were found in the four predictor case. In some cases methods P and M have mean squared error less than half the mean squared error associated with method L. In fairness, altering the spans of the various smoothers can alter the mean squared error. Here it is found that using default settings for the spans, the choice of smoother can make a difference. Method K can be substantially better than all other smoothers considered when the regression surface is a plane, but there are situations where methods P and M beat method K. Another basic issue was how generalized additive models compare to estimates having the form m(X). When data are not generated from a generalized additive model, not surprisingly, estimation via a generalized additive model can be unsatisfactory. When data are generated from a generalized additive model, estimation via a generalized additive model performed well compared to methods P and M when ρ = 0, but it did not compete well when ρ = .8, at least with p = 4. That is, even when data are generated from a generalized additive model, some alternative estimator might offer a practical advantage. Perhaps with p > 4 a generalized additive model will be more advantageous, but this remains to be seen. Finally, an R or S-PLUS function that performs method P is available from the author upon request. Please ask for the function runpd. References [1] R.E. Bellman, Adaptive Control Processes, Princeton University Press, 1961. [2] W.S. Cleveland, Robust locally weighted regression and smoothing scatterplots, Journal of the American Statistical Association 74 (1979) 829–836.

R.R. Wilcox / Statistical Methodology 2 (2005) 49–57

57

[3] W.S. Cleveland, S.J. Devlin, Locally-weighted regression: an approach to regression analysis by local fitting, Journal of the American Statistical Association 83 (1988) 596–610. [4] J. Fan, Local linear smoothers and their minimax efficiencies, The Annals of Statistics 21 (1993) 196–216. [5] J.H. Friedman, W. Stuetzle, Projection pursuit regression, Journal of the American Statistical Association 76 (1981) 817–823. [6] M. Frigge, D.C. Hoaglin, B. Iglewicz, Some implementations of the Boxplot, American Statistician 43 (1989) 50–54. [7] T.J. Hastie, R.J. Tibshirani, Generalized Additive Models, Chapman and Hall, New York, 1990. [8] D.C. Hoaglin, Summarizing shape numerically: the g-and-h distribution, in: D. Hoaglin, F. Mosteller, J. Tukey (Eds.), Exploring Data Tables Trends and Shapes, Wiley, New York, 1985. [9] A. Marazzi, Algorithms, Routines, and S Functions for Robust Statistics, Chapman and Hall, New York, 1993. [10] J.L. Rosenberger, M. Gasko, Comparing location estimators: trimmed means, medians, and trimean, in: D. Hoaglin, F. Mosteller, J. Tukey (Eds.), Underststanding Robust and Exploratory Data Analysis, Wiley, New York, 1983, pp. 297–336. [11] P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, Wiley, New York, 1987. [12] B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman and Hall, New York, 1986. [13] R.R. Wilcox, Introduction to Robust Estimation and Hypothesis Testing, 2nd edition, Academic Press, San Diego, CA (in press).