Nonparametric estimation for quadratic regression

Nonparametric estimation for quadratic regression

ARTICLE IN PRESS Statistics & Probability Letters 76 (2006) 1156–1163 www.elsevier.com/locate/stapro Nonparametric estimation for quadratic regressi...

152KB Sizes 2 Downloads 233 Views

ARTICLE IN PRESS

Statistics & Probability Letters 76 (2006) 1156–1163 www.elsevier.com/locate/stapro

Nonparametric estimation for quadratic regression Samprit Chatterjeea,b, Ingram Olkinc, a

b

New York University, NY, USA Mount Sinai Medical School, NY, USA c Stanford University, CA, USA

Received 25 February 2005; received in revised form 22 November 2005 Available online 25 January 2006

Abstract The method of least squares provides the most widely used algorithm for fitting a linear model. A variety of nonparametric procedures have been developed that are designed to be robust against model violations and resistant against aberrant points. One such method introduced by Theil [1950. A rank-invariant method of linear and polynomial regression analysis. I, II, III. Proc. Ned. Akad. Wet. 53, 386–392, 521–525, 1397–1412] is based on pairwise estimates. There are many examples in which the data are nonlinear, and in particular, where a quadratic fit may be more appropriate. We here propose a nonparametric method for fitting a quadratic regression. r 2006 Elsevier B.V. All rights reserved. Keywords: Distribution-free regression; Theil estimator; Least squares regression; Robust regression

1. Introduction The method of least squares provides the most widely used algorithm for fitting the linear model: y ¼ a þ bx þ  to a set of points ðxi ; yi Þ, i ¼ 1; . . . ; n. The assumption that the error term  is normally distributed is generally adopted, in which case the least squares estimate is also maximum likelihood. Corrective measures have been developed for different types of violations of the assumptions. Robust procedures focus on violations of model assumptions and data contaminations. Errors-in-variables procedures deal with the case that x and y are measured with error. Although the origins of these violations may differ, some of the corrective measures are similar. Our starting point consists of n independent observations ðyi ; xi Þ, i ¼ 1; . . . ; n, to be used in estimating the parameters in the linear model: yi ¼ a þ bxi þ i ;

i ¼ 1; . . . ; n.

(1.1)

No assumptions are made on the error terms. Wald (1940) focused on the case that both x and y are subject to error. For simplicity of discussion let n be even ðn ¼ 2mÞ and let the x’s be distinct. (The case of nondistinct x’s can be treated, but require more Corresponding author. Fax: +1 650 725 8977.

E-mail address: [email protected] (I. Olkin). 0167-7152/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2005.12.022

ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163

1157

complicated rules for breaking ties.) Divide the n observations based on the ordered x’s into two groups with x1 o    oxm as group 1 and xmþ1 o    oxn as group 2. The Wald slope bw is determined by bw ¼ ð¯y2  y¯ 1 Þ=ðx¯ 2  x¯ 1 Þ,

(1.2)

where the x¯ i and y¯ i are means of their respective groups. Then aw ¼ y¯  bw x. ¯

(1.3)

This method yields consistent unbiased estimators. A more efficient modification of the Wald method was introduced by Nair and Shrivastava (1942) and independently by Bartlett (1949), in which the ordered x’s are divided into three groups with a lower group based on x1 o    ox‘ and an upper group based on xu o    oxn , ‘ou. The middle group is omitted. The slope is now evaluated as in (1.2). Although the estimate of slope is identical for the Nair and Shrivastava (1942) and Bartlett (1949) methods, the intercept estimates differ. The efficiency of the Wald method with equally spaced x’s and not subject to error of measurement has efficiency relative to least squares at least 34, and the three group method has efficiency at least 89. Many other methods designed to be more robust and resistant have been proposed: Mood (1950), Brown and Mood (1951), Hampel (1971), Hogg and Randles (1975), Gentle (1977), and others. These methods are described in the excellent review by Emerson and Hoaglin (1983). For more recent discussions of errors-invariables methods see Gleser (1981), Fuller (1987), and Ruppert et al. (1995). 2. Pairwise estimates The method of pairwise estimates was introduced by Theil (1950) in the case of linear regressions. Assuming distinct x-values, for every pair ðxi ; yi Þ and ðxj ; yj Þ, estimate the slope by the two-point formula. bij ¼ ðyi  yj Þ=ðxi  xj Þ;

1pipjpn,

(2.1)

and define b~ ¼ med ðbij Þ  over all n2 pairs of points. Then define ~ i g. a~ ¼ med fyi  bx

(2.2)

(2.3)

A modified version developed by Siegel (1982) uses a repeated median method in which b ¼ medi fmedj bij ; . . . ; medj bnj g.

(2.4)

The intercept is obtained from a ¼ med fyi  b xi g;

i ¼ 1; . . . ; n.

(2.5)

An approximate   confidence interval for the slope is obtained by Hollander and Wolfe (1973) as follows. Order the N ¼ n2 slopes bð1Þ p    pbðNÞ and define L ¼ ðN  C a Þ=2;

U ¼ ðN þ C a Þ=2,

where C a  za=2

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi nðn  1Þð2n þ 5Þ=18

and z denotes the upper tail of a standard normal distribution. Then ðbL ; bU Þ constitutes a ð1  aÞ100% confidence interval. (Note that when L and U are not integers, the closest integer value is used.) 3. The quadratic case Although linearity is still the primary model in applications, there has been an increase in examples of data that are nonlinear, and in particular, where a quadratic fit may be more appropriate.

ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163

1158

The method of Theil (1950) can readily be modified for the quadratic case. Consider the model yi ¼ a þ bxi þ gx2i þ i ;

i ¼ 1; . . . ; n,

(3.1) 2

and assume  that the x-values are distinct and variance ði Þ ¼ s . For all n3 triplets solve the equations. yi ¼ a þ bxi þ gx2i , yj ¼ a þ bxj þ gx2j , yk ¼ a þ bxk þ gx2k , to yield g~ ijk ¼

ð3:2Þ

  1 yk  yi yj  yi  . ðxk  xj Þ xk  xi xj  xi

(3.3)

The median estimator of the coefficient of the quadratic term is g~ ¼ med f~gijk g.  Then for all n2 pairs let yj  yi b~ ij ¼  g~ ðxj þ xi Þ, xj  xi

(3.4)

(3.5)

and define b~ ¼ med fb~ ij g.

(3.6)

Finally, for all i ¼ 1; . . . ; n, let ~ i  g~ x2 , a~ i ¼ yi  bx i

(3.7)

and define a~ ¼ med f~ai g.

(3.8)

The fitted model is ~ i þ g~ x2 ; y~ i ¼ a~ þ bx i

i ¼ 1; . . . ; n,

and the estimate of variance is X ðyi  y~ i Þ2 =ðn  3Þ. s~ 2 ¼

(3.9)

(3.10)

The variance of g~ ijk is s~ 2ijk ¼ Var ð~gijk Þ ¼ s2

½ðxi  xj Þ2 þ ðxi  xk Þ2 þ ðxj  xk Þ2  ; ðxi  xj Þ2 ðxi  xk Þ2 ðxj  xk Þ2

(3.11)

the covariance between the g~ that have one observation in common is Covð~gijk ; g~ i‘m Þ ¼ s2 ½ðxi  xj Þðxi  xk Þðxi  x‘ Þðxi  xm Þ1 .

(3.12)

Thus the correlation between the g~ that have one observation in common is r~ ijk;i‘m ¼

Covð~gijk ; g~ i‘m Þ . s~ ijk s~ i‘m

(3.13)

The correlation between the g~ with two observations in common can also be calculated, but is somewhat more complicated. Remark. The estimates of a; b; g have been obtained only for triplets xi ; xj ; xk that are distinct. For any triplet that is not distinct define that particular g~ ijk ¼ 0. Using the approach outlined in Sen (1968), we can show that the g~ calculated from the distinct triplets converges to g.

ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163

1159

To judge the significance of the quadratic component an asymptotic standard error of the quadratic component is derived in the Appendix. 4. Simultaneous least squares estimation For every triplet, say, ðx1 ; y1 Þ; ðx2 ; y2 Þ; ðx3 ; y3 Þ, minimize X ðyi  a  bxi  gx2i Þ2 i¼1;2;3

to yield the solution 0

11

1

1

1

^ g^ Þ ¼ ðy ; y ; y ÞB x ð^a; b; 1 2 3 @ 1 x21

x2 x22

x3 C A x23

and more explicitly y1 y2 y3 þ þ , ðx2  x1 Þðx3  x1 Þ ðx1  x2 Þðx3  x2 Þ ðx1  x3 Þðx2  x3 Þ y1 ðx2 þ x3 Þ y2 ðx1 þ x3 Þ y3 ðx1 þ x2 Þ b^ 123 ¼    , ðx2  x1 Þðx3  x1 Þ ðx1  x2 Þðx3  x2 Þ ðx1  x3 Þðx2  x3 Þ y1 x2 x3 y2 x1 x3 y3 x1 x2 þ þ . a^ 123 ¼ ðx2  x1 Þðx3  x1 Þ ðx1  x2 Þðx3  x2 Þ ðx1  x3 Þðx2  x3 Þ

g^ 123 ¼

Consequently, let a^ LS ¼ med f^a123 g;

b^ LS ¼ med fb^ 123 g;

g^ LS ¼ med f^g123 g,

over all triplets. It is possible to estimate the three parameters simultaneously for each triplet in closed form because the design matrix is a Vandermonde that has a closed form inverse. The estimates of g given by this method and the method proposed in Section 3 are the same, because in both methods the estimator of g is the median of all the estimates generated by all triplets of the points. The estimates of a and b are, however, different. In Section 3, the constant and the linear coefficient are estimated after a smooth estimate of the quadratic component is subtracted from the data. For the simultaneous least square estimate this is, however, not done. We do not recommend this procedure, but present it here for the sake of completeness and novelty. 5. Example We illustrate our method by an example in which the standard least squares approach performs very poorly. The data are simulated under conditions that are realistic. Ten observations are generated from a polynomial with a negative linear, and a positive quadratic coefficient. The random disturbance is generated by a long tailed probability distribution (a t-distribution with 10 degrees of freedom). Two observations have been perturbed. The data is provided in Table 1. Table 1 Illustrative data for quadratic regression fitting x

y

x

y

3.4 5.7 7.5 8.8 11.1

18.8 20.5 29.4 45.0 72.4

12.5 14.2 15.2 15.8 17.9

92.7 124.2 142.4 154.7 118.4

ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163

1160

160 140 120

y

100 80

LS RNP

60 40 20 5

10

15 x

Fig. 1. Scatterplot of y and x, with least squares and nonparametric robust quadratic fit.

Fig. 1 shows the observations, least squares fit, and the nonparametric robust fit. The observations show a clear quadratic pattern. The superiority of the nonparametric robust fit in this situation is clear. The least squares fit is yLS ¼ 28:772 þ 9:329x þ 0:0414x2 , where both the linear and the quadratic terms are not significant. It is to be noted that the value of R2 is 0.878! In spite of the high value of R2 the poor least squares fit occurs because of two conditions present in the data. One is multicollinearity, (variance inflation factor is 29.6) and the other is the presence of two outliers with high leverage. Multicollinearity will almost always be present in any quadratic situation because of the nature of the variables. Occurrence of outliers with high leverage is also a commonly occurring situation in data analysis. The nonparametric method produces the model Y NP ¼ 11:586  2:556x þ 0:724x2 ; here the linear and quadratic components are significant. The observed value of R2 is 0.841. An iteratively robust weighted least squares method (IRWLS) given by Chatterjee and Machler (1997) gives Y R ¼ 15:256  3:572x þ 0:785x2 . The observed value of R2 is 0.837. In this example, the model fitted by IRWLS and the nonparametric robust model give nearly identical fits, and when superposed on the scatterplot are indistinguishable. The nonparametric method uses an L1 norm, whereas IRWLS uses a quadratic norm. Consequently, the nonparametric method will be more robust against long tailed error distributions. 6. Discussion The least squares estimates for linear models in the presence of outliers and leverage points are known to be highly nonrobust. The extended Theil method of Section 3 guards against these aberrant points. The IRWLS estimates downweights these points. The extended Theil method here developed also downweights these points, and consequently is close to the IRWLS estimates. This method will perform well in the presence of long tailed model errors (being based on the L1 norm), gross errors and leverage points. The method presented

ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163

1161

in Section 3 will provide a reliable quadratic fit without requiring an extensive analysis of the data. As with other nonparametric procedures, it provides a quick examination of the quadratic fit. Acknowledgements We would like to thank P. K. Sen and S. R. S. Varadhan for their helpful comments and suggestions on the work in this paper. Appendix A. Asymptotic standard error of the quadratic coefficient Let F n ðzÞ denote the empirical cumulative distribution function of g~ ijk : 1 X F n ðzÞ ¼ n Ið~gijk pzÞ, 3

(A.1)

i;j;k

and let f ðÞ denote its corresponding density. Theorem. The variance of g~ m , the median of all triplets g~ ijk , is Varð~gm Þ ’ Var½F n ðgÞ=½f ðgÞ2 . Proof. Because g~ m is the median of all triplets F n ð~gm Þ ¼ Sen, 1968; Hollander and Wolfe, 1973) it follows that

(A.2) 1 2.

Further, because g~ m converges to g (see for example

lim F n ð~gm Þ ¼ F ðgÞ ¼ 12.

n!1

(A.3)

Further F n ð~gm Þ  F ðgÞ ¼ 0.

(A.4)

Expand F ðgÞ in a Taylor series about g~ m : F n ð~gm Þ  F ð~gm Þ þ ðg  g~ m ÞF 0 ðgÞ ffi 0,

(A.5)

from which g~ m  g ffi ½F n ð~gm Þ  F ð~gÞ=F 0 ðgÞ

(A.6)

Eð~gm  gÞ2 ¼ E½F n ð~gm Þ  F ð~gm Þ2 =½F 0 ðgÞ2 .

(A.7)

and

The asymptotic variance, Varð~gm Þ, is then Varð~gm Þ ¼ lim Eð~g  gÞ2 n!1

¼ lim E½F n ð~gm Þ  F ð~gm Þ2 =½f ðgÞ2 n!1

¼ Var½F n ðgÞ=½f ðgÞ2 . Note that for independent random variables, (A.8) reduces to the asymptotic variance for the median. Appendix B. Asymptotic variance under a normality assumption Under the assumption that g~ ijk Nðg; s2 ðxi xj xk ÞÞ, we obtain F ðzÞ  Pf~gijk pzg  Z Z z ¼ jðt; g; sðxi xj xk ÞÞ dt f ðxÞ dx ZX 1 ¼ Fðz; g; sðxi xj xk ÞÞ f ðxÞ dx; X

ðA:8Þ &

ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163

1162

Z

0

f ðzÞ ¼ F ðzÞ ¼

jðz; g; sðxi xj xk ÞÞ f ðxÞ dx, X

where jðx; m; sÞ and Fðx; m; sÞ denote the density and distribution function of the normal distribution with mean m and standard deviation s. Also Z 1 pffiffiffiffiffiffi f ðgÞ ¼ f ðxÞ dx 2psðxi xj xk Þ X 1 ¼ pffiffiffiffiffiffi A, 2p where A ¼ Average



1 , sðxi xj xk Þ

the average taken over all distinct triplets. To compute Var½F n ðgÞ note that from (A.1) 1 X Varð~gijk pzÞ Var½F n ðzÞ ¼ n 2 ½ 3   36 X 1 ¼ Ið~gijk pzÞ Ið~gi‘m pzÞ þ O 2 n ijk‘m n 36 X Pf~gijk pz; g~ i‘m pzg  n ijk‘m 36 X ¼ Pf~gijk  gpz  g; g~ ijm  gpz  gg. n ijk‘m Consequently, Var½F n ðgÞ 

36 X Pf~gijk  gp0; g~ i‘m  gp0g, n ijk‘m

which assuming the bivariate normal distribution, is the probability contained in the positive quadrant of a standard bivariate normal distribution with correlation rijk;i‘m , namely,  36 1 1 Average þ arcsin rijk;i‘m , Var½F n ðgÞ ¼ n 4 2p (see e.g., Kendall and Stuart, 1968, vol. 1, Chapter 15). In conclusion,  72p 1 B þ , Varð~gÞ ¼ nA2 4 2p where B ¼ Average ðarcsin rijk;i‘m Þ, A ¼ Average ð1=sðxi xj xk ÞÞ. An estimate of the asymptotic Varð~gÞ is obtained by replacing the corresponding population quantities by the sample estimates. References Bartlett, M.S., 1949. Fitting a straight line when both variables are subject to error. Biometrics 5, 207–212. Brown, G.W., Mood, A.M., 1951. On median test for linear hypotheses. Proceedings of the Second Berkeley Symposium, pp. 159–166. Chatterjee, S., Machler, M., 1997. Robust regression: a weighted least squares approach. Comm. Statist. Theory Methods 26, 1381–1394. Emerson, J.D., Hoaglin, D.C., 1983. Resistant lines for y versus x. In: Hoaglin, D.C., Mosteller, F., Tukey, J.W. (Eds.), Understanding Robust and Exploratory Data Analysis. Wiley, New York, pp. 129–165. Fuller, W., 1987. Measurement Error Models. Wiley, New York. Gentle, J.E., 1977. Least absolute values estimates: an introduction. Comm. Statist. Simulation Comput. 6, 313–328. Gleser, L.J., 1981. Estimation in a multivariate ‘‘error-in-variables’’ regression model: large sample results. Ann. of Statist. 9, 24–44.

ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163

1163

Hampel, F.R., 1971. A general qualitative definition of robustness. Ann. Math. Statist. 42, 1887–1896. Hogg, R.V., Randles, R.H., 1975. Adaptive distribution-free regression methods and their applications. Technometrics 17, 399–407. Hollander, M., Wolfe, D.A., 1973. Nonparametric Statistical Methods. Wiley, New York. Kendall, M.G., Stuart, A., 1968. The Advanced Theory of Statistics, vol. 1. Hafner Publishing Company, New York. Mood, A.M., 1950. Introduction to the Theory of Statistics, McGraw-Hill, New York. Nair, K.L., Shrivastava, M.P., 1942. On a simple method of curve fitting. Sankhya¯ 6, 121–132. Ruppert, D., Carroll, R.J., Stefanski, L.A., 1995. Measurement Error in Nonlinear Models. Wiley, New York. Sen, P.K., 1968. Estimates of the regression coefficient based on Kendall’s tau. J. Amer. Statist. Assoc. 63, 1379–1389. Siegel, A.F., 1982. Robust regression using repeated medians. Biometrika 69, 242–244. Theil, H., 1950. A rank-invariant method of linear and polynomial regression analysis. I, II, III. Proceedings Nederlandse Akademie van Wetenschappen 53, 386–392, 521–525, 1397–1412. Wald, A., 1940. The fitting of straight lines if both variables are subject to error. Ann. Math. Statist. 11, 284–300.