ARTICLE IN PRESS
Statistics & Probability Letters 76 (2006) 1156–1163 www.elsevier.com/locate/stapro
Nonparametric estimation for quadratic regression Samprit Chatterjeea,b, Ingram Olkinc, a
b
New York University, NY, USA Mount Sinai Medical School, NY, USA c Stanford University, CA, USA
Received 25 February 2005; received in revised form 22 November 2005 Available online 25 January 2006
Abstract The method of least squares provides the most widely used algorithm for fitting a linear model. A variety of nonparametric procedures have been developed that are designed to be robust against model violations and resistant against aberrant points. One such method introduced by Theil [1950. A rank-invariant method of linear and polynomial regression analysis. I, II, III. Proc. Ned. Akad. Wet. 53, 386–392, 521–525, 1397–1412] is based on pairwise estimates. There are many examples in which the data are nonlinear, and in particular, where a quadratic fit may be more appropriate. We here propose a nonparametric method for fitting a quadratic regression. r 2006 Elsevier B.V. All rights reserved. Keywords: Distribution-free regression; Theil estimator; Least squares regression; Robust regression
1. Introduction The method of least squares provides the most widely used algorithm for fitting the linear model: y ¼ a þ bx þ to a set of points ðxi ; yi Þ, i ¼ 1; . . . ; n. The assumption that the error term is normally distributed is generally adopted, in which case the least squares estimate is also maximum likelihood. Corrective measures have been developed for different types of violations of the assumptions. Robust procedures focus on violations of model assumptions and data contaminations. Errors-in-variables procedures deal with the case that x and y are measured with error. Although the origins of these violations may differ, some of the corrective measures are similar. Our starting point consists of n independent observations ðyi ; xi Þ, i ¼ 1; . . . ; n, to be used in estimating the parameters in the linear model: yi ¼ a þ bxi þ i ;
i ¼ 1; . . . ; n.
(1.1)
No assumptions are made on the error terms. Wald (1940) focused on the case that both x and y are subject to error. For simplicity of discussion let n be even ðn ¼ 2mÞ and let the x’s be distinct. (The case of nondistinct x’s can be treated, but require more Corresponding author. Fax: +1 650 725 8977.
E-mail address:
[email protected] (I. Olkin). 0167-7152/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2005.12.022
ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163
1157
complicated rules for breaking ties.) Divide the n observations based on the ordered x’s into two groups with x1 o oxm as group 1 and xmþ1 o oxn as group 2. The Wald slope bw is determined by bw ¼ ð¯y2 y¯ 1 Þ=ðx¯ 2 x¯ 1 Þ,
(1.2)
where the x¯ i and y¯ i are means of their respective groups. Then aw ¼ y¯ bw x. ¯
(1.3)
This method yields consistent unbiased estimators. A more efficient modification of the Wald method was introduced by Nair and Shrivastava (1942) and independently by Bartlett (1949), in which the ordered x’s are divided into three groups with a lower group based on x1 o ox‘ and an upper group based on xu o oxn , ‘ou. The middle group is omitted. The slope is now evaluated as in (1.2). Although the estimate of slope is identical for the Nair and Shrivastava (1942) and Bartlett (1949) methods, the intercept estimates differ. The efficiency of the Wald method with equally spaced x’s and not subject to error of measurement has efficiency relative to least squares at least 34, and the three group method has efficiency at least 89. Many other methods designed to be more robust and resistant have been proposed: Mood (1950), Brown and Mood (1951), Hampel (1971), Hogg and Randles (1975), Gentle (1977), and others. These methods are described in the excellent review by Emerson and Hoaglin (1983). For more recent discussions of errors-invariables methods see Gleser (1981), Fuller (1987), and Ruppert et al. (1995). 2. Pairwise estimates The method of pairwise estimates was introduced by Theil (1950) in the case of linear regressions. Assuming distinct x-values, for every pair ðxi ; yi Þ and ðxj ; yj Þ, estimate the slope by the two-point formula. bij ¼ ðyi yj Þ=ðxi xj Þ;
1pipjpn,
(2.1)
and define b~ ¼ med ðbij Þ over all n2 pairs of points. Then define ~ i g. a~ ¼ med fyi bx
(2.2)
(2.3)
A modified version developed by Siegel (1982) uses a repeated median method in which b ¼ medi fmedj bij ; . . . ; medj bnj g.
(2.4)
The intercept is obtained from a ¼ med fyi b xi g;
i ¼ 1; . . . ; n.
(2.5)
An approximate confidence interval for the slope is obtained by Hollander and Wolfe (1973) as follows. Order the N ¼ n2 slopes bð1Þ p pbðNÞ and define L ¼ ðN C a Þ=2;
U ¼ ðN þ C a Þ=2,
where C a za=2
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi nðn 1Þð2n þ 5Þ=18
and z denotes the upper tail of a standard normal distribution. Then ðbL ; bU Þ constitutes a ð1 aÞ100% confidence interval. (Note that when L and U are not integers, the closest integer value is used.) 3. The quadratic case Although linearity is still the primary model in applications, there has been an increase in examples of data that are nonlinear, and in particular, where a quadratic fit may be more appropriate.
ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163
1158
The method of Theil (1950) can readily be modified for the quadratic case. Consider the model yi ¼ a þ bxi þ gx2i þ i ;
i ¼ 1; . . . ; n,
(3.1) 2
and assume that the x-values are distinct and variance ði Þ ¼ s . For all n3 triplets solve the equations. yi ¼ a þ bxi þ gx2i , yj ¼ a þ bxj þ gx2j , yk ¼ a þ bxk þ gx2k , to yield g~ ijk ¼
ð3:2Þ
1 yk yi yj yi . ðxk xj Þ xk xi xj xi
(3.3)
The median estimator of the coefficient of the quadratic term is g~ ¼ med f~gijk g. Then for all n2 pairs let yj yi b~ ij ¼ g~ ðxj þ xi Þ, xj xi
(3.4)
(3.5)
and define b~ ¼ med fb~ ij g.
(3.6)
Finally, for all i ¼ 1; . . . ; n, let ~ i g~ x2 , a~ i ¼ yi bx i
(3.7)
and define a~ ¼ med f~ai g.
(3.8)
The fitted model is ~ i þ g~ x2 ; y~ i ¼ a~ þ bx i
i ¼ 1; . . . ; n,
and the estimate of variance is X ðyi y~ i Þ2 =ðn 3Þ. s~ 2 ¼
(3.9)
(3.10)
The variance of g~ ijk is s~ 2ijk ¼ Var ð~gijk Þ ¼ s2
½ðxi xj Þ2 þ ðxi xk Þ2 þ ðxj xk Þ2 ; ðxi xj Þ2 ðxi xk Þ2 ðxj xk Þ2
(3.11)
the covariance between the g~ that have one observation in common is Covð~gijk ; g~ i‘m Þ ¼ s2 ½ðxi xj Þðxi xk Þðxi x‘ Þðxi xm Þ1 .
(3.12)
Thus the correlation between the g~ that have one observation in common is r~ ijk;i‘m ¼
Covð~gijk ; g~ i‘m Þ . s~ ijk s~ i‘m
(3.13)
The correlation between the g~ with two observations in common can also be calculated, but is somewhat more complicated. Remark. The estimates of a; b; g have been obtained only for triplets xi ; xj ; xk that are distinct. For any triplet that is not distinct define that particular g~ ijk ¼ 0. Using the approach outlined in Sen (1968), we can show that the g~ calculated from the distinct triplets converges to g.
ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163
1159
To judge the significance of the quadratic component an asymptotic standard error of the quadratic component is derived in the Appendix. 4. Simultaneous least squares estimation For every triplet, say, ðx1 ; y1 Þ; ðx2 ; y2 Þ; ðx3 ; y3 Þ, minimize X ðyi a bxi gx2i Þ2 i¼1;2;3
to yield the solution 0
11
1
1
1
^ g^ Þ ¼ ðy ; y ; y ÞB x ð^a; b; 1 2 3 @ 1 x21
x2 x22
x3 C A x23
and more explicitly y1 y2 y3 þ þ , ðx2 x1 Þðx3 x1 Þ ðx1 x2 Þðx3 x2 Þ ðx1 x3 Þðx2 x3 Þ y1 ðx2 þ x3 Þ y2 ðx1 þ x3 Þ y3 ðx1 þ x2 Þ b^ 123 ¼ , ðx2 x1 Þðx3 x1 Þ ðx1 x2 Þðx3 x2 Þ ðx1 x3 Þðx2 x3 Þ y1 x2 x3 y2 x1 x3 y3 x1 x2 þ þ . a^ 123 ¼ ðx2 x1 Þðx3 x1 Þ ðx1 x2 Þðx3 x2 Þ ðx1 x3 Þðx2 x3 Þ
g^ 123 ¼
Consequently, let a^ LS ¼ med f^a123 g;
b^ LS ¼ med fb^ 123 g;
g^ LS ¼ med f^g123 g,
over all triplets. It is possible to estimate the three parameters simultaneously for each triplet in closed form because the design matrix is a Vandermonde that has a closed form inverse. The estimates of g given by this method and the method proposed in Section 3 are the same, because in both methods the estimator of g is the median of all the estimates generated by all triplets of the points. The estimates of a and b are, however, different. In Section 3, the constant and the linear coefficient are estimated after a smooth estimate of the quadratic component is subtracted from the data. For the simultaneous least square estimate this is, however, not done. We do not recommend this procedure, but present it here for the sake of completeness and novelty. 5. Example We illustrate our method by an example in which the standard least squares approach performs very poorly. The data are simulated under conditions that are realistic. Ten observations are generated from a polynomial with a negative linear, and a positive quadratic coefficient. The random disturbance is generated by a long tailed probability distribution (a t-distribution with 10 degrees of freedom). Two observations have been perturbed. The data is provided in Table 1. Table 1 Illustrative data for quadratic regression fitting x
y
x
y
3.4 5.7 7.5 8.8 11.1
18.8 20.5 29.4 45.0 72.4
12.5 14.2 15.2 15.8 17.9
92.7 124.2 142.4 154.7 118.4
ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163
1160
160 140 120
y
100 80
LS RNP
60 40 20 5
10
15 x
Fig. 1. Scatterplot of y and x, with least squares and nonparametric robust quadratic fit.
Fig. 1 shows the observations, least squares fit, and the nonparametric robust fit. The observations show a clear quadratic pattern. The superiority of the nonparametric robust fit in this situation is clear. The least squares fit is yLS ¼ 28:772 þ 9:329x þ 0:0414x2 , where both the linear and the quadratic terms are not significant. It is to be noted that the value of R2 is 0.878! In spite of the high value of R2 the poor least squares fit occurs because of two conditions present in the data. One is multicollinearity, (variance inflation factor is 29.6) and the other is the presence of two outliers with high leverage. Multicollinearity will almost always be present in any quadratic situation because of the nature of the variables. Occurrence of outliers with high leverage is also a commonly occurring situation in data analysis. The nonparametric method produces the model Y NP ¼ 11:586 2:556x þ 0:724x2 ; here the linear and quadratic components are significant. The observed value of R2 is 0.841. An iteratively robust weighted least squares method (IRWLS) given by Chatterjee and Machler (1997) gives Y R ¼ 15:256 3:572x þ 0:785x2 . The observed value of R2 is 0.837. In this example, the model fitted by IRWLS and the nonparametric robust model give nearly identical fits, and when superposed on the scatterplot are indistinguishable. The nonparametric method uses an L1 norm, whereas IRWLS uses a quadratic norm. Consequently, the nonparametric method will be more robust against long tailed error distributions. 6. Discussion The least squares estimates for linear models in the presence of outliers and leverage points are known to be highly nonrobust. The extended Theil method of Section 3 guards against these aberrant points. The IRWLS estimates downweights these points. The extended Theil method here developed also downweights these points, and consequently is close to the IRWLS estimates. This method will perform well in the presence of long tailed model errors (being based on the L1 norm), gross errors and leverage points. The method presented
ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163
1161
in Section 3 will provide a reliable quadratic fit without requiring an extensive analysis of the data. As with other nonparametric procedures, it provides a quick examination of the quadratic fit. Acknowledgements We would like to thank P. K. Sen and S. R. S. Varadhan for their helpful comments and suggestions on the work in this paper. Appendix A. Asymptotic standard error of the quadratic coefficient Let F n ðzÞ denote the empirical cumulative distribution function of g~ ijk : 1 X F n ðzÞ ¼ n Ið~gijk pzÞ, 3
(A.1)
i;j;k
and let f ðÞ denote its corresponding density. Theorem. The variance of g~ m , the median of all triplets g~ ijk , is Varð~gm Þ ’ Var½F n ðgÞ=½f ðgÞ2 . Proof. Because g~ m is the median of all triplets F n ð~gm Þ ¼ Sen, 1968; Hollander and Wolfe, 1973) it follows that
(A.2) 1 2.
Further, because g~ m converges to g (see for example
lim F n ð~gm Þ ¼ F ðgÞ ¼ 12.
n!1
(A.3)
Further F n ð~gm Þ F ðgÞ ¼ 0.
(A.4)
Expand F ðgÞ in a Taylor series about g~ m : F n ð~gm Þ F ð~gm Þ þ ðg g~ m ÞF 0 ðgÞ ffi 0,
(A.5)
from which g~ m g ffi ½F n ð~gm Þ F ð~gÞ=F 0 ðgÞ
(A.6)
Eð~gm gÞ2 ¼ E½F n ð~gm Þ F ð~gm Þ2 =½F 0 ðgÞ2 .
(A.7)
and
The asymptotic variance, Varð~gm Þ, is then Varð~gm Þ ¼ lim Eð~g gÞ2 n!1
¼ lim E½F n ð~gm Þ F ð~gm Þ2 =½f ðgÞ2 n!1
¼ Var½F n ðgÞ=½f ðgÞ2 . Note that for independent random variables, (A.8) reduces to the asymptotic variance for the median. Appendix B. Asymptotic variance under a normality assumption Under the assumption that g~ ijk Nðg; s2 ðxi xj xk ÞÞ, we obtain F ðzÞ Pf~gijk pzg Z Z z ¼ jðt; g; sðxi xj xk ÞÞ dt f ðxÞ dx ZX 1 ¼ Fðz; g; sðxi xj xk ÞÞ f ðxÞ dx; X
ðA:8Þ &
ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163
1162
Z
0
f ðzÞ ¼ F ðzÞ ¼
jðz; g; sðxi xj xk ÞÞ f ðxÞ dx, X
where jðx; m; sÞ and Fðx; m; sÞ denote the density and distribution function of the normal distribution with mean m and standard deviation s. Also Z 1 pffiffiffiffiffiffi f ðgÞ ¼ f ðxÞ dx 2psðxi xj xk Þ X 1 ¼ pffiffiffiffiffiffi A, 2p where A ¼ Average
1 , sðxi xj xk Þ
the average taken over all distinct triplets. To compute Var½F n ðgÞ note that from (A.1) 1 X Varð~gijk pzÞ Var½F n ðzÞ ¼ n 2 ½ 3 36 X 1 ¼ Ið~gijk pzÞ Ið~gi‘m pzÞ þ O 2 n ijk‘m n 36 X Pf~gijk pz; g~ i‘m pzg n ijk‘m 36 X ¼ Pf~gijk gpz g; g~ ijm gpz gg. n ijk‘m Consequently, Var½F n ðgÞ
36 X Pf~gijk gp0; g~ i‘m gp0g, n ijk‘m
which assuming the bivariate normal distribution, is the probability contained in the positive quadrant of a standard bivariate normal distribution with correlation rijk;i‘m , namely, 36 1 1 Average þ arcsin rijk;i‘m , Var½F n ðgÞ ¼ n 4 2p (see e.g., Kendall and Stuart, 1968, vol. 1, Chapter 15). In conclusion, 72p 1 B þ , Varð~gÞ ¼ nA2 4 2p where B ¼ Average ðarcsin rijk;i‘m Þ, A ¼ Average ð1=sðxi xj xk ÞÞ. An estimate of the asymptotic Varð~gÞ is obtained by replacing the corresponding population quantities by the sample estimates. References Bartlett, M.S., 1949. Fitting a straight line when both variables are subject to error. Biometrics 5, 207–212. Brown, G.W., Mood, A.M., 1951. On median test for linear hypotheses. Proceedings of the Second Berkeley Symposium, pp. 159–166. Chatterjee, S., Machler, M., 1997. Robust regression: a weighted least squares approach. Comm. Statist. Theory Methods 26, 1381–1394. Emerson, J.D., Hoaglin, D.C., 1983. Resistant lines for y versus x. In: Hoaglin, D.C., Mosteller, F., Tukey, J.W. (Eds.), Understanding Robust and Exploratory Data Analysis. Wiley, New York, pp. 129–165. Fuller, W., 1987. Measurement Error Models. Wiley, New York. Gentle, J.E., 1977. Least absolute values estimates: an introduction. Comm. Statist. Simulation Comput. 6, 313–328. Gleser, L.J., 1981. Estimation in a multivariate ‘‘error-in-variables’’ regression model: large sample results. Ann. of Statist. 9, 24–44.
ARTICLE IN PRESS S. Chatterjee, I. Olkin / Statistics & Probability Letters 76 (2006) 1156–1163
1163
Hampel, F.R., 1971. A general qualitative definition of robustness. Ann. Math. Statist. 42, 1887–1896. Hogg, R.V., Randles, R.H., 1975. Adaptive distribution-free regression methods and their applications. Technometrics 17, 399–407. Hollander, M., Wolfe, D.A., 1973. Nonparametric Statistical Methods. Wiley, New York. Kendall, M.G., Stuart, A., 1968. The Advanced Theory of Statistics, vol. 1. Hafner Publishing Company, New York. Mood, A.M., 1950. Introduction to the Theory of Statistics, McGraw-Hill, New York. Nair, K.L., Shrivastava, M.P., 1942. On a simple method of curve fitting. Sankhya¯ 6, 121–132. Ruppert, D., Carroll, R.J., Stefanski, L.A., 1995. Measurement Error in Nonlinear Models. Wiley, New York. Sen, P.K., 1968. Estimates of the regression coefficient based on Kendall’s tau. J. Amer. Statist. Assoc. 63, 1379–1389. Siegel, A.F., 1982. Robust regression using repeated medians. Biometrika 69, 242–244. Theil, H., 1950. A rank-invariant method of linear and polynomial regression analysis. I, II, III. Proceedings Nederlandse Akademie van Wetenschappen 53, 386–392, 521–525, 1397–1412. Wald, A., 1940. The fitting of straight lines if both variables are subject to error. Ann. Math. Statist. 11, 284–300.