Statistics & Probability North-Holland
September
Letters 10 (1990) 283-289
MEAN SQUARED OF REGRESSION
ERROR PROPERTIES QUANTILES
1990
OF KERNEL ESTIMATES
M.C. JONES Mathematical
Sciences Department,
IBM Thomas J. Watson Research Center, P.O. Box 218, Yorktown
Heights, NY 10.598, USA
Peter HALL Department
of Statistics,
Australian
National
University, P.O. Box 4, Canberra, ACT 2601, Australia
Received March 1989 Revised November 1989
Abstract: Mean squared error properties are derived and discussed.
Keywords:
Conditional
estimation,
of kernel estimates
nonparametric
regression,
of regression
reference
quantiles,
data,
for both fixed and random
design cases,
smoothing.
1. Introduction Nonparametric estimation of a smooth regression mean curve has received much attention in the literature in recent years; see, for example, Eubank (1988) for a recent review. Smooth nonparametric estimation of other aspects of conditional distributions has been rather less popular. This paper discusses an important example of the latter, namely, the nonparametric estimation of quantiles of the conditional distribution, or regression quantiles for short. The application which motivated this research is the provision of smooth ‘centile curves’ fit to ‘reference data’ in medicine (e.g. Cole, 1988). In that context, charts comprising a few selected (estimated) regression quantiles of an attribute of interest (e.g. height), usually plotted against age, give a useful and interpretable summary of the mainstream development of a population with respect to that attribute. In the discussion of Cole’s paper, Jones (1988) suggests using a spline smoothing approach to providing smooth regression quantiles, patterned after Bloomfield and Steiger (1983, Chapter 5). While that method has been applied in various practical situations, we switch attention here to the kernel smoothing version of the same, for reasons of greater amenability to theoretical investigation. The data we have are (xi, y,), i = 1,. .., n, and the y,‘s are considered to be realisations from the conditional distribution of Y given X = x, the distribution function of which will be written as F,(y) or F(x; y), whichever proves the more convenient (the corresponding density is f,(y)). It is important to note the quite general dependence of F on x; we will, in general, assume only properties of F relating to a smooth variation of F with x. For given a E (0, l), the associated regression quantile q,(x) is defined by F,(q,(x)) = a. An alternative characterisation of q,(x) is as the function B that minimises E{ p,(y - O)}, where
0167-7152/90/$3.50
0 1990 - Elsevier Science Publishers
B.V. (North-Holland)
283
Voiume
10, Number
4
STATISTICS
L PROBABILITY
LETTERS
September
and IA(z) is the usual indicator function. This motivates the natural kernel estimate of qa(x), d,(x) as the solution 8 to the equation
1990
namely,
where q,(z)
= +,&)
- 0 - ~>~(-,,&>~
z =+0,
0.3)
is the derivative of p,, except for being undefined at z = 0. (In practice, this is overcome by a slight ‘rounding off’ of p, about zero; 4, can then be computed by the method of iteratively reweighted least squares, for example.) The weights K(x), i = 1,. . . , n, depend on the distances 1x - x, 1through a kernel function K as detailed below. One way of assessing the performance of $, as an estimate of 4, is by its mean squared error (MSE) at any point x. The purpose of this paper is to derive and discuss the leading terms in the asymptotic expansion of MSE. (We ignore boundary effects, so our analysis is directed at performance in the interior of the range of x’s.) The idea is to get some qualitative insight into factors affecting bias and variance of G,, with particular reference to the appropriate choice of smoothing parameter, h = h, - the ‘bandwidth - associated with the kernel weights y(x). It is well known (e.g. Eubank, 1988) that for the regression mean the bias depends on the second derivative of the true mean (a measure of its ‘roughness’) and the variance on the variance corresponding to f, (together with the ‘usual’ terms arising from the smoothing). It turns out that for 4, the bias depends on the second derivative of F, with respect to x evaluated at q,(x), divided by f,, and the variance on the asymptotic variance of the sample a-quantile. Comparison of the mean and quantile estimators is, therefore, tricky to understand in general as is the optimal variation of h, with a! (see Section 3). The latter is disappointing in terms of its lack of simplicity. It seems that Jones’s (1988) hope that a single h would suffice for all (Y and Owen’s (1987) guess that h, should be greater for extreme quantiles than for moderate ones are overoptimistic in general, although it remains possible that either could be a reasonable approximation in many cases of practical interest. Global versions of our pointwise MSE results follow by integration provided the resulting integrals remain finite, so corresponding interpretations follow. It is also supposed that results for the kernel estimation case have some relevance for the spline smoothing approach alluded to above. We make the usual distinction between fixed and random designs, and treat both. In the fixed design case, we take the x’s to be in (0, 1) with x, = (i - i)/n, i = 1,. . . , n. The corresponding MSE of Qa is derived in Section 2 and discussed in Section 3. Aside from its own practical interest, the fixed design case allows properties of &,, such as those outlined in the previous paragraph, to appear free of the added complication of randomness in the x,‘s. The MSE for the latter case is derived in Section 4 and the effect of the density g, from which xi,. . . , x, is then a random sample, is briefly discussed in Section 5. It will prove convenient, if not compelling, to use different weights W;,(x) in each case. For the fixed design, we take
w(x)
=
lz,ly
(i-1)/n
K{ h,‘(x - u)} du,
(1.4)
and for the random design,
w(x)
=
(nh,)-lK{h,‘(x-x;)}.
(1.5)
The kernel K will be taken to be a symmetric probability density function. In the regression mean case, (1.4) and (1.5) would lead to the well known Gasser-Mtiller and Nadaraya-Watson versions of the kernel estimate ((6d) and (6~) of Eubank, 1988, p. 113), respectively. 284
Volume
10, Number
4
STATISTICS
& PROBABILITY
September
LETTERS
1990
The estimator $,, being the solution to equation (1.2), fits the prescription for a kernel M-estimator (e.g. Eubank, 1988, Section 4.11). Of course, these are usually considered as robust estimates of the mean function, a role suited only to the median (the case (Y= :) out of our class of estimates. More importantly, 4, is a special case arising out of the work of Stone (1977) and Owen (1987) on nonparametrically estimating conditional distributions and associated functionals of them. Stone and Owen estimate each F, as a weighted empirical distribution function, I?!, the weights controlling the contribution of y,‘s corresponding to x,‘s ‘near’ x to I??. Replacing F, by 3x (using our kernel weights) in the definition of q, given just above (1.1) results in precisely the estimator %. That the quantile case remains worthy of individual attention as in the current paper follows from the fact, familiar in nonparametric (unconditional) density estimation, that degrees of smoothing suitable for estimating particular aspects of the distribution (as measured by appropriate functionals) differ from those appropriate for estimation of the distribution as a whole. By way of notation, write
t.Yb(q,(x))=
j$$ F(z; Y) I
(l-6)
X&(X).
As is usual, our asymptotic results hold as n -+ cc, h, = h,(n)
-+ 0 and n/r,(n) --, 00.
2. The MSE of gm for a fixed design For the fixed design described in Section 1, recall that 8, is the solution to (1.2) with #, given by (1.3) and u’;(x) by (1.4). The main work involved in obtaining the asymptotic mean and variance of 4, consists in obtaining the same quantities for H,(8), where 8 = 0(x). Write p,(e) = EH,(8). From (1.2) and (1.3), we have p.,(e)=
?
i=l
~:(x)[~~{~,-e(x)>0}-(1-~)~{~,-~(x)<0}1
(2.1) Writing 6(x) = e(x) - qJx>, <,(6(x>)
Taylor series expansion about (x, q,(x))
2: F,(qa(x))
+ ~(x)L(qa(x))
gives
+ (xi-x>F,‘“(qa(x))
+ :(x1 -x)*I;x2”(qa(x>>.
Standard manipulations involving the kernel weights give i
i=l
T(x) = 1,
where ui = lx*K(x) P,(e)
2 vgx)(x,
- x) = 0(n-1)
i=l
i
u/;(x)(x;
- x)’
= aihi,
i=l
dx. Then (2.1) becomes
= - {e(x)
- q,(x))f,(q,(x))
- 5+@ZY~,tx))
since F,( qa(x)) = a. The analogue of (2.1) for the variance is, writing a:(e)
d(e) =
and
5 W(X){ ff2+ (1-
i=l
za)F,,(e(x))}.
(2.2)
= var{ H,(B)}, (2.3)
285
Volume
10, Number
4
Then, since F,,(B(x)) U;(e)
STATISTICS
= (Y and C:=,K2(x)
= (nhJ’R(K)a(l
& PROBABILITY
= (nh,)-‘R(K),
LETTERS
where R(K)
-a).
September
= fK2(x)
1990
dx, we have (2.4)
Now notice that (iin is the solution B(X) to the equation H,(B) - p,(e) = -p,(e) and that ~;~(8)( H,(B) - p,(e)} is asymptotically standard normal. The asymptotic MSE of B,(x) follows as the sum of its squared bias and variance obtained explicitly from (2.2) and (2.4) (under assumptions which ensure the existence and finiteness of the quantities involved). Result 1. In the fixed design case, MSE( &( x)) = :+z:
(2.5)
It follows that the asymptotically optimal value, h,*, say, of h, because of its dependence of unknowns) is given by
(which is not a practical proposition
(24 and the corresponding best possible MSE is
(2.7)
3. Discussion of the fixed design case It is instructive to compare (2.5) with the corresponding expression for the kernel estimate A, say, of the regression mean m. This is (e.g. Eubank, 1988, p. 131) MSE(A(x))
= ~a;h4{m”(x)}2+
(nh)-‘R(K)s’(x),
(3.1)
where h is A’s bandwidth and s2(x) is the conditional variance of y. We first note that the orders of magnitude of terms on the right-hand sides of (2.5) and (3.1) are precisely the same: O(A4) for the squared bias and O((nX)-‘) for the variance, where X denotes either h or any h,. The optimal h is thus O(n-“5) and the optimal MSE is O(n-4/5). Standard multipliers, 7‘0: of A4 and R(K) of (nX)-‘, due to the effect of the kernel, also appear in each (we are, therefore, justified in taking K to be the same in estimating m or any qa; the ‘Epanechnikov kernel’ is optimal, Marron and Nolan, 1988, but other reasonable choices have a similar performance). The other factors in the MSE expressions are problem specific. The variance of A depends on s’(x) which is the variance of fi x the sample mean of a hypothetical sample of n y’s at x; the variance of 4, which is the leading term in the variance of fi X the depends in the same way on a(1 - a)/{ f,(q,(x))}*, sample cu-quantile at x, as n + co. While the bias of Gr depends on m”, the corresponding quantity in the bias of 4, is FX2’(q,(x))/f,(q,(x)). The latter, too, depends on a second derivative and this is also a measure of roughness, but one adapted to the quantile regression problem: it is the second derivative of F 286
STATISTICS & PROBABILITY LETTERS
Volume10, Number4
September1990
in the x-direction as we track the curve q,(x). Since F(x; q,(x)) is a constant, we can differentiate with respect to x and set the answer equal to zero to get the expression -F,2°(qu(x>>
= 4Xx)f,(q,W
+ {qXx)j2f:(q,(x))
twice
(3.2)
+ 24XMX&))
(here, fX’( q,( x)) E FXo2(qa( x))), which has the slightly more transparent alternative form -@o(q,(x))
=4:Wf,(q,(x))
- {4:(x)~*f:G?,(x))
+ 2Y:(x)&{+
4&))). (3.3)
Notice how, for both & and g,, the variance factor depends on conditional properties of y, reflecting the data production process, while the bias factor depends on (normalised) smoothness properties of the model in the x-direction. It is useful to identify special cases which simplify the above comparisons. If 1q:(x) 1is much larger and the comparison with {m”(x)}’ is much clearer. than I d(x) I9 { F,2’(qa(x>)/f,(qa(x>)}* = { d’(x>l* If F,(y) has the location/scale form H[{y - m(x)}/s(x)], f or some fixed distribution H, then d{ f(x; q,(x)))/d x in (3.3) is zero. Specialising to the case of the median, the second term in (3.3) is also zero if H is symmetric, in which case q ,,z(x) is the same as m(x) and the bias terms for median and mean are identical. A comparison of hl*/* from (2.6) and the mean-optimal bandwidth h*=
1
1n-“5 l/5
R(K)
~‘(4
Ui
{m”(x)}*
(3.4)
(which minimises (3.1)) then depends only on the variance comparison of {2fX(q1,2(x))}-2 with s*(x)_ It is clear, however, that it is too much to expect any simple modification of a bandwidth chosen appropriately for the mean (for data-driven methods in that situation, see Eubank, 1988) to be generally available for application to choosing hi,* for the median. The relationship between bandwidths appropriate for estimating different quantiles is also of great interest. Not that fX’(q,(x)) f 0 in general. Only when qg dominates q: (in which case h,* is likely to be small) does the bias term simplify appreciably; even then, unless a location-shift-only model holds, q::(x) and q::(x) (for (pi # (Y*) may differ, obscuring the variance-based comparison of hz, and hz2 based on cy,(l - al)fX2(q,,(x))/{ a,(1 - a2)fX2(q,,(x))}. In this very special case, one would thus expect h,* to increase as (Y moves away from i for many distributions of interest, although there are examples where this is not so. In general, a Taylor series expansion yields h*
+1-:((Y1-(Y2) a2
F,
2.
F,z’(qa*(x))
(4&))f,(%&))
’
and it can be shown that -3qZx)F,21(qa(x))
= 4:” (xJfx(4J-x))
+ terms in (qg,
4:, 1);
it follows that, in cases where 14:” (x) I dominates lower order derivatives, hz, and hzz can be made to differ considerably and in either direction. Again, it had been hoped that a general relationship of a much simpler kind might hold. That is not to say, however, that use of a simple relationship - that of equal h,‘s is most appealing - might not suffice in most practical instances. 4. The MSE of Qa for a random design Now consider the random design case: 4, solves (1.2) with J/, given by (1.3) X = (xi,. . , xn) is a random sample from the distribution with density g.
w(x)
by (1.5)
and
287
Volume 10, Number 4
Conditional
STATISTICS t PROBABILITY
on X, expression
E{ KP)
IX}
=
(2.1) and the Taylor
-
2
W(x){
LETTERS
expansion
w4L(%b))
+
September 1990
following
(x1
-
it still hold, yielding
4F,‘“hxw
i=l +
Taking
expectation
over X requires
the readily
c W(x) -g(x),
E c
42F,20bL(4~)
results
&(x)(x,
-x)
= a$zig’(x)
and
i=l
i=l
E i
-
n
n
E
obtained
:@I
-x)‘= u,$h;g(x).
w;(x)(x,
i=l
It follows that, for the case of the random P,(e)
= - {W) -t&2,{
design,
- q,Wg(x)LGL(x)) gGWO(q&))
(4.1)
+ 2g’WF,1°(4a(x))}.
From (2.3), var{ H,(8)
] X} = i
K2(x)a(l
- a)
i=l
and we also have E k v2(x) i=l
2: (nh,)-‘g(x)R(K).
It can be shown that var[E{ H,(8) u;(e)
] X}] comprises
= (nhJ’g(x)R(K)cY(l
only smaller
order terms so that
-a).
(4.2)
Combining (4.1) and (4.2) (as we did (2.2) and (2.4)) and using the fact that F,“(q,(x)) = - qi( x)f,( qa( x)) (differentiate the equation F( x; qn( x)) = (Y once with respect to x) gives the following. Result 2. In the random design case, (Y(1 - a) M=(B,(x))
= :e:h: {L(4a(x))j2’
q
(4.3) From (4.3) we see that
41- 4dx>
7
l/5
{ g(X)<20(4a(x))- W(x>q~(x)L(qa(x)))1
n-1’5
(4.4)
and MSE* =
J.
4
[ {
u,R(K)a(l -
a,}“{ g(x)F,20(dx)) -
2g’(x)q~(x)r,(q.(x))}2/g _
g(x){L(q,(x)H2 (4.5)
288
Volume
10, Number
4
STATISTICS
& PROBABILITY
LETTERS
September
1990
5. The random design effect We finally comment that the effect of the random x-design on the MSE of 4, is precisely the same as its effect on the MSE of 6r. The latter is (e.g. Collomb, 1981) MSE@(x))
= :a,$r4
m”(x)
+ 2g’(;;;;(x)
j
2
R(K) + n/r
sZ(x) ___ g(x)
.
(5.1)
In both (4.3) and (5.1), the divisor g(x) is introduced into the variance term while an additional term of the form 2g’(x)n’(x)/g(x), where n is q, in the case of (4.3) and m in the case of (5.1), appears in the bias (note that signs are the same because of the negative relationship between FX2’(q,(x)) and q:(x) in both (3.2) and (3.3)).
Acknowledgements We are grateful to Robert Kohn for bringing Art Owen’s work to our attention, and to Art Owen for supplying us with a copy of his thesis. Comments of Peter Bloomfield and a referee were helpful. This work was done while Chris Jones held a Visiting Fellowship, funded by the Mathematical Sciences Research Centre, at the Australian National University.
References Bloomfield, P. and W.L. Steiger (1983), Least Absolute Deuiations: Theory, Applications, and Algorithms (Birkhaiiser, Boston). Cole, T.J. (1988) Fitting smoothed centile curves to reference data (with discussion), J. Roy. Sfarist. Sot. Ser. A 151, 385-418. Collomb, G. (1981) Estimation non-parametrique de la regression: revue bibliographique, Internat. Statist. Rev. 49, 7593. Eubank, R.L. (1988), Spline Smoothing and Nonparametric Regression (Dekker, New York).
Jones, M.C. (1988) Contribution to the discussion of Cole (1988) J. Roy Statist. Sot. Ser. A 151, 412-413. Marron, J.S. and D. Nolan (1988) Canonical kernels for density estimation, Statist. Probab. I&r. 7, 195-199. Owen, A.B. (1987), Nonparametric conditional estimation, Tech. Rept. No. 265, Dept. of Statist., Stanford Univ. (Stanford; CA). Stone, C.J. (1977) Consistent nonparametric regression (with discussion), Ann. Sfatisf. 5, 595-645.
289