Consistent bandwidth selection for kernel binary regression

Consistent bandwidth selection for kernel binary regression

Journal of Statistical Planning and Inference 70 (1998) 121 137 ELSEVIER journal of statistical planning and inference Consistent bandwidth selecti...

760KB Sizes 0 Downloads 117 Views

Journal of Statistical Planning and Inference 70 (1998) 121 137

ELSEVIER

journal of statistical planning and inference

Consistent bandwidth selection for kernel binary regression Naomi Altman a, Brenda MacGibbon b,, "Biometrics Unit, Cornell University, Ithaca, NY 14853, USA bDOpartement de Mathematiques, UniversitO du Qua;bee 6 Montreal, c.p. 8888, Jute. Centre-ville, Montreal, Canada H3G 3P8

Received 3 August 1995; received in revised form 20 October 1997

Abstract The use of nonparametric regression techniques for binary regression is a promising alternative to parametric methods. As in other nonparametric smoothing problems, the choice of smoothing parameter is critical to the performance of the estimator and the appearance of the resulting estimate. In this paper, we discuss the use of selection criteria based on estimates of squared prediction risk and show consistency and asymptotic normality of the selected bandwidths. The usefulness of the methods is explored on a data set and in a small simulation study. @ 1998 Elsevier Science B.V. All rights reserved. A M S class(fication: Primary 62G07; Secondary 62G20 Keywords: Cross-validation; CL; Nonparametric regression; U-statistics

1. Introduction Nonparametric regression can be particularly useful in the binary regression problem since scatterplots o f the raw data and o f regression residuals are often difficult to interpret. The model is yi~Bernoulli[p(ti)],

i= l ..... n

(1.1)

It(t) is a smooth function with square integrable pth derivative, and the ti are ei-

ther fixed or random design points on [0, 1], depending on the sample size n. Several authors have noted the advantages o f using smoothing for diagnostics in binary regression (Altman, 1992; Azzalini et al., 1989; Copas, 1983; Fowlkes, 1987). However, a selection criterion for the choice o f bandwidth for this problem has not been previously studied, although one has been suggested (Azzalini et al., 1989; Hastie and Tibshirani, 1990, p. 159). In this article we examine the use o f cross-validation (CV) * Corresponding author. 0378-3758/98/$19.00 @ 1998 Elsevier Science B.V. All rights reserved. PIIS0378-3758(97)00176-6

122

N. Altman, B. MacGibbon/Journal of Statistical Planning and hlJbrence 70 (1998) 121 137

(Allen, 1974; Stone, 1974)and Mallows' Cc (Mallows, 1973)for selecting bandwidths for kernel binary regression. Consistency of the adaptive estimators and convergence of the selected bandwidth are shown under average squared error loss. In the ordinary nonparametric regression problem, Yi =

12(ti ) + ~:i,

(1.2)

/1 and ti are as in Eq. (1.1) and ei are independent and identically distributed (i.i.d.) random variables with mean zero and finite variance. In this case, linear smoothers such as kernel regression estimators, smoothing splines, and local polynomial smoothers are known to provide consistent estimators of the regression function (Stone, 1977; Wahba and Wold, 1975). Good performance of these estimators depends critically on the choice of a smoothing parameter, which controls the variance and bias. In this paper we consider bandwidth selection for binary regression (1.1) using the kernel nonparametric regression estimator: ll

i--I

where h;.i(t) can represent either of the kernel weights: * Nadaraya-Watson weights (Nadaraya, 1964; Watson, 1964) h).i(l)=

Kit-t, t-tj

;

~iK( ~7 ) • Gasse~Mfiller weights (Gasser and Mfiller, 1979)

h ~.i(t ) =

f, .%

K

ds,

I

1 with so = 0, sn = 1 and si = ~(ti + ti+l ) for 1 ~
N. Altman, B. MacGibbon/Journal o f Statistical Plannin9 and Inference 70 (1998) 121 137

123

but not necessary identically distributed random variables (analogous to that of Hall (1984) in the i.i.d, case) is essential and is proved here. Section 3 contains an example. Section 4 gives simulation results related to the example. In Section 5 we discuss other methods for bandwidth selection, including minimizing (mean) integrated squared error such as given in Jennen-Steinmetz and Gasser (1988), Gasser and Engel (1990) and Chu and Marron (1991) for the case of ordinary nonparametric regression. We also discuss weighted least squares where the weights depend on estimates of the variance function, and the "cross-validated likelihood" method of Hastie and Tibshirani (1990), (p. 159) and Azzalini et al. (1989).

2. Asymptotic optimality of CV and CL Performance of a nonparametric regression estimator is generally assessed by the average squared error (ASE) or its expectation (MASE): nASE(2) = IIH(;0Y - ~112, nMASE(2) = nE[ASE(2)] = tr ,~H(2)H(2)' + IIH(;.)~ - ~112, where ~ denotes the vector J/i = l l ( t i ) , X i s the diagonal matrix ~ii = O ' 2 ( t i ) = / / ( t i ) [ l -#(ti)] and Ilxll 2 =x'x. The ti are a selected set of points, generally the design points. H(2) is the matrix H(2)ij = [h;.ij] = [h;i(t/)]. The largest eigenvalue of H()~)H(2)' will be denoted by ~(2). We consider two bandwidth selection criteria based on ASE(2) and show that these lead to asymptotically optimal estimators of the regression function. Mallow's CL selects a bandwidth 2CL which minimizes

cLuo= 1 :{; Iy,-

+ 2-

Il i -- 1

.r/i= 1

where #~(ti)=/~;(ti)[1 -fi;(ti)]. The CV criterion selects a bandwidth 2cv which minimizes

1 " CV(2) = - ~ [y, - fi~i(¢i )]2,

(2.1)

Hi I

where li~i(ti) is the kernel regression estimator of lz(ti) for the data set including all of the data except Yi. We will show that ASE()~c) p --+1, min;.cL" ASE(2) where Ln is defined in (A) below and 2c is the bandwidth selected by one of the criteria. We also show that letting )~opt= min;.~L, ASE(2) or )-opt= min;~L,, MASE(2), D

2

nS'q°(2c - )~opt) --~ N(O, ac),

124 N. Altman, B. MacGibbonIJournal of Statistical Planning and lnjerence 70 (1998) 121 137 where the value of ~2 depends on which criterion is being optimized. We will require that the design points are regular in the sense that either the design points come from a distribution F with density f bounded away from 0 on the interval of estimation, or the design points are fixed and approach ti = F -1 ( ( i - 0.5)/n) for such a distribution. The proofs in this section are inspired by Li (1987) and equation numbers of the form La.b refer to equation a.b in that paper. We first show the uniform convergence of loss to risk. For this, we need the following assumptions concerning the discrete index set Ln of bandwidths: (A) Ln C(An 1/(2P+I),Bn-I/(2p+I)) where A and B are positive constants and the first nonzero moment of the kernel has order p. (B) The cardinality of Ln is O(n m 4/5), where m>~2. We also need that for n sufficiently large there is a bound C with

(c)

]h;iil < C/(n,~).

These assumptions are not too restrictive. We find that the pointwise asymptotic bias of the kernel estimator is the same as in the continuous case (Gasser and Engel, 1990): Bias[fi;(t)] = C*(t))~ p + o()~p)

(2.2)

and the pointwise asymptotic variance is

Var[fi; (t)] =

C**o-2(t) n)~ + o(1/n2).

(2.3)

where C*(t) is a function and C** is a constant depending on the kernel, #, the "density" of the design points, and whether the design points are fixed or random. Therefore any sequence of bandwidths that leads to estimators converging at the optimal rate must satisfy Assumption A. Assumptions A and B also ensure that for m~>2, n-m ~;.~L,, MASE()~) -m ~ 0, which is a key fact in the proofs. The Lipschitz continuity of MASE(2) as a function of 2 ensures that the results extend to the entire interval (A n-I/(2p+l ), Bn-I/(2p+l ) ). We obtain the following convergence result (proof is in the Appendix): Lemma 1. sup;~6g,, [(ASE(2)/MASE(2)) - 1[ ~P 0 as n---+ oa. Once Lemma 1 is established, we need only show that ASE(2c) - min;~ MASE(2) p 0. min;. MASE(2)

2.1. CL Theorem 1. For n---~oa if [h;~iil<~C/(n2) and 4(2) is bounded, then Mallow's CL is asymptotically optimal.

N. Altman, B. MacGibbon/Journal of Statistical Planniny and Inference 70 (1998) 121-137 125 ProoL Let 2~;. be the diagonal matrix with Z;.ii ^ = a;~(ti) ^2 and y be the vector Then n[Ct(2) - ASE(2)] =

Ily -

(YI'" "Y,)'.

~112 + 2(y -/~)'[/~ - H(2)/~]

+ 2 tr ZH(2) - 2(y -/~)'H(A)(y - / ~ ) - 2tr (2; - S;~)H().). The proof exactly parallels the proofs of L2.2 and L2.3 after noting: ( 1) the moments of a Bernoulli random variable are bounded by 1; (2) by H61der's inequality,

ri = [ElYi - [2(ti)[4m] l/4m <~a(ti)[Elyi

-

[2(ti)l 4m

2]1/(4m--2)~
and (3) for n sufficiently large, t ~ ( t ) - ~2(t)l ~<4l~;(t) - ~(t)l.

2.2. C V The proof of asymptotic optimality rests on the fact that for large n, the difference between fi).(ti) and fi~i(li) is small. Define /J~ as the vector with ith element tiii(ti), and K(2) as the matrix such that K(2)y = ~ . Theorem 2. I f the largest eigenvalue of K(2)K~(2) is bounded, then CV(2) is asymptotically optimal. Proof. Let nASa'E(2)= I1~ - g(,bYll 2 and MA--SE(2)= E [ ~ E ( 2 ) ] . The proof parallels the proof of Theorem 1 if we can show that for any sequence {2n} with 2n E Ln MASE(2,) ---+1. MASE(),n)

(2.4)

The proof of Eq. (2.4) is in the Appendix. Remark. It can easily be shown that the Gasser-Miiller weights satisfy Ih;~ii]< C/n. This implies the boundedness of the largest eigenvalue in the case of positive ( p = 2) kernels. Consequently it follows that the largest eigenvalue of K(,~)K~(2) is also bounded, because the corresponding entries of H(2)H~(2) and K(2)K'(2) differ by at most C/n. It is well known (Milnor, 1963) that if corresponding entries of two symmetric matrices (n x n) differ by at most e then their corresponding eigenvalues differ by at most m:. For the Nadaraya-Watson kernels, however, we must assume that the largest eigenvalue is bounded for Theorem 2.

2.3. Convergence of the selected bandwidth H~irdle et al. (1988) addressed the problem of smoothing parameter selection for nonparametric curve estimators in the ordinary nonparametric regression problem (1.2). They proved a central limit theorem which quantifies a convergence rate of a class of

126 N. Altman, B. MacGibbon/Journal of Statistical Planning and InJerence 70 (1998) 121-137 automatically selected bandwidths to the "optimal bandwidth", that is, the minimizer of the ASE(2) or MASE(2). Their result is true for a large list of bandwidth selectors denoted by C(2). Here we sketch the reasons why an analogous result holds for binary regression. Essentially, we are interested in showing that a bandwidth selector 2c, minimizing a selection criterion C(2) which satisfies Eq. (2.5) of Hfirdle et al. (1988), converges to )~ASE, the bandwidth minimizing ASE(2), and that an analogous result holds with MASE(2) replacing ASE(2). We obtain the following theorem for binary regression.

Theorem 3. Under assumptions A and B, when the kernel has order p = 2 and H6lder continous second derivative on its support, and the regression function #(t) has a uniformly continuous integrable second derivative then n3"l°(2c - 2ASE) O N(0, o'2), n[ASE(2c) - ASE(2ASE)] o DZ2, where D and 62 are defined as part o f the proof in the Appendix. An analogous result with different D and a 2 also holds when 2c is replaced by 2MASE, the bandwidth minimizing MASE(2).

3. Analysis of the periparturient recumbent cows data Clark et al. (1987) collected a set of biochemical and haematological measures on dairy cows suffering from periparturient recumbency, a common problem often leading to death or euthanasia, in order to evaluate these measures as predictors of recovery. For the purposes of this analysis, 110 cows were classified as either recovered or not recovered. Although a number of measures were collected, the focus here is on a single variable, serum urea. In cattle, increased serum urea may be due to a number of causes such as shock, increased protein catabolism and/or kidney damage. Both high and low values are considered indicators of poor health. Fig. 1 shows the raw data and fits for Nadaray~Watson and Gasser-M/iller kernel estimators using quadratic weights and bandwidths selected by either CV or CL. The raw data, indicated by asterisks on the plot, are the percentage of cows recovering at each value of serum urea. As many as 3 cows were measured at some levels of urea, but most levels had only 1 animal, leading to proportions of 0, ½, ½, ] or 1. CV selected a bandwidth of 0.51 for the Gasser-Miiller kernel, and 0.11 for the Nadaraya-Watson kernel, leading to undersmoothing of the latter. (See Altman (1992), for an assessment of the fit of the smooth using the method of Azzalini et al. (1989) and Aragaki and Altman (1997), for a fit using bootstrap bandwidth selection with local linear regression.) Conversely, CL undersmoothed using the Gasser Miiller kernel (2 =0.19) and produced a smoother fit with the Nadaraya-Watson kernel (2=0.52). These results indicate that the sample sizes required for automatic bandwidth selection in binary regression are somewhat larger than those needed for continuous

N. Altman, B. MacGibbon/Journal o f Statistical Planning and lnJerence 70 (1998) 121 137

127

Nadaraya-Wats0n kernel

Gasser-Muller kernel q

ii

CV

O

o

"F,(.q.

d

Q ..0 n

0.

"d

c, i ji

(,9

..CV

.

ed

t~

(5

c

O

N

6 ¸

1.0

1.5 2.0 2.5 3.0 log(serum urea)

3.5

1.0

1.5 2.0 2.5 3.0 log(serum urea)

3.5

Fig. 1. Kernel regression smooths of survival probability of periparturient recumbent cows as a function of log(serum urea) with bandwidth selected by CV and CL. The dots denote the raw data and include replicates at some values.

data. A simulation study which explores this idea in more detail is discussed in Section 4.

4. Simulations In order to understand the results o f the periparturient recumbent cows example, we undertook a small simulation study. Design points were simulated from the mixture o f normals 0.25N(1.4, 0.4) + 0.60N(2.2, 0.4) + 0.15N(3.0, 0.2) truncated to the interval [0.7, 3.4]. This mixture approximates a kernel density estimate o f the design points for the data set. The average curve was defined by # ( t ) = 19.1 - 57.1t + 63.0t 2 - 31.9t 3 + 7.6t 4 - 0.69t 5 which is the 5th degree polynomial fit to the periparturient recumbent cows data slightly adjusted to ensure positivity on the region o f definition. Then 100 Bernoulli response "curves" were generated for sample sizes o f 100, 500 and 1000 design points. To reduce computation time, design points were generated just once for each sample size.

128 N. Altman, B. MacGibbon/Journal of Statistical Planning and InJerence 70 (1998) 121 137

ASE

Bandwidth Densities t,¢3

i_',[ - / ~ S E

.~o

0

#_

c5 "~-'2":.":

............ .....

03

o

0.4

0.6 0.8 Selected Bandwidth

1.0

1.0

1.5 2.0 2.5 simulated data

3.0

3.5

3.0

3.5

CL

CV Q

o .-~ cO

c5

(,0 0-

¢5 ,~.

13. ,~.

¢5

6 c5

03 o Q

1.0

1.5 2.0 2.5 simulated data

3.0

3.5

1.0

1.5 2.0 2.5 simulated data

Fig. 2. Simulation results for n = 100, the Gasse~M/iller kernel estimator and bandwidth selected by minimizing ASE, CV and Ct. Fig. 2a shows kernel density estimates of the distribution of the selected bandwidths. (ASE is the solid line; CV is dotted; CL is dashed.) The boxplots show the median and interquartile range for each method, demonstrating the tendency for CL to choose the bandwidths which are too small, and the large variance of CV. Fig. 2b Fig. 2d show the true mean function (solid) and curves fitted using the lower quartile (long dash), median (dotted) and upper quartile (short dash) of the bandwidths selected by each method. The main differences in the fitted curves occur in the interval (2.5, 3.2) where the effects of picking too small a bandwidth appear as a valley and peak which are too large. The figures also show that while the main contributions to ASE occur in the ascending region (I.0, 1.9), edge effects also make a substantial contribution.

N a d a r a y a - W a t s o n and G a s s e ~ M i i l l e r kernel regression estimators were fitted to each generated curve, using quadratic kernels. Bandwidths for each curve were selected using one o f three criteria: ASE, CV and CL. Fig. 2a shows kernel estimates o f the distributions o f bandwidths for ASE, CV and CL for n = 100, along with boxplots indicating the m e d i a n and interquartile range. (These were smoothed on the logarithmic scale and then back-transformed.) It is clear that CL is picking bandwidths that are, in general, too small, while C V has excessive variance. Fig. 2 b - F i g . 2d show the true m e a n function and the smooths obtained b y the lower quartile, m e d i a n and upper quartile bandwidths selected b y each method. The most notable feature is that the 3 smooths on each plot do not differ too greatly, at least in visual assessment o f the plots. Features on Fig. 2 b - F i g . 2d are similar. The smoothing broadens the "valley" near x = 1.0. The most evident effect o f the choice o f 2 occurs in the region (2.5, 3.2). The lower quartile choice o f 2 accentuates the dip and peak in this region, while the upper quartile choice produces more o f a plateau. The m e d i a n choices are similar

N. Altman, B. MacGibbon/Journal of Statistical Planning and InJbrence 70 (1998) 121-137 129

s= ~

i,o.

m

__

'-":

= =

n---~0 n-----~0 n - - - ~ 0 a) Relative Bandwidth CV

"-

Q

T

=

n=100 n=500 n=1000 b) Relative Bandwidth CL

c6 O

m

~-"

o

m

o

wo, i co

--

m _ _

m

"3-

.~.

m

o.

n=100 n=500 n=1000 c) Relative ASE CV

n=100 n=500 n=1000 d) Relative ASE CL

Fig. 3. Comparison of results for CV and CL for the Gasser-M/iller kernel. Fig. 3a and Fig. 3b show the logarithms of the ratio of the selected to optimal bandwidths for each method. Fig. 3c and Fig. 3d show the ratio ASE(2c)/mini. ASE()o) for each selector. Notice that both the bandwidth ratio and ASE ratio are converging to 1, but there is still considerable variability, even at n = 1000.

for ASE and CV, but the tendency o f CL to pick bandwidths which are too small is evident. Pronounced edge effects evident in the plots demonstrate the desirability of down-weighting design-points near the edge when selecting a bandwidth. Fig. 3a and Fig. 3b show the difference between the logarithms of the selected and optimal bandwidths for each selector for samples sizes o f 100, 500 and 1000. As the sample size increases, the distribution of ratios slowly becomes less spread out, although there is considerable variability even at n = 1000. Fig. 3c and Fig. 3d show the ratio A S E ( 2 c ) / m i n ; A S E ( 2 ) for each selector. The relative ASE does approach 1 as the sample size increases, but convergence is slow. There is considerable variability even at n = 1000. The plots for the Nadaraya-Watson kernel were similar. Optimal bandwidths differ slightly depending on the weighting scheme. In the simulation study, both CV and CL were quite variable, as the asymptotics would suggest. However, CV was preferable as it was somewhat less spread out, and had less tendency to pick very small bandwidths. Simulations in Aragaki and Altman (1997) showed that plug-in methods do not work well for local linear binary regression, due to upwards bias in estimating the integrated derivative functionals required for plug-in estimation. Bootstrap methods appeared promising, but consistency has not yet been shown. Methods based on estimated squared prediction error are therefore competitive for kernel binary regression,' " and CV seems to be preferable to CL (and is easier to compute). Although the choice of kernels has very little effect on consistency and asymptotic normality, it does affect the bandwidth selection when sample sizes are small or moderate. The example in Section 3 illustrates this point. For an excellent discussion o f problems o f kernel choice see Gasser and Engel (1990) and Chu and Marron (1991).

130 N. Altman, B. MacGibbon/Journal of Statistical Plann&,q and Inference 70 (1998) 121 137

5. Further remarks

We could also have considered different methods of bandwidth selection. Because the variance of Yi depends on its mean in the binary regression problem, an average weighted squared error criterion of the form n

AWSE(2) = 1 ~ [IJ;(ti) - p(ti)]Zw(ti) /7 i = l

with weights depending on the local variance may be preferable to the use of ASE(2) for bandwidth selection. Although we favor this idea on philosophical grounds, for reasons discussed below, we were unable to find an appropriate way to estimate the weights from the data. For Bernoulli data, ~r2(t)= 0 whenever #(t) is 0 or 1. Eq. (2.2) shows that, if there is curvature at such points, the asymptotic bias of the regression estimator is nonzero. Hence if the weights are chosen to be inversely proportional to the variance for general mean functions p(t), AWSE(2) would become infinite. However, even if the mean function is in some region 0 < 6 " ~<#(t)~<3** <1 the use of weighted least squares is problematic, due to the need to estimate the weights from the data. When the weights are estimated by v~(ti)= 1/~(ti), weighted versions of CL(2) and CV(2) do not have the correct order of magnitude. We have been unable to find a method of estimating weights that does not suffer from this problem. Another method of bandwidth selection would be to consider criteria such as integrated squared error (ISE) ISE(2) =

f

- p(t)] 2 dt

or mean integrated squared error (MISE) (Jennen-Steinmetz and Gasser, 1988) MISE(2) = / E[fi;~(t)

p(t)] 2 dt.

Our consistency and convergence results remain true for these criteria provided the bandwidth is selected by CL or CV on l

n

^

-~-~ [/~;~(ti) - #(ti)]2f-l(ti) ni_l

for ISE and its expectation for MISE and provided these sums are good approximations of the required integrals. (However, when f is unknown, we need to use a consistent estimator of f - 1 . See Chu and Marron (1991) with discussion for suggestions in the ordinary nonparametric regression case.) The methods of proof used here would remain valid for these quantities which are weighted sums of the [fi;~(ti)- ~(ti)] 2 or E[l?t;~(ti)- #(ti)] 2, respectively; only the constants in the proofs change. The bandwidth selection methods explored in this paper have been based on quadratic loss. In analogy with maximum likelihood methods in parametric problems, the use of modified maximum likelihood has been suggested for bandwidth selection (Azzalini

N. Altman, B. MacGibbon/Journal of Statistical Planning and InJerence 70 (1998) 121-137

131

et al., 1989; Hastie and Tibshirani, 1990, p. 159). The bandwidth is selected to minimize the leave-one-out estimator of the log-likelihood: n

ML(2) = ± ~ log(/2~'(ti)Y~[1 ni_l

-

fi~i(t i)]l- Yi ),

where ]J~i(ti) is defined following Eq. (2.1). Bowman (1984) showed that for density estimation ML(2) is a cross-validation criterion based on Kullback-Leibler loss for discrete density estimation problems and can be viewed in that context for continuous density estimation problems. However, Schuster and Gregory (1981) showed that ML(2) is not asymptotically optimal for density estimation of continuous distributions with long tails. Using the argument of Bowman (1984), it is easy to derive a loss function for binary regression for which ML(2) is a leave-one-out estimator. This has the form: nL( It, IJ;~)

=~

lt ;.( ti ) 10g[/2(ti)/ ft ;,( ti )]

i=1

n 1 + ~ ( - #;~(ti))log[(1 - #(ti))/(1 -/~j,(ti))]. i--I

There do not appear to be any compelling reasons to consider this as an appropriate loss function for this problem. Appendix L e m m a 1.

sup ASE(2) ;~eL,, MASE(2)

1 ~ 0.

Proof.

n[ASE(2) - MASE(2)] = 2(# - H ( 2 ) l ~ ) ' n ( ) o ) ( y - # ) +IIH(A)(y - ~,)112 - tr S H ( 2 ) H ( 2 ) ' . By Whittle's inequality for sums, E{[(y - # ) ' H ( 2 ) ' ( y

-

H()0/t)] 2m} < D I I H ( 2 ) ' ( y - H(2)/OII 2m <~ D ~ ( 2 ) m [lY -- H(2)#I] 2m,

SO

sup (y - # ) ' H ( 2 ) ' ( # - H ( 2 ) # ) nMASE()0

p O.

),EL,,

By Whittle's Inequality for quadratic forms and the proof of L2.3, Eltr S H ( 2 ) H ( 2 ) '

- (y - #)'H(2)'n(2)(y

~
m

- p)l 2m

132

N. Altman, B. MacGibbon/Journal of Statistical Planning and Injerence 70 (1998) 121 137

SO

sup

tr S H ( 2 ) H ( 2 ) '

- (y - p)'H(2)'H(2)(y

- # ) L O.

nMASE(2)

2EL,

P r o o f o f Eq. (2.4): For any sequence {2,} with 2,, E L,,

MASE()~n) + 1. MASE(2, ) Proof proceeds separately for the Nadaraya-Watson and Gasser-Mfiller weights. Nadaraya- Watson

Weights:

~ ( t ~ ) - ~(t~)=

~;, (ti) - # ( t i ) - h;.ii[Yi - ~t(ti )] 1 -- h).ii

SO

E[~2 i(ti) _ ~t(t i)]2

E[]~).(ti) - ]g(ti )]2 _ h~iiff2(ti) ( 1 - h;~ii)2

=

Therefore, MASE(2) - MASE(2) MASE(2) nMASE()~)+ ~,'.'=, ~2( ¢i )( h2ii - 2h;.ii ) - nMASE(2) + ( 2h;.ii - h~ii ) n M A S E

~<~+o Gasser-Miiller

(,) 2~

( 1 - h;.ii)2nMASE(2)

'

Weights: ~i(ti ) =

fi2(ti ) -t- ( y i _ 1 - Yi )h2ii,

so

n E

nMASE(2) = nMASE(2)+ ~

22

( Y i - I - Yi) h;.ii

i=1

+ 2 ~" h~ii E [Yi-1 - y~][~(t~) - ~(ti)]. i=1

Now lYi-I - Y i l < assumption (C), ~ E(yi-I i=l

1 so E ( y i

1 -Yi)<~l.

By the Cauchy-Schwartz inequality and

C 2

2 ~ -Yi) 2 h2ii n22"

yielding h;,iiE[yi-1 - yi][~;(ti) - p(ti)] ~< n2i=l IE[fi;~(ti) i=1

= O(2p-~ ).

- ~(ti)]l

N. Altman, B. MacGibbonIJournal o[ Statistical Plann&9 and lnJerence 70 (1998) 121 137

133

Proof of Theorem 3. Here the method of proof follows that in Hfirdle et al. (1988). The

sequence of steps outlined in the Appendix there is meticulously followed here with the appropriate changes for binary regression. For the sake of completeness, the two main differences are given below. The first lemma they used is listed here and a proof is outlined which differs slightly from theirs in order to accommodate Bernoulli random variables instead of mean zero i.i.d, errors. It is also necessary to develop a central limit theorem for degenerate U-statistics based on independent but not necessarily identically distributed random variables; (see Theorem 4 below). The proofs of the remaining results necessary for Theorem 3 follow analogously to those in H~irdle et al. (1988), and are omitted here. As the constants in our Theorem 3 differ from theirs we give them first. Let

1 a 2 , = 8 C o 3 ( / o .4) ( / ( K - L )

2) +4C02 ( f a

2) ( / ( C * ( t ) ) 2 ) ,

where C*(t) and C** are defined in Eqs. (2.2) and (2.3) and L(x)=--xKl(x). The constants in Theorem 3 can now be defined as a 2 = Ci-20.2,, and D = Cl0.2/2. (Analogous constants can also be defined for the limiting distribution of 2MASE.) AS in H/irdle et al. (1988), the following two expansions play an important role: ASE'(2ASE) = 0 = MASE/()~ASE) + A/(ZASE)

: ("~ASE-- 2MASE)MASE()~*) 4- A'(,~ASE), where the prime ( ) denotes derivative, 2" is between 2ASE and 2MASEand A(2)= ASE ( 2 ) - MASE(2). Now let C(,~)=[ASE()O+~2+(~I(Z)] [ l + ~ K ( O ) + O p ( n - 2 ) ~ - 2 ) ] , t7

where n~ 2 = ~i~_1 [Yi -/~(ti)] 2 and n(51(2)= 2 Y'~i=l [/i;~(ti) - #(ti)][l~(ti) - ys]. Lemma 2 (Hfirdle, Hall and Marron, Lemma 1). Let ~t represent either A or 61. For m = 1,2 .... there exist constants D3 and D 4 and an ql > 0 such that

( sup E ).c L,,

)'1'/2 l)~i +24 n

) Z]'(2) 2m ~
(A.1)

-

E Lk//_I~_I -~- ~4 (Z]'(,~) -- A ('~0))

404

whenever Z, 2o E L, with 2 4 2o and I ~ 1 ~<1.

~Z ) ~

,m

(A.2)

134 N. Airman, B. MacGibbon/Journal oJ Statistical Planning and In[erence 70 (1998) 121-137 Proof. Here we discuss the case of 3(2). It should be noted that A l ( 2 ) = - ½2A'(2) can be expanded into

AI()~) = All(Jr) - A12(,;.) + A21()-) - A22().) + A31()-) A32()~), where the Aij have analogous definitions to those of S/i given by H~irdle et al. (1988) in the proof of Lemma 1. In order to prove Eq. (A.2), e.g., it suffices to consider each of the Ai) separately. By standard arguments involving the compactness of the support of K we have, e.g., that

~
Theorem 4. Let U, = ~i4=1Hn(Y,, Yj), where kin is a symmetric function and Y1. . . . . Y, are independent non-identically distributed random variables (or vectors) and H, satisfies for each i C j, 1 <~i,j<~n: 1. E[H~(Yi, ~)] = 0 and E[Hn(Yi, Yj) I Yi] = 0 a.e.; 2. and if bn= maxl<~i,j<~,E[H2n(yi, Yj)] then s u p , ( b , ) < ~ . Now let

a2(U,) =

~-~

E[H,,(Yi, Yjj)H,(Yi, Yk)];

(A.3)

1 <~i
and for each i = 1..... n let X,,i = ~)=l H~(Y,-, Yj) and Gn, i(x, y) = E[Hn(Yi,x)Hn(Yi, y)]. Now if i=, j= E( G2 i( yi, Yj))+i~_2E(X4i)

o'4(Un) -'+ 0

(a.4)

N. Altman, B. MacGibbon/Journal of Statistical Planning and Inference 70 (1998) 121 137

135

then U, is asymptotically normally distributed with mean 0 and variance given by

o'2(Un). Proof. It follows from Hoeffding (1948), (p. 300) that tr2(Un) is given by Eq. (A.3). A straightforward application of Brown's martingale central limit theorem (Brown, (1971); Hall and Heyde, 1980) as used by Hall (1984) in the case of i.i.d, random variables yields the result. More explicitly, let and

v,,I = E[X211Y, .... , Yi-1]

V,~2 = ~ v,l. l=2

Clearly, i--I i-1

vn'= E E Gn(Yj, Yk) =2

EE

G~(Yj,Yk)+ ~ G,(Yj, Yj).

l<~j
j=lk=l

j=l

Now E[Gn(Yj, Yk)Gn(Yp, Y,.)]=O unless j = k = p = r k = r . Thus, i f l ~ m l

E(v,,lv,,m) = 4

~

lm

or j=kT~ p = r or j = p

and

1

E[G2(yj, Yk)] + ~ ~ E[G,(Xj,XJ)IE[Gn(Xp,Xp)]

I <~j
j=l

p=l

m

+ ~ var a,,(Xr,Xr). r= l

Clearly, if s,2 = crZ(Un), condition (A.4) implies that 1 2 -TE(V,, - s2) 2 --* 0.

(A.5)

s~ Also,

i--1

E(X2~ ) = E E[H2~(Yi, YJ)], j=l n i--1

a2(U.) =s2. = E E E[H2(yi, YJ)], i=2j=l

i--I

i--I

j=l

j,k=l jT~k

E[X~] = ~ E[H,4(Y~,~)1 + 3 ~

E[H£2(L., ~)H,~2(Yi, Y/,)].

Hence it follows from condition (A.4) that n

1 ~_~E(X4i)

__

-----)

O.

(A.6)

$4 i=2

Now conditions (A.5) and (A.6) suffice for the martingale central limit theorem of Brown (1971) to hold (Hall and Heyde, 1980; Hall, 1984).

136 N. Altman, B. MacGibbonIJournal of Statistical Planning and InJerence 70 (1998) 121-137

Acknowledgements The authors would like to thank the referees for their helpful comments. The work of the first author was supported by Hatch Grant NYC 410 and NSF Grant DMS-9525350. The work of the second author was supported in part by grants from NSERC of Canada and FCAR of Quebec.

References Allen, D.M., 1974. The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16, 1307-1325. AItman, N.S., 1992. An introduction to kernel and nearest-neighbor nonparametric regression. Amer. Statist. 46, 175-185. Aragaki, A., Altman, N.S., 1997. Local polynomial regression for binary response. Biometrics Unit Technical Report BU-1397-M. Azzalini, A., Bowman, A.W., H/irdle, W., 1989. On the use of nonparametric regression for model checking. Biometrika 76, 1-11. Bowman, A.W., 1984. An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71, 353 360. Brown, B.M., 1971. Martingale Central Limit Theorems. Ann. Math. Statist. 42, 59~6. Buja, A., Hastie, T., Tibshirani, R., 1989. Linear smoothers and additive models (with discussion). Ann. Statist. 17, 453-555. Chu, C.-K., Marron, J.S., 1991. Choosing a kernel regression estimator (with discussion). Statist. Sci. 6, 406-436. Clark, R.G., Henderson, H.V., Hoggard, G.K., Ellison, R.S., Young, B.J., 1987. The ability of biochemical and haematological tests to predict recovery in periparturient recumbent cows. New Zealand Vet. J. 35, 126-133. Copas, J.B., 1983. Plotting p against x. Appl. Statist. 32, 25 31. Fowlkes, E.B., 1987. Some diagnostics for binary logistic regression via smoothing. Biometrika 74, 503-515. Gasser, T., Engel, J., 1990. The choice of weights in kernel regression estimation. Biometrika 77, 377 381. Gasser, T., Mfiller, H.G., 1979. Kernel estimation of regression functions. In: Gasser, T., Rosenblatt, M. (Eds.), Smoothing Techniques for Curve Estimation. Lecture Notes in Mathematics, vol. 757, Springer, Heidelberg, pp. 23-67. Hall, P., 1984. Central limit theorem for integrated square error of multivariate nonparametric density estimators. J. Multivariate Anal. 14, 1-16. Hall, P., Heyde, C.C., 1980. Martingale Limit Theory and its Application. Academic Press, New York. Hfirdle, W., Hall, P., Marron, J.S., 1988. How far are automatically chosen regression smoothing parameters from their optimum? (with discussion). J. Amer. Statist. Assoc. 83, 86-95. Hastie, T.J., Tibshirani, R.J., 1990. Generalized Linear Models. Chapman and Hall, New York. Hoeffding, W., 1948. A class of statistics with asymptotically normal distribution. Ann. Math. Statist. 19, 293-325. Jennen-Steinmetz, C., Gasser, T., 1988. A unifying approach to nonparametric regression estimation. J. Amer. Statist. Assoc. 83, 1084-1089. Li, K.-C., 1987. Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: discrete index set. Ann. Statist. 15, 958-975. Mallows, C., 1973. Some comments on Cp. Technometrics 15, 661~75. Milnor, J., 1963. Morse Theory. Princeton University Press, Princeton, NJ. Nadaraya, E.A., 1964. On estimating regression. Theory Probab. Appl. 9, 141-142. Priestley, M.B., Chao, M.T., 1972. Non-parametric function fitting. J. Roy. Statist. Soc. Ser. B 34, 385-392. Schuster, E.F., Gregory, G.G., 1981. On the nonconsistency of maximum likelihood nonparametric density estimators. In: W.F. Eddy (Ed.), 13th Annual Symp. on the Interface of Computer Science and Statistics, Springer, New York, pp. 295-298. Stone, C.J., 1977. Consistent nonparametric regression. Ann. Statist. 5, 595~645.

N. Altman, B. MacGibbonlJournal of Statistical Planninq and InJbrence 70 (1998) 121 137

137

Stone, M., 1974. Cross-validatory choice and assessment of statistical predictions. J. Roy. Statist. Soc. Ser. B 39, 44-47. Wahba, G., Wold, S., 1975. A completely automatic french curve: fitting spline functions by cross-validation. Commun. Statist. 4, 1-47. Watson, G.S., 1964. Smooth regression analysis. Sankhya Ser. A 26, 359-372. Whittle, P., 1960. Bounds for the moments of linear and quadratic forms in independent variables. Theory Probab. Appl. 5, 302-305.