JOURNAL OF Econometrics Journal of Econometrics 75 (1996) 291-316
ELSEVILK
Optimal bandwidth choice for density-weighted averages Jarnes L. Powell~, T h o m a s M. S t o k e r *'b a Department o j Economics. Prim'eton Uni~'ersit.r. Princeton. NJ ¢18544.USA b Shum School t~/"Manageotent. MIT, Canlbrid.qe, MA 02139. USA
(Received May 1992; final version received May 1995)
Abstract This paper characterizes the optimal bandwidth value for estimating density-weighted averages, statistics that arise in semiparametric estimation methods for index models and models of selected samples based on nonparametric kernel estimators. The optimal bandwidth is derived by minilnizing the leading terms of mean squared error of the densityweighted average. The optimal bandwidth formulation is developed by comparison to the optimal pointwise bandwidth of a naturally associated nonparametric estimation problem, highlighting the role of sample size and the structure of nonparametric estimation bias. The methods are illustrated by estimators of average density, density-weighted average derivatives, and conditional covariances, and bandwidth values are calculated for normal designs. A simple 'plug-in' estimator for the optimal bandwidth is proposed. Finally, the optimal bandwidth lbr estimating ratios of density-weighted averages is derived, ;howing that the earlier optimal formulae can be implemented directly using naturally delincd 'residual' values.
K,t'y words: Smoothing; Optimal bandwidth: Senfiparametric estimation: Nonparametric estimation; Density weighting; 'Plug-in' estimator J E L chtss(fication: CI3; CI4; C25
I. Introduction Recent advances in the study of semiparametric methods in econometrics have yielded a number o f new tools for studying empirical economic relationships. An important class o f these methods involve 'plug-in' estimators, where estimation o f
* Correspo,lding at,thor. This rcscarch was supported by Nalional Scien¢e Foundation Grants SES 9010881 and SES 9122886. The authors wish to thank the reviewers for several valuable comments. 0304-4076/96/$15.0(} C(') 1996 Elsevier Science S.A. All rights reserved SSD! 0 3 0 4 - 4 0 7 6 ( 95 ) 0 1 7 6 1 - 2
292
J.L. Powdl. 7" M. Stoker~Journal ¢~f Ecom, netrics 75 (1996) 291-316
parameters of interest is facilitated by using nonparametric estimates of functions in place of the true, but unknown functions. Typical examples of such unknown functions include the density of disturbances in a model, or unknown features of the regression function of the response on the predictor variables. Example~ of nonparametric estimators include kernel estimators and related local smoothing methods, or series estimators such as truncated polynomials or spline methods. The issues of precision of nonparametric estimators are well known. In particular, suppose a function is estimated by averaging over a window of nearby data values, and consider the difference between setting a large or small window size. A large window includes more observations, thereby reducing variance, but masks subtle nonlinearity, or increases bias. Alternatively, a small window better facilitates detecting nonlinearity, or reduces bias, but involves less observations, thereby increasing variance. For estimating the function at a point, the optimal window size, or bandwidth value, is given by balancing variance with squared bias, thereby assuring the smallest mean squared error. The tradeoff between bias and variance will vary over different ranges of the function to be estimated, as well the optimal bandwidth or window size. A single, global choice of bandwidth can be based on minimizing average or integrated (pointwise) mean squared error values, or some other weighting of error across different ranges of the unknown function, i The literature on bandwidth choice in estimation of functions is quite extensive, and include several automatic (data-based) methods for choosing bandwidths in applications. 2 When nonparamctric estimators are used as ingredients in semiparametric estimation, the concerns regarding their precision are different. Since the parameters to be estimated are a primary tbcus, the relative importance of (pointwise) bias and variance of the nonparametric estimators is difl'erent than in Ihe purely nonparametric case. For instance, if the parameter estimates are adversely alt~zcted by bias in the nonparametric estimators, then it may be sensible to lower the bandwidth size, reducing pointwisc bias relative to variance. As such, scmiparalnctric use of nonparametric estimators involves different criteria for nonparametric approximation, than optimal estimation of the unknown functions. This feature is evident from the now standard results of asymptotic theory for semiparametric estimators. For example, procedures that employ kernel estirrmtors typically involve 'asymptotic tmdersrnoothing' - if N denotes sample size, "asymptotic undersmoothing' refers to the notion that lbr parameter estimates to be v'~-consistent, the bandwidth for kernel estimation must be shrunk
i The same is.~ues apply for any mmparametric method, StlCil as dmosing Ihc degree of a polynomial
expansion, or the degree of ,~pline t'unctitms used for approximation. : Textbook treatmentsof bandwidtl~choice in nonparametricestimation are given in Silvcrman(1986), I[/.irdle ( It~t,~l), and Ilaslie and Tibshirani (ItJtJ0). ReferencesIo nlore recent literalurc are given in II/.irdle, thdl, and Matron (1~)88). Gasser, Kneip, and Kohler (19t,ll). and Nychka (1991), among others,
J.L. Powell. T.M. StokerlJourmd o/ Ecommtetrics 75 (1996) 291 3f6
293
more rapidly to zero than it would be for optimal pointwise estimation. This feature was noted for the estimators studied in Robinson (1988), Powell, Stock, and Stoker (1989), and Hardle and Stoker (1989), among others, and more recently is highlighted in the unifying theory of Goldstein and Messer (1990). 3 This work does not address the issues of choosing bandwidths for particular applications, but rather just indicates how the conditions of limiting theory differ between nonparametric and semiparametric estimation. In this paper we characterize bandwidth choice in perhaps the simplest substantive semiparametric estimation problem, namely, the estimation of densityweighted averages. This problem is interesting because it covers procedures for a wide range of semiparametric models, including situations where the precision of the nonparametric kernel estimators is a central focus. 4 We derive the optimal bandwidth by minimizing mean squared error of the estimator, and the nature of the solution is simple because the technical details of the analysis are kept to a minimum. This simplicity has the added bonus of permitting a straightforward comparison between optimal ba~hdwidth values for pointwise estimation and for semiparametric estimation. Practical methods for bandwidth choice follow naturally from the development. Related to our derivation is work on estimation of integrated squared density derivatives. In particular, Hall and Marron (1987) study that problem using kernel estimators,5 and derive an optimal bandwidth formula. Our development can be viewed as a generalization of their results to more general estimation problems. Also related is a recent paper by H/irdle, Hart, Marron, and Tsybakov (1992), which studies the problem of bandwidth choice for estimation of one-dimensional unweighted average derivatives. Their results are specilic to this case, and require strong conditions on the distribution of the covariates which are not irnposed here. t' Hfirdle and Tsybakov (1993) independently derived a result on bandwidth choice Ibr weighted average derivatives similar to that in Section 4; our results specializc to their lb,'taaulae Ibr the weighted average derivative case. Section 2 presents the estimator, estimand, and a series of exarnples for motivation. Section 3 presents our assumptions, in the context of a 'pointwise' nonparametric estimator that is closely associated with the density-weighted average, and reviews the optirnal bandwidth forrnula for the pointwise estirnator. Section
3 Newey (1994) notes some differences between pointwisc fimction estimation and scmipal'ametric estimation when truncated polynomials are used. Andrews (1989) discusses situations where the precision of nonparametrlc estimators does not atlccl the :lsymptolic theory tbr 'plug-in' scmiparametric methods. 5 Work on other aspects of estimating integrated squared dcnsiw derivatives includes Bickel and Ritov (1988) and Jones antd Sheather (1991), among others. ¢' Kcrncl estimators of unwcighlcd average dcrivutivcs (|l~irdle and Stoker, 1989) are nonlinear combinations of kernel estimators computed with trimming of the data sample. These two fcalttlcs substanti~,,lly complicalc the analysis of bandwidth ehoicc.
J.L. Powell. T.M. Stoker I Jourmd of Ecommwtrics 75 (1996) 291 316
294
4 derives the optimal bandwidth for the density-weighted average, and spells out how the optimal bandwidth differs from the pointwise bandwidth in terms of an adjustment for sample size and an adjustment for the structure of nonparametric bias. These features are illustrated by computed bandwidth values for designs based on normal random variables. Section 4 closes with a simple "plug-in' estimator of the optimal bandwidth. Section 5 then characterizes bandwidth choice for ratios of density-weighted averages, as motivated in certain examples of Section 2. Section 6 gives some concluding remarks.
2. The estimation problem and examples We assume that the data represent an i.i.d, sample of observations {zi= I ..... N }, where z is the vector of responses and predictor variables as outlined below. We study estimators of 'density-weighted averages' of the following form:
=
2
~ p(z,,zj, h),
i
(2.1)
where the function p(.) is symmetric in pairs of observation - that is, p(z,,zi, h) =p(z/,zi, h). As such, ~(h) is a second order U-statistic with kernel p. The bandwidth is the parameter h, and the limiting theory for ~i(h) has h decreasing with sample size, or h = h(N ) --, 0 as N ---, ~ . If the expectation of ~(h) is denoted
6(h) ~ E[,~(h)] : El/,(z~,zt,h) 1,
(2.2)
then the object of estimation is 6o ~ lirn 6(h),
(2.3)
The object of the paper is to characterize the optimal bandwidth h ~ for computing ~i(h). We also characterize the optimal bandwidth for ratios of density-weighted averages, or ratios of estimators in the form (2.1). We refer to the U-statistic in (2.1) as a 'density-weighted average' because this form often arises when kernel methods are used to estimate density-weighted expectations, as in each of the following examples. Example 2.1 is a useful pedagogical device tbr illustrating our results, and Examples 2.2 and 2.3 arise from standard semiparametric problems in econometrics.
Examlde 2.1 (a~:erage densiO'), ttere zi ~- a, E R ~ is tl corn/re.ms ramhm~ vector, A'~ ~.-[(v)d.v, a/zd the oh]eel Of e.vlimalioH is the areraqe density rahw." (~, =
J'.l'(x,)"dx~---E If(x, )].
(2.4)
J.L. Powell. T.M. StokerlJoumal o[" Ecommwtrics 75 (1996) 291 316
295
The esthmttor ~( h ) ~1" 60 Ls" constructed by computiml density estimates for each ~ktla point as J'i(xi'h)=- N -
1 ,= I
'
(2.5)
.j~ti and taking their average 3(h) = N -~ )--~,N=, j';(x,, h), which .qives (2.1 ) with (2.6)
Here .~" : R ~ ~ R is a (kernel)function with . ~ ( u ) = .~/ ( - u ) and .J" . Y ( u ) d u = 1. Here amt elsewhere, we utilize '&ave-out' kernel esthn~ttoJw such as (2.5); we coukt hworporate the ' i = j ' terms without chang#to any results, but tmnecessarily complicate the notation to account .[or the terms. 7 For parts of out" bias discusshm below, we will r~:fer to the order o f the kelvwl . # : this is defined as an integer P( >1 2 ) s u c h that .j'u/l'...u~ ~ ,h" ( u ) d u = 0 t']' {/i} are nonnegative inteaers with l, + . . . + I~. < P ami .r ttt~' ...u[, .z/. (u) d u ¢ o.for some I~ + . . . + 1~ = e . (hi the O'ph'ai case where .if{ is a positive svmnletrk" density function, P = 2.) Examph" 2.2 (density-weiqhted average Yi E R Idem~les a re,wonse or dependent pre~#ctor rariahle, xi ~ f ( x ) d x , and the = y ( x i ). The oh/ect o[' estimathm is the 6o=E
.l'(x,) ~
')
---2El
derivative). Here zi =- (.vi,x[)', where variable, xi E R h denotes a contimtous regression ~[' y on x is ~h'noted E[yilxi] densit),-weiyhled ateraoe derirative:
?'.v, " j
assumin.q ./'(x).qlx ) --. 0 as Ixl - ~ . ,m(/ at/ ~h,rit,atin,s and moments e.vi~'t (Powell. Sire'l(. alul S l o / ( e r , 1 9 8 9 ) , The ~h,nsily-wei.qhled al'era.qe th'rit'alit'e ~lit'es an estimator o.f inde.x" n;mh'l co(J.'[icienls tq~ 1o scah, that is. when g ( x i ) = G(x[flo). then 6o is proportional 1o the coe.].lich,nts rio. The estimator ~(h) is the sam= ph' allttlalllt(, of the second equality, or the al,('g¢tgt, of-2.i,i~.~.ii(xi, h)/c3x, where .f'i(xi, h) is the kernel estimator (2.5). This .qives d(h) h~ the (vector).lorm (2.1), with
I
p(zi,z/,h)
(.,', - x / )
. I. • t . " . = . hh.+
71
" (3,
) I"
),
(2.8)
7 The average density is covered ;is lhe integrated squared (zero-order) density derivative in tlall and Man'on (1987), who also discus.~ the omission of the "i-:: ]' terms. While these terms ,::m produce inferior asymptolic behavior of the statistic. Jont2s and Shealhcr (Iq91) discuss u.sing thc "i--.i"
lerms Io reduce linite-sample mean squared error. Eslimation of lhc average density also arises in lhe measuremcnl of precision of various rank estimators (Jaeckcl. 1972; Jure~kov~. 1971 ~.
296
J.L. Powell. 1] M. Stoker~Journal ~!/"Econometrics 75 (1996) 291 316
where "1 ( u ) = O.;f(u)/?Ju, for .X{: R ~ --~ R assumed to have the sttme properties as discussed for Example 2. I. A rehtted est#nator d(h) uses ?'j'i(xi, h)/?~x as h~struments hi a linear re~lression of Yi on xi: namely the sample anaioq of
[ ~.f(x,
,t,, = E [W
)
,]
,j
-'
E
[,~./(x,)
,]
(29)
With an index model, do is likewise proportional to the coe.fficients flo, and the esthnator ~'t(h) is a ratio o f density-weiqhted averat/es of the j b r m (2.1) (see Powell. Stock, and Stoker ( 1 9 8 9 ) f o r details). We study bandwMth choice j o t such rath~s in Section 5. Example 2.3 ( density-weitlhted condition,d cot'arhmces ). Here zi - ()'~,x~, w~)'. where y: E R I is a re,~ponse, xi E R ~' is a set o f predictor trariahles (discrete or continuous), and wi E R k is a set of eonthmous predictor variahh's, wi ",~ f ( w )dw. The ohjects o f est#nation are density-weiahted eondithmai covarumt'es. ,qicen as
6~ = F[f(w, ). (x, - E[xi Iw,] )" (.v~ - E[vil w;])]
(2.10)
,'i~; -- E[,f( w, ). (x, -- E[x, lwi] ). (xi - EIx, lwd )].
(2. i I )
arid
Density-wei~lhted con,Jithutal covarhmces are feb, rant,for t,sthnation o f the parthdly linear mmh,I
y, : .v~/Io + 0( w, ) ~- u,,
I~l,,Ix,, wi] -= 0,
(2.12)
whert' [hi Call I~e writtell as
assttm#it. I di~ is tumsinqltlar and (1(.) is st((licientl.l' snlooth (tJ? Powell, 1987. amotttl others). An estimator c~tl) of tJi~' is t.lil'en t~l' (2.1) with Pv(Zi'zl'h) = 9~I ~'xi" ( ' '
-h w ' ) • (x, - x,, • ( y / -
y,),
(2.14)
where again .~' (u) is tt kernel .limction sati,~.'/.i'#ul the properties discussed in E.\'amph, 2. i. ,,tnalotlOU.vly. 6~i is esthnated h)' ~'(h) o.f (2. I ). where / (-,,-i,hl
= 2~t~ .~
~
"(.v,.... x t l ' (.v, -.r~).
12.15)
,4~.lah~. our mah~./m'us is on httnthvhith ('hoh'e lot ~ ' ( h ) am/,:i~(h) h~ Section 4, w#h hatuhridth ch,,ice /i,r the ratio [~(h )= [,;;'(it ) ] ~3 ' ( h ) discussed i. Sectio. 5.
J.L. Powcll. T.M. StokerlJournal o,[ Econometrics 75 (1996) 291 316
297
3. The 'pointwisC structure of density-weighted averages At this point, we could derive the optimal bandwidth for ~(h) directly, as we do in Section 4. However, we first develop the structure of O~(h) by analyzing a nonparametric estimation problem that is closely associated with it. This is useful for a couple of reasons. First, the issues involved with bandwidth choice for estimating a function are well-known, and pro~ i d e a reasonable backdrop for discussing bandwidth choice for ,~(h). Second, one of our main aims is to spell out the differences between bandwidth choice for estimating functions and ba,~dwi& ~ choice in semiparametric procedures, such as 6(h).. The ,,,,~-" . . . .~;, . . .,~ ,~,.",~1~.v,',e"~:"provides the relevant grounds for comparison - 3(k) is ju~, ~.:~,: : w e t , ,:~ the associated nonparametric estimator evaluated over the data s a m p k , ,~,~.... , : : :parative analysis is based on the difference between pointwise 6':t!'- 4 ,::-.:, :i~,:-, ,~d the averaged criterion appropriate for the performance of fi(h). Define the functions r(2i, h) and ro(zi) as the conditional expectations ~ r(_-, h) = E[p(=,..-i, h )l=i].
(3.1)
r0(-i) = lim r(-i, h) = r(-i, 0).
(3.2)
h~0
These functions are related to 6(h) and 6o through
6(h ) = E [r(:i, h )],
(3.3)
E [r.(:/)].
(3.4)
=
The natural nonparametric estimator of ro(Ci) is obtained by averaging
p ( - i , = . h ) over j. or
l:_( :,, h ) --~:~ N -
p(:,,:/, h),
(3.5)
I i I I/i
with the density-weighted average ~i(//)just the sample average of tllis nonparametric estirnator: I
31111=
t~.I
' j~ , F(:,,h).
(3.6)
The 'pointwise' bandwidth of interest is the optimal bandwidth choice for the estimator ~:(-i, h) o f r0(zi ), '~
x These timctions arise from the projection of 1he U-statistic (2.1 ]: el: Hocll"ding (1948). '~ For Example 2.1. tile estimator F(z,,h) of (3.5t i.'+ the kcrnd density estinlalor .](-'t. h) of (2.5). In this ca.,,e, comparison of the optinlal bandwidths for estimation of F(:,.h) and for <'~(h) will indicate how bandwidth choice for density estimation dilt'ers from thai for estimation of average density.
298
J.L. PowelL T.M. StokerlJourmd o] Ecotzometri,;s 75 {199b) 291 316
Further intuition can be gained from noting the role that the function r0(z~) plays in the asymptotic theory for ~g(h). As outlined in Section 4, when ~(h) is a ~ asymptotically normal estimator, then 2 [ r J z ( ) - 60] is the 'influence' of the ith observation in the asymptotic variation of 6(h). io For Example 2.1, it is easy to verify that r0(z) equals f(x); for Example 2.2, ro(z) equals f i x ) . ,~ I ~O(x)/cx[ y - ,q(x)]- ~f(x)/~x; and for Example 2.3, r0(z) equals ~. f(w), (x - E[xlw]). ( y - E[ylw]). Moreover, the sample variance of 2~(zi, h) gives a natural estimator of the asymptotic variance of ~](h), as proposed by Powell, Stock, and Stoker (1989). II We state our assumptions in terms of properties of ~:(zi,h) as an estimator of n~(zi), which are easily derived for each of our examples (under standard primitive conditions). First, we assume that the bias of J:(z,,h) is of polynomial order in h: Assumption 1 (rate o f com,er#ence o f pohmrise hhts o f d(zi, h)). r(zi, h) satisJies r(zi, h) - to(z, ) = s(zi ). h ~ + s* (zi, h ),
The .[imcthm (3.7)
.~or some ~ > O. where E[s(zi )] ¢ O, ami the rema#uler term s*(.) sati,~]ies E IIs (',,h)ll" = o(h2=).
(3.8)
In each of our examples, the power ~ is the order of the kernel .~(u) - namely := P - but generally, ~ depends on the structure of the kernel p(.) of the Ustatistic. Assumption I clearly implies a polynomial ol'dcr for the bias of ~(h) for ,'iq~ (3.3), (3.4) and (3.7) imply lhat
,~(h) --- 60 ..... Els(z,)]. h' t o(h ~).
(3.9)
We next structure the variance of ~:(z,,h) by assuming: Assumption 2 (serh,s eXlmnshm .for secoml moment). The .lmu'tion p(z. :,. h) sati,~fies
EIIIP(z.z.h)II
Izd = q ( : , )
'
: +
(3.10)
for some 7 > O. where the remainder term q* sati.~Jies E IIq*(:,,h)ll:
:: o(h- : ).
(3.11)
l U i h i l c ci(h) is Ihe a',cragc tff i:{..-,.h) of (3.6). the I'~1¢t tha! its asymptotic variance is given by 2 [rd..", I ......~'~[ is due to the many common components (uvcrldps| in d{.:~,h ) tbr dilTerent obscrvatio~ls i I . . . . . ,'V, Ba,&" di,.,cu~i.,m of this o~.e,lap structure is given in Stoker (1992), I I This procedure was subsequently proposed in "linearized" lbrm in the analysis of unwcightcd average derivative estimators by |i/,irdlc and Stoker (1989), among others.
J.L. Powell, T.,4L StokerlJourmd of Econometrh's 75 (1996) 291 316
299
In Examples 2.1, 2.2, and 2.3, the coefficient 7 of Assumption 2 takes the values k, k +2, or k, respectively. Assumption 2 clearly restricts the unconditional variance of p(.), as
E[tlp(z.zj, h)ll 2] --:-E[q(zi)] • h-" + o(h-").
(3.12)
All our optimal bandwidth values are derived by minimizing mean squared error, employing the leading terms of squared bias and variance. For these calculations, we take p(.) to be a scalar function for simplicity. For cases where p(.) is a vector function, our derivations apply immediately to a single component of p(.) and of 6(h), and can immediately be extended to any particular linear combination 2'p(.), by computing optimal bandwidths for 2'~ and
2'3(h). 12 These assumptions allow an immediate derivation of the optimal pointwise bandwidth for the estimation of the function r0(z). For a given argument value, the pointwise mean squared error of i:(zi, h) is PMSE[;"(z,, h)] = E [;:(z, h) - ro(z)lz] 2 = (N - I)-Ivar [p(z, zi, h)z] + [,'(z,h) -r0(z)]"
(3.13)
= N--'q(z). h -7 + s(:) 2 • h 2~ + o(I/Nh;') + o(h2:'), and by minimizing the first two terms, we can derive the optimal bandwidth h*(z) for estimation of/'o(z) at z = - i as
h~(z)
:-:
2~..','(:)~
"
+ o
,
(3.14)
where we have assumed that s(z) ¢ 0.13 The bandwidth h*(z) will vary with z, depending on the (local) sensitivity of bias and variance to bandwidth value, tlarough .~'(..-) and q(z). To choose a single bandwidth value for estimating r0(z) over its domain, we can minimize a global fitting criterion, such as the integrated mean squared error. For our problem, it is convenient to consider the average of
12 Likewise, we cot, ld solve for the bandwidth Ihat minimizes F.[(ci(h)- ,io )'ll'(c;i(h) -- ?i0)] for any p~;sitivc scmidclhfitc matrix W, etc. By diagonalizing the weigl+t matrix IlL we can rewrite this problem as the minimization of the mean squared error of a linear combinatiov ~,~;;.'<];(h) of Ustafislics, which is itself a scala," U-slatistic. In the formulae below, we would replace the le,'ms for the squared bias and v;irianee wilh the corresponding quadralic form iq bias and El(~(h) -
E( ,i(h ) ) )' II;'q ,i( h ) -- E( ,i(h ) ))1. respectively. t3 See, for example, Silverman (1986) Ibr a discussion of this calculation.
300
J.L. Powell. 1".3,1. Stoker~Journal qf Ecmwnwtrics 75 (1996) 291 316
the pointwise mean squared error values, or AMSE[{'(z. h)] = E [['(z, h ) - ro(z )]2
= (N - 1 )- l E( var[ p( z, zi.h)]) + E Jr(z, h) - r0(z)] 2 = N - I E [q(z)] • h-;' + E [s(z) z] • h 2x + o ( l / N h ' ) + o(hZx), (3.15) where we have assumed that E[s(z,)2] # 0. Likewise, minimizing (the leading
ten'ns of) the AMSE gives (3.16) We will use h* as the "pointwise' optimal bandwidth in our comparisons later. It is easy to verify that h* [and h'(z) for any z] displays standard rates for large sample nonparametric estimation - the orders of pointwise variance and squared bias are equated as O(I/N(h*):') = O((h* )2x), with h * = O(N-t:~x~:')), and the (best) average pointwise mean squared error rate is AMSE[~(zi,h*)] = O(N-2X'(2~-;')l.14
The bandwidth h* is approximated by its leading term, which we denote as h'*. For later reference, we summarize this diset, ssion as
Pr~qmsition 3. !. Giren A.~'.~'umption.s' I ami 2, if h* &,runes the IJumiwidth value that minimizes AMSE[{'(:,, h)] aml (3.17)
then we have that (h *~ - - h * ) = o ( N
t~:~.;~).
4. Analysis of the density-weighted average 4, 1. Root-N a.u'nqmJtic normality of the ch'nsity-wei~.lhted average The density-weighted average ~(h) has substantively dilTerent statistical properties than its associated nonparametric estimator, and involves different considerations lbr bandwidth choice. In particular, under certain rate condition,,; on the bandwidth h, ~i(h) is a v/N-consistent, asymptotically normal estimator of
la The slow rate of .conv~.,r~¢nccof lh¢ b~mdwidlh IF given here is well known: se¢, among others, Iliirdle, IIMI, and Marron (It)~8).
301
J.L. Powell, T. A£ Stoker~Journal of Econometri~w 75 (1996) 291 316
6,,, and the pointwise optimal bandwidth h* does not generally satisfy those conditions./5 In order to ensure that the bias of ~g(h) vanishes at rate x/~, or v ' N [ 6 ( h ) - 60] = o(1), the bandwidth h must satisfy (4.1)
h = o ( N -I/(2~))
[from (3.9)], which is a condition not obeyed by h*. The variance of 6(h), as a U-statistic, vanishes at rate N provided that E[lp(zi,z/, h)] 2] = o(N) (Lemma 3. l of Powell, Stock, and Stoker 1989), which [from (3.12)] requires h-I = o(N ~:,).
(4.2)
The expressions (4.1) and (4.2) bound the rate at which the bandwidth h converges to 0, and for both to hold simultaneously, we need 2~ > 7- Further, ~;~(h) is v ' ~ asymptotically normally distributed if E[Ir.(_~)l 2] < ~ , since we can write
,~(h) -
,
¥
,~o = ~ : ~ 2
[ro(z~) - &d + %
(4.3)
so v/N[3(h) - h,,] ~ . t'(0, I~,) with Vo -: 4.E{[r0(:,)- 3o]-[ro(z,)- 6o]'}) ('
4.2. The optimal hamlwi¢flh for the th,nsiO'-weighted average The optimal bandwidth for computing ci(h) is computed by minimizing the leading terms of its mean squared error. Since 3(h) is a U-statistic, its finitesample variance is found using the standard formulation in Serfling (1980, Sect. 5.57. With p(.) a scalar function as belbre, the finite-sample variance of ~(h) is I
var[,~(h)J-: ( ' 2
I2(N - 2)var[r(zi h)] + var[p(zi,zt, h)]]
==4N Ivar[r(=,h)]+2N 21![p(zi.:rh)2]2 +o(N 2).
(4.4)
Consequently, we require a characterization for the variance of the conditional expectation r(_-/,h)- E[p(-,--/,h)]zi]. From Assumption i it follows that
var[r(zi, h)]
= v a r [ r o ( z i )] + C,,. h :~ + o l h :~),
(4.5)
where C,, - 2covlrl~(:, ),s(z/)].
Is Powell, Stock, arm Stoker (1980) give a rigorous deriw.ltion of lira lbllowing condilions fi)r Example 2.2, and Powell ~1987) covers Example 2.3. I, This conlirms our earlier assertion that 2 [ro(:~ ) ¢)0] represents the inltuence of the ith ob:~ervation in the ;.|Sylllptollc variance of ci(h).
J.L. Powell, T M. Stoker l Journal o f Ecommr, trics 75 (1996)291 316
3{)2
We can now formulate the mean squared error of ~i(h) for rio. By combining (3.97, (3.12), (4.4), and (4.5), we have MSE[f(h)] = [6(h) - 60] 2 + var[f(h)] = {E[s(z, )] }Zh-~ + 4N-I var[r0(zi )]
+2N-ICoh ~ + 2N-'-E[q(z~)]h-;' +o(h 2~) + o(h~/N) + o(I/N~'h ;') + o(N-~').
(4.6)
For characterizing the optimal bandwidth, we subtract the variance term 4N -I x var[r0(zi)] because it does not vary with the bandwidth h, writing (net) mean squared error as MSE[d(h )1 - 4N - Ivar[ro(zi )] =
{ E[s(.., )] }-h" + 2N - ~ColF + 2N -" E[q(zi )]h -;' •-r
"~
37/
"~
+[o(h z~ ) + o(h'~/N ) + o( I/N2h ; ) -: o(N-Z 7] =T~ + 7"_-+ T,. + R.
(4.7)
it is clear that Ti and Tz are inci'easing in h, while T3 is decreasing in h. At a minimizing bandwidth sequence n ' , the term TI must be of larger order than ~ , since it" ~ were of larger orde:, the bandwidth which equates orders of squared bias and variance would be O(N i ~,,;,j), which would imply TI is O(N z~ ~ : ~ ) , which is of greater order than 7), which would be ()(N ~ ~'7/~:'~~";'~). Therelbrc, miilimizing on tile basis of the Icading terms Yl and T3 gives tile optimal bandwidth as IF =
',i![q(,:)]
aq
'
,:~
+
o
.'~-~,,;.~
.
(4.s)
This tbrmula captures how h ' eqt,ates the leading orders of variance and squared bias -- setting O( I/N2h ~;)=O(h ,2~ ) gives h * =O(N--' ~ ' ;~). Moreover, the (net) mean squared en'or of d(h ~ ) vanishes at rate MSE[~](h' )] - 4N-Ivarlr~(z,}]
=
0 (N "~' ~"~':'~).
(4.9)
Under the conditions for asymptotic normality, namely 2~ > ~, , the right-hand remainder term is o(N -t ), but necessarily greater than O(N-.,).~7
17 Notu' thal tilis result hold.', ,:',on it" the conditions fiar ,.ff.'V con.,,istcncy of g](h) do 1lot hold. II1 particular, in our ,t,tampl~s, lhc ¢onditiml ):t > 7 rcqturcs Ihc use of a higher-order kernel .;¢(.). It: on grounds of superior litfit¢-,s[mlfde perlbrnlallee of the estimator, we look .;~' (.) ~ls a standard positive density I~mction, then our bandwidth analysis will still apply. This is relevmll il" a pe~sitive kernel ~ ' ( . ) is used in any of our examples when /, > 3 (or k > I in l-x~mple 2j.
I L. Powell, T.M. Stoker~Journal o f Ecommwtrics 75 (1996) 291 316
303
As before, we can approxunate h ~ by its leading term, which we denote as h ++. We can summarize this discussion as: Proposition 4.1. Given Assumptions 1 and 2, i/'h + denows the bandwMth value .that m#limizes MSE[/~(h +)] and [ ",-E[q(z)]
r~2~+:'~
[N1_.]2/(2:~+;,)
(4.10)
then we have ~,hat (h ++ - h + ) = o(N -2 {2~+-,.j). 4.3. htterpretation and examph, s Proposition 4.1 gives the main result of the paper. We have developed the structure of density-weighted averages in some detail, to facilitate comparing the optimal bandwidth h + with the pointwisc optimal bandwidth h*. A quick comparison of (3.16) with (4.8) indicates that the bandwidth h ~ shrinks to 0 at twice the rate of the pointwise optimal bandwidth h*. This feature of h + is referred to as 'asymptotic undersmoothing', and occurs because pointwise squared bias must decrease at a faster rate than pointwise variance when estimating 6,. Therefore h + must become smaller than h* for sufficiently large samples. However, one cannot conclu(':," that h + is smaller than h* in any particular application, or for a given sample size, because of differences in the leading constants of h + and h*. We can spell this out by comparing the approximauons h ~~ and h**. In partitular, we have that h I' :-=:A,v'B'h'*,
(4.11)
AN =
14.12)
where
is an adjustment factor for sample size, and
B= [ E[s(:)2]]'{2~:'} [{E[s(:)]}"
(4.13)
is an adjustment factor lbr the structure of tile pointwise bias. The adjustment term AN tbr sample size is less than 1 (tbr N > 2) and decreascs to 0 as N --, ~ , reflecting the different rates of convergence discussed above. The adjustment term B for the structure of bias does not vary with sampie size and is greater than I unless s(::) is a constant function, in particular,
J.L. Poweli, 12 M. Stoker~Journal of Econometrics 75 ~1996) 291 316
304
B-'~< is I plus the squared coefficient of variation of s(zi). The factor B arises because tae average pointwise mean square error depends on the variance of the poin, '4se bias of r"(z,h), whereas the bias of d(h) depends only on the mean of the pointwise bias. Variation in the pointwise bias dictates a smaller pointwise bandwidth h** relative to the bandwidth M +. Whether the adjustment for sample slze is larger or smaller than the adjustment for bias structure depends on the particular application. To get a clearer notion of the sizes of the optimal bandwidths and these two effects, we compute bandwidths for Examples 2.1 and 2.2 based on normal designs.
4.3. 1. Bamhridths .for arerage density with m , ' m a l variuhles Recall from Example 2.1 that zi = x , here, where xi ",, .l(x)dx, and the object of estimation is the average density value 6o = E[.f(xi)]. Moreover, £'(xi, h) is the kernel density estimator .li(xi, h) of (2.5), and r0(x) is the density .f(x). For simplicity, we denote partial derivatives of .f using subscripts: .11~ - ?'f/?xi, .I"i~ = ?,'-.f/C'xj~xl, etc., where each function is evahmted at x unless another argument value is indicated. Here r(x,h)-rt~(x) is given from the familiar expression for bias of the kernel density estimator:
r(x.h)--J~(x)=E [h-~.# (x' -7, x/)]
- ,f (x )
:= .l' .~r'( u )./'(x 4- h,'I)du.
(4.14)
We lake .~'(u) to be a positive density function, with ,J'tt.~(u)tit, =0 a n d / ' = 2, SO thai ~ =: 2 a n d s ( x , ) ~(~r,'('/l'Jh , . in parlicular s(x, ):: ( -~ •~
tt l
)tlt, ILl t/(V ).
(4.1 5 )
/
For the variance term, we have that
[ ( /]
E[ p ( xi, x ,, h ); Ix,] = E it- '~ .1,"
.~i -- .~j
7,
: h ~ .I.~ -lu)./l.v, + hu)du = h
-~
.l:~r-'(.)d..
,f(x,) + ot,(h ~ ),
(4.16)
so that q(x, ) :: ,li~'-'(, ) d , . L(x, ) and 7 : k.
(4.17)
J.L. Powell, 12.M. Stoker~Journal ~I' Ecomnm, trics 75 f 1996) 291 316
305
To compute bandwidth values, we specialize the general formulae to the case where f ( x ) is the spherical normal. I ( 0 , / ) density, and ,Y(u) is likewise chosen to be the. I ( 0 , I ) density. These specifications imply first that
s(xi) = (½) ~--~.f~(xi). .i
(4.18)
if we define c~ = (2n) -~;2, then some tedious arithmetic yields E[s(xi)]2
=
k2c~.( t2 ""lk+4
E[,,,(x,)=]
=
k(/,
-
(4.19) (4.20)
)k2+,
E[q(xi )] = c~.(5 )k
(4.21 )
and 2~ + 7 = k +4. Therefore, the approximate pointwise optimal bandwidth (3.17) is (4.22) The bias factor (4.13) is /7
,
(4.23)
and the size factor (4.12) is Ax ~
NIl (/< +4)
(4.24)
Consequently, the approximatc optimal bandwidth (4.10) for estimating the average density is h~+ = A x . B . h
**= ( 8 ) t l ~ ' 4 J ( / )
''~4~
(4.25)
Table 1 contains computed bandwidth values and bias and size factors for various dimensions k and sample sizes N. In terms of estimating average density versus estimating the density function, here the size factor always outweighs the bias factor, so that a smaller bandwidth should be used for estimating the average density. For increases in sample size, all bandwidths shrink, with the size effect much more pronounced in lower-dimensional problems. Interestingly, for high dimension and low sample size, the optimal bandwidths for the pointwisc and average density problems are nearly equal; also, for k =- 1 and small N the pointwise bandwidths are larger than for k = 2, then increase monotonically in k.
31)6
J !. Powell T M. Stoker~Journal vl Ec~mmm'trics 75 (1996) 291 3t6
Table 1 Optimal bandwidths for estimating density and average density Spherical nomml regressors: Spherical normal kemeh
./(x) = . I (0, / ) density . # ( , t ) = . ( (0, I) density
Dimension k
I
2
3
4
5
10
Bias factor
I. 155
1.296
1.31)3
1.295
1.284
1.243
N = 50 Bandwidth ( PW ) Size t;actor Bandwidth (Ave)
0.ST~ 0.525 0.317
0.451 0.585 0.342
0.457 0.631 0.376
0.474 0.669 0.4 I(1
0.492 0.699 0.442
0.570 0.795 0.563
N - 100 Bandwidth (PW } Sizc factor Bandwidth (Ave)
0.455 0.457 0.240
0.402 0.521 0.271
0.414 0.572 (I.309
0.434 0.613 0.345
0.455 I).647 0.379
0.542 0.756 0.5 I0
N - 5OO Bandwidth ( PW ) Size factor Bandwidth (Ave)
0.330 0.331 0,126
',!.307 0.398 0.159
0.329 0.454 0.195
0.355 0.501 0.231
0.381 0.541 0.265
0.483 0.674 0.405
N .... 1000 Bandwidth (PW) Size laclor Banth~,'idth ( Ave )
0.287 0.289 0.096
0.274 0.355 O. 126
0.298 0.412 0,16(1
0.326 0.460 0.194
0.353 (].501 (I.227
0.460 ().642 0.367
N--- 500(I Bandwidth (PW) ,Size factor Iland~idth ( A~c )
0.208 0,2(19 01150
0.209 (I,271 (I,O74
11,237 (I.327 O. I01
0.266 0.376 0.130
0.295 0.41 t) 0, 159
O.410 0.572 0.202
10000 Bandwidlh ( PW ) Size li|ctor ]iandwidth ( Ave )
(}, I X l (}, 182 O.038
t). I S7 0.242 0.058
0.21,1 0.296 0.083
0.244 0.345 0. I (19
0.273 0.388 0,136
0.3t)0 0,544 0.264
N ..... 100000 Flandwidth ( t)W ) Size factor Bandwidth ( Ave )
0. I 14 0. I ] 5 0.015
0.127 0.165 0.027
0.154 (1.213 0.043
0.183 0.250 0.061
0.211 0.301 0.082
0.33 I 0.462 0.100
,V
4.3.2. BamhrMths for &'nsitr weiqhted arermle &,riratiees. linear model u'ith 1101"Hitd I'('QI'e,'¢SOI'S Recall from Example 2.2 that :, = (y,.v~)' here, where .v, ~ l'(x)dx and E(y, lx,)=.q(x, ). We compute bandwidth values for estimation of the tirst component of the density-weighted average derivative, 6ol = El.l'(xi),q'=(xi)], where, as above, ~/'1=?'q/~x. We denote the first component of ¢ =,~.~"/~u as ¢ l----=?'.7{/~'lll. The ftmction r(-, h) is computed directly as the conditional expectation of the lirst
,I.L. Powell, 7~ M. StokerlJourmd Of Econometrics 75 (19¢~6) 291 316
307
component of p(.) of (2.8), as r(z, h) = E[ p(z, zj. h)lz] = -f#'(u)
• y-.f'l(x + hu)du
+ f ~ O0[,q'~(h + h u ) f ( x + hu) + ~,l(x + hu).f~(x + hu)]du,
(4.26)
where the latter equality employs integration by parts. As above, we have ~ = 2, I ~ "~ with s ( z i ) = s~ J/th ]h=O. Therefore " .v ~1" " " l/,,t vi "~ + ,q(xi
I
.i
If
.
I
,If
If
).! li/(.xi "'
-
)
.f
+~,lt i(-'~ ).l)(x~ ) + gl (xi ).l jl(x~ ) + gjl(xi ),/ i(x~ ) +#i(xi ' ).lll(xi)+.ql(xJ "" , TI1¢ v a r i "a n c e
).f ,,i j(xi )] }.
(4.27)
,o " ¢,,,,,-,d S iP.Iila" ly
,,~m
q(z~) = .t' '1 ~(u)du • [y, - 2y~.q(x, ) + E(y~ Ix~ )],
(4.28)
where 7 : k + 2 . To compute bandwidth values, we again specialize the general formulae to the case where f ( x ) is the spherical normal, i (0, I)density, and .IF(u) is likewise the . I '(0,1) density. We assume the true model is linear, /,
y i = ~, .vii + ~,,
(4.29)
i I
whe,'e a~ is univariate normal with mean 0 and variance a/,, and independent of x. This allows s(zi) to be simplilied as
=
""""
_
"'
""'"
-.! i,t-~,) + .//~(-'~) + -.t ~/t.xi )]. "
"
(4.30)
1'
Considerably more tedious algebra gives E[s(xi )]'~ = k~c~( ~ )~'~4 E[s(xi)2]
: c~(~),~2~2 "'
(4.31 ) 3k2 [-1.25k+7.25+ak(~+g+3)],
E [ q ( x i )] : el,( 1 )/, 2~ l at,,
k
(4.32) (4.33)
and 2~ ~- ;'----k 4 6. The t,pproximate pointwise bandwidth h** , the bias factor B, the size factor Ax, and the approximate optimal bandwidth h ~ a,'e computed fi'om these terms as above. Tables 2a and 2b contain computed bandwidth values
308
J" L" PowclI' T M" S t o k e r 1 Journal t!/ Econometrics 75 ( 19961 2 9 1 3 1 6
Table 2a Optimal bandwidths for density-weighted average derivatives: Spherical normal regressors: Spherical normal kernel: Linear model:
Y' =
R ~ = 0.80
f l x J = . I ( 0 , / ) density .,'g (u) = . I (0,1 ) density
~i
Ix;i + c,,
i = 1. . . . . N,
~: ~
" ~ "
(o,,7~). ]
01~"
I -- R2 ] =
k
g-"
]
Dimension k
I
2
3
4
5
I0
Bias factor
1.537
1.354
1.302
1.279
1.265
1.235
N = 50 Bandwidth (PW) Size factor Bandwidth IAve)
0.472 0.631 0.458
0.621 0.669 0.563
0.743 11.699 0.677
0.852 0.725 0.790
11.953 0.746 0.900
1.355 0.818 1.368
N = 1110 Bandwidth ( PW ) Size f~actor Bandwidth (Ave)
0.428 0.572 0.376
11.570 0.613 0.473
0.688 0.647 0.580
0.795 0.676 0.688
0.895 0.701 0.793
1.297 0.783 1.254
N = 5110 Bandwidth (PW } Size tactor Bandwidth (Avel
11.340 1t.454 11.237
0.466 0.501 0.316
11.575 0.541 11.406
0.677 0.576 0.498
0.773 0.605 0.592
I. 173 1/.7118 1.026
N = 111110 Bandwidth ( PW ) Size t~.lctor Bandwidth (Ave)
11.308 1).412 O. 195
11.427 I).461) 0.266
0.533 I).5111 0.348
/I.632 11.537 11.434
11.726 0.568 11.522
I. 123 11.678 0.94 i
50110 Ilandwidlh ( PW ) Size laclor B:mdwidth ( Ave }
0.245 11.327 0.123
11.349 O. 376 O. 178
I).445 11.419 0.243
0.538 11.457 0.315
11.627 11.491 0.391t
1.016 11.6[ 3 11.76~}
1011011 Bandwidth ( PW ) Size ti~etor Bandwidth (Ave)
11.222 11.296 0. IOI
11.320 11.345 11.15(1
11.412 11.388 11.2OR
0.502 11.427 I).274
0.589 0.461 0.343
1).973 0.587 t).7115
N -- 1110000 Bandwidth ( PW ) Size lactor Bandwidth 1Ave )
0.159 11.213 0.O52
11.2411 11.259 0.084
0.319 11.3111 O.125
11.399 11.339 11.173
I).478 11.374 0.226
I).842 1!.5119 0.529
N
N
a n d b i a s a n d s i z e t h c t o r s for v a r i o u s d i m e n s i o n s k a n d s a m p l e s i z e s N . T o m a k e the values comparable
a c r o s s d i , n e n s i o n s , t h e l i n e a r m o d e l ( 4 . 2 9 1 is a s s u m e d
h a v e a c o n s t a n t v a l u e o f R 2 , s o that t h e v a r i a n c e o f :-, v a r i e s w i t h k a s
(i 1
•
(4.34)
Io
J.L. Pow('ll, T M .
Table 2b Optimal bandwidths for:
R" = 0.20
Spherical normal regressors: Spherical normal kernel: Linear model:
309
S t o k e r l J o m ' n a l o l ' E c o n o m e t r i c s 75 119961 291 316
.l(.v) . I (0, I ) densily ..8' (u) = . I (0, / ) density
''k . yi = K z ,,j=l "~ji 4-
I: i ,
i = 1. . . . . N,
,
~:
~
[I-R
•
R-'
-~] J
Dimension k
I
2
3
4
5
10
Bias factor
1.734
1.543
1.475
1.440
1.418
1.359
N = 50 Bandwidth (PW) Size factor Bandwidth (Ave)
0.622 I).631 0.681
0.771 0.669 (I.796
0.893 0.699 0.921
0.999 0.725 1.042
1.094 0.746 1.158
1.463 0.818 1.627
N = 100 Bandwidth (PW) Size factor Bandwidth (Ave)
0.563 0.572 0.559
0.707 0.613 11.669
0.827 0.647 0.789
0.932 0.676 0.908
1.027 0.701 1.021
1.401 0.783 1.492
N - 500 Bandwidth (PW ) Size lactor Bandwidth (Ave)
0.448 0.454 I).353
0.579 0.501 11.448
0.691 0.541 11.552
11.793 0.576 0.658
0.887 0.605 0.762
1.267 0.708 1.220
N = 1000 Bandwidth (PW) Size factor Bandwidth (Ave)
0.405 I).412 t).289
0.531 0.460 0.376
0.640 0.5111 0.473
0.740 11.537 (I.573
0.833 0.568 (I.672
1.214 0.678 I. I 19
N :: 5111111 B:mdwidtl* {PW ) Size lilclor I}andwidtl', ~Ave 1
11.322 11.327 11.183
11.434 11,376 11.252
11.535 I).419 0.331
11.6311 I).457 I).415
11.7211 11.491 0.501
1.097 0.613 0.915
11111110 I}andwid(h PW ) Size factor Bandwidlh [ Ave
0.292 11.296 0.1511
0.398 11.345 0.212
11.495 11.388 11.284
11.588 11.427 11.361
11.(~71~ 11.461 11.442
1.05 I 11.587 0.839
N .... 100000 Bandwidfl, ( VW ) Size Ihctor Bandwidth (Ave)
(1.210 (1.213 (I.II78
0.298 11.259 (I. 119
0.384 11.3111 0.170
11.467 11.339 11.228
11.548 11.374 11.291
(1.910 11.5119 0.629
\~
Table 2a gives values for R 2 = 0.80, and Table 2b gives values for R-'= 0.20.These bandwidth values are qualitatively similar to those in Table I, but there are some notable differences. As before, the sample size effect is much more p,'or|ounced in low dimensions. The bandwidths tbr estimating average derivatives arc generally larger lhan those for estimatir|g average density, and incrcase as the R" value decreases. The bias facto, is more evident lbr average derivatives as
310
J.l.. Pow('ll, 7:M. StokerlJourmtl t~lEcommwtrics 75 (1996) 291 316
well - for instance, the bandwidths for estimating average derivatives are larger than the pointwise bandwidths for smaller sample sizes in Table 2b. Finally, for R2 = 0.20, several of the optimal pointwise bandwidths actually exceed their "averaged' counterparts, though this possibility vanishes as N increases relative to k.
4.4. A 'plug-in' estimator of the optimal bandwMth One approach for approximating the optimal bandwidth h + is to make use of cross-validation or another data-based method for estimating the pointwise optimal bandwidth h*, 18 and then applying the factorizatioi: (4.11) using the appropriate ,4,, value and an approximate value of B. To the extent that the application at hand is similar to one of the examples above, the B values in Tables I, 2a, or 2b may give reasonable performanceJ 9 Alternatively, we propose a simple 'plug-in' estimator of h ~, based on empirical implementations of the bias and variance formulations above. 2° In particular, to approximate (4.8) or (4.9), we need consistent estimators of Q~ --- E[q(z,)] and S¢l =- E[s(zi)]. Denoting such estimators as (~ and S, respectively, the optimal bandwidth is estimated as (4.35)
I;rorn the corisistcncy ot'Q and .~, it tbllows inuriedJalely thai #7 -- h
:7 o;,(N
"
"i),
(4.36)
foI ° /I ~ + o f (4,1()), iilld lher¢ttn'e w c can conchlde thai
(4.37)
tl - /i t : : o p ( N I 2 2, t ) ) ,
for the optimal bandwidth h + of (4.8). The variance coellicient Q0 is estimated by empirically implementing (3.12). Using an initial bandwidth value ho, detine
(7)
I
E hit" p(:,,:;.
)2.
(4.38)
I ":: j
iN See Nychka (199!) Ibr a reecm dL~cussion of .'¢moodUug parameter choice a,ld ret'c,'enccs 1o automatic methods for kernel eslimalion. t,~ I lall and Man'on (t,W/) p,'opose an imeresUng "plug-in' method of eslimaling Ihe poinlwise optinml bandwidth for density e~thuali,m usin 8 an ~...Cmatc of the inteLsrated sqtlared deli,,,d|y deriw~tivc. 1"hey exploit a com)eclinn duc to Uucgration-hy-parL,s thai is not available m ~vr gcnci'al .selling. ~u Gasser. Kncip. and Kohlcr ( 1~99i i discuss itelXlliv¢ "plug-in" methods for app,'oximaling optimal pointwisc bandwidlh~,.
J.l.. Powell, T.M. StokerlJom'md of Ecommwtrics 75 (1996) 291 316
311
Consistency of 0 for Qo is obtained by allowing h0 to converge to 0 at a different rate than the optimal bandwidth h +. From the U-statistic structure of (4.38), if E[ p(.:,,.9, ho )4] = O(ho,1--'r ),
(4.39)
for some q > 0, then Lemma 3.1 of Powell, Stock, and Stoker (1989) implies that 0 is consistent for Qo if h0 + 0 and Nh'U 2;' + 0 as N --~ co. To estimate the bias coefficient So , we exploit (3.9) :a differenced form. Since So is the leading term in the bias expansion of 6(h), estimate So by ,~ = 3( rho ) - ,g(h0 )
(rh0)~ - h~ '
(4.40)
for some positive r ~ 1. This has expectation E(,~) = [(rho) ~ - h ~ ] - ' . { S o
[(rh0)~ - h~] + o(h~)} =S0 + o ( I ) ,
(4.41)
assuming ho ---, 0 as N ~ %. Since S is a U-statistic with kernel p s ( z i , z i , ho) = hi-)"~[P(Zi,zi, ho) - P(zi, z/, rho)]/'(z ~ - I),
(4.42)
we have that E[d,]
= O ( h o" E [ p ' ]' ) = O ( h
o- "--
'),
(4.43)
so that ,~ is consistent for ,% provided ho --, 0 and Nh~ "~~:' --- vc, as N ~ e<. -'t In summary, we have shown:
Pr~qmsition 4.2. Umh'r Assunq~tions ! ami 2, stq~pose that (4.39) is ralid.lbr q > O. ami h't p = max{q -+- 2;'.2~ -t- 7}. !1' h. - , 0 am/Nh{i ....... ~ as /V ...... x . tlw,t 0 0 ( h ) and .~" = ,~'lh),o',' c,msiste,tt estimators ,!f (),, = |:[q(zi)] ,m,I &~ ~- E[s(zi)l. re.vwctirelY, amt the "phi,q-in' hamhridth eslimalor h o/wl's :-
/~ - h ~ = ot,(N-2 ~-':,~:)).
(4.44)
While Proposition 4.2 gives a solution to the problem of estimating h~, we rnust mention one teclmical proviso of the result. We have not shown that our conditions will guarantee that the 'plug-in' estimator ¢~(/~) will be asymptotically equivalent to 6(h ~ ), because the mean squared error calculations used to derive the tbrm of the optimal bandwidth held h tixed in calculating the moments of the U-statistic in (2.1). The MSE formulae does not follow immediately if the
21 Wilh rel'erencg 1o the i1~.I,[~follmving (43.)), the estimator ,~ can estinlatc the bias in situations whm'¢
~¢ijh)does not obey all the tenets ~f ~
consistency for ~o. While our .justilicationoI'.~is asylrlplotic.
tll~ cstJnlator ,~ nlay provide a practical method of .,,mdying small-sample bkls issues, such as those
discussed in Stoker( It~t)3a, Igt)3b).
312
J.L. Powell. 7: M. Stoker/Journal c~l E('(mometrh's 75 ~1996) 291 316
bandwidth h were replaced by a (stochastic) bandwidth value that was constructed using the same data as appear in the U-statistic, and derivation of general largesample properties of (~(t;) would require much stronger conditions (e.g., higherorder difl'erentlability) on the kernel t'unction p(zi,zj, h). A straightforward but inelegant solution to this technical problem could be based upon a familiar 'sample-splitting' device - the bandwidth h could be constructed using, say, the first N* = O(ln(N)) observations on z;, with the remaining observations being used to form the U-statistic in (2.1). While this samplesplitting approach would ensure the equivalence of 6(h) and 6(h÷), it would clearly lead to very imprecise estimates of the optimal bandwidth in practice, and an approach that made use of the whole sample in both steps seems more likely to be well-behaved. Another straightforward solution would be to discretize the set of possible scaling constants, replacing the estimated constant term with the closest value in some finite set. While the optimal constant will generally not be in this set, it can be arbitrarily well approximated if the mesh of the set is small enough; the rate of convergence of the discretized constant will be arbitrarily high in probability, so this "plug-in' bandwidth will not affect the asymptotic properties of the U-statistic.
5. The optimal bandwidth for estimating ratios of density-weighted averages As outlined Ibr Examples 2.2 and 2.3, there are various estimators of interest that take the tbrm of ratios of density-weighted averages. We can solve the optimal bandwidth pmble!r| for estimators of this type by a straighttbrward modilication of the derivations above. We now spell out this modilication."-' In particular, as motiw.ltcd by (2.9)and (2.13) of Examples 2.2 and 2.3, wc arc ol'tc!l interested in the estimation of a parameter of the form .......
%
( 5. I )
The corresponding estimato," takes the form
The optimal bandwidth for the estinmte.!" (5.2) can be studied by luanipulations similar to those familiar from linear regression analysis, in particular, we Ibcus the variation of I'l(h) on the "residuals" of tile problem by dcfir|ing
,:;"(h) _=
(5.3)
:~ We do not ctmsider tilt: possibility of using dill'crcnl band~idths Ibr the "numerator" and 'dcnoluillalt)r' i~l"the ratio, bill rather a ~,ingle balltl~,~.'idlhIbr each. (]tlr vcrsitm is directly nloti~,:alcd by certain problcn)s, such as the instrumental variables estimator of Example 2.2 I)ilt;,:rcn! bandwidths would imply that dilTcrcnt illStrtlnlClllS ill't' Hsl2d fi,)r tile "x" nlOlll(211t,',;that) for tile ")'" moments.
J.L. Powell. T.M. Stoker IJoin'rod ql" E¢'om,,etrics 75 (1996) 291 316
313
t~"(h) is a second-order U-statistic of the form (2.1) with kernel
u(z,, z j, h ) =- pY(zi, Zi, h ) - p~(zi, z i, h )[~o,
( 5.4 )
where p-~(.) and p~(.) are the kernels of the U statistics ~Y(h) and 6'(h), respectively. By construction, 6"(h) is an estimator of ,~ = ,~,
-
,~;1~,, = o.
(5.5)
Returning to D(h), we have that
,~"(h) /
J
As long as ~]X(h) is N"-consistent to," some q > 0, the second term on the righthand side of (5.6) is of smaller order (in N) than the first, so that the :'ate of convergence o f ~ ( h ) to [:~0 is the same as that of ~"(h) to zero. Thus the optimal bandwidth for [l(h) is tile same as that for the U-statistic based on the scaled residuals [6~i] - i N(Zi,~l,h)" It is easy to check that the use of an estimated 'residual' does not change any of the conclusions above, provided the rate of convergence of ~]'(h) is the same as for [~"(h). To be more precise, suppose that 6" and [1 are consistent estimators of ,Si~jand [Io, say, based on an initial bandwidth value, and define
13'] ~ '3"(h) ~ [3'1-~[3v(h) - ,i'(h)fil.
(5.7)
Then we Imvc thal
13'1
'3"(h)~ l,~?i] '3"(h) - (I,S{'] ' --13'1 ')3"(I,) -13'] '3'(h)[/~ -/~,,] • ~
= l,~i]-'3"(h) + Op([a0]
-
.
[ ~.1¢
~..k ~
a (h)) + Op(,~ (h)).
(5.8)
Replacement of [g']- ~3"(h) by [6i~i]-i3"(h) does not affect the leading terms of mean squared error, nor does it alter the optimal bandwidth value. In summary, we have shown:
Pr~qmsititm 5. !. Under Assumptions I and 2 ¢q~plied to ~'(h) and ,~(h). the opt#hal handwidth ./'or estimating the ratio/~(h) = [ ~ ( h )]- i ~i."( h ) is the same as that.lor estimalinq [6~]-I~"(h). the ~h,nsity-weiqhted average with scaled 'residual' kernel [6i'i] lu(z,,zi, h) o./ (5.4). This hamhridth is consi.s'tently estimated using consistent ¢'stimates ~' and ~ to construct the scaled residual kernel.
314
J.L. Powc'll. 7". M. Stokc'rlJoto'nol ~!/ E~'onometrics 75 (1996) 291 316
It is worthwhile noting one special case of our results, because of its appearance in common model designs. In particular, there are settings where the biases in /~(h) and ~;~"(h) are identically zero. For instance, consider the instrumental variables estimators for Example 2.2 [implementing (2.7)]. If the true model is linear, since the "instruments' ~./i(x~)/~x are solely functions of x, the coefficient estimator d(h) is conditionally unbiased for the linear coefficients. This implies that the optimal bandwidth calculation only contains the terms for the variance of d(h), which are decreasing in h. Consequently, one will want to set the bandwidth to a "large' value in this setting. 23 This is reflected in the optimal bandwidth lbrmula (4.11) by noting that E[s(x;)] 2 = 0 in this case. At any rate, our 'plug-in' estimation method of Section 4 permits determining whether this is the case empirically, in that a small estimate of the bias in ~"(h) will translate to the large bandwidth value.
6. Conclusion In this paper we have characterized the optimal bandwidth for estimating density-weighted averages and ratios of density-weighted averages. Our main purpose was to provide a guide for choice of bandwidth for estimators that employ kernel estimators in the form of density weighted averages. Most of the existing a'~ymptotic theory for these estimators had little to say about how to set bandwidth values in applications, so our results give more spccilic help tbr this problem. A natural next step is to study the pertbrmance of fully automatic methods (namely estimators that use estimates of optimal bandwidths), to see whether approximating an optimal bandwidth gives rise to real practical benelits. One concern z'~liscd by the compttled bandwidth values of :Section 4.3 is tile quality of the asympt,,:tic approximations we have employed, in particular, if the bandwidth is of the same order (say one half) o1" the standard error of the data components, then one could question the quality of our analysis based on leading terms in a series expansion. Further research is indicated to see whetiler the remainder terms substantially afl~ct the optimal bandwidth value. Nevertheless, computing the estimators of Section 4.4 (of the leading coefficients of bias and variance) will be intbrmative in applications~ for indicating how sensitive the estimated results are to the bandwidth valucs used. Another object of the paper was to give a concrete comparison of bandwidth settings |br optimal fimction approximation versus optimal perllbrmance of a derived semiparametric estimator. The simple framework above gave rise to an immediate relationshil~ of this type, and permitted us to compare a pointwise
:~ Mtll'~ pI'~.'ciscly, the optimal b~.llldwidlh ~,'~llu¢111~|yI'~olshrink ~vith sample size in thi.s ~:asc. I towc,~cr, ~vhilc larger bandwidth values decrease variance, al Ihc liluil h .... x., 9/,(.v,)gx, is everywhere zero, and thcrct~re Ihils to he ~l proper instrtmlenlt.
J.L. Powell. T. AL StokerlJom'nal t?/Econm,etrhw 75 [1996) 291 316
315
optimal bandwidth with an optimal bandwidth for the semipararnetric problem. Since the nonpararnetric estimator for this exercise is intrinsically connected to the density-weighted average of interest, our results somewhat beg the question of bandwidth choice in other semiparametric contexts. However, the simple structure of our results may provide some general insight in how to adjust for different uses of nonparametric estimators, as well as how to practically implement 'asymptotic undcrsmoothing'. References Andrews, D.W.K., 1989, Asymptotics lbr scmiparametric econometric models: I. Estimatk~n, Working paper (Cowles Foundation, Yale University, New Haven, CT). Bickel, P.J. and Y. Ritov, 1988, Estimating integrated squared density derivatives: Sharp best order of convergence estimates, Sankhyfi 50 A, 381-393. Gasser, T., A. Kneip, and W. Kohler, 1991, A flexible and fast method for automatic smoothing, Journal of the American Statistical Association 86, 643 653. Goldstein, L. and K. Mcsser, 1990, Optimal plug-in estimators tbr nonparametric functional estim~tion, Technical report no. 277 (Stanford University, Stantbrd, CA). Hall, P. and J.S. Marron, 1987, Estimation of integrated squared d,.nsity derivatives, Slalistics and Probability Letters 6, 109- I 15. Hfirdle, W., 1991, Applied nonparametric regression, Econometric Society monographs (Cambridge Unive.'sity Press, Cambridge). Itfirdle, W. and T.M. Stoker, 1989, Investigating smooth multiple regr,'ssion by the method of average derivatives, Journal of the American Statistical Association 84, 986-995. Itfirdle, W. and A.F',. Tsybakov, 1093, I tow sensitive are average derivatives?, Journal of I-conometrics 58, 31 48. I liirdle, W., P. tlall, and S. Marron, l gNN, l low far qrc automatically chosen regression snloolhing parameters li'om their 4.~rflinltllln?, ,lotlrnal Of tIle American Statistical Asst~ciation ~3, N6 95. I liirdle, W., J. I lart, J.S. Marron, and A.B. Tsybakov, 1992, Bandwidth choice for average dcriv:llive estimation, .Iotlrnal of tile American Statistical Associatio,1 87, 227 233. Ilastie, T.J. and R.,I. Tibshirani, 1990, (Jcncralized additive models {('llalmlan and Ilall, London). Ilocll'ding, W., Iq4N, A class o1" stalislics with asymptotically imrmal distribution, Anli~fls of Mathematical Statistics 19, 293 325. Jaeckel, L.A., 1972, Estimating regression coellicienls by luinilnizing tile dispersion of tile residuals, Annals of Mathematical Statistics 43, 1449 1458. Jones, M.('. and S..I. Sheather, 1991, tJsing non-stochastic terms to adv:mtagc in kernel-based estimation of integrated squared density derivatives, Stalistics and Ptvbability Letters II, 5 1 1 514. Jure~zkovfi, J., 1971, Nonparametric estimate of regression coetlicients, Annals of Mathematical Statistics, 42, 1328 1338. Newey, W.K., 1994, The asymptotic variance of semiparamctric estimators, Economctrica 62, 1349 1382. Nychka, D.. 1991, ('hoosing a range for the amount of smoothing in nonparamctric regression, Journal of tile American Statistical Association 86, 653665. Powell, J.L., 1987, Scmiparametric estimation of bivariatc latent variables models, Draft (University of Wisconsin, Madison, W[ ). Powell, J.l.., J.II. Stock, and T.M. Stoker. 19N9. Scnliparamelric estimation tfl" index cocll~cienls. Eeonometrica 57, 1403 1430. Robinsoll, PM., 198~, Root-N-consistent semiparametric regression, Economelrica 56, 931 954.
316
J.L. Powetl. 7~M. Stoker/Journal qf Econometri¢'s 75 (1996) 291 316
Serfling, R.J., 1980, Approximation theorems of mathematical statistics (Wiley, New York, NY). Stoker, T.M. 1992, Lectures on semiparametric econometrics, CORE lecture series (CORE Foundation, louvair-la-Neuve). Stoker, T.I~.~., 1993a, Smoothing bias in the estimation of density derivatives, Journal of the American Statistica! Association 88, 855-863. Stoker, T.M.. 1993b, Smoothing bias in the measurement of marginal effects, Journal of Econometrics, forthcoming.