Estimation of semi-parametric additive coefficient model

Estimation of semi-parametric additive coefficient model

Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 www.elsevier.com/locate/jspi Estimation of semi-parametric additive coefficient m...

2MB Sizes 0 Downloads 25 Views

Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 www.elsevier.com/locate/jspi

Estimation of semi-parametric additive coefficient model Lan Xue∗ , Lijian Yang Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA Received 25 February 2004; received in revised form 29 September 2004; accepted 5 November 2004 Available online 24 December 2004

Abstract In the multivariate regression setting, we propose a flexible varying coefficient model in which the regression coefficients of some predictors are additive functions of other predictors. Marginal integration estimators of the coefficients are developed and their asymptotic properties investigated. Under -mixing, it√ is found that the estimators of the parameters in the regression coefficients have rate of convergence 1/ n, and the nonparametric additive components are estimated at the same rate of convergence as in univariate smoothing. A data-driven bandwidth selection method is developed based on asymptotic considerations. Its effectiveness is confirmed in a Monte-Carlo study. The procedure is applied to the real German GNP and Wolf’s Sunspot data, where the semi-parametric additive coefficient model demonstrates superior performance in terms of out-of-sample forecasts. © 2004 Elsevier B.V. All rights reserved. Keywords: German real GNP; Local polynomial; Marginal integration; Out-of-sample forecast; Rate of convergence; Varying coefficient model; Wolf’s sunspot number

1. Introduction The nonparametric regression model has been widely used in various applications due to its ability to discover data structures that linear and parametric models fail to detect.A serious limitation of the general nonparametric model which gives statisticians reservation is the “curse of dimensionality” phenomenon. This term refers to the fact that the convergence rate of nonparametric smoothing estimators becomes rather slow when the estimation target is a ∗ Corresponding author.

E-mail address: [email protected] (L. Xue). 0378-3758/$ - see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2004.11.003

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2507

general function of a large number of variables without additional structures. Many efforts have been made to impose structures on the regression function to partly alleviate the “curse of dimensionality”, which is broadly described as dimension reduction. Some well-known dimension reduction approaches are: (generalized) additive models (Chen and Tsay, 1993a; Hastie and Tibshirani, 1990; Sperlich et al., 2002; Stone, 1985), partially linear models (Härdle et al., 2000) and varying coefficient models (Hastie and Tibshirani, 1993). The idea of the varying coefficient model is especially appealing. It allows a response variable to depend linearly on some regressors, with coefficients as smooth functions of some other predictor variables. The additive-linear structure enables simple interpretation and avoids the curse of dimensionality problem in high-dimensional cases. Specifically, consider a multivariate regression model in which a sample {(Yi , Xi , Ti )}ni=1 is drawn that satisfies Yi = m(Xi , Ti ) + (Xi , Ti )i ,

(1.1)

where for the response variables Yi and predictor vectors Xi and Ti , m and 2 are the conditional mean and variance functions m(Xi , Ti ) = E(Yi |Xi , Ti ),

2 (Xi , Ti ) = var(Yi |Xi , Ti )

(1.2)

and E(i |Xi , Ti ) = 0, var(i |Xi , Ti ) = 1. For the varying coefficient model, the conditional mean takes the following form: m(X, T) =

d 

l (Xl )Tl

(1.3)

l=1

in which all tuning variables Xl , l = 1, . . . , d make up the vector X, and all linear predictor variables Tl , l = 1, . . . , d are univariate and distinct. Hastie and Tibshirani (1993) proposed a backfitting algorithm to estimate the varying coefficient functions {l (xl )}1  l  d , but gave no asymptotic justification of the algorithm. A somewhat restricted model, the functional coefficient model, was proposed in the time series context by Chen and Tsay (1993b) and later in the context of longitudinal data by Hoover et al. (1998), in which all the tuning variables Xl , l = 1, . . . , d are the same and univariate. For more recent developments of the functional coefficient model, see Cai et al. (2000). In a different direction, Yang et al. (2004) studied inference for model (1.3) when all the tuning variables {Xil }1  l  d are univariate but have a joint d-dimensional density. This model breaks the restrictive nature of the functional coefficient model that all the tuning variables Xl , l = 1, . . . , d have to be equal. On the other hand, it requires that none of the tuning variables Xl , l = 1, . . . , d are equal. In this paper, we propose the following additive coefficient model which has a more flexible form, namely m(X, T) =

d1  l=1

l (X)Tl ,

l (X) =

d2 

ls (Xs ),

∀1 l d1

(1.4)

s=1

1 in which the coefficient functions {l (X)}dl=1 are additive functions of the tuning variables T X = (X1 , . . . , Xd2 ) . Note that without the additivity restriction on the coefficient functions

2508

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

Table 1 German real GNP: the ASEs and ASPEs of four fits

Additive coefficient model (1.5), local linear fit Additive coefficient model (1.5), local cubic fit Linear AR (24) model (4.2) Linear AR (248) model (4.3)

ASE

ASPE

0.000201 0.000205 0.000253 0.000258

0.000085 0.000077 0.000112 0.000116

1 {l (X)}dl=1 , model (1.4) would be a kind of functional coefficient model with a multivariate tuning variable X instead of a univariate one as in the existing literature. The additive 1 structure is imposed on the coefficient functions {l (X)}dl=1 , so that inference can be made on them without the “curse of dimensionality”. To understand the flexibility of this model, we look at some of the models that are included as special cases:

1. When the dimension of X is 1 (d2 = 1), (1.4) reduces to the functional coefficient model of Chen and Tsay (1993b). 2. When the linear regressor vector T is constant (d1 = 1, and T1 ≡ 1), (1.4) reduces to the additive model of Chen and Tsay (1993a), Hastie and Tibshirani (1990). 3. When for any fixed l = 1, . . . , d1 , ls (xs ) ≡ 0 for all but one s = 1, . . . , d2 , (1.4) reduces to the varying coefficient model (1.3) of Hastie and Tibshirani (1993). 4. When d1 = d2 = d, and ls (xs ) ≡ 0 for l  = s, (1.4) reduces to the varying-coefficient model of Yang et al. (2004). The additive coefficient model is a useful nonparametric alternative to the parametric models. To gain some insight into it, consider the application of our estimation procedure to the quarterly West German real GNP data from January 1960 to April 1990. Denote this time series by {Gt }124 t=1 , where Gt is the real GNP in the tth quarter (the first quarter being from January 1, 1960 to April 1, 1960, the 124th quarter being from January 1, 1990 to April 1, 1990). Yang and Tschernig (2002) deseasonalized this series by removing the four seasonal means from the series log(Gt+4 /Gt+3 ), t = 1, . . . , 120. Denote the transformed time series as {Yt }120 t=1 . As the nonparametric alternative to the best fitting linear autoregressive model (4.2) in Section 4.2, we have fitted the following additive coefficient model (details in Section 4.2) Yt = {c1 + 11 (Yt−1 ) + 12 (Yt−8 )}Yt−2 + {c2 + 21 (Yt−1 ) + 22 (Yt−8 )}Yt−4 + t . (1.5) Using this model, we can efficiently take into account the phenomenon that the effect of Yt−2 , Yt−4 on Yt vary with Yt−1 , Yt−8 . The efficiency is evidenced by its superior out-ofsample one-step prediction at each of the last 10 quarters. The averaged squared prediction error (ASPE) is 0.000112 for the linear autoregressive fit in (4.2), and 0.000077 or 0.000085 for two fits of the additive coefficient model (1.5). Hence the reduction in ASPE is between 31% and 46%, see Table 1. Fig. 1 clearly illustrates this improvement in prediction power.

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2509

Observed vs predicted values of seasonally adjusted GNP data

0.02 0.01 0.0 -0.01 -0.02 -0.03 112

114

116

118

120

Fig. 1. German real GNP: one-step prediction performance for the last 10 quarters. Circle denotes the observed value, triangle denotes the prediction by additive coefficient model (1.5), and cross denotes the prediction by linear autoregressive model (4.2).

One can see that the additive coefficient model out-performs the linear autoregressive model in prediction for 8 of the 10 quarters. We organize the paper as follows. In Section 2, we present the estimation procedure for the coefficient functions in model (1.4) and the asymptotic properties of the estimators. In Section 3, we discuss computing issues relating to the implementation of the marginal estimation method as in Section 2. In Section 4, simulation results and applications to two empirical examples will be presented. Technical proofs are contained in the Appendix. 2. The estimators 2.1. Model identification For the additive coefficient model, the regression function m(X, T) in (1.4) needs to be identified. One practical solution is to rewrite it as m(X, T) =

d1 

l (X)Tl ,

l (X) = cl +

d2 

ls (Xs ),

∀1 l d1

(2.1)

s=1

l=1

with the identification conditions E{w(X)ls (Xs )} ≡ 0,

l = 1, . . . , d1 ,

s = 1, . . . , d2

(2.2)

for some nonnegative weight function w, with E{w(X)} = 1. The weight function w is introduced so that estimation of the unknown functions {l (X)}1  l  d1 will be carried out only on the support of w, supp(w), which is compact according to assumption (A7). This is important as most of the asymptotic results for kernel-type estimators are developed only

2510

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

for values over compact sets. By having this weight function, the support of the distribution of X is not required to be compact. This relaxation is very desirable since most time series distributions are not compactly supported. See Yang and Tschernig (2002, p. 1414) for similar use of the weight function. Note that (2.2) does not impose any restriction on the model, since any regression function  1 d2 ∗ m(X, T) = dl=1 s=1 ls (Xs )Tl can be reorganized to satisfy (2.2), by writing m(X, T) =

d1 

 cl +

d2 

 ls (Xs ) Tl

s=1

l=1

2 ∗ with cl = E{w(X) ds=1 ls (Xs )}, ls (Xs ) = ∗ls (Xs ) − E{w(X)∗ls (Xs )}. s  d2 In addition, for the functions {ls (Xs )}11   l  d1 and parameters {cl }1  l  d1 to be uniquely determined, one imposes an additional assumption (A0) There exists a constant C > 0 such that for any set of measurable functions s  d2 {bls (Xs )}11   l  d1 that satisfy (2.2) and any set of constants {al }1  l  d1 , the following holds: d   2 d  d2 d1  d2 1 1     2 2 E al + bls (Xs ) Tl C al + E{bls (Xs )} . (2.3) s=1

l=1

l=1 s=1

l=1

Lemma 1. Under assumptions (A0) and (A5) in the Appendix, the representation in (2.1) subject to (2.2) is unique. Proof. Suppose that m(X, T) =

d1 

 cl +

d2 

 ls (Xs ) Tl =

s=1

l=1

d1 

  cl +

d2 

  ls (Xs ) Tl

s=1

l=1

s  d2 s  d2 with both the set {ls (Xs )}11  ls (Xs )}11  cl }1  l  d1  l  d1 , {cl }1  l  d1 and the set {  l  d1 , { satisfying (2.2). Then upon defining for all s, l

ls (Xs ) − ls (Xs ), al ≡  cl − cl bls (Xs ) ≡  1 2 one has dl=1 {al + ds=1 bls (Xs )}Tl ≡ 0. Hence by assumption (A0) 0=E

d  1  l=1

al +

d2  s=1

 bls (Xs ) Tl

2 C

d 1  l=1

al2

+

d1  d2 

 2 E{bls (Xs )}

l=1 s=1

2 (X ) ≡ 0 almost surely. Since assumption (A5) entailing that for all s, l, al ≡ 0 and bls s requires that all Xs are continuous random variables, one has bls (x) ≡ 0 for all s, l. 

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2511

2.2. The model Consider {(Yi , Xi , Ti )}ni=1 , a sample that follows (1.1) and (1.2) whose conditional mean function is described by (2.1) and the identifiability conditions (2.2), (2.3). The error terms {i }ni=1 are i.i.d with Ei =0, E2i =1, and with the additional property that i is independent of {(Xj , Tj ), j i}, i = 1, . . . , n. With this error structure, the explanatory variable vector (Xi , Ti ) can contain exogenous variables and/or lag variables of Yi . If (Xi , Ti ) contains only the lags of Yi , it is a semi-parametric time series model, which is a useful extension of many existing nonlinear and nonparametric time series models such as exponential autoregressive model (EXPAR), threshold autoregressive model (TAR), and functional autoregressive model (FAR). In (2.1), for every l = 1, . . . , d2 , the coefficient of Til consists of two parts, the unknown parameter cl , and the unknown univariate functions {ls }1  s  d2 . The marginal integration method will be applied to estimate both. The marginal integration method was first discussed in Linton and Nielsen (1995) in the context of additive models, see also the marginal integration method for generalized additive models in Linton and Härdle (1996). To see how the marginal integration method works in our context, observe that according to the identification condition (2.2), for every l = 1, . . . , d1 one has  (2.4) cl = E{w(X)l (X)} = w(x)l (x)(x) dx and for every point x = (x1 , . . . , xd2 )T , and every l = 1, . . . , d1 , s = 1, . . . , d2 , one has cl + ls (xs ) = E{w−s (X−s )l (xs , X−s )}  = w−s (u−s )l (xs , u−s )−s (u−s ) du−s ,

(2.5)

where u−s = (u1 , . . . , us−1 , us+1 , . . . , ud2 )T , (xs , u−s ) = (u1 , . . . , us−1 , xs , us+1 , . . . , ud2 )T , the density of X is , and the marginal density of X−s = (X1 , . . . , Xs−1 , Xs+1 , . . . , Xd2 )T is −s , and w−s (x−s ) = E{w(Xs , x−s )} = w(u, x−s ) ds (u). In addition, the marginal density of Xs is denoted by s . Intuitively, one has n n ws (Xi,−s )l (xs , Xi,−s ) i=1 w(Xi )l (Xi )  cl ≈ (2.6) , ls (xs ) ≈ i=1 n − cl n w(X ) i i=1 i=1 ws (Xi,−s ) 1 and the d2 -dimensional functions {l (x)}dl=1 in the above equations (2.6) can be replaced by the usual local polynomial estimators. This is the essential idea behind the marginal integration method.

2.3. Estimators of constants {cl } 1 According to the first approximation equation in (2.6), to estimate the constants {cl }dl=1 , d1 we first estimate the unknown functions {l (x)}l=1 at those data points Xi that are in the support of the weight function w. More generally, for any fixed x ∈ supp(w), we

2512

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

1 approximate l (x) locally by a constant l , and estimate {l (x)}dl=1 by minimizing the T following weighted sum of squares with respect to  = (1 , . . . , d1 ) :

n  i=1

 Yi −

d1 

2 l Til

KH (Xi − x),

(2.7)

l=1

where K is a d2 -variate kernel function of order q1 , see assumption (A1) in the Appendix, H = diag{h0,1 , . . . , h0,d2 } is a diagonal matrix of positive numbers h0,1 , . . . , h0,d2 , called bandwidths, and x1 xd2 1 KH (x) = d . K ,..., 2 h0,1 h0,d2 h0,s s=1

Let ˆ = (ˆ1 , . . . , ˆ d1 )T be the solution to the least-squares problem in (2.7). Note that ˆ is 1 dependent on x, as is (2.7), and the components in ˆ give the estimators for {l (x)}dl=1 . To T emphasize the dependence on x, we write ˆ = ˆ (x) = (ˆ1 (x), . . . , ˆ d1 (x)) . More precisely, let ⎛T ··· T ⎞ 1d1

11

W(x) = diag{KH (Xi − x)/n}1  i  n ,

Z = ⎝ ... Tn1

.. ⎠ , . . · · · Tnd 1 ..

Y = (Y1 , . . . , Yn )T and el be a d1 -dimensional vector with all entries 0 except the lth entry being 1. Then {ˆl (x)}1  l  d1 is given by ˆ l (x) = elT {ZT W(x)Z}−1 ZT W(x)Y.

(2.8)

By (2.6), the parameter cl can be estimated as a weighted average of ˆ l (Xi )’s, i.e., n w(Xi )ˆl (Xi ) n cˆl = i=1 (2.9) , l = 1, . . . , d1 . i=1 w(Xi ) Theorem 1. Under assumptions (A1)–(A7) in the Appendix, for any l = 1, . . . , d1 √

L

n(cˆl − cl ) → N{0, 2l },

where the asymptotic variance 2l is defined in (A.11) in the Appendix. √ The rate of 1/ n at which cˆl converges to cl is due to two special features of ˆ l (x). First, q1 q1 , . . . , h0,d , bounded by the bias of ˆ l (x) in estimating l (x) consists of terms of order h0,1 2 √ 1/ n according to assumption (A6) (a), see the derivation of Lemma A.5 about the term Pc1 . −1 Second, the usual variance of ˆ l (x) in estimating l (x) is proportional to n−1 h−1 0,1 · · · h0,d2 , which gets reduced to 1/n due to the effect of averaging in (2.9), see the derivation of the term Pc2 in (A.9) and (A.10). This technique of simultaneously reducing the bias by the use of a higher-order kernel and “integrating out the variance” is the common feature of all marginal integration procedures.

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2513

s  d2 2.4. Estimators of functions {ls }11   l  d1

In the following, we illustrate the procedure for estimating the functions {ls (xs )}1  l  d1 , for any fixed s = 1, . . . , d1 . Let xs be a point at which we want to evaluate the func1 tions {ls (xs )}1  l  d1 . According to (2.6), we need to estimate {l (x)}dl=1 at those points (xs , Xi,−s ) that lie in the support of w. For any x ∈ supp(w), differently fromestimating the p constants, we approximate the function l (u) locally at x by l (u) ≈ l + j =1 lj (us −

1 xs )j , and estimate {l (x)}dl=1 by minimizing the following weighted sum of squares with respect to  = (1 , . . . , d1 )T ,  = (11 , . . . , 1p , . . . , d1 1 , . . . , d1 p )T

n 

⎡ ⎣Yi −

i=1

⎧ d1 ⎨ 

l +

l=1



p 

lj (Xis − xs )j

j =1

⎫ ⎬ ⎭

⎤2 Til ⎦ khs (Xis − xs )LGs (Xi,−s − x−s )

in which k is a univariate kernel, L is a (d2 − 1)-variate kernel of order q2 , as in assumption (A1) in the Appendix, the bandwidth matrix Gs = diag{g1 , . . . , gs−1 , gs+1 , . . . , gd2 }, and khs (us ) =

1 k hs



us hs

, 1



u1 us−1 us+1 ud LGs (u−s ) =

L ,..., , ,..., 2 g1 gs−1 gs+1 gd2 1  s  d2 ,s =s gs



for u−s = (u1 , . . . , us−1 , us+1 , . . . , ud2 ). Let ˆ , ˆ be the solution of the above least-squares 1 problem. Then the components in ˆ give the estimators for {l (x)}dl=1 , which is given by ˆl (x) = elT {ZTs Ws (x)Zs }−1 ZTs Ws (x)Y,

(2.10)

where el is a (p + 1)d1 -dimensional vector with all entries 0 except the lth entry being 1, Ws (x) ≡ diag{n−1 khs (Xis − xs )LGs (Xi,−s − x−s )}1  i  n and ⎡ TT , {(X − x )/ h }TT , . . . , {(X − x )/ h }p TT ⎤ 1s s s 1 1s s s 1 1 . ⎦ ⎣ .. Zs = TTn , {(Xns − xs )/ hs }TTn , . . . , {(Xns − xs )/ hs }p TTn ⎡ [p{(X − x )/ h }]T ⊗ TT ⎤ 1s s s 1 .. ⎦ =⎣ . T T [p{(Xns − xs )/ hs }] ⊗ Tn

(2.11)

in which p(u) = (1, u, . . . , up )T and ⊗ denotes the Kronecker product of matrices. Then for each s, we can construct the marginal integration estimators of ls for l = 1, . . . , d1

2514

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

simultaneously, which are given by n ˆ ls (xs ) =

l (xs , Xi,−s ) i=1 w−s (Xi,−s )ˆ n i=1 w−s (Xi,−s )

− cˆl ,

(2.12)

√ where the term cˆl is the n-consistent estimator of cl in Theorem 1. The estimator ˆ ls (xs ) is referred to as the pth order local polynomial estimator, where p is the highest polynomial degree of variables Xis − xs , i = 1, . . . , n, in the definition of the design matrix Zs in (2.11). In particular, the local linear (p = 1) and the local cubic estimators (p = 3) are the most commonly used. Theorem 2. Under assumptions A1–A7 in the Appendix, for any x = (x1 , . . . , xd2 )T ∈ supp(w), one has for l = 1, . . . , d1 , s = 1, . . . , d2 

p+1

nhs {ˆls (xs ) − ls (xs ) − hs

L

ls (xs )} → N{0, 2ls (xs )},

(2.13)

where ls (xs ) and 2ls (xs ) are defined in (A.16) and (A.18), respectively. Finally, based on (2.9) and (2.12), one can predict Y given any realization (x, t) of (X, T) by the predictor

m(x, ˆ t) =

d1  l=1

 cˆl +

d2 

 ˆ ls (xs ) tl .

(2.14)

s=1

√ To appreciate why ls can be estimated by ˆ ls at the rate of 1/ nhs , which is the same as the rate of estimating a nonparametric function in the univariate case, we discuss two special features of ˆ l (x) given in (2.10), which are similar to those discussed in Section 2.3. p+1 q2 First the bias of ˆ l (x) in estimating l (x) is of order hs + gmax , where the first term can be understood as the approximation bias caused by locally approximating ls using a pth degree polynomial, see the derivation of Ps2 in Lemma A.9, and the second term can be considered as the approximation bias by locally approximating functions {ls }s =s using q2 since the kernel L is of order q2 , see Ps3 in Lemma a constant, which is bounded by gmax q2 of the second bias term is negligible compared to the rescaling factor A.9. The order g √ max of order 1/ nhs , according to (A6) (b). Hence, only the first bias term appears in the asymptotic distribution formula (2.13). As for the variance of ˆ l (x) in estimating l (x), it is −1 −1 −1 −1 proportional to n−1 h−1 s g1 · · · gs−1 gs+1 · · · gd2 , but due to marginal averaging of variables Xi,−s , the bandwidths g1 , . . . , gs−1 , gs+1 , . . . , gd2 related to Xi,−s are integrated out, see Ps1 in Lemma A.9. Then the variance of ˆ ls is reduced to the order n−1 h−1 s . If the same bandwidth hs is used for all variable directions in X, then Assumption (A6) (b) would imply that nn−d2 /(2p+3) → ∞ and hence restricting d2 to be less than 2p + 3, for the asymptotic results of Theorem 2 to be true. That is why we prefer the flexibility of using a set of bandwidths g1 , . . . , gs−1 , gs+1 , . . . , gd2 different from hs .

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2515

3. Implementation Practical implementation of the estimators defined in (2.9) and (2.12) requires a rather intelligent choice of bandwidths H = diag{h0,1 , . . . , h0,d2 }, Gs = diag{g1 , . . . , gs−1 , gs+1 , . . . , gd2 } and {hs }1  s  d2 . In the following, we discuss the choices of such bandwidths. 1 • Note from Theorem 1 that the asymptotic distributions of the estimators {cˆl }dl=1 de2 pend only on the quantity l , not on the bandwidths in H. Hence we have only specˆ ˆ ified  that H satisfy the order assumptions in (A6) (a) by taking h01 = · · · = h0d2 = var(X) ˆ log(n)n−1/(2q1 −1) , where q1 is the order of the kernel K, required to be greater

2 1/d2 ,in which var(X ˆ = { ds=1 var(X ˆ ˆ than (d2 + 1)/2, and var(X) s )} s ) denotes the sample variance of Xs , s = 1, . . . , d2 . s  d2 • The asymptotic distributions of the estimators {ˆls }11   l  d1 depend not only on the func2 tions ls (xs ) and ls (xs ) but also crucially on the choice of bandwidths hs . Moreover, for each s = 1, . . . , d2 , the coefficient functions {ls (xs ), l = 1, . . . , d1 } are estimated simultaneously. So we define the optimal bandwidth of hs , denoted by hs,opt , as the minimizer of the total asymptotic mean integrated squared errors of {ˆls (xs ), l = 1, . . . , d1 }, which is defined as

d1 

2(p+1) AMISE{ˆls } = hs

d1  

2ls (xs ) dxs

l=1

l=1

d1  1  2ls (xs ) dxs . + nhs l=1

Then hs,opt is found to be  hs,opt =

d1

2ls (xs ) dxs d1 2 2n(p + 1) l=1 ls (xs ) dxs

1/(2p+3)

l=1

in which ls (xs ) and 2ls (xs ) are the asymptotic bias and variance of ˆ ls as in (A.16) and (A.18).According to the definitions of ls (xs ) and 2ls (xs ), 2ls (xs ) dxs and 2ls (xs ) dxs can be approximated, respectively, by  

1  1 (p+1) l s (xs ) (p + 1)! l =1 2

d

×(u, xs, Xi,−s , Ti )}du

 up+1

n 1  {w−s (Xi,−s )Til Kls∗ n i=1

dxs ,

 n 2 (X 2 2 1  w−s i,−s )−s (Xi,−s ) (Xi , Ti ) Kls∗2 (u, Xi , Ti ) du, n 2 (Xi ) i=1

where the functions Kls∗ are defined in (A.17).

2516

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 (p+1)

To implement this, one needs to evaluate terms such as l s (xs ), 2 (x, t), (x), (x−s ) and Kls∗ . We propose the following simple estimation methods for those quantities. The resulting bandwidth is denoted as hˆ s,opt . (p+1)

1. The derivative functions l s model of degree p + 2 E(Y |X, T) =

(xs ) are estimated by fitting a polynomial regression

d1  d2 p+2  

als,k Xsk Tl .

l=1 s=1 k=0 (p+1) Then l s (xs ) is estimated as (p+1)!al s,p+1 +(p+2)!al s,p+2 xs . As a by-product, the mean-squared error of this model, is used as an estimate of 2 (x).

2. Density functions (x) and (x−s ), are estimated as   n d2 Xis − xs 1 1  , (x) ˆ = n h(X, d2 ) h(X, d2 ) i=1 s=1

 ˆ −s (x−s ) =

  n 1 Xis − xs 1   h(X−s , d2 − 1) h(X−s , d2 − 1) n i=1 s =s

with the standard normal density  and the rule-of-the-thumb bandwidth  ˆ + 2)}1/(m+4) n−1/(m+4) . h(X, m) = var(X){4/(m 3. According to the definition in (A.17), the dependence of the functions Kls∗ (u, x, t) on u and t is explicitly known. The only unknown term E(TTT |X = x) contained in S−1 (x) is estimated by fitting matrix polynomial regression E(TT |X = x) = c + T

p d2  

cs,k xsk

s=1 k=1

in which the coefficients c, cs,k are d1 × d1 matrices. In this procedure, one simply uses polynomial regression to estimate some of the unknown quantities, which is easy to implement, but may lead the estimated optimal bandwidths to be biased relative to the true optimal bandwidths. The development of a more sophisticated bandwidth selection method requires further investigation. s  d2 • Since Theorem 2 implies that the asymptotic distributions of the estimators {ˆls }11   l  d1

2 do not depend on {Gs }ds−1 , we only specify that the Gs satisfies the order assumption (p+1)/q in (A6) (b) g1 = · · · = gs−1 = gs+1 = · · · = gd2 = hˆ s,opt 2 / log(n), in which q2 , the order of the kernel function L, is required to be greater than (d2 − 1)/2, and hˆ s,opt is the optimal bandwidth obtained using the above procedure.

Following the above discussion, the order of the kernels K and L are required to be greater than (d2 + 1)/2 and (d2 − 1)/2, respectively. If the dimension of X equals to 2, kernels K

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2517

and L can have order 2. We have used the quadratic kernel k(u) = 15/16(1 − u2 )2 1{|u|  1} , where 1{|u|  1} is the indicator function of [−1, 1] and the kernels K, L are product kernels. −1 TTT , and the matrix Lastly, the matrix ZT W(x)Z in (2.7) is computedas ZT W(x)Z+n  ˆ k(u)p(u)p(u)T du} TTT , ZTs Ws (x)Zs in (2.10) as ZTs Ws (x)Zs +(nhˆ s,opt )−1 var(X){ following the ridge regression idea of Seifert and Gasser (1996).

4. Examples 4.1. A simulated example The data are generated from the following model Y = {c1 + 11 (X1 ) + 12 (X2 )}T1 + {c2 + 21 (X1 ) + 22 (X2 )}T2 + 

(4.1)

with c1 = 2,

c2 = 1,

11 (x1 ) = 21 (x1 ) = sin(x1 ),

12 (x2 ) = x2 ,

22 (x2 ) = 0,

where X=(X1 , X2 )T is uniformly distributed on [−, ]×[−, ], and T=(T1 , T2 )T follows the bivariate standard normal distribution. The vectors X, T are generated independently. The error term  is a standard normal random variable and independent of (X, T). We use sample sizes n = 100, 250 and 500. The number of replications in the simulation is 100. First, to assess the performance of the data-driven bandwidth selector in Section 3, we plot in Fig. 2 the kernel estimates of the sampling distribution density of the ratio hˆ 1,opt / h1,opt , where h1,opt is the optimal bandwidth for estimating 11 and 21 . One can see that the sampling distribution of the ratio hˆ 1,opt / h1,opt converges to 1 rapidly as the

20

15

10

5

0 1.0

1.5

2.0

Fig. 2. Bandwidth selection results of the simulated example: kernel density estimates of hˆ 1,opt / h1,opt (h1,opt is the theoretical optimal bandwidth for estimating the functions of x1 in (4.1)). Solid curve is for n = 100, dotted curve is for n = 250, and dot–dashed curve is for n = 500.

2518

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

Table 2 Results of the 100 simulations: the means and standard errors (in parentheses) of the estimators cˆ1 and cˆ2 of c1 and c2 , and AISEs of the estimators ˆ 11 , ˆ 12 , ˆ 21 , ˆ 22 of the coefficient functions 11 , 12 , 21 , 22

n = 100 n = 250 n = 500

c1 = 2

c2 = 1

11

12

21

22

1.9737 (0.3574) 2.0299 (0.2410) 1.9786 (0.1680)

1.0406 (0.2503) 1.0056 (0.1490) 1.0026 (0.1111)

0.1609 0.0568 0.0295

0.2541 0.0963 0.0483

0.1205 0.0338 0.0191

0.2761 0.0649 0.0310

sample size increases. Similar results are also obtained for h2,opt , the optimal bandwidth for estimating 12 and 22 . The plot is omitted. The simulation results indicate that the proposed bandwidth selection method is reliable in this instance. The fact that the distribution of the selected bandwidth seems skewed toward larger values is due to the use of a simple polynomial function as a plug-in substitute of the true regression function. Second, we estimate the functions ls on a grid of equally spaced grid of points xm , m = 1, . . . , ngrid with x1 =−0.975, xngrid =0.975, ngrid =62. We summarize the fitting results in Table 2, which includes the means and standard errors (in the parentheses) of the cˆl , l=1, 2 and the averaged integrated squared errors (AISE) of ˆ ls . By denoting the estimated function ˆ ls of ls in the ith replication by ˆ ils , we define ISE(ˆils ) =

1 ngrid

ngrid



{ˆils (xm ) − ls (xm )}2

and

AISE(ˆls ) =

m=1

100 1  ISE(ˆils ). 100 i=1

Fig. 3 gives the plots of ˆ 11 , ˆ 12 , ˆ 21 , ˆ 22 obtained in the 100 replications for sample size n = 100, 250, 500, respectively. Also plotted are the typical estimators, whose ISE is the median of the ISEs in the 100 replications. 4.2. The West German real GNP In this section, we discuss in detail the West German real GNP data first mentioned in the introduction. Yang and Tschernig (2002) found that it had an autoregressive structure on lags 4, 2, 8 according to FPE and AIC, lags 4, 2 according to BIC, where the FPE, AIC and BIC are lag selection criteria for linear time series models as in Brockwell and Davis (1991). On the other hand, lags 4, 2, 8 are selected by the semi-parametric seasonal shift criterion, and lags 4, 1, 7 are selected by the semi-parametric seasonal dummy criterion for the slightly different series {log(Gt+4 /Gt+3 )}120 t=1 . Both semi-parametric criteria are developed in Yang and Tschernig (2002). According to Brockwell and Davis (1991, p. 304), the lag selection criteria AIC and FPE of the linear time series models are asymptotically efficient but inconsistent, while BIC selects the correct set of variables consistently. Therefore, one may fit a linear autoregressive model with either Yt−2 , Yt−4 or Yt−2 , Yt−4 , Yt−8 as the regressors, with the understanding that the variable Yt−8 may be redundant for linear modeling Linear AR (24): Yt = a1 Yt−2 + a2 Yt−4 + t ,

(4.2)

Linear AR (248): Yt = b1 Yt−2 + b2 Yt−4 + b3 Yt−8 + t .

(4.3)

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 2

3 2 1

1.0

1.0

0.0

1

0.0

0

-1 -1.5

-1

-2

-1

0

1

2

3

-2

-1.5

-3 -3

-3

-2

-1

a1

0

1

2

3

-3

-2

-1

a2

0

1

2

3

-1

0

1

2

3

-3

-3

-2

-1

0

1

2

3

-1.5

-2

-1

0

1

2

3

1

2

3

-3

-2

-1

0

1

2

3

-3

-2

-1

0

1

2

3

d1

1

2

3

-3

-2

-1

0

1

2

3

1

2

3

2 1.0

1 0

-1 0

3

c4

0.0

-1

-3

c3

0.0

-2

2

-2

-1.5 -3

3 2 1

-3

1

0

c2

-1.5

0

-1

c1 1.0

-1

1

0.0

0

-2

2

-1 -1

-3

b4

1.0

0.0

-2

3

-2 -3

b3

3 2 1

-3

2

0

b2

-1.5

1

-1

b1 1.0

0

1

-1 -2

-1

2 0.0

-3

-2

a4

1.0

0.0

-1.5

-3

a3

3 2 1

1.0

2519

-1 -2

-1.5 -3

-2

-1

0

d2

1

2

3

-3

-2

-1

0

d3

1

2

3

-3

-2

-1

0

d4

Fig. 3. Plots of the estimated coefficient functions in the simulated example. (a1–a4) are plots of the 100 estimated curves for 11 (x1 ) = sin(x1 ), 12 (x2 ) = x2 , 21 (x1 ) = sin(x1 ), 22 (x2 ) = 0 with n = 100. (b1–b4) and (c1–c4) are the same as (a1–a4), but with sample size n = 250 and 500, respectively. (d1–d4) are plots of the typical estimators, the solid curve represents the true curve, the dotted curve is the typical estimated curve with n = 100, the dot–dashed curve is with n = 250 and the dashed curve is with n = 500.

From Table 1, it is clear that besides being more parsimonious, the linear model (4.2) has smaller average squared prediction error, compared with model (4.3). Thus model (4.2) is the preferred linear autoregressive model. Moreover, Figs. 4 and 5 show that the scatter plots of Yt against the two significant linear predictors, Yt−2 and Yt−4 , along with the leastsquares regression lines, actually vary significantly at different levels of Yt−1 and Yt−8 . So we have fitted the additive coefficient model (1.5) in the introduction. We use the first 110 observations for estimation and perform one-step prediction using the last 10 observations. When estimating the coefficient functions in model (1.5), we use local cubic fitting (i.e. taking p = 3 in Zs (2.11)). According to the bandwidth selection method in Section 3, we use bandwidths 0.0031 and 0.0020 for estimating the functions of Yt−1 and Yt−8 , respectively. The estimated coefficient functions are plotted in Fig. 6. We have also generated 500 wild bootstrap (Mammen, 1992) samples and obtain 95% pointwise bootstrap confidence intervals of the estimated coefficient functions. From Fig. 6, one may observe that the estimated functions have obviously nonconstant forms. In addition, their 95% confidence intervals cannot completely cover a horizontal line passing zero in any of the four plots. This supports the hypothesis that the coefficient functions in (1.5) are significantly different from a constant. (Notice that by the restrictions proposed in (2.2), if a coefficient function is constant, it has to be zero.)

2520

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 L LLL L L L

L

0.02

0.02

0.0

0.0

L

L L

L

L

L LL

L L

L L LL L L L LL L

L L

(a)

0.02

-0.04 -0.02

M

M M

0.0

M

0.0

M

MM M M M MM

-0.04

(c)

L

-0.04

-0.04

-0.04 -0.02

0.02

0.04

M

0.02

MM M M M

M

MM

M M M M M M MM M M M M

0.0

(b)

0.02

0.04

0.0

-0.04 -0.02

L

0.0

0.02 H

H H H H H HHH HH H HH H H HH H

H H H HH

H

H H H

-0.04

(d)

-0.04 -0.02

0.04

0.0

HH H H

0.02

H

0.04

Fig. 4. Scatter plot of Yt , Yt−2 at three levels of Yt−1 : H, M, L. Here the levels are defined as: H, the high level, is the top 33% of the data, L, the lower level, is the lower 33% of the data, and M, the middle level, is the rest of the data: (a) scatter plot of Yt , Yt−2 ; (b) scatter plot of Yt , Yt−2 at high level of Yt−1 ; (c) scatter plot at middle level of Yt−1 ; (d) scatter plot at lower level of Yt−1 .

L

0.02

L

0.02 L

0.0

0.0

-0.04

-0.04

(a)

-0.04 -0.02

0.0

0.02

0.04

(b)

L L L L L L

-0.04 -0.02

M

0.02 0.0

M

H H M M M M M

M M M M M

M

-0.04

(c)

-0.04 -0.02

M

H

M M MM M M M MM M M MM M M M M M M

0.0

0.02

0.04

LL L L L LL L L LL L L L LL L L L LL L L

0.0

H H H H HH

0.02 HH HH H H HH H HH H H H H 0.0 H H

0.02

0.04

0.02

0.04

H

HH H

HHH

-0.04

(d)

-0.04 -0.02

0.0

Fig. 5. Scatter plot of Yt , Yt−2 at three levels of Yt−8 : H, M, L. Here the levels are defined as: H, the high level, is the top 33% of the data, L, the lower level, is the lower 33% of the data, and M, the middle level, is the rest of the data: (a) scatter plot of Yt , Yt−2 ; (b) scatter plot of Yt , Yt−2 at high level of Yt−8 ; (c) scatter plot at middle level of Yt−8 ; (d) scatter plot at lower level of Yt−8 .

For the two linear autoregressive models, we estimate their constant coefficients by maximum likelihood method. The estimated coefficients are aˆ 1 = −0.2436, aˆ 2 = 0.5622 and bˆ1 = −0.1191, bˆ2 = 0.6458, bˆ3 = 0.0704. Lastly, to assess the sensitivity of marginal integration estimation method to the degree of the local polynomial, we have also fitted model (1.5) using local linear estimation (i.e. taking p =1 in Zs ). Table 1 gives the ASEs (averaged squared estimation error) and ASPEs (averaged squared prediction error) from the above four estimations. One can see that overall the marginal integration estimation for model (1.5) is not sensitive to the order of local polynomial used. Local cubic fits outperform the

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

0.4

0.4

0.2

0.2

0.0

0.0

-0.4

(a)

-0.4 -0.04 -0.02 0.0

0.02 0.04 0.06

(b)

0.4

0.4

0.2

0.2

0.0

0.0

-0.4

(c)

2521

-0.04 -0.02 0.0 0.02 0.04 0.06

-0.4 -0.04 -0.02 0.0

0.02 0.04 0.06

(d)

-0.04 -0.02 0.0 0.02 0.04 0.06

Fig. 6. German real GNP: estimated functions and their point-wise wild bootstrap 95% confidence intervals, based on model (1.5): (a) ˆ 11 , (b) ˆ 12 , (c) ˆ 21 , (d) ˆ 22 .

local linear fits for prediction. Both local linear and local cubic fitting provide significant improvements over the linear autoregressive models. 4.3. Wolf’s annual sunspot number In this example, we consider Wolf’s annual sunspot number data for the period 1700–1987. Many authors have analyzed this data set. Tong (1990) used a TAR model with lag 8 as the tuning variable. Chen and Tsay (1993b) and Cai et al. (2000) both used a FAR model with lag 3 as the tuning variable. Xia and Li (1999) proposed a single index model using a linear combination of lag 3 and lag 8 as the tuning variable. Motivated by those models, we propose our additive coefficient model (4.4), in which we use both lag 3 and lag 8 as the additive tuning variables: i.e. Yt = {c1 + 11 (Yt−3 ) + 12 (Yt−8 )}Yt−1 + {c2 + 21 (Yt−3 ) + 22 (Yt−8 )}Yt−2 × {c3 + 31 (Yt−3 ) + 32 (Yt−8 )}Yt−3 + t . (4.4) Following the convention in the literature, we use the transformed data, where Yt = √ 2( 1 + Xt − 1), Xt denotes the observed sunspot number at year t. We use the first 280 data points (year 1700–1979) to estimate the coefficient functions, and leave out years 1980–1987 for prediction. The bandwidths 6.87 and 6.52 are selected for estimating functions of Yt−3 and functions of Yt−8 , respectively. The estimated coefficient functions are plotted in Fig. 7, and the time plot of the fitted values in Fig. 8. The ASE is 4.18. Finally, we use our estimated model to predict the sunspot numbers in 1980–1987, and compare these predictions with those based on the TAR model of Tong (1990), the FAR model of Chen and Tsay (1993), denoted as FAR1, and the following two models; the FAR model of Cai et al. (2000) denoted as FAR2 Yt = 1 (Yt−3 )Yt−1 + 2 (Yt−3 )Yt−2 + 3 (Yt−3 )Yt−3 + 6 (Yt−3 )Yt−6 + 8 (Yt−3 )Yt−8 + t

(4.5)

2522

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 1.5

0.2 0.0

0.2 0.0

1.0

-0.4

0.5

-0.4

0.0

-0.8

(a)

0.4

0

5 10 15 20 25

(b)

-0.8

0

5 10 15 20 25

(c)

0

5 10 15 20 25

0

5 10 15 20 25

0.5

0.3

0.0

0.2

0.0

0.1

-0.5 -0.5

0.0 -0.1

-1.0

-1.0

-0.2

(d)

0

5 10 15 20 25

(e)

0

5 10 15 20 25

(f)

Fig. 7. Wolf’s Sunspot Number: estimated functions based on model (4.4): (a) ˆ 11 , (b) ˆ 21 , (c) ˆ 31 , (d) ˆ 12 , (e) ˆ 22 , (f) ˆ 32 .

25

20

15

10

5

0 1700

1750

1800

1850 Year

1900

1950

Fig. 8. Wolf’s Sunspot Number: time plot of the fitted values based on model (4.4) (solid line), with the observed values (circles).

and the single index coefficient model of Xia and Li (1999) denoted as SIND Yt = 0 {g4 ( , Yt−3 , Yt−8 )} + 1 {g4 ( , Yt−3 , Yt−8 )}Yt−1 + 2 {g4 ( , Yt−3 , Yt−8 )} × Yt−2 + 3 {g4 ( , Yt−3 , Yt−8 )}Yt−3 + 4 {g4 ( , Yt−3 , Yt−8 )}Yt−8 + t , (4.6) in which g4 ( , Yt−3 , Yt−8 ) = cos( )Yt−3 + sin( )Yt−8 . According to Condition (A.1) b, of Cai et al. (2000, p. 952), the conditional density of Yt−3 given the variables (Yt−1 , Yt−2 , Yt−3 , Yt−6 , Yt−8 ) should be bounded. It is clear, however, that Yt−3 is completely predictable from (Yt−1 , Yt−2 , Yt−3 , Yt−6 , Yt−8 ), and hence the distribution of Yt−3 given the variables (Yt−1 , Yt−2 , Yt−3 , Yt−6 , Yt−8 ) is a probability mass at one point, not a continuous distribution with any kind of density. Thus, the

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2523

Table 3 Out-of-sample absolute prediction errors for sunspot data, under different models Year

Xt

TAR

FAR1

FAR2

SIND

Additive coefficient

1980 1981 1982 1983 1984 1985 1986 1987 AAPE ASPE

154.7 140.5 115.9 66.6 45.9 17.9 13.4 29.2

5.5 1.3 19.5 4.8 14.8 0.2 5.5 0.7 6.6 85.6

13.8 0.0 10.0 3.3 3.8 4.6 1.3 21.7 7.3 101.1

1.4 11.4 15.7 10.3 1.0 2.6 3.1 12.3 7.2 81.6

2.1 1.7 2.6 2.4 2.3 7.6 4.2 13.2 4.5 34.3

14.9 2.4 17.5 1.37 5.92 1.96 0.57 0.7 5.7 71.9

Results on models TAR, FAR1 and FAR2 are from Cai et al. (2000, Table 4, p. 951), while results on model SIND are from Xia and Li (1999, Table 2, p. 1280).

use of model (4.5) has not been theoretically justified. Similarly, model (4.6) is also not theoretically justified, since according to Condition C5, of Xia and Li (1999, p. 1277), the conditional density of g4 ( , Yt−3 , Yt−8 ) given the variables (Yt−1 , Yt−2 , Yt−3 , Yt−8 , Yt ) should be bounded, whereas again, the distribution of g4 ( , Yt−3 , Yt−8 ) given the variables (Yt−1 , Yt−2 , Yt−3 , Yt−8 , Yt ) is also a point mass. In addition, we illustrate that model (4.6) is unidentifiable. For any set of functions {0 , . . . , 4 } that satisfy (4.6), one can always pick an arbitrary nonzero function f (u) and define  0 (u) = 0 (u) + uf (u),

 1 (u) = 1 (u),

 3 (u) = 3 (u) − cos( )f (u),

 2 (u) = 2 (u),

 4 (u) = 4 (u) − sin( )f (u).

It is straightforward to verify that the new set of functions { 0 , . . . ,  4 } satisfy (4.6) as well. One possible fix of this problem is to drop either one of the terms 3 {g4 ( , Yt−3 , Yt−8 )}Yt−3 and 4 {g4 ( , Yt−3 , Yt−8 )}Yt−8 from (4.6), then the model is fully identifiable and satisfies Condition C5, of Xia and Li (1999, p. 1277). Hence the current form of (4.6) may be considered an overfitting anomaly. Despite the fact that models (4.5) and (4.6) suffer these theoretical deficiencies, we have listed the average absolute prediction errors (AAPE) and ASPE of model TAR, FAR1, FAR2, SIND and our proposed model in Table 3. By comparing the AAPEs and ASPEs, our model outperforms TAR, FAR1 and FAR2, while the unidentifiable SIND model has smallest AAPE and ASPE. We believe that this superior forecasting power of (4.6) is due to the prediction advantage of overfitting models. For example, in forecasting of linear time series, the overfitting AIC/FPE selects models more powerful than the consistent BIC, see, for instance, the discussion of Brockwell and Davis (1991, p. 304). Acknowledgements Xue’s research has been supported in part by NSF Grants BCS 0308420 and DMS 0405330. Yang’s research has been supported in part by NSF Grants SES 0127722 and

2524

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

DMS 0405330. The constructive comments by the coordinating editor and a referee are appreciated. Appendix A. A.1. Assumptions and an auxiliary lemma We have listed below some assumptions necessary for proving Theorems 1 and 2. Throughout this appendix, we denote by the same letters c, C, etc., any positive constants, without distinction in each case. (A1) The kernel functions k, K and L are symmetric, Lipschitz continuous and compactly supported. The function k is a univariate probability density function, while K is d2 rd variate, and of order q1 , i.e. K(u) du = 1 while K(u)ur11 · · · ud22 du = 0, for 1r1 + · · · + rd2 q1 − 1. Kernel L is (d2 − 1) variate and of order q2 . Denote p∗ = max(p + 1, q1 , q2 ). Then we assume further that (A2) The functions ls (xs ) have bounded continuous p ∗ th derivatives for l = 1, . . . , d1 , s = 1, . . . , d2 . ∞ (A3) The vector process {i }∞ i=1 = {(Yi , Xi , Ti )}i=1 is strictly stationary and -mixing k with the -mixing coefficient (k)c , for some constants c > 0, 0 < < 1. The -mixing coefficient is defined as (k) = sup E n1

sup

A∈F∞ n+k

|P (A|Fn0 ) − P (A)|,

n where F∞ n+k and F0 denote the -algebras generated by {i , i n + k} and {0 , . . . , n } separately. According to (1.1) of Bosq (1998), the strong mixing coefficient (k)(k)/2, hence

(k)c k /2.

(A.1)

(A4) The error term satisfies: 2 2+ < + ∞ for (a) The innovations {i }∞ i=1 are i.i.d with Ei = 0, Ei = 1 and E|i | some > 0. Also, the term i is independent of {(Xj , Tj ), j i} for all i > 1. (b) The conditional standard deviation function (x, t) is bounded and Lipschitz continuous.

(A5) The vector (X, T) has a joint probability density (x, t). The marginal densities of X, Xs and X−s are denoted by , s and −s , respectively. (a) Letting q ∗ = max(q1 , q2 ) − 1, we assume that (x, t) has bounded continuous q ∗ th partial derivatives with respect to x. And the marginal density  is bounded away from zero on the support of the weight function w. (b) Let S(x) = E(TTT |X = x). We assume there exists a c > 0, such that S(x) cId2 uniformly for x ∈ supp(w). Here Id2 is the d2 × d2 identity matrix.

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2525

(c) The random matrix TTT satisfies the Cramer’s moment condition, i.e., there exists a positive constant c, such that E|Tl Tl |k ck−2 k!E|Tl Tl |2 , and E|Tl Tl |2 c holds uniformly for k = 3, 4, . . . , and 1 l, l d1 . (A6) The bandwidths satisfy: √ q1 (a) For the bandwidth matrix H = diag{h01 , . . . , h0d2 } of Theorem 1, nhmax → 0 and

2 nhprod ∝ n for some  > 0, where hmax = max{h01 , . . . , h0d2 }, hprod = di=1 h0i , and ∝ means proportional to. (b) For the bandwidths hs , and Gs = diag{g1 , . . . , gs−1 , gs+1 , . . . , gd2 −1 } of Theorem 1/2 q2 2, hs = O{n−1/(2p+3) }, nhs gprod ∝ n for some  > 0 and (nh

s ln n) gmax → 0, where gmax = max{g1 , . . . , gs−1 , gs+1 , . . . , gd2 −1 }, gprod = s =s gs . (A7) The weight function w is nonnegative, has compact support with nonempty interior, and is Lipschitz continuous on its support. The proof of many results in this paper makes use of some inequalities about U-statistics and von Mises’ statistics of dependent variables derived from Yoshihara (1976). In general, let i , 1i n denote a strictly stationary sequence of random variables with values in R d and -mixing coefficients (k), k = 1, 2, . . ., and r a fixed positive integer. Let { n (F )} denote the functionals of the distribution function F of i  n (F ) = gn (x1 , . . . , xm ) dF (x1 ) · · · dF (xm ), where {gn } are measurable functions symmetric in their m arguments such that  |gn (x1 , . . . , xm )|2+ dF (x1 ) · · · dF (xm ) Mn < + ∞,  sup

|gn (x1 , . . . , xm )|2+ dF i

1

(i1 ,...,im )∈Sc

,..., im (x1 , . . . , xm ) Mn,c

c = 0, . . . , m − 1

< + ∞, (A.2)

for some > 0, where Sc = {(i1 , . . . , im )|#r (i1 , . . . , im ) = c}, c = 0, . . . , m − 1 and for every (i1 , . . . , im ), 1i1  · · · im n, #r (i1 , . . . , im )= the number of j = 1, . . . , m − 1 satisfying ij +1 − ij < r. Clearly, the cardinality of each set Sc is less than nm−c . The von Mises’ differentiable statistic and the U-statistic  n (Fn ) = gn (x1 , . . . , xm ) dFn (x1 ) · · · dFn (xm ) n n  1  · · · gn ( i1 , . . . , im ), nm i1 =1 im =1  1 Un = n ! gn ( i1 , . . . , im )

=

m

1  i1 <···
2526

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

allow decompositions as n (Fn ) = n (F ) +

m " #  m

c

c=1

 Vn(c) =

Vn(c) ,

c 

gn,c (x1 , . . . , xc )

[dFn (xj ) − dF (xj )],

j =1

Un = n (F ) +

m " #  m

c

c=1

Un(c) =

(n − c)! n!

Un(c) , 



gn,c (xi1 , . . . , xic )

1  i1 <···
c 

[dIR d (xj − ij ) − dF (xj )], +

j =1

where gn,c are the projections of gn  gn,c (x1 , . . . , xc ) =

gn (x1 , . . . , xm ) dF (xc+1 ) · · · dF (xm ),

c = 0, 1, . . . , m,

so that gn,0 = n (F ), gn = gn,m and IR d is the indicator function of the nonnegative part of +

d = {(y , . . . , y ) ∈ R d |y 0, j = 1, . . . , d}. R d , R+ 1 d j



Lemma A.1. If (k)C1 k −(2+ )/ , > > 0, then  EV (c)2 n

−c + EU (c)2 n C(m, , r)n

2/(2+ )

Mn

n 

k /(2+ ) (k)

k=r+1

+

m−1  c =0

n

−c

2/(2+ ) Mn,c

r 

 /(2+ )

k

(A.3)

(k)

k=1

for some constant C(m, , r) > 0. In particular, if one has (k) C2 k , 0 < < 1 then  EV (c)2 n

−c + EU (c)2 n C(m, , r)C2 C( )n

2/(2+ ) Mn

+

m−1  c =0

 n

−c

2/(2+ ) Mn,c

. (A.4)

Proof. The proof of Lemma 2 in Yoshihara (1976), which dealt with the special case of gn ≡ g, r = 1, Mn = Mn and yielded (A.3), provides an obvious venue of extension to the more general setup. Elementary arguments then establish (A.4) under geometric mixing conditions. 

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2527

A.2. Proofs of Theorems 1 and 2 For any x ∈ supp(w), we can write ZT W(x)Z =

n 1  KH (Xi − x)Ti TTi , n i=1

1 khs (Xis − xs )LGs (Xi,−s − x−s ) n i=1  % & $ Xis − xs T Xis − xs T p (Ti Ti ) × p hs hs n

ZTs Ws (x)Zs =

in which,



denotes the Kronecker product of matrices. Define also the following matrix 

S (x) =

%

k(u)p(u)p(u)T du

(A.5)

S(x),

where S(x) = E(TTT |X = x) as defined in (A5) (b). For any matrix A, |A| denotes the maximum absolute value of all elements in A.   q1 q2 Lemma A.2. Let b1 = ln n(hmax + 1/ nhprod ), b2 = ln n(hs + gmax + 1/ nhs gprod ), and define the compact set B = supp(w) ⊂ R d2 . Under assumptions (A1)–(A6), as n → ∞, with probability one sup |ZT W(x)Z − (x)S(x)| = o(b1 ),

and

x∈B

sup |ZTs Ws (x)Zs − (x)S (x)| = o(b2 ).

x∈B

Proof. The proof of this lemma makes use of the covering technique and the Bernstain’s inequality for mixing data (Bosq, 1998, Theorem 1.4). The complete proof of this lemma can be found in Xue and Yang (2004).  A.2.1. Proof of Theorem 1 q1 +2 Now let v1 be the integer satisfying b1v1 +1 = o(hmax ). Following immediately from Lemma A.2, one has {Z W(x)Z} T

−1

 v1  S(x)−1  S(x)−1 ZT W(x)ZS −1 (x) = Id1 − − + Qc (x) (x) (x) (x) =

v1 

=1

Qcv (x) + Qc (x),

v=1 q +2

1 where the matrix Qc (x) satisfies supx∈B |Qc (x)| = o(hmax ) w.p.1.

2528

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

Also by observing that, elT {ZT W(Xi )Z}−1 ZT W(Xi )Zel = ll , where ll equals to 1 if l = l and equals to 0 otherwise, one has the following decomposition: v1 n 2 2    1  w(Xi ){cˆl − cl } = {Dci + Pci } + Fi  + R c , n i=1

(A.6)

i=1 =1

i=1

 2 with Rc = n1 ni=1 w(Xi ) ds=1 ls (Xis ), and the terms Dci , Pci , and Fi  are given as follows: By denoting the operator c as c (A, B) =

n 1  w(Xi )A(Xi )ZT W(Xi )B, n i=1

we set, for i = 1, 2, and v = 1, . . . , v1 , Dci = c (Qc (x), Bi );

Pci = c (S−1 (x)/(x), Bi );

Fi  = c (Qcv (x), Bi ), (A.7)

with  B1 = [(Xj , Tj )j ]j =1,...,n ,

B2 =

d2 d1  

 {l s (Xj s ) − l s (Xis )}Tj l

l =1 s=1

. j =1,...,n

(A.8) To prove Theorem 1, we need the following lemmas. q +2

1 Lemma A.3. As n → +∞, |Dc1 | + |Dc2 | = o(hmax ) w.p.1.

√ Lemma A.4. For fixed  = 1, . . . , v1 , as n → +∞, |F1 | + |F2 | = o(b1 / n) w.p.1. Proof. For simplicity, consider only the case of F1 with  = 1: i.e.,   n 1  S −1 (Xi ) ZT W(Xi )ZS −1 (Xi ) Id1 − ZT W(Xi )E w(Xi ) n (Xi ) (Xi ) i=1 & $ n 1  S(Xi ) E{ZT W(Xi )Z} −1 −1 = − S (Xi )ZT W(Xi )E w(Xi )S (Xi ) n (Xi ) 2 (Xi ) i=1 $ T & n 1 Z W(Xi )Z E{ZT W(Xi )Z} − w(Xi )S −1 (Xi ) − n 2 (Xi ) 2 (Xi )

F11 (x) =

i=1 −1

×S

(Xi )ZT W(Xi )E = P1 − P2 .

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2529

Let i = (Xi , Ti , i ), and define & S(Xi ) E{ZT W(Xi )Z} − (Xi ) 2 (Xi ) × S −1 (Xi )KH (Xj − Xi )Tj (Xj , Tj )j & $ S(Xj ) E{ZT W(Xj )Z} − + w(Xj )S −1 (Xj ) (Xj ) 2 (Xj )

gn (i , j ) = w(Xi )S −1 (Xi )

$

× S −1 (Xj )KH (Xi − Xj )Ti (Xi , Ti )i . Then P1 can be written as the von Mises’ differentiable statistic P1 = 2n1 2 (1)

n

(2)

i,j =1 gn (i , j ),

which can be decomposed as P1 = 21 {n (F ) + 2Vn + Vn }, in which, n (F ) = 0. (1)

(2)

In order to write down the explicit expressions of Vn , Vn , let Ei (En,i ) denote taking expectation with respect to the random vector indexed by i using the theoretical (empirical) (1) measure, both under the presumption of independence between i and j . One has Vn =  Ei En,j gn (i , j ) = n1 nj=1 gn,1 (j ), in which $

& S(z) E{ZT W(z)Z} − (z) 2 (z) × S −1 (z)KH (Xj − z)(z) dzTj (Xj , Tj )j .

 gn,1 (j ) =

w(z)S −1 (z)

(1)

Clearly gn,1 has mean 0 and variance of order b12 . So Vn = n1 For

(2) Vn ,

n



j =1 gn,1 (j ) = op (b1 /

n).

by Lemma A.1, under assumption (A3), one has for some small > 0  2  2 2 2+ 2+ −1 E(Vn(2) )2 cn−2 Mn2+ + Mn,0 , + Mn,1 n

where Mn , Mn,0 and Mn,1 satisfy the inequalities as in (A.2). Observe that 2+ Ei,j |gn (i , j )|2+ cb2+ 1 E|w(Xi )KH (Xj − Xi )Tj (Xj , Tj )j | −(1+2 )(2+ )/(2+2 )

hprod

−(1+2 )(2+ )/(2+2 )

So we can take Mn,0 = hprod

cb2+ 1 , and by setting the mixing coefficient

−(1+2 )(2+ )/(2+2 )

to 0, one also gets Mn = hprod −(2+ )

cb2+ 1 hprod

cb2+ 1 c( ).

cb2+ 1 . Similarly, we can show that Mn,1 =

. So by taking small, one has E(P12 ) is less or equal to

(1+2 )(2+ ) 2/(2+ ) " # − −(2+ ) 2/(2+ ) cn−2 hprod (2+2 ) b12+ + cn−3 b12+ hprod + cb21 /n cn−1 b12 . √ Similarly, we can show that EP 22 cn−1 b12 . So we have F11 = op (b1 / n).  1 Lemma A.5. Pc1 = Op (hmax ) = op (n−1/2 ) as n → ∞.

q

2530

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

Proof. Let Kl∗ (X, T) = elT S−1 (X)T, then Pc1 is a von Mises’ statistic, with n of the form  d d  2 1   w(z) ∗ {l s (xs ) − l s (zs )}tl (z) (x, t) dz dx dt. K (z, t)KH (x − z) (z) l l =1 s=1

After a change of variable u = H −1 (x − z), the above becomes  d d  1  2  {l s (zs + h0,s us ) − l s (zs )}tl w(z)Kl∗ (z, t)K(u) l =1 s=1

× (z + H u, t) du dz dt, q

1 which is O(hmax ) by a Taylor expansion of ls (zs +h0,s us ) to q1 th degree and of (z+H u, t) to (q1 −1)th degree, which exist according to assumptions (A2) and (A5) (a). By assumption q1 q1 (A1), all the terms with order smaller than hmax disappear. So the leading term left is of hmax q1 q1 (1) (2) order. It is routine to verify that Vn and Vn are Op (hmax ) as well. Hence Pc1 = Op (hmax ) q1 and assumption (A6) (a) entails that Op (hmax ) = op (n−1/2 ). 

We can finish the proof of Theorem 1 as follows. By a von Mises’ statistic argument, n  log n 1  w(x) ∗ Pc2 = K (x, Tj )KH (Xj − x)(x) dx(Xj , Tj )j + op n (x) l n2 hprod j =1

(A.9) which, after the change of variable Xj = x + H u becomes 1 w(Xj )Kl∗ (Xj , Tj )(Xj , Tj )j + op (n−1/2 ). n n

(A.10)

j =1

Now let us return to the decomposition of A.2–A.5, one has

1 n i=1 w(Xi )(cˆl − cl ) as in (A.6). By Lemmas n

  d2 n n  1  1  ∗ w(Xi )(cˆl − cl ) = w(Xj ) Kl (Xj , Tj )(Xj , Tj )j + ls (Xj s ) n n i=1

i=1

s=1

+ op (n−1/2 ). 2 If we define j = w(Xj ){Kl∗ (Xj , Tj )(Xj , Tj )j + ds=1 ls (Xj s )} = j 1 + j 2 , then by assumption (A4) (a) and the identification condition (2.2), we have E(j )=0. Together with assumption (A3), {j } is a stationary -mixing process, with geometric -mixing coefficient. By Minkowski’s inequality and assumptions (A1), (A4), (A5) and (A7), we have, for some > 0, E|j |2+ < + ∞. Next, define 2l = 2

+∞  j =1

cov(0 , j 2 ) + var(0 )

(A.11)

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2531

which is finite by Theorem 1.5 of Bosq (1998). Applying the central  limit theorem for strongly mixing processes (Bosq, 1998, Theorem 1.7), we have √1n ni=1 j ⇒ N(0, 2l ). Theorem 1now follows immediately by assumption (A6) (a) on the bandwidths and the fact that n1 ni=1 w(Xi ) → 1 a.s.  A.2.2. Proof of Theorem 2 Following similarly as in the proof of Theorem 1, let vs be an integer which satisfies p+2 b2vs = o(hs ). Then by Lemma A.2, one has s S −1 (x)  {ZTs Ws (x)Zs }−1 −  Qsv (x) + Qs (x), = (x)

v

(A.12)

v=1

" # ZT W (x)Z S −1 (x) v S −1 (x) where Qsv (x) =  (x) I(p+1)d1 − s s (x)s  , and the matrix Qs (x) satisfies p+2 supx∈B |Qs (x)| = o(hs )w.p.1. Also as in the proof of Theorem 1, by the equation that el {ZTs Ws (Xis )Zs }−1 ZTs Ws (Xis )Zs el = ll , l = 1, . . . , d1 , for fixed l = 1, . . . , d1 and s = 1, . . . , d2 , we have the following decomposition:   vs n 3   1  Dsi (xs ) + Psi (xs ) + w−s (Xi,−s ){ˆls (xs ) − ls (xs )} = Rvi (xs ) n i=1

where

v=1

i=1

+ R1 + R 2 ,

⎧ ⎫ n ⎨ ⎬   1 R1 = w−s (Xi,−s ) ls (Xis ) , ⎩ ⎭ n s =s

i=1

(A.13)

  n 1  R2 = w−s (Xi,−s ) (cˆl − cl ) n i=1

(A.14) and the other terms in (A.13) are given as follows. By defining an operator s (A, C) =

n 1  w−s (Xi,−s ){elT AZTs Ws (xs , Xi,−s )C}, n i=1

we set for i = 1, 2, 3, and v = 1, . . . , vs , Dsi (xs ) = s (Qs (xs , Xi,−s ), Ci ), Psi (xs ) = s (S−1 /(xs , Xi,−s ), Ci ), Rvi (xs ) = s (Qsv {(xs , Xi,−s )}r , Ci ), where C1 is the vector of errors same as B1 in (A.8), and   d  p (v) 1   l s (xs ) v C2 = C2 (xs ) = l s (Xj s ) − (Xj s − xs ) Tj l , v! v=0 l =1 j =1,...,n ⎤ ⎡ d2 d1   C3 = C3 (Xi,−s ) = ⎣ {l s (Xj s ) − l s (Xis )}Tj l ⎦ . l =1 s =s

j =1,...,n

(A.15)

2532

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

The proof of Theorem 2 is completed by applying assumption (A6) (b) and the asymptotic results on each term in (A.13), which are presented in the following lemmas: Lemma A.6. As n → +∞,



√ √ √ nhs R1 = Op ( hs ), nhs R2 = Op ( hs ). p+2

Lemma A.7. As n → +∞, supxs ∈supp(ws ) |Ds1 (xs ) + Ds2 (xs ) + Ds3 (xs )| = o(hs w.p. 1.

)

Lemma A.8. For any fixed v=1, . . . , vs , as n → +∞, supxs ∈supp(ws ) |Rv1 (xs )|+|Rv2 (xs )|+ √ |Rv3 (xs )| = o(b2v / nhs ) w.p. 1. Lemma A.9. As n → +∞ Ps1 =

n 1  w−s (Xj,−s ) 1 ∗ Xj s − xs Kls , xs , Xj,−s , Tj (xs , Xj,−s ) hs hs n j =1

× −s (Xj,−s )(Xj , Tj )j + op {(nhs log n)−1/2 }, p+1

Ps2 (xs ) = hs which

p+1

ls (xs ) + op (hs

1  1 (p+1) l s (xs ) (p + 1)!

d

ls (xs ) =

2 ), and Ps3 (xs ) = Op (gmax ) = op {(nhs log n)−1/2 }, in

q



up+1 E{w−s (X−s )Tl Kls∗ (u, xs , X−s , T)} du

l =1

(A.16) with Kls∗ (u, x, T) = elT S−1 (x)q ∗ (u, T)k(u), Furthermore

q ∗ (u, T) = (T, uT, . . . , up T)T .

(A.17)

√ L nhs Ps1 → N{0, 2ls (xs )}, in which 

2ls (xs ) =

2 (z ) w−s −s K ∗2 (u, xs , z−s , t)2−s (z−s )2 (xs , z−s , t) 2  (xs , z−s ) ls × (xs , z−s , t) du dz−s dt.

(A.18)

Proof of Lemma A.6. This lemma follows immediately by applying the central limit theorem for strongly mixing process (Bosq, 1998, Theorem 1.7) on R1 and Theorem 1 on R2 , respectively.  Proof of Lemmas A.7 and A.8. We have omitted these as they are similar to the proofs of Lemmas A.3, A.4. 

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2533

Proof of Lemma A.9. From the definition in (A.15) and using the von Mises’ statistic and change of variable arguments, Ps1 equals to n 1  w−s (Xj,−s ) ∗ Xj s − xs , xs , Xj,−s , Tj −s (Xj,−s )(Xj , Tj )j Kls hs nhs (xs , Xj,−s ) j =1

+ op {(nhs log n)−1/2 }. By assumption (A4) (a), the first term is the average of a sequence of martingale differences. Then by the martingale central limit theorem of Liptser and Shirjaev (1980), the term √ nhs Ps1 is asymptotically normal with mean 0 and variance  2 (z ) w−s −s K ∗2 (u, xs , z−s , t)2−s (z−s )2 (xs , z−s , t) (xs , z−s , t) du dz−s dt 2 (xs , z−s ) ls + o(hs ) = 2ls (xs ) + o(hs ) in which the leading term 2ls (xs ) is as defined in (A.18). For the term Ps2 (xs ), we have  w−s (x−s ) 1 ∗ zs − xs K , xs , x−s , t LGs (z−s − x−s ) Ps2 (xs ) = (xs , x−s ) hs ls hs d    p (v) 1   l s (xs ) v × l s (zs ) − (zs − xs ) tl v! l =1

v=0

× (z, t)−s (x−s ) dz dx−s dt{1 + op (1)}. p+1

After the change of variables zs =xs +hs u and z−s =x−s +Gs v, it equals to hs (xs )ls (xs )+ p+1 op (hs ), with ls (xs ) as defined in (A.16). Lastly, the term Ps3 is seen to be  w−s (x−s ) 1 ∗ zs − xs Kls , xs , x−s , t LGs (z−s − x−s ) (x , x ) h hs ⎤ ⎡ s −s s d d 2 1   ×⎣ {l s (zs ) − l s (xs )}tl ⎦ (z, t)−s (x−s ) dz dx−s dt{1 + op (1)} l =1 s =s

which after the change of variables, zs = xs + hs u and z−s = x−s + Gs v, is seen to be q2 Op (gmax )=op {(nh log n)−1/2 }, by a Taylor expansion to q2 th degree for l s and (q2 − 1)th degree for , using assumptions (A2) and (A5) (a). Then the result follows from assumption (A1) that L is a kernel function of q2 th order.  References Bosq, D., 1998. Nonparametric Statistics for Stochastic Processes. Springer, New York. Brockwell, P.J., Davis, R.A., 1991. Time Series: Theory and Methods. Springer, New York. Cai, Z., Fan, J., Yao, Q.W., 2000. Functional-coefficient regression models for nonlinear time series. J. Amer. Statist. Assoc. 95, 941–956. Chen, R., Tsay, R.S., 1993a. Nonlinear additive ARX models. J. Amer. Statist. Assoc. 88, 955–967. Chen, R., Tsay, R.S., 1993b. Functional-coefficient autoregressive models. J. Amer. Statist. Assoc. 88, 298–308.

2534

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

Härdle, W., Liang, H., Gao, J.T., 2000. Partially Linear Models. Springer, Heidelberg. Hastie, T.J., Tibshirani, R.J., 1990. Generalized Additive Models. Chapman & Hall, London. Hastie, T.J., Tibshirani, R.J., 1993. Varying-coefficient models. J. Roy. Statist. Soc. Ser. B 55, 757–796. Hoover, D.R., Rice, J.A., Wu, C.O., Yang, L.P., 1998. Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika 85, 809–822. Linton, O.B., Härdle, W., 1996. Estimation of additive regression models with known links. Biometrika 83, 529–540. Linton, O.B., Nielsen, J.P., 1995. A kernel method of estimating structured nonparametric regression based on marginal integration. Biometrika 82, 93–100. Liptser, R.Sh., Shirjaev, A.N., 1980. A functional central limit theorem for martingales. Theory Probab. Appl. 25, 667–688. Mammen, E., 1992. When Does Bootstrap Work: Asymptotic Results and Simulations. Lecture Notes in Statistics, vol. 77, Springer, Berlin. Seifert, B., Gasser, T., 1996. Finite-sample variance of local polynomial: analysis and solutions. J. Amer. Statist. Assoc. 91, 267–275. Sperlich, S., TjZstheim, D.,Yang, L., 2002. Nonparametric estimation and testing of interaction in additive models. Econom. Theory 18, 197–251. Stone, C.J., 1985. Additive regression and other nonparametric models. Ann. Statist. 13, 689–705. Tong, H., 1990. Nonlinear Time Series: A Dynamical System Approach. Oxford University Press, Oxford, UK. Xia, Y.C., Li, W.K., 1999. On single-index coefficient regression models. J. Amer. Statist. Assoc. 94, 1275–1285. Xue, L., Yang, L, 2004. Estimation of semiparametric additive coefficient model. Research Manuscript 628, Department of Statistics and Probability, MSU. Downloadable at http://www.msu.edu/∼ yangli/addeoefffinal.pdf. Yang, L., Tschernig, R., 2002. Non- and semi-parametric identification of seasonal nonlinear autoregression models. Econom. Theory 18, 1408–1448. Yang, L., Härdle, W., Park, B.U., Xue, L., 2004. Estimation and testing of varying coefficients with marginal integration. J. Amer. Statist. Assoc., under revision. Yoshihara, K., 1976. Limiting behavior of U-statistics for stationary, absolutely regular processes. Z. Wahrscheinlichkeitstheorie verwandte Gebiete 35, 237–252.