Estimation of semi-parametric additive coefficient model

Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 www.elsevier.com/locate/jspi Estimation of semi-parametric additive coefﬁcient m...

Download PDF

2MB Sizes 0 Downloads 25 Views

Report

PDF Reader
Full Text

Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 www.elsevier.com/locate/jspi

Estimation of semi-parametric additive coefﬁcient model Lan Xue∗ , Lijian Yang Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA Received 25 February 2004; received in revised form 29 September 2004; accepted 5 November 2004 Available online 24 December 2004

Abstract In the multivariate regression setting, we propose a ﬂexible varying coefﬁcient model in which the regression coefﬁcients of some predictors are additive functions of other predictors. Marginal integration estimators of the coefﬁcients are developed and their asymptotic properties investigated. Under -mixing, it√ is found that the estimators of the parameters in the regression coefﬁcients have rate of convergence 1/ n, and the nonparametric additive components are estimated at the same rate of convergence as in univariate smoothing. A data-driven bandwidth selection method is developed based on asymptotic considerations. Its effectiveness is conﬁrmed in a Monte-Carlo study. The procedure is applied to the real German GNP and Wolf’s Sunspot data, where the semi-parametric additive coefﬁcient model demonstrates superior performance in terms of out-of-sample forecasts. © 2004 Elsevier B.V. All rights reserved. Keywords: German real GNP; Local polynomial; Marginal integration; Out-of-sample forecast; Rate of convergence; Varying coefﬁcient model; Wolf’s sunspot number

1. Introduction The nonparametric regression model has been widely used in various applications due to its ability to discover data structures that linear and parametric models fail to detect.A serious limitation of the general nonparametric model which gives statisticians reservation is the “curse of dimensionality” phenomenon. This term refers to the fact that the convergence rate of nonparametric smoothing estimators becomes rather slow when the estimation target is a ∗ Corresponding author.

E-mail address: [email protected] (L. Xue). 0378-3758/$ - see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2004.11.003

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2507

general function of a large number of variables without additional structures. Many efforts have been made to impose structures on the regression function to partly alleviate the “curse of dimensionality”, which is broadly described as dimension reduction. Some well-known dimension reduction approaches are: (generalized) additive models (Chen and Tsay, 1993a; Hastie and Tibshirani, 1990; Sperlich et al., 2002; Stone, 1985), partially linear models (Härdle et al., 2000) and varying coefﬁcient models (Hastie and Tibshirani, 1993). The idea of the varying coefﬁcient model is especially appealing. It allows a response variable to depend linearly on some regressors, with coefﬁcients as smooth functions of some other predictor variables. The additive-linear structure enables simple interpretation and avoids the curse of dimensionality problem in high-dimensional cases. Speciﬁcally, consider a multivariate regression model in which a sample {(Yi , Xi , Ti )}ni=1 is drawn that satisﬁes Yi = m(Xi , Ti ) + (Xi , Ti )i ,

(1.1)

where for the response variables Yi and predictor vectors Xi and Ti , m and 2 are the conditional mean and variance functions m(Xi , Ti ) = E(Yi |Xi , Ti ),

2 (Xi , Ti ) = var(Yi |Xi , Ti )

(1.2)

and E(i |Xi , Ti ) = 0, var(i |Xi , Ti ) = 1. For the varying coefﬁcient model, the conditional mean takes the following form: m(X, T) =

d

l (Xl )Tl

(1.3)

l=1

in which all tuning variables Xl , l = 1, . . . , d make up the vector X, and all linear predictor variables Tl , l = 1, . . . , d are univariate and distinct. Hastie and Tibshirani (1993) proposed a backﬁtting algorithm to estimate the varying coefﬁcient functions {l (xl )}1 l d , but gave no asymptotic justiﬁcation of the algorithm. A somewhat restricted model, the functional coefﬁcient model, was proposed in the time series context by Chen and Tsay (1993b) and later in the context of longitudinal data by Hoover et al. (1998), in which all the tuning variables Xl , l = 1, . . . , d are the same and univariate. For more recent developments of the functional coefﬁcient model, see Cai et al. (2000). In a different direction, Yang et al. (2004) studied inference for model (1.3) when all the tuning variables {Xil }1 l d are univariate but have a joint d-dimensional density. This model breaks the restrictive nature of the functional coefﬁcient model that all the tuning variables Xl , l = 1, . . . , d have to be equal. On the other hand, it requires that none of the tuning variables Xl , l = 1, . . . , d are equal. In this paper, we propose the following additive coefﬁcient model which has a more ﬂexible form, namely m(X, T) =

d1 l=1

l (X)Tl ,

l (X) =

d2

ls (Xs ),

∀1 l d1

(1.4)

s=1

1 in which the coefﬁcient functions {l (X)}dl=1 are additive functions of the tuning variables T X = (X1 , . . . , Xd2 ) . Note that without the additivity restriction on the coefﬁcient functions

2508

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

Table 1 German real GNP: the ASEs and ASPEs of four ﬁts

Additive coefﬁcient model (1.5), local linear ﬁt Additive coefﬁcient model (1.5), local cubic ﬁt Linear AR (24) model (4.2) Linear AR (248) model (4.3)

ASE

ASPE

0.000201 0.000205 0.000253 0.000258

0.000085 0.000077 0.000112 0.000116

1 {l (X)}dl=1 , model (1.4) would be a kind of functional coefﬁcient model with a multivariate tuning variable X instead of a univariate one as in the existing literature. The additive 1 structure is imposed on the coefﬁcient functions {l (X)}dl=1 , so that inference can be made on them without the “curse of dimensionality”. To understand the ﬂexibility of this model, we look at some of the models that are included as special cases:

1. When the dimension of X is 1 (d2 = 1), (1.4) reduces to the functional coefﬁcient model of Chen and Tsay (1993b). 2. When the linear regressor vector T is constant (d1 = 1, and T1 ≡ 1), (1.4) reduces to the additive model of Chen and Tsay (1993a), Hastie and Tibshirani (1990). 3. When for any ﬁxed l = 1, . . . , d1 , ls (xs ) ≡ 0 for all but one s = 1, . . . , d2 , (1.4) reduces to the varying coefﬁcient model (1.3) of Hastie and Tibshirani (1993). 4. When d1 = d2 = d, and ls (xs ) ≡ 0 for l = s, (1.4) reduces to the varying-coefﬁcient model of Yang et al. (2004). The additive coefﬁcient model is a useful nonparametric alternative to the parametric models. To gain some insight into it, consider the application of our estimation procedure to the quarterly West German real GNP data from January 1960 to April 1990. Denote this time series by {Gt }124 t=1 , where Gt is the real GNP in the tth quarter (the ﬁrst quarter being from January 1, 1960 to April 1, 1960, the 124th quarter being from January 1, 1990 to April 1, 1990). Yang and Tschernig (2002) deseasonalized this series by removing the four seasonal means from the series log(Gt+4 /Gt+3 ), t = 1, . . . , 120. Denote the transformed time series as {Yt }120 t=1 . As the nonparametric alternative to the best ﬁtting linear autoregressive model (4.2) in Section 4.2, we have ﬁtted the following additive coefﬁcient model (details in Section 4.2) Yt = {c1 + 11 (Yt−1 ) + 12 (Yt−8 )}Yt−2 + {c2 + 21 (Yt−1 ) + 22 (Yt−8 )}Yt−4 + t . (1.5) Using this model, we can efﬁciently take into account the phenomenon that the effect of Yt−2 , Yt−4 on Yt vary with Yt−1 , Yt−8 . The efﬁciency is evidenced by its superior out-ofsample one-step prediction at each of the last 10 quarters. The averaged squared prediction error (ASPE) is 0.000112 for the linear autoregressive ﬁt in (4.2), and 0.000077 or 0.000085 for two ﬁts of the additive coefﬁcient model (1.5). Hence the reduction in ASPE is between 31% and 46%, see Table 1. Fig. 1 clearly illustrates this improvement in prediction power.

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2509

Observed vs predicted values of seasonally adjusted GNP data

0.02 0.01 0.0 -0.01 -0.02 -0.03 112

114

116

118

120

Fig. 1. German real GNP: one-step prediction performance for the last 10 quarters. Circle denotes the observed value, triangle denotes the prediction by additive coefﬁcient model (1.5), and cross denotes the prediction by linear autoregressive model (4.2).

One can see that the additive coefﬁcient model out-performs the linear autoregressive model in prediction for 8 of the 10 quarters. We organize the paper as follows. In Section 2, we present the estimation procedure for the coefﬁcient functions in model (1.4) and the asymptotic properties of the estimators. In Section 3, we discuss computing issues relating to the implementation of the marginal estimation method as in Section 2. In Section 4, simulation results and applications to two empirical examples will be presented. Technical proofs are contained in the Appendix. 2. The estimators 2.1. Model identiﬁcation For the additive coefﬁcient model, the regression function m(X, T) in (1.4) needs to be identiﬁed. One practical solution is to rewrite it as m(X, T) =

d1

l (X)Tl ,

l (X) = cl +

d2

ls (Xs ),

∀1 l d1

(2.1)

s=1

l=1

with the identiﬁcation conditions E{w(X)ls (Xs )} ≡ 0,

l = 1, . . . , d1 ,

s = 1, . . . , d2

(2.2)

for some nonnegative weight function w, with E{w(X)} = 1. The weight function w is introduced so that estimation of the unknown functions {l (X)}1 l d1 will be carried out only on the support of w, supp(w), which is compact according to assumption (A7). This is important as most of the asymptotic results for kernel-type estimators are developed only

2510

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

for values over compact sets. By having this weight function, the support of the distribution of X is not required to be compact. This relaxation is very desirable since most time series distributions are not compactly supported. See Yang and Tschernig (2002, p. 1414) for similar use of the weight function. Note that (2.2) does not impose any restriction on the model, since any regression function 1 d2 ∗ m(X, T) = dl=1 s=1 ls (Xs )Tl can be reorganized to satisfy (2.2), by writing m(X, T) =

d1

cl +

d2

ls (Xs ) Tl

s=1

l=1

2 ∗ with cl = E{w(X) ds=1 ls (Xs )}, ls (Xs ) = ∗ls (Xs ) − E{w(X)∗ls (Xs )}. s d2 In addition, for the functions {ls (Xs )}11 l d1 and parameters {cl }1 l d1 to be uniquely determined, one imposes an additional assumption (A0) There exists a constant C > 0 such that for any set of measurable functions s d2 {bls (Xs )}11 l d1 that satisfy (2.2) and any set of constants {al }1 l d1 , the following holds: d 2 d d2 d1 d2 1 1 2 2 E al + bls (Xs ) Tl C al + E{bls (Xs )} . (2.3) s=1

l=1

l=1 s=1

l=1

Lemma 1. Under assumptions (A0) and (A5) in the Appendix, the representation in (2.1) subject to (2.2) is unique. Proof. Suppose that m(X, T) =

d1

cl +

d2

ls (Xs ) Tl =

s=1

l=1

d1

cl +

d2

ls (Xs ) Tl

s=1

l=1

s d2 s d2 with both the set {ls (Xs )}11 ls (Xs )}11 cl }1 l d1 l d1 , {cl }1 l d1 and the set { l d1 , { satisfying (2.2). Then upon deﬁning for all s, l

ls (Xs ) − ls (Xs ), al ≡ cl − cl bls (Xs ) ≡ 1 2 one has dl=1 {al + ds=1 bls (Xs )}Tl ≡ 0. Hence by assumption (A0) 0=E

d 1 l=1

al +

d2 s=1

bls (Xs ) Tl

2 C

d 1 l=1

al2

+

d1 d2

2 E{bls (Xs )}

l=1 s=1

2 (X ) ≡ 0 almost surely. Since assumption (A5) entailing that for all s, l, al ≡ 0 and bls s requires that all Xs are continuous random variables, one has bls (x) ≡ 0 for all s, l.

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2511

2.2. The model Consider {(Yi , Xi , Ti )}ni=1 , a sample that follows (1.1) and (1.2) whose conditional mean function is described by (2.1) and the identiﬁability conditions (2.2), (2.3). The error terms {i }ni=1 are i.i.d with Ei =0, E2i =1, and with the additional property that i is independent of {(Xj , Tj ), j i}, i = 1, . . . , n. With this error structure, the explanatory variable vector (Xi , Ti ) can contain exogenous variables and/or lag variables of Yi . If (Xi , Ti ) contains only the lags of Yi , it is a semi-parametric time series model, which is a useful extension of many existing nonlinear and nonparametric time series models such as exponential autoregressive model (EXPAR), threshold autoregressive model (TAR), and functional autoregressive model (FAR). In (2.1), for every l = 1, . . . , d2 , the coefﬁcient of Til consists of two parts, the unknown parameter cl , and the unknown univariate functions {ls }1 s d2 . The marginal integration method will be applied to estimate both. The marginal integration method was ﬁrst discussed in Linton and Nielsen (1995) in the context of additive models, see also the marginal integration method for generalized additive models in Linton and Härdle (1996). To see how the marginal integration method works in our context, observe that according to the identiﬁcation condition (2.2), for every l = 1, . . . , d1 one has (2.4) cl = E{w(X)l (X)} = w(x)l (x)(x) dx and for every point x = (x1 , . . . , xd2 )T , and every l = 1, . . . , d1 , s = 1, . . . , d2 , one has cl + ls (xs ) = E{w−s (X−s )l (xs , X−s )} = w−s (u−s )l (xs , u−s )−s (u−s ) du−s ,

(2.5)

where u−s = (u1 , . . . , us−1 , us+1 , . . . , ud2 )T , (xs , u−s ) = (u1 , . . . , us−1 , xs , us+1 , . . . , ud2 )T , the density of X is , and the marginal density of X−s = (X1 , . . . , Xs−1 , Xs+1 , . . . , Xd2 )T is −s , and w−s (x−s ) = E{w(Xs , x−s )} = w(u, x−s ) ds (u). In addition, the marginal density of Xs is denoted by s . Intuitively, one has n n ws (Xi,−s )l (xs , Xi,−s ) i=1 w(Xi )l (Xi ) cl ≈ (2.6) , ls (xs ) ≈ i=1 n − cl n w(X ) i i=1 i=1 ws (Xi,−s ) 1 and the d2 -dimensional functions {l (x)}dl=1 in the above equations (2.6) can be replaced by the usual local polynomial estimators. This is the essential idea behind the marginal integration method.

2.3. Estimators of constants {cl } 1 According to the ﬁrst approximation equation in (2.6), to estimate the constants {cl }dl=1 , d1 we ﬁrst estimate the unknown functions {l (x)}l=1 at those data points Xi that are in the support of the weight function w. More generally, for any ﬁxed x ∈ supp(w), we

2512

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

1 approximate l (x) locally by a constant l , and estimate {l (x)}dl=1 by minimizing the T following weighted sum of squares with respect to = (1 , . . . , d1 ) :

n i=1

Yi −

d1

2 l Til

KH (Xi − x),

(2.7)

l=1

where K is a d2 -variate kernel function of order q1 , see assumption (A1) in the Appendix, H = diag{h0,1 , . . . , h0,d2 } is a diagonal matrix of positive numbers h0,1 , . . . , h0,d2 , called bandwidths, and x1 xd2 1 KH (x) = d . K ,..., 2 h0,1 h0,d2 h0,s s=1

Let ˆ = (ˆ1 , . . . , ˆ d1 )T be the solution to the least-squares problem in (2.7). Note that ˆ is 1 dependent on x, as is (2.7), and the components in ˆ give the estimators for {l (x)}dl=1 . To T emphasize the dependence on x, we write ˆ = ˆ (x) = (ˆ1 (x), . . . , ˆ d1 (x)) . More precisely, let ⎛T ··· T ⎞ 1d1

11

W(x) = diag{KH (Xi − x)/n}1 i n ,

Z = ⎝ ... Tn1

.. ⎠ , . . · · · Tnd 1 ..

Y = (Y1 , . . . , Yn )T and el be a d1 -dimensional vector with all entries 0 except the lth entry being 1. Then {ˆl (x)}1 l d1 is given by ˆ l (x) = elT {ZT W(x)Z}−1 ZT W(x)Y.

(2.8)

By (2.6), the parameter cl can be estimated as a weighted average of ˆ l (Xi )’s, i.e., n w(Xi )ˆl (Xi ) n cˆl = i=1 (2.9) , l = 1, . . . , d1 . i=1 w(Xi ) Theorem 1. Under assumptions (A1)–(A7) in the Appendix, for any l = 1, . . . , d1 √

L

n(cˆl − cl ) → N{0, 2l },

where the asymptotic variance 2l is deﬁned in (A.11) in the Appendix. √ The rate of 1/ n at which cˆl converges to cl is due to two special features of ˆ l (x). First, q1 q1 , . . . , h0,d , bounded by the bias of ˆ l (x) in estimating l (x) consists of terms of order h0,1 2 √ 1/ n according to assumption (A6) (a), see the derivation of Lemma A.5 about the term Pc1 . −1 Second, the usual variance of ˆ l (x) in estimating l (x) is proportional to n−1 h−1 0,1 · · · h0,d2 , which gets reduced to 1/n due to the effect of averaging in (2.9), see the derivation of the term Pc2 in (A.9) and (A.10). This technique of simultaneously reducing the bias by the use of a higher-order kernel and “integrating out the variance” is the common feature of all marginal integration procedures.

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2513

s d2 2.4. Estimators of functions {ls }11 l d1

In the following, we illustrate the procedure for estimating the functions {ls (xs )}1 l d1 , for any ﬁxed s = 1, . . . , d1 . Let xs be a point at which we want to evaluate the func1 tions {ls (xs )}1 l d1 . According to (2.6), we need to estimate {l (x)}dl=1 at those points (xs , Xi,−s ) that lie in the support of w. For any x ∈ supp(w), differently fromestimating the p constants, we approximate the function l (u) locally at x by l (u) ≈ l + j =1 lj (us −

1 xs )j , and estimate {l (x)}dl=1 by minimizing the following weighted sum of squares with respect to = (1 , . . . , d1 )T , = (11 , . . . , 1p , . . . , d1 1 , . . . , d1 p )T

n

⎡ ⎣Yi −

i=1

⎧ d1 ⎨

l +

l=1

⎩

p

lj (Xis − xs )j

j =1

⎫ ⎬ ⎭

⎤2 Til ⎦ khs (Xis − xs )LGs (Xi,−s − x−s )

in which k is a univariate kernel, L is a (d2 − 1)-variate kernel of order q2 , as in assumption (A1) in the Appendix, the bandwidth matrix Gs = diag{g1 , . . . , gs−1 , gs+1 , . . . , gd2 }, and khs (us ) =

1 k hs

us hs

, 1

u1 us−1 us+1 ud LGs (u−s ) =

L ,..., , ,..., 2 g1 gs−1 gs+1 gd2 1 s d2 ,s =s gs

for u−s = (u1 , . . . , us−1 , us+1 , . . . , ud2 ). Let ˆ , ˆ be the solution of the above least-squares 1 problem. Then the components in ˆ give the estimators for {l (x)}dl=1 , which is given by ˆl (x) = elT {ZTs Ws (x)Zs }−1 ZTs Ws (x)Y,

(2.10)

where el is a (p + 1)d1 -dimensional vector with all entries 0 except the lth entry being 1, Ws (x) ≡ diag{n−1 khs (Xis − xs )LGs (Xi,−s − x−s )}1 i n and ⎡ TT , {(X − x )/ h }TT , . . . , {(X − x )/ h }p TT ⎤ 1s s s 1 1s s s 1 1 . ⎦ ⎣ .. Zs = TTn , {(Xns − xs )/ hs }TTn , . . . , {(Xns − xs )/ hs }p TTn ⎡ [p{(X − x )/ h }]T ⊗ TT ⎤ 1s s s 1 .. ⎦ =⎣ . T T [p{(Xns − xs )/ hs }] ⊗ Tn

(2.11)

in which p(u) = (1, u, . . . , up )T and ⊗ denotes the Kronecker product of matrices. Then for each s, we can construct the marginal integration estimators of ls for l = 1, . . . , d1

2514

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

simultaneously, which are given by n ˆ ls (xs ) =

l (xs , Xi,−s ) i=1 w−s (Xi,−s )ˆ n i=1 w−s (Xi,−s )

− cˆl ,

(2.12)

√ where the term cˆl is the n-consistent estimator of cl in Theorem 1. The estimator ˆ ls (xs ) is referred to as the pth order local polynomial estimator, where p is the highest polynomial degree of variables Xis − xs , i = 1, . . . , n, in the deﬁnition of the design matrix Zs in (2.11). In particular, the local linear (p = 1) and the local cubic estimators (p = 3) are the most commonly used. Theorem 2. Under assumptions A1–A7 in the Appendix, for any x = (x1 , . . . , xd2 )T ∈ supp(w), one has for l = 1, . . . , d1 , s = 1, . . . , d2

p+1

nhs {ˆls (xs ) − ls (xs ) − hs

L

ls (xs )} → N{0, 2ls (xs )},

(2.13)

where ls (xs ) and 2ls (xs ) are deﬁned in (A.16) and (A.18), respectively. Finally, based on (2.9) and (2.12), one can predict Y given any realization (x, t) of (X, T) by the predictor

m(x, ˆ t) =

d1 l=1

cˆl +

d2

ˆ ls (xs ) tl .

(2.14)

s=1

√ To appreciate why ls can be estimated by ˆ ls at the rate of 1/ nhs , which is the same as the rate of estimating a nonparametric function in the univariate case, we discuss two special features of ˆ l (x) given in (2.10), which are similar to those discussed in Section 2.3. p+1 q2 First the bias of ˆ l (x) in estimating l (x) is of order hs + gmax , where the ﬁrst term can be understood as the approximation bias caused by locally approximating ls using a pth degree polynomial, see the derivation of Ps2 in Lemma A.9, and the second term can be considered as the approximation bias by locally approximating functions {ls }s =s using q2 since the kernel L is of order q2 , see Ps3 in Lemma a constant, which is bounded by gmax q2 of the second bias term is negligible compared to the rescaling factor A.9. The order g √ max of order 1/ nhs , according to (A6) (b). Hence, only the ﬁrst bias term appears in the asymptotic distribution formula (2.13). As for the variance of ˆ l (x) in estimating l (x), it is −1 −1 −1 −1 proportional to n−1 h−1 s g1 · · · gs−1 gs+1 · · · gd2 , but due to marginal averaging of variables Xi,−s , the bandwidths g1 , . . . , gs−1 , gs+1 , . . . , gd2 related to Xi,−s are integrated out, see Ps1 in Lemma A.9. Then the variance of ˆ ls is reduced to the order n−1 h−1 s . If the same bandwidth hs is used for all variable directions in X, then Assumption (A6) (b) would imply that nn−d2 /(2p+3) → ∞ and hence restricting d2 to be less than 2p + 3, for the asymptotic results of Theorem 2 to be true. That is why we prefer the ﬂexibility of using a set of bandwidths g1 , . . . , gs−1 , gs+1 , . . . , gd2 different from hs .

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2515

3. Implementation Practical implementation of the estimators deﬁned in (2.9) and (2.12) requires a rather intelligent choice of bandwidths H = diag{h0,1 , . . . , h0,d2 }, Gs = diag{g1 , . . . , gs−1 , gs+1 , . . . , gd2 } and {hs }1 s d2 . In the following, we discuss the choices of such bandwidths. 1 • Note from Theorem 1 that the asymptotic distributions of the estimators {cˆl }dl=1 de2 pend only on the quantity l , not on the bandwidths in H. Hence we have only specˆ ˆ iﬁed that H satisfy the order assumptions in (A6) (a) by taking h01 = · · · = h0d2 = var(X) ˆ log(n)n−1/(2q1 −1) , where q1 is the order of the kernel K, required to be greater

2 1/d2 ,in which var(X ˆ = { ds=1 var(X ˆ ˆ than (d2 + 1)/2, and var(X) s )} s ) denotes the sample variance of Xs , s = 1, . . . , d2 . s d2 • The asymptotic distributions of the estimators {ˆls }11 l d1 depend not only on the func2 tions ls (xs ) and ls (xs ) but also crucially on the choice of bandwidths hs . Moreover, for each s = 1, . . . , d2 , the coefﬁcient functions {ls (xs ), l = 1, . . . , d1 } are estimated simultaneously. So we deﬁne the optimal bandwidth of hs , denoted by hs,opt , as the minimizer of the total asymptotic mean integrated squared errors of {ˆls (xs ), l = 1, . . . , d1 }, which is deﬁned as

d1

2(p+1) AMISE{ˆls } = hs

d1

2ls (xs ) dxs

l=1

l=1

d1 1 2ls (xs ) dxs . + nhs l=1

Then hs,opt is found to be hs,opt =

d1

2ls (xs ) dxs d1 2 2n(p + 1) l=1 ls (xs ) dxs

1/(2p+3)

l=1

in which ls (xs ) and 2ls (xs ) are the asymptotic bias and variance of ˆ ls as in (A.16) and (A.18).According to the deﬁnitions of ls (xs ) and 2ls (xs ), 2ls (xs ) dxs and 2ls (xs ) dxs can be approximated, respectively, by

1 1 (p+1) l s (xs ) (p + 1)! l =1 2

d

×(u, xs, Xi,−s , Ti )}du

up+1

n 1 {w−s (Xi,−s )Til Kls∗ n i=1

dxs ,

n 2 (X 2 2 1 w−s i,−s )−s (Xi,−s ) (Xi , Ti ) Kls∗2 (u, Xi , Ti ) du, n 2 (Xi ) i=1

where the functions Kls∗ are deﬁned in (A.17).

2516

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 (p+1)

To implement this, one needs to evaluate terms such as l s (xs ), 2 (x, t), (x), (x−s ) and Kls∗ . We propose the following simple estimation methods for those quantities. The resulting bandwidth is denoted as hˆ s,opt . (p+1)

1. The derivative functions l s model of degree p + 2 E(Y |X, T) =

(xs ) are estimated by ﬁtting a polynomial regression

d1 d2 p+2

als,k Xsk Tl .

l=1 s=1 k=0 (p+1) Then l s (xs ) is estimated as (p+1)!al s,p+1 +(p+2)!al s,p+2 xs . As a by-product, the mean-squared error of this model, is used as an estimate of 2 (x).

2. Density functions (x) and (x−s ), are estimated as n d2 Xis − xs 1 1 , (x) ˆ = n h(X, d2 ) h(X, d2 ) i=1 s=1

ˆ −s (x−s ) =

n 1 Xis − xs 1 h(X−s , d2 − 1) h(X−s , d2 − 1) n i=1 s =s

with the standard normal density and the rule-of-the-thumb bandwidth ˆ + 2)}1/(m+4) n−1/(m+4) . h(X, m) = var(X){4/(m 3. According to the deﬁnition in (A.17), the dependence of the functions Kls∗ (u, x, t) on u and t is explicitly known. The only unknown term E(TTT |X = x) contained in S−1 (x) is estimated by ﬁtting matrix polynomial regression E(TT |X = x) = c + T

p d2

cs,k xsk

s=1 k=1

in which the coefﬁcients c, cs,k are d1 × d1 matrices. In this procedure, one simply uses polynomial regression to estimate some of the unknown quantities, which is easy to implement, but may lead the estimated optimal bandwidths to be biased relative to the true optimal bandwidths. The development of a more sophisticated bandwidth selection method requires further investigation. s d2 • Since Theorem 2 implies that the asymptotic distributions of the estimators {ˆls }11 l d1

2 do not depend on {Gs }ds−1 , we only specify that the Gs satisﬁes the order assumption (p+1)/q in (A6) (b) g1 = · · · = gs−1 = gs+1 = · · · = gd2 = hˆ s,opt 2 / log(n), in which q2 , the order of the kernel function L, is required to be greater than (d2 − 1)/2, and hˆ s,opt is the optimal bandwidth obtained using the above procedure.

Following the above discussion, the order of the kernels K and L are required to be greater than (d2 + 1)/2 and (d2 − 1)/2, respectively. If the dimension of X equals to 2, kernels K

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2517

and L can have order 2. We have used the quadratic kernel k(u) = 15/16(1 − u2 )2 1{|u| 1} , where 1{|u| 1} is the indicator function of [−1, 1] and the kernels K, L are product kernels. −1 TTT , and the matrix Lastly, the matrix ZT W(x)Z in (2.7) is computedas ZT W(x)Z+n ˆ k(u)p(u)p(u)T du} TTT , ZTs Ws (x)Zs in (2.10) as ZTs Ws (x)Zs +(nhˆ s,opt )−1 var(X){ following the ridge regression idea of Seifert and Gasser (1996).

4. Examples 4.1. A simulated example The data are generated from the following model Y = {c1 + 11 (X1 ) + 12 (X2 )}T1 + {c2 + 21 (X1 ) + 22 (X2 )}T2 +

(4.1)

with c1 = 2,

c2 = 1,

11 (x1 ) = 21 (x1 ) = sin(x1 ),

12 (x2 ) = x2 ,

22 (x2 ) = 0,

where X=(X1 , X2 )T is uniformly distributed on [−, ]×[−, ], and T=(T1 , T2 )T follows the bivariate standard normal distribution. The vectors X, T are generated independently. The error term is a standard normal random variable and independent of (X, T). We use sample sizes n = 100, 250 and 500. The number of replications in the simulation is 100. First, to assess the performance of the data-driven bandwidth selector in Section 3, we plot in Fig. 2 the kernel estimates of the sampling distribution density of the ratio hˆ 1,opt / h1,opt , where h1,opt is the optimal bandwidth for estimating 11 and 21 . One can see that the sampling distribution of the ratio hˆ 1,opt / h1,opt converges to 1 rapidly as the

20

15

10

5

0 1.0

1.5

2.0

Fig. 2. Bandwidth selection results of the simulated example: kernel density estimates of hˆ 1,opt / h1,opt (h1,opt is the theoretical optimal bandwidth for estimating the functions of x1 in (4.1)). Solid curve is for n = 100, dotted curve is for n = 250, and dot–dashed curve is for n = 500.

2518

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

Table 2 Results of the 100 simulations: the means and standard errors (in parentheses) of the estimators cˆ1 and cˆ2 of c1 and c2 , and AISEs of the estimators ˆ 11 , ˆ 12 , ˆ 21 , ˆ 22 of the coefﬁcient functions 11 , 12 , 21 , 22

n = 100 n = 250 n = 500

c1 = 2

c2 = 1

11

12

21

22

1.9737 (0.3574) 2.0299 (0.2410) 1.9786 (0.1680)

1.0406 (0.2503) 1.0056 (0.1490) 1.0026 (0.1111)

0.1609 0.0568 0.0295

0.2541 0.0963 0.0483

0.1205 0.0338 0.0191

0.2761 0.0649 0.0310

sample size increases. Similar results are also obtained for h2,opt , the optimal bandwidth for estimating 12 and 22 . The plot is omitted. The simulation results indicate that the proposed bandwidth selection method is reliable in this instance. The fact that the distribution of the selected bandwidth seems skewed toward larger values is due to the use of a simple polynomial function as a plug-in substitute of the true regression function. Second, we estimate the functions ls on a grid of equally spaced grid of points xm , m = 1, . . . , ngrid with x1 =−0.975, xngrid =0.975, ngrid =62. We summarize the ﬁtting results in Table 2, which includes the means and standard errors (in the parentheses) of the cˆl , l=1, 2 and the averaged integrated squared errors (AISE) of ˆ ls . By denoting the estimated function ˆ ls of ls in the ith replication by ˆ ils , we deﬁne ISE(ˆils ) =

1 ngrid

ngrid

{ˆils (xm ) − ls (xm )}2

and

AISE(ˆls ) =

m=1

100 1 ISE(ˆils ). 100 i=1

Fig. 3 gives the plots of ˆ 11 , ˆ 12 , ˆ 21 , ˆ 22 obtained in the 100 replications for sample size n = 100, 250, 500, respectively. Also plotted are the typical estimators, whose ISE is the median of the ISEs in the 100 replications. 4.2. The West German real GNP In this section, we discuss in detail the West German real GNP data ﬁrst mentioned in the introduction. Yang and Tschernig (2002) found that it had an autoregressive structure on lags 4, 2, 8 according to FPE and AIC, lags 4, 2 according to BIC, where the FPE, AIC and BIC are lag selection criteria for linear time series models as in Brockwell and Davis (1991). On the other hand, lags 4, 2, 8 are selected by the semi-parametric seasonal shift criterion, and lags 4, 1, 7 are selected by the semi-parametric seasonal dummy criterion for the slightly different series {log(Gt+4 /Gt+3 )}120 t=1 . Both semi-parametric criteria are developed in Yang and Tschernig (2002). According to Brockwell and Davis (1991, p. 304), the lag selection criteria AIC and FPE of the linear time series models are asymptotically efﬁcient but inconsistent, while BIC selects the correct set of variables consistently. Therefore, one may ﬁt a linear autoregressive model with either Yt−2 , Yt−4 or Yt−2 , Yt−4 , Yt−8 as the regressors, with the understanding that the variable Yt−8 may be redundant for linear modeling Linear AR (24): Yt = a1 Yt−2 + a2 Yt−4 + t ,

(4.2)

Linear AR (248): Yt = b1 Yt−2 + b2 Yt−4 + b3 Yt−8 + t .

(4.3)

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 2

3 2 1

1.0

1.0

0.0

1

0.0

0

-1 -1.5

-1

-2

-1

0

1

2

3

-2

-1.5

-3 -3

-3

-2

-1

a1

0

1

2

3

-3

-2

-1

a2

0

1

2

3

-1

0

1

2

3

-3

-3

-2

-1

0

1

2

3

-1.5

-2

-1

0

1

2

3

1

2

3

-3

-2

-1

0

1

2

3

-3

-2

-1

0

1

2

3

d1

1

2

3

-3

-2

-1

0

1

2

3

1

2

3

2 1.0

1 0

-1 0

3

c4

0.0

-1

-3

c3

0.0

-2

2

-2

-1.5 -3

3 2 1

-3

1

0

c2

-1.5

0

-1

c1 1.0

-1

1

0.0

0

-2

2

-1 -1

-3

b4

1.0

0.0

-2

3

-2 -3

b3

3 2 1

-3

2

0

b2

-1.5

1

-1

b1 1.0

0

1

-1 -2

-1

2 0.0

-3

-2

a4

1.0

0.0

-1.5

-3

a3

3 2 1

1.0

2519

-1 -2

-1.5 -3

-2

-1

0

d2

1

2

3

-3

-2

-1

0

d3

1

2

3

-3

-2

-1

0

d4

Fig. 3. Plots of the estimated coefﬁcient functions in the simulated example. (a1–a4) are plots of the 100 estimated curves for 11 (x1 ) = sin(x1 ), 12 (x2 ) = x2 , 21 (x1 ) = sin(x1 ), 22 (x2 ) = 0 with n = 100. (b1–b4) and (c1–c4) are the same as (a1–a4), but with sample size n = 250 and 500, respectively. (d1–d4) are plots of the typical estimators, the solid curve represents the true curve, the dotted curve is the typical estimated curve with n = 100, the dot–dashed curve is with n = 250 and the dashed curve is with n = 500.

From Table 1, it is clear that besides being more parsimonious, the linear model (4.2) has smaller average squared prediction error, compared with model (4.3). Thus model (4.2) is the preferred linear autoregressive model. Moreover, Figs. 4 and 5 show that the scatter plots of Yt against the two signiﬁcant linear predictors, Yt−2 and Yt−4 , along with the leastsquares regression lines, actually vary signiﬁcantly at different levels of Yt−1 and Yt−8 . So we have ﬁtted the additive coefﬁcient model (1.5) in the introduction. We use the ﬁrst 110 observations for estimation and perform one-step prediction using the last 10 observations. When estimating the coefﬁcient functions in model (1.5), we use local cubic ﬁtting (i.e. taking p = 3 in Zs (2.11)). According to the bandwidth selection method in Section 3, we use bandwidths 0.0031 and 0.0020 for estimating the functions of Yt−1 and Yt−8 , respectively. The estimated coefﬁcient functions are plotted in Fig. 6. We have also generated 500 wild bootstrap (Mammen, 1992) samples and obtain 95% pointwise bootstrap conﬁdence intervals of the estimated coefﬁcient functions. From Fig. 6, one may observe that the estimated functions have obviously nonconstant forms. In addition, their 95% conﬁdence intervals cannot completely cover a horizontal line passing zero in any of the four plots. This supports the hypothesis that the coefﬁcient functions in (1.5) are signiﬁcantly different from a constant. (Notice that by the restrictions proposed in (2.2), if a coefﬁcient function is constant, it has to be zero.)

2520

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 L LLL L L L

L

0.02

0.02

0.0

0.0

L

L L

L

L

L LL

L L

L L LL L L L LL L

L L

(a)

0.02

-0.04 -0.02

M

M M

0.0

M

0.0

M

MM M M M MM

-0.04

(c)

L

-0.04

-0.04

-0.04 -0.02

0.02

0.04

M

0.02

MM M M M

M

MM

M M M M M M MM M M M M

0.0

(b)

0.02

0.04

0.0

-0.04 -0.02

L

0.0

0.02 H

H H H H H HHH HH H HH H H HH H

H H H HH

H

H H H

-0.04

(d)

-0.04 -0.02

0.04

0.0

HH H H

0.02

H

0.04

Fig. 4. Scatter plot of Yt , Yt−2 at three levels of Yt−1 : H, M, L. Here the levels are deﬁned as: H, the high level, is the top 33% of the data, L, the lower level, is the lower 33% of the data, and M, the middle level, is the rest of the data: (a) scatter plot of Yt , Yt−2 ; (b) scatter plot of Yt , Yt−2 at high level of Yt−1 ; (c) scatter plot at middle level of Yt−1 ; (d) scatter plot at lower level of Yt−1 .

L

0.02

L

0.02 L

0.0

0.0

-0.04

-0.04

(a)

-0.04 -0.02

0.0

0.02

0.04

(b)

L L L L L L

-0.04 -0.02

M

0.02 0.0

M

H H M M M M M

M M M M M

M

-0.04

(c)

-0.04 -0.02

M

H

M M MM M M M MM M M MM M M M M M M

0.0

0.02

0.04

LL L L L LL L L LL L L L LL L L L LL L L

0.0

H H H H HH

0.02 HH HH H H HH H HH H H H H 0.0 H H

0.02

0.04

0.02

0.04

H

HH H

HHH

-0.04

(d)

-0.04 -0.02

0.0

Fig. 5. Scatter plot of Yt , Yt−2 at three levels of Yt−8 : H, M, L. Here the levels are deﬁned as: H, the high level, is the top 33% of the data, L, the lower level, is the lower 33% of the data, and M, the middle level, is the rest of the data: (a) scatter plot of Yt , Yt−2 ; (b) scatter plot of Yt , Yt−2 at high level of Yt−8 ; (c) scatter plot at middle level of Yt−8 ; (d) scatter plot at lower level of Yt−8 .

For the two linear autoregressive models, we estimate their constant coefﬁcients by maximum likelihood method. The estimated coefﬁcients are aˆ 1 = −0.2436, aˆ 2 = 0.5622 and bˆ1 = −0.1191, bˆ2 = 0.6458, bˆ3 = 0.0704. Lastly, to assess the sensitivity of marginal integration estimation method to the degree of the local polynomial, we have also ﬁtted model (1.5) using local linear estimation (i.e. taking p =1 in Zs ). Table 1 gives the ASEs (averaged squared estimation error) and ASPEs (averaged squared prediction error) from the above four estimations. One can see that overall the marginal integration estimation for model (1.5) is not sensitive to the order of local polynomial used. Local cubic ﬁts outperform the

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

0.4

0.4

0.2

0.2

0.0

0.0

-0.4

(a)

-0.4 -0.04 -0.02 0.0

0.02 0.04 0.06

(b)

0.4

0.4

0.2

0.2

0.0

0.0

-0.4

(c)

2521

-0.04 -0.02 0.0 0.02 0.04 0.06

-0.4 -0.04 -0.02 0.0

0.02 0.04 0.06

(d)

-0.04 -0.02 0.0 0.02 0.04 0.06

Fig. 6. German real GNP: estimated functions and their point-wise wild bootstrap 95% conﬁdence intervals, based on model (1.5): (a) ˆ 11 , (b) ˆ 12 , (c) ˆ 21 , (d) ˆ 22 .

local linear ﬁts for prediction. Both local linear and local cubic ﬁtting provide signiﬁcant improvements over the linear autoregressive models. 4.3. Wolf’s annual sunspot number In this example, we consider Wolf’s annual sunspot number data for the period 1700–1987. Many authors have analyzed this data set. Tong (1990) used a TAR model with lag 8 as the tuning variable. Chen and Tsay (1993b) and Cai et al. (2000) both used a FAR model with lag 3 as the tuning variable. Xia and Li (1999) proposed a single index model using a linear combination of lag 3 and lag 8 as the tuning variable. Motivated by those models, we propose our additive coefﬁcient model (4.4), in which we use both lag 3 and lag 8 as the additive tuning variables: i.e. Yt = {c1 + 11 (Yt−3 ) + 12 (Yt−8 )}Yt−1 + {c2 + 21 (Yt−3 ) + 22 (Yt−8 )}Yt−2 × {c3 + 31 (Yt−3 ) + 32 (Yt−8 )}Yt−3 + t . (4.4) Following the convention in the literature, we use the transformed data, where Yt = √ 2( 1 + Xt − 1), Xt denotes the observed sunspot number at year t. We use the ﬁrst 280 data points (year 1700–1979) to estimate the coefﬁcient functions, and leave out years 1980–1987 for prediction. The bandwidths 6.87 and 6.52 are selected for estimating functions of Yt−3 and functions of Yt−8 , respectively. The estimated coefﬁcient functions are plotted in Fig. 7, and the time plot of the ﬁtted values in Fig. 8. The ASE is 4.18. Finally, we use our estimated model to predict the sunspot numbers in 1980–1987, and compare these predictions with those based on the TAR model of Tong (1990), the FAR model of Chen and Tsay (1993), denoted as FAR1, and the following two models; the FAR model of Cai et al. (2000) denoted as FAR2 Yt = 1 (Yt−3 )Yt−1 + 2 (Yt−3 )Yt−2 + 3 (Yt−3 )Yt−3 + 6 (Yt−3 )Yt−6 + 8 (Yt−3 )Yt−8 + t

(4.5)

2522

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534 1.5

0.2 0.0

0.2 0.0

1.0

-0.4

0.5

-0.4

0.0

-0.8

(a)

0.4

0

5 10 15 20 25

(b)

-0.8

0

5 10 15 20 25

(c)

0

5 10 15 20 25

0

5 10 15 20 25

0.5

0.3

0.0

0.2

0.0

0.1

-0.5 -0.5

0.0 -0.1

-1.0

-1.0

-0.2

(d)

0

5 10 15 20 25

(e)

0

5 10 15 20 25

(f)

Fig. 7. Wolf’s Sunspot Number: estimated functions based on model (4.4): (a) ˆ 11 , (b) ˆ 21 , (c) ˆ 31 , (d) ˆ 12 , (e) ˆ 22 , (f) ˆ 32 .

25

20

15

10

5

0 1700

1750

1800

1850 Year

1900

1950

Fig. 8. Wolf’s Sunspot Number: time plot of the ﬁtted values based on model (4.4) (solid line), with the observed values (circles).

and the single index coefﬁcient model of Xia and Li (1999) denoted as SIND Yt = 0 {g4 ( , Yt−3 , Yt−8 )} + 1 {g4 ( , Yt−3 , Yt−8 )}Yt−1 + 2 {g4 ( , Yt−3 , Yt−8 )} × Yt−2 + 3 {g4 ( , Yt−3 , Yt−8 )}Yt−3 + 4 {g4 ( , Yt−3 , Yt−8 )}Yt−8 + t , (4.6) in which g4 ( , Yt−3 , Yt−8 ) = cos( )Yt−3 + sin( )Yt−8 . According to Condition (A.1) b, of Cai et al. (2000, p. 952), the conditional density of Yt−3 given the variables (Yt−1 , Yt−2 , Yt−3 , Yt−6 , Yt−8 ) should be bounded. It is clear, however, that Yt−3 is completely predictable from (Yt−1 , Yt−2 , Yt−3 , Yt−6 , Yt−8 ), and hence the distribution of Yt−3 given the variables (Yt−1 , Yt−2 , Yt−3 , Yt−6 , Yt−8 ) is a probability mass at one point, not a continuous distribution with any kind of density. Thus, the

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2523

Table 3 Out-of-sample absolute prediction errors for sunspot data, under different models Year

Xt

TAR

FAR1

FAR2

SIND

Additive coefﬁcient

1980 1981 1982 1983 1984 1985 1986 1987 AAPE ASPE

154.7 140.5 115.9 66.6 45.9 17.9 13.4 29.2

5.5 1.3 19.5 4.8 14.8 0.2 5.5 0.7 6.6 85.6

13.8 0.0 10.0 3.3 3.8 4.6 1.3 21.7 7.3 101.1

1.4 11.4 15.7 10.3 1.0 2.6 3.1 12.3 7.2 81.6

2.1 1.7 2.6 2.4 2.3 7.6 4.2 13.2 4.5 34.3

14.9 2.4 17.5 1.37 5.92 1.96 0.57 0.7 5.7 71.9

Results on models TAR, FAR1 and FAR2 are from Cai et al. (2000, Table 4, p. 951), while results on model SIND are from Xia and Li (1999, Table 2, p. 1280).

use of model (4.5) has not been theoretically justiﬁed. Similarly, model (4.6) is also not theoretically justiﬁed, since according to Condition C5, of Xia and Li (1999, p. 1277), the conditional density of g4 ( , Yt−3 , Yt−8 ) given the variables (Yt−1 , Yt−2 , Yt−3 , Yt−8 , Yt ) should be bounded, whereas again, the distribution of g4 ( , Yt−3 , Yt−8 ) given the variables (Yt−1 , Yt−2 , Yt−3 , Yt−8 , Yt ) is also a point mass. In addition, we illustrate that model (4.6) is unidentiﬁable. For any set of functions {0 , . . . , 4 } that satisfy (4.6), one can always pick an arbitrary nonzero function f (u) and deﬁne 0 (u) = 0 (u) + uf (u),

1 (u) = 1 (u),

3 (u) = 3 (u) − cos( )f (u),

2 (u) = 2 (u),

4 (u) = 4 (u) − sin( )f (u).

It is straightforward to verify that the new set of functions { 0 , . . . , 4 } satisfy (4.6) as well. One possible ﬁx of this problem is to drop either one of the terms 3 {g4 ( , Yt−3 , Yt−8 )}Yt−3 and 4 {g4 ( , Yt−3 , Yt−8 )}Yt−8 from (4.6), then the model is fully identiﬁable and satisﬁes Condition C5, of Xia and Li (1999, p. 1277). Hence the current form of (4.6) may be considered an overﬁtting anomaly. Despite the fact that models (4.5) and (4.6) suffer these theoretical deﬁciencies, we have listed the average absolute prediction errors (AAPE) and ASPE of model TAR, FAR1, FAR2, SIND and our proposed model in Table 3. By comparing the AAPEs and ASPEs, our model outperforms TAR, FAR1 and FAR2, while the unidentiﬁable SIND model has smallest AAPE and ASPE. We believe that this superior forecasting power of (4.6) is due to the prediction advantage of overﬁtting models. For example, in forecasting of linear time series, the overﬁtting AIC/FPE selects models more powerful than the consistent BIC, see, for instance, the discussion of Brockwell and Davis (1991, p. 304). Acknowledgements Xue’s research has been supported in part by NSF Grants BCS 0308420 and DMS 0405330. Yang’s research has been supported in part by NSF Grants SES 0127722 and

2524

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

DMS 0405330. The constructive comments by the coordinating editor and a referee are appreciated. Appendix A. A.1. Assumptions and an auxiliary lemma We have listed below some assumptions necessary for proving Theorems 1 and 2. Throughout this appendix, we denote by the same letters c, C, etc., any positive constants, without distinction in each case. (A1) The kernel functions k, K and L are symmetric, Lipschitz continuous and compactly supported. The function k is a univariate probability density function, while K is d2 rd variate, and of order q1 , i.e. K(u) du = 1 while K(u)ur11 · · · ud22 du = 0, for 1r1 + · · · + rd2 q1 − 1. Kernel L is (d2 − 1) variate and of order q2 . Denote p∗ = max(p + 1, q1 , q2 ). Then we assume further that (A2) The functions ls (xs ) have bounded continuous p ∗ th derivatives for l = 1, . . . , d1 , s = 1, . . . , d2 . ∞ (A3) The vector process {i }∞ i=1 = {(Yi , Xi , Ti )}i=1 is strictly stationary and -mixing k with the -mixing coefﬁcient (k)c , for some constants c > 0, 0 < < 1. The -mixing coefﬁcient is deﬁned as (k) = sup E n1

sup

A∈F∞ n+k

|P (A|Fn0 ) − P (A)|,

n where F∞ n+k and F0 denote the -algebras generated by {i , i n + k} and {0 , . . . , n } separately. According to (1.1) of Bosq (1998), the strong mixing coefﬁcient (k)(k)/2, hence

(k)c k /2.

(A.1)

(A4) The error term satisﬁes: 2 2+ < + ∞ for (a) The innovations {i }∞ i=1 are i.i.d with Ei = 0, Ei = 1 and E|i | some > 0. Also, the term i is independent of {(Xj , Tj ), j i} for all i > 1. (b) The conditional standard deviation function (x, t) is bounded and Lipschitz continuous.

(A5) The vector (X, T) has a joint probability density (x, t). The marginal densities of X, Xs and X−s are denoted by , s and −s , respectively. (a) Letting q ∗ = max(q1 , q2 ) − 1, we assume that (x, t) has bounded continuous q ∗ th partial derivatives with respect to x. And the marginal density is bounded away from zero on the support of the weight function w. (b) Let S(x) = E(TTT |X = x). We assume there exists a c > 0, such that S(x) cId2 uniformly for x ∈ supp(w). Here Id2 is the d2 × d2 identity matrix.

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2525

(c) The random matrix TTT satisﬁes the Cramer’s moment condition, i.e., there exists a positive constant c, such that E|Tl Tl |k ck−2 k!E|Tl Tl |2 , and E|Tl Tl |2 c holds uniformly for k = 3, 4, . . . , and 1 l, l d1 . (A6) The bandwidths satisfy: √ q1 (a) For the bandwidth matrix H = diag{h01 , . . . , h0d2 } of Theorem 1, nhmax → 0 and

2 nhprod ∝ n for some > 0, where hmax = max{h01 , . . . , h0d2 }, hprod = di=1 h0i , and ∝ means proportional to. (b) For the bandwidths hs , and Gs = diag{g1 , . . . , gs−1 , gs+1 , . . . , gd2 −1 } of Theorem 1/2 q2 2, hs = O{n−1/(2p+3) }, nhs gprod ∝ n for some > 0 and (nh

s ln n) gmax → 0, where gmax = max{g1 , . . . , gs−1 , gs+1 , . . . , gd2 −1 }, gprod = s =s gs . (A7) The weight function w is nonnegative, has compact support with nonempty interior, and is Lipschitz continuous on its support. The proof of many results in this paper makes use of some inequalities about U-statistics and von Mises’ statistics of dependent variables derived from Yoshihara (1976). In general, let i , 1i n denote a strictly stationary sequence of random variables with values in R d and -mixing coefﬁcients (k), k = 1, 2, . . ., and r a ﬁxed positive integer. Let { n (F )} denote the functionals of the distribution function F of i n (F ) = gn (x1 , . . . , xm ) dF (x1 ) · · · dF (xm ), where {gn } are measurable functions symmetric in their m arguments such that |gn (x1 , . . . , xm )|2+ dF (x1 ) · · · dF (xm ) Mn < + ∞, sup

|gn (x1 , . . . , xm )|2+ dF i

1

(i1 ,...,im )∈Sc

,..., im (x1 , . . . , xm ) Mn,c

c = 0, . . . , m − 1

< + ∞, (A.2)

for some > 0, where Sc = {(i1 , . . . , im )|#r (i1 , . . . , im ) = c}, c = 0, . . . , m − 1 and for every (i1 , . . . , im ), 1i1 · · · im n, #r (i1 , . . . , im )= the number of j = 1, . . . , m − 1 satisfying ij +1 − ij < r. Clearly, the cardinality of each set Sc is less than nm−c . The von Mises’ differentiable statistic and the U-statistic n (Fn ) = gn (x1 , . . . , xm ) dFn (x1 ) · · · dFn (xm ) n n 1 · · · gn ( i1 , . . . , im ), nm i1 =1 im =1 1 Un = n ! gn ( i1 , . . . , im )

=

m

1 i1 <···
2526

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

allow decompositions as n (Fn ) = n (F ) +

m " # m

c

c=1

Vn(c) =

Vn(c) ,

c

gn,c (x1 , . . . , xc )

[dFn (xj ) − dF (xj )],

j =1

Un = n (F ) +

m " # m

c

c=1

Un(c) =

(n − c)! n!

Un(c) ,

gn,c (xi1 , . . . , xic )

1 i1 <···
c

[dIR d (xj − ij ) − dF (xj )], +

j =1

where gn,c are the projections of gn gn,c (x1 , . . . , xc ) =

gn (x1 , . . . , xm ) dF (xc+1 ) · · · dF (xm ),

c = 0, 1, . . . , m,

so that gn,0 = n (F ), gn = gn,m and IR d is the indicator function of the nonnegative part of +

d = {(y , . . . , y ) ∈ R d |y 0, j = 1, . . . , d}. R d , R+ 1 d j

Lemma A.1. If (k)C1 k −(2+ )/ , > > 0, then EV (c)2 n

−c + EU (c)2 n C(m, , r)n

2/(2+)

Mn

n

k/(2+) (k)

k=r+1

+

m−1 c =0

n

−c

2/(2+) Mn,c

r

/(2+)

k

(A.3)

(k)

k=1

for some constant C(m, , r) > 0. In particular, if one has (k) C2 k , 0 < < 1 then EV (c)2 n

−c + EU (c)2 n C(m, , r)C2 C( )n

2/(2+) Mn

+

m−1 c =0

n

−c

2/(2+) Mn,c

. (A.4)

Proof. The proof of Lemma 2 in Yoshihara (1976), which dealt with the special case of gn ≡ g, r = 1, Mn = Mn and yielded (A.3), provides an obvious venue of extension to the more general setup. Elementary arguments then establish (A.4) under geometric mixing conditions.

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2527

A.2. Proofs of Theorems 1 and 2 For any x ∈ supp(w), we can write ZT W(x)Z =

n 1 KH (Xi − x)Ti TTi , n i=1

1 khs (Xis − xs )LGs (Xi,−s − x−s ) n i=1 % & $ Xis − xs T Xis − xs T p (Ti Ti ) × p hs hs n

ZTs Ws (x)Zs =

in which,

denotes the Kronecker product of matrices. Deﬁne also the following matrix

S (x) =

%

k(u)p(u)p(u)T du

(A.5)

S(x),

where S(x) = E(TTT |X = x) as deﬁned in (A5) (b). For any matrix A, |A| denotes the maximum absolute value of all elements in A. q1 q2 Lemma A.2. Let b1 = ln n(hmax + 1/ nhprod ), b2 = ln n(hs + gmax + 1/ nhs gprod ), and deﬁne the compact set B = supp(w) ⊂ R d2 . Under assumptions (A1)–(A6), as n → ∞, with probability one sup |ZT W(x)Z − (x)S(x)| = o(b1 ),

and

x∈B

sup |ZTs Ws (x)Zs − (x)S (x)| = o(b2 ).

x∈B

Proof. The proof of this lemma makes use of the covering technique and the Bernstain’s inequality for mixing data (Bosq, 1998, Theorem 1.4). The complete proof of this lemma can be found in Xue and Yang (2004). A.2.1. Proof of Theorem 1 q1 +2 Now let v1 be the integer satisfying b1v1 +1 = o(hmax ). Following immediately from Lemma A.2, one has {Z W(x)Z} T

−1

v1 S(x)−1 S(x)−1 ZT W(x)ZS −1 (x) = Id1 − − + Qc (x) (x) (x) (x) =

v1

=1

Qcv (x) + Qc (x),

v=1 q +2

1 where the matrix Qc (x) satisﬁes supx∈B |Qc (x)| = o(hmax ) w.p.1.

2528

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

Also by observing that, elT {ZT W(Xi )Z}−1 ZT W(Xi )Zel = ll , where ll equals to 1 if l = l and equals to 0 otherwise, one has the following decomposition: v1 n 2 2 1 w(Xi ){cˆl − cl } = {Dci + Pci } + Fi + R c , n i=1

(A.6)

i=1 =1

i=1

2 with Rc = n1 ni=1 w(Xi ) ds=1 ls (Xis ), and the terms Dci , Pci , and Fi are given as follows: By denoting the operator c as c (A, B) =

n 1 w(Xi )A(Xi )ZT W(Xi )B, n i=1

we set, for i = 1, 2, and v = 1, . . . , v1 , Dci = c (Qc (x), Bi );

Pci = c (S−1 (x)/(x), Bi );

Fi = c (Qcv (x), Bi ), (A.7)

with B1 = [(Xj , Tj )j ]j =1,...,n ,

B2 =

d2 d1

{l s (Xj s ) − l s (Xis )}Tj l

l =1 s=1

. j =1,...,n

(A.8) To prove Theorem 1, we need the following lemmas. q +2

1 Lemma A.3. As n → +∞, |Dc1 | + |Dc2 | = o(hmax ) w.p.1.

√ Lemma A.4. For ﬁxed = 1, . . . , v1 , as n → +∞, |F1 | + |F2 | = o(b1 / n) w.p.1. Proof. For simplicity, consider only the case of F1 with = 1: i.e., n 1 S −1 (Xi ) ZT W(Xi )ZS −1 (Xi ) Id1 − ZT W(Xi )E w(Xi ) n (Xi ) (Xi ) i=1 & $ n 1 S(Xi ) E{ZT W(Xi )Z} −1 −1 = − S (Xi )ZT W(Xi )E w(Xi )S (Xi ) n (Xi ) 2 (Xi ) i=1 $ T & n 1 Z W(Xi )Z E{ZT W(Xi )Z} − w(Xi )S −1 (Xi ) − n 2 (Xi ) 2 (Xi )

F11 (x) =

i=1 −1

×S

(Xi )ZT W(Xi )E = P1 − P2 .

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2529

Let i = (Xi , Ti , i ), and deﬁne & S(Xi ) E{ZT W(Xi )Z} − (Xi ) 2 (Xi ) × S −1 (Xi )KH (Xj − Xi )Tj (Xj , Tj )j & $ S(Xj ) E{ZT W(Xj )Z} − + w(Xj )S −1 (Xj ) (Xj ) 2 (Xj )

gn (i , j ) = w(Xi )S −1 (Xi )

$

× S −1 (Xj )KH (Xi − Xj )Ti (Xi , Ti )i . Then P1 can be written as the von Mises’ differentiable statistic P1 = 2n1 2 (1)

n

(2)

i,j =1 gn (i , j ),

which can be decomposed as P1 = 21 {n (F ) + 2Vn + Vn }, in which, n (F ) = 0. (1)

(2)

In order to write down the explicit expressions of Vn , Vn , let Ei (En,i ) denote taking expectation with respect to the random vector indexed by i using the theoretical (empirical) (1) measure, both under the presumption of independence between i and j . One has Vn = Ei En,j gn (i , j ) = n1 nj=1 gn,1 (j ), in which $

& S(z) E{ZT W(z)Z} − (z) 2 (z) × S −1 (z)KH (Xj − z)(z) dzTj (Xj , Tj )j .

gn,1 (j ) =

w(z)S −1 (z)

(1)

Clearly gn,1 has mean 0 and variance of order b12 . So Vn = n1 For

(2) Vn ,

n

√

j =1 gn,1 (j ) = op (b1 /

n).

by Lemma A.1, under assumption (A3), one has for some small > 0 2 2 2 2+ 2+ −1 E(Vn(2) )2 cn−2 Mn2+ + Mn,0 , + Mn,1 n

where Mn , Mn,0 and Mn,1 satisfy the inequalities as in (A.2). Observe that 2+ Ei,j |gn (i , j )|2+ cb2+ 1 E|w(Xi )KH (Xj − Xi )Tj (Xj , Tj )j | −(1+2)(2+)/(2+2)

hprod

−(1+2)(2+)/(2+2)

So we can take Mn,0 = hprod

cb2+ 1 , and by setting the mixing coefﬁcient

−(1+2)(2+)/(2+2)

to 0, one also gets Mn = hprod −(2+)

cb2+ 1 hprod

cb2+ 1 c( ).

cb2+ 1 . Similarly, we can show that Mn,1 =

. So by taking small, one has E(P12 ) is less or equal to

(1+2)(2+) 2/(2+) " # − −(2+) 2/(2+) cn−2 hprod (2+2) b12+ + cn−3 b12+ hprod + cb21 /n cn−1 b12 . √ Similarly, we can show that EP 22 cn−1 b12 . So we have F11 = op (b1 / n). 1 Lemma A.5. Pc1 = Op (hmax ) = op (n−1/2 ) as n → ∞.

q

2530

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

Proof. Let Kl∗ (X, T) = elT S−1 (X)T, then Pc1 is a von Mises’ statistic, with n of the form d d 2 1 w(z) ∗ {l s (xs ) − l s (zs )}tl (z) (x, t) dz dx dt. K (z, t)KH (x − z) (z) l l =1 s=1

After a change of variable u = H −1 (x − z), the above becomes d d 1 2 {l s (zs + h0,s us ) − l s (zs )}tl w(z)Kl∗ (z, t)K(u) l =1 s=1

× (z + H u, t) du dz dt, q

1 which is O(hmax ) by a Taylor expansion of ls (zs +h0,s us ) to q1 th degree and of (z+H u, t) to (q1 −1)th degree, which exist according to assumptions (A2) and (A5) (a). By assumption q1 q1 (A1), all the terms with order smaller than hmax disappear. So the leading term left is of hmax q1 q1 (1) (2) order. It is routine to verify that Vn and Vn are Op (hmax ) as well. Hence Pc1 = Op (hmax ) q1 and assumption (A6) (a) entails that Op (hmax ) = op (n−1/2 ).

We can ﬁnish the proof of Theorem 1 as follows. By a von Mises’ statistic argument, n log n 1 w(x) ∗ Pc2 = K (x, Tj )KH (Xj − x)(x) dx(Xj , Tj )j + op n (x) l n2 hprod j =1

(A.9) which, after the change of variable Xj = x + H u becomes 1 w(Xj )Kl∗ (Xj , Tj )(Xj , Tj )j + op (n−1/2 ). n n

(A.10)

j =1

Now let us return to the decomposition of A.2–A.5, one has

1 n i=1 w(Xi )(cˆl − cl ) as in (A.6). By Lemmas n

d2 n n 1 1 ∗ w(Xi )(cˆl − cl ) = w(Xj ) Kl (Xj , Tj )(Xj , Tj )j + ls (Xj s ) n n i=1

i=1

s=1

+ op (n−1/2 ). 2 If we deﬁne j = w(Xj ){Kl∗ (Xj , Tj )(Xj , Tj )j + ds=1 ls (Xj s )} = j 1 + j 2 , then by assumption (A4) (a) and the identiﬁcation condition (2.2), we have E(j )=0. Together with assumption (A3), {j } is a stationary -mixing process, with geometric -mixing coefﬁcient. By Minkowski’s inequality and assumptions (A1), (A4), (A5) and (A7), we have, for some > 0, E|j |2+ < + ∞. Next, deﬁne 2l = 2

+∞ j =1

cov(0 , j 2 ) + var(0 )

(A.11)

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2531

which is ﬁnite by Theorem 1.5 of Bosq (1998). Applying the central limit theorem for strongly mixing processes (Bosq, 1998, Theorem 1.7), we have √1n ni=1 j ⇒ N(0, 2l ). Theorem 1now follows immediately by assumption (A6) (a) on the bandwidths and the fact that n1 ni=1 w(Xi ) → 1 a.s. A.2.2. Proof of Theorem 2 Following similarly as in the proof of Theorem 1, let vs be an integer which satisﬁes p+2 b2vs = o(hs ). Then by Lemma A.2, one has s S −1 (x) {ZTs Ws (x)Zs }−1 − Qsv (x) + Qs (x), = (x)

v

(A.12)

v=1

" # ZT W (x)Z S −1 (x) v S −1 (x) where Qsv (x) = (x) I(p+1)d1 − s s (x)s , and the matrix Qs (x) satisﬁes p+2 supx∈B |Qs (x)| = o(hs )w.p.1. Also as in the proof of Theorem 1, by the equation that el {ZTs Ws (Xis )Zs }−1 ZTs Ws (Xis )Zs el = ll , l = 1, . . . , d1 , for ﬁxed l = 1, . . . , d1 and s = 1, . . . , d2 , we have the following decomposition: vs n 3 1 Dsi (xs ) + Psi (xs ) + w−s (Xi,−s ){ˆls (xs ) − ls (xs )} = Rvi (xs ) n i=1

where

v=1

i=1

+ R1 + R 2 ,

⎧ ⎫ n ⎨ ⎬ 1 R1 = w−s (Xi,−s ) ls (Xis ) , ⎩ ⎭ n s =s

i=1

(A.13)

n 1 R2 = w−s (Xi,−s ) (cˆl − cl ) n i=1

(A.14) and the other terms in (A.13) are given as follows. By deﬁning an operator s (A, C) =

n 1 w−s (Xi,−s ){elT AZTs Ws (xs , Xi,−s )C}, n i=1

we set for i = 1, 2, 3, and v = 1, . . . , vs , Dsi (xs ) = s (Qs (xs , Xi,−s ), Ci ), Psi (xs ) = s (S−1 /(xs , Xi,−s ), Ci ), Rvi (xs ) = s (Qsv {(xs , Xi,−s )}r , Ci ), where C1 is the vector of errors same as B1 in (A.8), and d p (v) 1 l s (xs ) v C2 = C2 (xs ) = l s (Xj s ) − (Xj s − xs ) Tj l , v! v=0 l =1 j =1,...,n ⎤ ⎡ d2 d1 C3 = C3 (Xi,−s ) = ⎣ {l s (Xj s ) − l s (Xis )}Tj l ⎦ . l =1 s =s

j =1,...,n

(A.15)

2532

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

The proof of Theorem 2 is completed by applying assumption (A6) (b) and the asymptotic results on each term in (A.13), which are presented in the following lemmas: Lemma A.6. As n → +∞,

√

√ √ √ nhs R1 = Op ( hs ), nhs R2 = Op ( hs ). p+2

Lemma A.7. As n → +∞, supxs ∈supp(ws ) |Ds1 (xs ) + Ds2 (xs ) + Ds3 (xs )| = o(hs w.p. 1.

)

Lemma A.8. For any ﬁxed v=1, . . . , vs , as n → +∞, supxs ∈supp(ws ) |Rv1 (xs )|+|Rv2 (xs )|+ √ |Rv3 (xs )| = o(b2v / nhs ) w.p. 1. Lemma A.9. As n → +∞ Ps1 =

n 1 w−s (Xj,−s ) 1 ∗ Xj s − xs Kls , xs , Xj,−s , Tj (xs , Xj,−s ) hs hs n j =1

× −s (Xj,−s )(Xj , Tj )j + op {(nhs log n)−1/2 }, p+1

Ps2 (xs ) = hs which

p+1

ls (xs ) + op (hs

1 1 (p+1) l s (xs ) (p + 1)!

d

ls (xs ) =

2 ), and Ps3 (xs ) = Op (gmax ) = op {(nhs log n)−1/2 }, in

q

up+1 E{w−s (X−s )Tl Kls∗ (u, xs , X−s , T)} du

l =1

(A.16) with Kls∗ (u, x, T) = elT S−1 (x)q ∗ (u, T)k(u), Furthermore

q ∗ (u, T) = (T, uT, . . . , up T)T .

(A.17)

√ L nhs Ps1 → N{0, 2ls (xs )}, in which

2ls (xs ) =

2 (z ) w−s −s K ∗2 (u, xs , z−s , t)2−s (z−s )2 (xs , z−s , t) 2 (xs , z−s ) ls × (xs , z−s , t) du dz−s dt.

(A.18)

Proof of Lemma A.6. This lemma follows immediately by applying the central limit theorem for strongly mixing process (Bosq, 1998, Theorem 1.7) on R1 and Theorem 1 on R2 , respectively. Proof of Lemmas A.7 and A.8. We have omitted these as they are similar to the proofs of Lemmas A.3, A.4.

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

2533

Proof of Lemma A.9. From the deﬁnition in (A.15) and using the von Mises’ statistic and change of variable arguments, Ps1 equals to n 1 w−s (Xj,−s ) ∗ Xj s − xs , xs , Xj,−s , Tj −s (Xj,−s )(Xj , Tj )j Kls hs nhs (xs , Xj,−s ) j =1

+ op {(nhs log n)−1/2 }. By assumption (A4) (a), the ﬁrst term is the average of a sequence of martingale differences. Then by the martingale central limit theorem of Liptser and Shirjaev (1980), the term √ nhs Ps1 is asymptotically normal with mean 0 and variance 2 (z ) w−s −s K ∗2 (u, xs , z−s , t)2−s (z−s )2 (xs , z−s , t) (xs , z−s , t) du dz−s dt 2 (xs , z−s ) ls + o(hs ) = 2ls (xs ) + o(hs ) in which the leading term 2ls (xs ) is as deﬁned in (A.18). For the term Ps2 (xs ), we have w−s (x−s ) 1 ∗ zs − xs K , xs , x−s , t LGs (z−s − x−s ) Ps2 (xs ) = (xs , x−s ) hs ls hs d p (v) 1 l s (xs ) v × l s (zs ) − (zs − xs ) tl v! l =1

v=0

× (z, t)−s (x−s ) dz dx−s dt{1 + op (1)}. p+1

After the change of variables zs =xs +hs u and z−s =x−s +Gs v, it equals to hs (xs )ls (xs )+ p+1 op (hs ), with ls (xs ) as deﬁned in (A.16). Lastly, the term Ps3 is seen to be w−s (x−s ) 1 ∗ zs − xs Kls , xs , x−s , t LGs (z−s − x−s ) (x , x ) h hs ⎤ ⎡ s −s s d d 2 1 ×⎣ {l s (zs ) − l s (xs )}tl ⎦ (z, t)−s (x−s ) dz dx−s dt{1 + op (1)} l =1 s =s

which after the change of variables, zs = xs + hs u and z−s = x−s + Gs v, is seen to be q2 Op (gmax )=op {(nh log n)−1/2 }, by a Taylor expansion to q2 th degree for l s and (q2 − 1)th degree for , using assumptions (A2) and (A5) (a). Then the result follows from assumption (A1) that L is a kernel function of q2 th order. References Bosq, D., 1998. Nonparametric Statistics for Stochastic Processes. Springer, New York. Brockwell, P.J., Davis, R.A., 1991. Time Series: Theory and Methods. Springer, New York. Cai, Z., Fan, J., Yao, Q.W., 2000. Functional-coefﬁcient regression models for nonlinear time series. J. Amer. Statist. Assoc. 95, 941–956. Chen, R., Tsay, R.S., 1993a. Nonlinear additive ARX models. J. Amer. Statist. Assoc. 88, 955–967. Chen, R., Tsay, R.S., 1993b. Functional-coefﬁcient autoregressive models. J. Amer. Statist. Assoc. 88, 298–308.

2534

L. Xue, L. Yang / Journal of Statistical Planning and Inference 136 (2006) 2506 – 2534

Härdle, W., Liang, H., Gao, J.T., 2000. Partially Linear Models. Springer, Heidelberg. Hastie, T.J., Tibshirani, R.J., 1990. Generalized Additive Models. Chapman & Hall, London. Hastie, T.J., Tibshirani, R.J., 1993. Varying-coefﬁcient models. J. Roy. Statist. Soc. Ser. B 55, 757–796. Hoover, D.R., Rice, J.A., Wu, C.O., Yang, L.P., 1998. Nonparametric smoothing estimates of time-varying coefﬁcient models with longitudinal data. Biometrika 85, 809–822. Linton, O.B., Härdle, W., 1996. Estimation of additive regression models with known links. Biometrika 83, 529–540. Linton, O.B., Nielsen, J.P., 1995. A kernel method of estimating structured nonparametric regression based on marginal integration. Biometrika 82, 93–100. Liptser, R.Sh., Shirjaev, A.N., 1980. A functional central limit theorem for martingales. Theory Probab. Appl. 25, 667–688. Mammen, E., 1992. When Does Bootstrap Work: Asymptotic Results and Simulations. Lecture Notes in Statistics, vol. 77, Springer, Berlin. Seifert, B., Gasser, T., 1996. Finite-sample variance of local polynomial: analysis and solutions. J. Amer. Statist. Assoc. 91, 267–275. Sperlich, S., TjZstheim, D.,Yang, L., 2002. Nonparametric estimation and testing of interaction in additive models. Econom. Theory 18, 197–251. Stone, C.J., 1985. Additive regression and other nonparametric models. Ann. Statist. 13, 689–705. Tong, H., 1990. Nonlinear Time Series: A Dynamical System Approach. Oxford University Press, Oxford, UK. Xia, Y.C., Li, W.K., 1999. On single-index coefﬁcient regression models. J. Amer. Statist. Assoc. 94, 1275–1285. Xue, L., Yang, L, 2004. Estimation of semiparametric additive coefﬁcient model. Research Manuscript 628, Department of Statistics and Probability, MSU. Downloadable at http://www.msu.edu/∼ yangli/addeoeffﬁnal.pdf. Yang, L., Tschernig, R., 2002. Non- and semi-parametric identiﬁcation of seasonal nonlinear autoregression models. Econom. Theory 18, 1408–1448. Yang, L., Härdle, W., Park, B.U., Xue, L., 2004. Estimation and testing of varying coefﬁcients with marginal integration. J. Amer. Statist. Assoc., under revision. Yoshihara, K., 1976. Limiting behavior of U-statistics for stationary, absolutely regular processes. Z. Wahrscheinlichkeitstheorie verwandte Gebiete 35, 237–252.

Estimation of semi-parametric additive coefficient model

Estimation of semi-parametric additive coefficient model

Recommend Documents