On splines approximation for sliced average variance estimation

On splines approximation for sliced average variance estimation

Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505 Contents lists available at ScienceDirect Journal of Statistical Planning and ...

263KB Sizes 0 Downloads 33 Views

Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / j s p i

On splines approximation for sliced average variance estimation夡 Zhou Yua , Li-Ping Zhua , Li-Xing Zhua, b,∗ a b

East China Normal University, Shanghai, China Hong Kong Baptist University, Hongkong, China

A R T I C L E

I N F O

Article history: Received 27 April 2006 Received in revised form 18 July 2008 Accepted 25 July 2008 Available online 20 August 2008 Keywords: Asymptotic normality B-spline Bayes information criterion Dimension reduction Sliced average variance estimation Structural dimension

A B S T R A C T

To avoid the inconsistency and slow convergence rate of the slicing estimator of the sliced average variance estimation (SAVE), particularly in the continuous response cases, we sug√ gest B-spline approximation that can make the estimator n consistent and keeps the spirit of easy implementation that the slicing estimation shares. Compared with kernel estimation that has been used in the literature, B-spline approximation is of higher accuracy and is easier to implement. To estimate the structural dimension of the central dimension reduction space, a modified Bayes information criterion is suggested, which makes the leading term and the penalty term comparable in magnitude. This modified criterion can help to enhance the efficacy of estimation. The methodologies and theoretical results are illustrated through an application to the horse mussel data and simulation comparisons with existing methods by simulations. © 2008 Elsevier B.V. All rights reserved.

1. Introduction Consider the regression of a one-dimensionalresponse Y on a p-dimensional predictor vector X = (X1 , . . . , Xp )T where T denotes the transpose operator. When p becomes large, the well-known curse of dimensionality causes data sparseness, and the estimation of the regression function suffers from it badly. Recent developments to circumvent this issue focus on semiparametric models, such as multi-index models in which Y is related to X through K linear combinations of X, say BT X where B is a p × K matrix. This is equivalent to say that Y is independent of X when BT X is given. When K is much smaller than p, this model achieves the goal of dimension reduction, and the column space of B is called a dimension reduction subspace. As B is not unique, Cook (1998) defined the central dimension reduction (CDR) subspace that is the intersection of all the dimensional reduction subspaces satisfying this conditional independence, denoted by SY|X . Cook (1996, 1998) showed that the CDR subspace exists under some mild conditions. Denote the positive definite covariance matrix of X by . When the standardized version of X, Z = −1/2 (X − E(X)), is used, the CDR subspace SY|Z = 1/2 SY|X (see Cook, 1998, Chapters 10, 11). That is, the CDR subspace based on the standardized variable Z can be easily transformed back to that based on the xoriginal predictor vector X. Therefore, we shall use the standardized variable Z throughout this paper.. Sliced inverse regression (SIR) (Li, 1991) and sliced average variance estimation (SAVE) (Cook and Weisberg, 1991) are two promising tools that can be used to recover the CDR subspace. SIR estimates the CDR subspace through the eigenvectors

夡 The first author was supported by National Social Science Foundation of China (08CTJ001), and the second author was supported by a scholarship under the State Scholarship Fund ([2007]3020). The third author was supported by a RGC grant from the Research Grants Council of Hong Kong, Hong Kong, China. The authors thank the editor, the associate editor and the referees for their constructive comments and suggestions that led to a significant improvement in presentation of the early draft. ∗ Corresponding author at: Hong Kong Baptist University, Hongkong, China. E-mail address: [email protected] (L.-X. Zhu). 0378-3758/$ - see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2008.07.017

1494

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

that are associated with the non-zero eigenvalues of the candidate matrix Cov[E(Z|Y)], while SAVE uses the candidate matrix E[Ip − Cov(Z|Y)]2 where Ip is the p × p identity matrix. When SIR is used, we need the following linearity condition: Z) = PS

E(Z|PS

Y|Z

Y|Z

Z,

(1.1)

and for SAVE, we need in addition the constant variance condition: Cov(Z|PS

Y|Z

Z) = Ip − PS

Y|Z

,

(1.2)

where P(·) stands for the projection operator based on the standard inner product (Li, 1991; Cook, 1998). How to estimate the candidate matrices of SIR and of SAVE based on the observations is naturally of importance. There are two proposals in the literature: slicing estimation and kernel smoothing. Slicing estimation proposed by Li (1991) has become one of the standard methods in this area. The idea of slicing estimation is to divide the whole space of Y into several slices, and then to estimate the SIR matrix through the average of the covariance matrices of Z in each slice. Zhu and Ng (1995) established √ the asymptotic normality of the slicing estimation of the SIR matrix when the number of slices ranges from n to n/2. Note that the slicing algorithm is easy to implement in practice and can also be applied to estimate the SAVE matrix (Cook and Weisberg, 1991). However, further study demonstrates that the asymptotic behavior of the slicing estimator of SAVE is very different from that of SIR. Unlike SIR which is insensitive to the number of slices, Cook (2000) pointed out that the number of slices plays the role of smoothing parameter and thus SAVE may be affected by this choice. Zhu et al.'s (2007) empirical study also indicated that the slicing estimator of SAVE is sensitive to the selection of the number of slices. A rather surprising finding was obtained in Li and Zhu (2007) who provided a systematic study on the convergence of the slicing estimator of the SAVE matrix. A rigorous √ proof shows that when Y is continuous, the estimator cannot be n consistent, and when each slice contains fixed number of data points not depending on n, it is even inconsistent. This observation motivates us to consider alternative methods to slicing estimation. Clearly, any sophisticated nonparametric smoothing can be an alternative. Zhu and Zhu (2007) suggested using kernel estimation following the idea of Zhu and Fang (1996) for SIR. Although the kernel estimator can achieve good theoretical √ properties, its implementation is much more difficult than the slicing estimator. To keep the computational efficacy and n consistency simultaneously, in the first part of this paper we suggest B-spline approximation due to its nature of least squares and parametrization. The asymptotic normality can be achieved for a wide range of knots. Compared with the kernel method, the B-spline approximation gains higher efficiency, especially when the sample size is relatively small. To determine the structural dimension of the CDR subspace, the sequential test proposed by Li (1991) has received much attention. However, this test greatly depends on the asymptotic normality of related estimator and its covariance matrix. In Section 2, we will see that although the asymptotic normality could be achieved by B-spline approximation, the covariance matrix has a very complicated structure, which is always the case when estimating the SAVE matrix. The sequential test procedure thus suffers from a complicated plug-in estimation. Zhu et al. (2006) proposed a BIC type criterion to estimate the structural dimension when SIR is applied. For SAVE, Zhu and Zhu (2007) suggested the Bayes information criterion (BIC) to consistently estimate the dimension when the kernel estimation is employed. Our empirical studies suggest that, when the sample size is relatively small, the estimation of the structural dimension under this criterion tends to have more variation than we expect. We address this problem in the second part of our paper. We modify the BIC suggested in Zhu and Zhu (2007) by rescaling the leading term and the penalty term to make them comparable in magnitude. The empirical study shows that after such an adjustment this modified BIC type criterion achieves a higher accuracy. We also prove the consistency of the estimator of the structural dimension. The rest of the paper is organized as follows. The asymptotic results of the B-spline approximation are presented in Section 2. In Section 3, we discuss the estimation of the structural dimension. Section 4 reports a simulation study to evidence the efficiency of B-spline approximation and the modified BIC type criterion. We use the horse mussel data to show that the convergence of B-spline approximation holds for a wide range of knots. It is also shown that our proposed BIC type criterion in estimating the structural dimension is very insensitive to the choice of the number of knots. All the technical details are postponed to the Appendix. 2. Asymptotic behavior of the splines approximation Let Z and its independent copies zj be Z = (Z1 , . . . , Zp )T ,

zj = (z1j , . . . , zpj )T ,

j = 1, . . . , n,

and define A2 = AA for any squared symmetric matrix A. Then the population version of the SAVE matrix is

 = E(Ip − Cov(Z|Y))2 = Ip − 2E(Cov(Z|Y)) + E(Cov(Z|Y))2 . For notational simplicity, let Rkl (y) = E(Zk Zl |Y = y) and rk (y) = E(Zk |Y = y). let kl = 1 if k = l, and kl = 0 otherwise for 1  k, l  p. Then the kl-th element kl of  can be written as ⎛ ⎞ p  2 kl = kl − 2E(Rkl (Y) − rk (Y)rl (Y)) + E ⎝ (Rkh (Y)Rhl (Y) − Rkh (Y)rh (Y)rl (Y) − rk (Y)rh (Y)Rhl (Y) + rk (Y)rl (Y)rh (Y))⎠ . h=1

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

1495

Now we introduce the B-spline approximation to estimate the involved conditional expectations. A spline is defined as a piecewise polynomial that is smoothly connected at its knots. Specifically, for any fixed integer m > 1, denote S(m, t) to be the set of spline functions with knots t = {0 = t0 < t1 < · · · < tk +1 = 1}, then for m  2, S(m, t) = {s ∈ C m−2 [0, 1] : s(y) is a polynomial of 0 degree (m − 1) on each subinterval [ti , ti+1 ]}. The common choices of m are two for linear splines, three for quadratic splines and four for cubic splines. Here k0 is referred as the number of internal knots. It is convenient to express elements in S(m, t) in terms of B-spline. For any fixed m and t, let Ni,m (y) = (ti − ti−m )[ti−m , . . . , ti ](t − y)m−1 , +

i = 1, . . . , J = k0 + m,

where [ti−m , . . . , ti ]g denotes the mth-order divided difference of the function g and ti = tmin(max(i,0),k +1) for any i = 1 − m, . . . , J. 0 J

Then {Ni,m (·)}i=1 form a basis for S(m, t). See Schumaker (1981, p. 124). That is, for any s(y) ∈ S(m, t), there exists an  such that

s(y) = T Nm (y), where Nm (y) = (N1,m (y), . . . , NJ,m (y))T . For notational convenience, in the sequel, Nm (·) will be abbreviated as N(·). Consider the spline approximation of the condition expectation, say, rj (y), based on the samples. The estimator of order m for rj (y) is defined to be the least squares minimizer rˆj (y) ∈ S(m, t) corresponding to n  (zji − rˆj (yi ))2 = i=1

min

sj (y)∈S(m,t)

n  (zji − sj (yi ))2 . i=1

 Write GJ,n = (1/n) nj=1 N(yj )NT (yj ) and its expectation G(q) = E(N(Y)NT (Y)). The B-spline approximation of rk (y) is given by  rˆk (y)=NT (y)(nGJ,n )−1 nl=1 N(yl )zkl . In a similar fashion, the B-spline approximation of Rkl (y) is estimated by Rˆ kl (y)=NT (y)(nGJ,n )−1 n ˆ can be written as, through replacing the N(y )z z . Therefore, the kl-th element ˆ in the B-spline approximation  i=1

i

ki li

kl

unknowns by their estimators,

ˆ kl = kl −

n n p 2 ˆ 1 ˆ (Rkl (yj ) − rˆk (yj )rˆl (yj )) + (Rkh (yj )Rˆ hl (yj ) − Rˆ kh (yj )rˆh (yj )rˆl (yj ) − rˆk (yj )rˆh (yj )Rˆ hl (yj ) n n j=1

j=1 h=1

+ rˆk (yj )rˆl (yj )rˆh2 (yj )).

(2.1)

We adopt matrix vectorization before presenting the main theorem. For a symmetric (p × p) matrix C = (ckl )p×p , let Vech(C) = (c11 , . . . , cp1 , c22 , . . . , cp2 , c33 , . . . , cpp ) be a p(p + 1)/2 dimensional vector. Then C and Vech(C) have a one to one correspondence. We are now in the position to introduce the theoretical results. Define the kl-th element of matrix H(Z, Y) as

Hkl (Z, Y) = − 2 × (Zk Zl + Rkl (Y) − Zk rl (Y) − Zl rk (Y) − rk (Y)rl (Y) − 2E(Rkl (Y)) + 3E(rk (Y)rl (Y))) + + Rkh (Y)Rhl (Y) − 3E(Rkh (Y)Rhl (Y)) − Zk Zh rh (Y)rl (Y) − Zh rl (Y)Rkh (Y)

p 

(Zh Zl Rhl (Y) + Zk Zh Rhl (Y)

h=1

− Zl rh (Y)Rkh (Y) − Rkh (Y)rl (Y)rh (Y) − Zl Zh rh (Y)rk (Y) − Zh rk (Y)Rlh (Y) − Zk rh (Y)Rlh (Y) − Rlh (Y)rk (Y)rh (Y) + rh2 (Y)rk (Y)rl (Y) + Zl rh2 (Y)rk (Y) + Zk rh2 (Y)rl (Y) + 2Zh rh (Y)rl (Y)rk (Y) + 4(Rkh (Y)rl (Y)rh (Y)) + 4E(Rlh (Y)rk (Y)rh (Y)) − 5E(rh2 (Y)rk (Y)rl (Y))).

(2.2)

The asymptotic normality is stated in the following theorem. Theorem 1. In addition to (1.1) and (1.2), assume that conditions (1)–(4) in Appendix A.1 hold. Then as n → ∞, we have √

ˆ − ) → H(Z, Y) n(

in distribution,

(2.3)

where Vech(H(Z, Y)) is distributed as N(0, Cov(Vech(H(Z, Y)))). From Theorem 1, we can derive the asymptotic normality of the eigenvalues and their corresponding normalized eigenvectors by using perturbation theory. The following result is parallel to that of SIR obtained by Zhu and Fang (1996) and Zhu and Ng (1995). Let 1 (A)  2 (A)  · · ·  p (A)  0 and bi (A) = (b1i (A), . . . , bpi (A))T , i = 1, . . . , p, denote, respectively, the eigenvalues and their corresponding normalized eigenvectors of a p × p matrix A.

1496

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

Theorem 2. In addition to the conditions of Theorem 1, assume that the nonzero l () s are distinct. Then for each nonzero eigenvalue i () and the corresponding normalized eigenvector bi (), we have √ √ √ ˆ ) −  ()) = nb ()T ( ˆ − )b () + op ( n ˆ − ) n(i ( i i i √ T ˆ = b () Hb () + op ( n − ), (2.4) i

i

where H is given in Theorem 1, and as n → ∞, √

ˆ ) − b ()) = n(bi ( i

p ˆ − )b () √  bi ()bi ()T ( √ i ˆ − ) n + op ( n j () − l () l=1,l  i p 

=

l=1,l  i

√ bi ()bi ()T Hbi () ˆ − ), + op ( n j () − l ()

(2.5)

ˆ −  =  where  1  i,j  p |aij |. 3. Estimation of the structural dimension When we use SAVE, sequential tests for structural dimension estimation may be inappropriate due to the following reasons: asymptotic normality does not hold when slicing estimation is used, and the plug-in estimation for the limiting covariance matrix of the estimator is difficult to implement when kernel or splines approximation is applied. Zhu et al. (2006) built a bridge between the classical model selection criteria BIC and SIR to overcome the problem. Their proposed modified BIC type criterion can be extended to handle the SAVE case since consistency of their structural dimension estimator only requires convergence of the eigenvalues of the associated matrix. ˆ = ˆ + Ip . Recall the definition of  (A). Clearly,  () =  () + 1. Determining the structural dimension Let  =  + Ip and  i i i ˆ ) greater now becomes estimation of K, the number of the eigenvalues of  being greater than 1. Let  denote the number of  ( i

than 1. Zhu and Zhu (2007) define the criterion by V(k) =

n 2

p  i=1+min(,k)

ˆ ) + 1 −  ( ˆ (ln i ( i )) + (p − k) ln(n).

(3.1)

Then the estimator of the structural dimension K is defined as the maximizer Kˆ of V(k) over k ∈ {0...p − 1}. Since p ln(n) does not affect the estimation, we can simply ignore this term. Based on our empirical studies, this criterion tends to give an incorrect estimation of the structural dimension with a relatively large probability when the sample size is small. This might be due to the fact that the penalty term in (3.1) is sometimes problematic. Since the eigenvalues and the dimension p may not be comparable in magnitude, we modify this criterion by rescaling both the leading term and the penalty term and define G(·) as follows: k

G(k) = n i=1 p i=1

ˆ ) + 1 −  ( ˆ (ln i ( i )) ˆ ) + 1 −  ( ˆ (ln i ( i )) k

− 2 ln(n)

(3.2)

p

ˆ ) + 1 −  ( ˆ (ln i ( i )) to make it not exceed one. We also multiply 2k ln(n) by k/p which does not exceed one either. The estimator of K is defined as the maximizer Kˆ of G(k) over k ∈ {1, . . . , p}, that is, Note that we divide

i=1

ˆ ) + 1 −  ( ˆ (ln i ( i )) by

k2 . p

i=1

ˆ = max G(k). G(K) 1kp

(3.3)

The estimator through this modified BIC type criterion inherits the convergence property, as stated in the following theorem. Theorem 3. Under the conditions of Theorem 1, Kˆ converges to K in probability. Not surprisingly, rescaling of the leading term and the penalty term results in better estimating accuracy, as we will see in the following section. 4. Illustrative examples In this section, we demonstrate our methodologies and theoretical results through the horse mussel dataset and a simulation study. As in other B-spline approximation settings, the first concern is the choice of the number of knots and their locations. The generalized cross-validation (GCV) is a frequently used method. However, GCV cannot be applied directly because we need a

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

1497

Table 1 The empirical R2 with n = 400.

SAVE SAVE SAVE SAVE SAVE SAVE SAVE SAVE SAVE SAVE

B-spline ( 2 = 1) Kernel ( 2 = 1) B-spline ( 2 = 1.2) Kernel ( 2 = 1.2) B-spline ( 2 = 1.4) Kernel ( 2 = 1.4) B-spline ( 2 = 1.6) Kernel ( 2 = 1.6) B-spline( 2 = 1.8) Kernel ( 2 = 1.8)

Model 1

Model 2

Model 3

Model 4

0.9886 0.9837 0.9882 0.9865 0.9702 0.9697 0.9704 0.9645 0.9760 0.9602

0.9870 0.9824 0.9942 0.9863 0.9689 0.9665 0.9687 0.9546 0.9669 0.9424

0.9758 0.9716 0.9819 0.9840 0.9640 0.9760 0.9598 0.9557 0.9648 0.9449

0.9645 0.9498 0.9650 0.9332 0.9522 0.9308 0.9444 0.9201 0.9305 0.9106

Table 2 The empirical R2 with n = 200.

SAVE SAVE SAVE SAVE SAVE SAVE SAVE SAVE SAVE SAVE

B-spline ( 2 = 1) Kernel ( 2 = 1) B-spline ( 2 = 1.2) Kernel ( 2 = 1.2) B-spline ( 2 = 1.4) Kernel ( 2 = 1.4) B-spline ( 2 = 1.6) Kernel ( 2 = 1.6) B-spline ( 2 = 1.8) Kernel ( 2 = 1.8)

Model 1

Model 2

Model 3

Model 4

0.9759 0.9724 0.9614 0.9401 0.9663 0.9280 0.9541 0.8908 0.9479 0.8647

0.9780 0.9750 0.9643 0.9511 0.9623 0.9348 0.9481 0.9046 0.9388 0.8891

0.9743 0.9565 0.9644 0.9434 0.9522 0.9223 0.9546 0.9058 0.9451 0.8866

0.9530 0.9313 0.9497 0.9106 0.9391 0.9002 0.9349 0.8958 0.9212 0.8534

large number of knots to get an undersmoothing estimator for the regression curve. See Stute and Zhu (2005). Then we suggest a semidata driven algorithm, which is easy to implement. Write k0n the number of knots selected by GCV. We then choose kn to be the integer closest to k0n × n2/15 . Use ti as the i/kn -th quantile of the observed values of the response so that they are uniformly scattered in the range. 4.1. Simulation study In this subsection we will compare the performance of B-spline approximation with the kernel methods. The efficiency of our proposed BIC type criterion for determining the structural dimension is also shown here. The trace correlation coefficient R2 = trace(PB P )/K proposed by Ferre (1998) is adopted to measure the distance between the estimated CDR subspace span( B) B and the true CDR subspace span(B). See Ferre (1998) for more details. The data are generated from the following four models with n = 200 and 400: Model 1: y = ( T x)2 + ; Model 2: y = ( T x)4 + ; Model 3: y = ( T1 x)4 + ( T2 x)2 + ; Model 4: y = ( T1 x)2 + ( T2 x)2 × . In these models, the predictor x and the error are independent, and follow, respectively, normal N(0, I10 ) and N(0, 2 ), where 2 = 1, 1.2, 1.4, 1.6, and 1.8. In the simulations, = (1, 1, 0, 0, . . . , 0)T for models 1 and 2, and 1 = (1, 0, 0, 0, . . . , 0)T , 2 = (0, 1, 0, 0, . . . , 0)T for models 3 and 4. Zhu and Zhu (2007) showed the superiority of kernel estimation to the slicing estimation with n = 400 and 2 = 1. To make a comprehensive comparison, we report in Tables 1 and 2 the median of R2 , evaluating the performance of the B-spline approximation and the kernel estimation with different n and 2 , respectively, based on 100 data replications. For the kernel estimation, we choose the resulting bandwidth hfinal suggested in Zhu and Zhu (2007), and kn as the working number of knots for the B-spline approximation. As can be seen from Table 1, when the sample size n is large, both methods work well. This phenomenon verifies their consistency for large sample size. Table 2 indicates that the B-spline approximation outperforms the kernel method for a moderate sample size of 200. The kernel smoothing deteriorates quickly with increasing 2 while the B-spline approximation keeps much higher accuracy. We also tried different numbers of knots from 2 to 6 for sample size 200 and 400. We found that the values of R2 did not change much and so we will not report the results in this paper. This phenomenon indicates that for B-spline approximation of SAVE, the choice of the number of knots is not crucial.

1498

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

Table 3 The frequency of the decisions of dimension with n = 400 when criterion G(·) is used.

Model 1 Model 2 Model 3 Model 4

D=1

D=2

D=3

D=4

D=5

D=6

D=7

D=8

D=9

D = 10

1 1 0 0.020

0 0 1 0.980

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

D stands for dimension.

Table 4 The frequency of the decisions of dimension with n = 400 when criterion V(·) is used.

Model 1 Model 2 Model 3 Model 4

D=0

D=1

D=2

D=3

D=4

D=5

D=6

D=7

D=8

D=9

0 0 0 0

0.970 0.985 0.080 0.050

0.030 0.015 0.915 0.950

0 0 0.005 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

D stands for dimension.

Table 5 The frequency of the decisions of dimension with n = 200 when criterion G(·) is used.

Model 1 Model 2 Model 3 Model 4

D=1

D=2

D=3

D=4

D=5

D=6

D=7

D=8

D=9

D = 10

1 0.980 0.080 0.060

0 0.010 0.920 0.940

0 0.010 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

D stands for dimension.

Table 6 The frequency of the decisions of dimension with n = 200 when criterion V(·) is used.

Model 1 Model 2 Model 3 Model 4

D=0

D=1

D=2

D=3

D=4

D=5

D=6

D=7

D=8

D=9

0 0 0 0

0.915 0.930 0.080 0.105

0.085 0.070 0.920 0.895

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

D stands for dimension.

Now let us investigate the efficiency of the two different criterions: V(·) proposed in Zhu and Zhu (2007) and G(·) in this paper, for determining the structural dimension. Two hundred Monte Carlo samples are generated for n = 200 and 400 with 2 = 1. From Tables 3 and 4 we can see that both criterions perform well when the sample size is large. When the sample size is moderate, Tables 5 and 6 indicate that the criterion V(·) tends to misspecify the true structural dimension with a much higher probability. This empirical study implies that the adjustment in V(·) does help enhance the efficacy by rescaling the leading term and the penalty term. 4.2. Horse mussel data This dataset consists of a sample of 201 measurements collected from five sites in the Marlborough Sounds in December 1984, which were located off the north-east coast of New Zealand's South Island. The response variable is the muscle mass M, the edible portion of the mussel, in grams. The quantitative predictors are the shell width W in millimeters, the shell height H in millimeters, the shell length L in millimeters and the shell mass S in grams. The observations were assumed to be independent and identically distributed. See also the description in Cook (1998). Bura and Cook (2001) pointed out that the transformed variables ln(W) and ln(S) should be used in place of W and S to satisfy linearity condition (1.1) and constant variance condition (1.2). Our analysis is also based on this transformed dataset. The quadratic B-spline approximation is applied. First, the G(k) values of (3.1) are reported in Fig. 1 to determine the structural dimension, where the knots are selected as suggested at the beginning of this section. According to Fig. 1, a proper estimation of the structural dimension should be K = 1 when we adopt the BIC type criterion. This complies with Bura and Cook's (2001) finding. However, the sequential test gives different results when applying SIR: it estimates the structural dimension to be 1 or 2, depending on the number of slices (Bura and Cook, 2001). We also try several numbers of knots as kn = 2, 3, 4, 5, and all agreed with this conclusion. For this dataset, our decision of the structural dimension is insensitive to the choice of knots.

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

1499

0 −5 −10

G (k)

−15 −20 −25 −30 −35 −40 −45

1

1.5

2

2.5 k

3

3.5

4

60

60

50

50

40

40

30

30

M

M

Fig. 1. The G(k) values versus k for horse mussel data.

20

20

10

10

0 −200

−150

−100

−50

T β0 X

0 −10

−8

−6

−4

T

β X

Fig. 2. The scatter plots of M versus the projected directions.

To further study the relationship between the response M and the estimated direction, we also report the scatter plot of M in Fig. 2 against the combination T X and T0 X, respectively, where X = (ln(W), H, L, ln(S))T , the direction = (−0.5250, −0.0360,

−0.0171, 0.6272)T was identified by B-spline approximation with five knots and 0 = (0.028, −0.029, −0.593, 0.804)T was suggested by Bura and Cook (2001). Similar trends can be beenseen from the two scatter plots with only the scale difference of the horizontal axis. The correlation coefficient between T0 X and T X is 0.9763 and so the scale difference is not crucial for further nonparametric statistical modeling. Appendix A. A.1. Assumptions The following four conditions are required for Theorems 1–3. (1) Rkl (y) = E(Zk Zl |Y = y) ∈ C m [0, 1] and ri (y) = E(Zi |Y = y) ∈ C m [0, 1], for 1  k, l  p; (2) EZk 4 < ∞, for all 1  k  p;

1500

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

(3) The B-spline function N(·) satisfies: max1  i  k |hi+1 − hi | = o(k−1 ) and h/min1  i  k hi  M where hi = ti − ti−1 , 0 0 0 h = max1  i  k hi and M > 0 is a predetermined constant; 0

1 < c < 1 , and the notation ∼ means the two quantities have the (4) As n → ∞, h ∼ n−c1 with positive numbers c1 satisfying 2m 1 2 same convergence order.

Remark A.1. Condition (1) is concerned with the smoothness of the inverse regression curve E(Z|Y = y), which is similar to ˆ . Condition (3) is the conditions used in Lemma 4 of Bura (2003). Condition (2) is necessary for the asymptotic normality of  usually used in the splines approximation. Such an assumption assures that M−1 < k0 h < M, which is necessary for numerical computations. Condition (4) restricts the range of number of knots for asymptotic normality. Clearly, although it is fairly wide, undersmoothing is needed because the optimal number of knots O(n1/(2m+1 ) is not within this range. This phenomenon is essentially the same as the kernel estimation in Zhu and Fang (1996). A.2. Some lemmas Since the Proof of Theorem 1 is rather long, we split the main steps into several lemmas. The following lemmas present the ˆ can be written as U-statistics and then can be approximated by sums of i.i.d. random variables. results that the elements of  Lemma A.1. Suppose conditions (1)–(4) satisfied. Then n 1  ˆ (T1 (yj ) − T1 (yj ))(Tˆ2 (yj ) − T2 (yj )) = op (1), √ n j=1

where both T1 (·) and T2 (·) can be rk (·) and Rkl (·) for each pair 1  , k, l  p. Lemma A.2. Suppose conditions (1)–(4) are satisfied. Then n n 1  1  (T(yj )rˆk (yj ) − T(yj )rk (yj )) = √ (zkj T(yj ) − E(T(Y)rk (Y))) + op (1), √ n n j=1

j=1

where T(·) can be rl (·), rl (·)rh2 (·), rk (·)rl (·)rh (·) and rh (·)Rhl . Lemma A.3. Suppose conditions (1)–(4) are satisfied. Then n n 1  1  (T(yj )Rˆ kh (yj ) − T(yj )Rkh (yj )) = √ (zkj zhj T(yj ) − E(T(Y)Rkh (Y))) + op (1), √ n n j=1

j=1

where T(·) can be 1, Rhl (·) and rk (·)rh (·). The proof of these lemmas is postponed to Appendix A.4. A.3. Proof of theorems ˆ . We will divide the proof into five steps. In each step, Proof of Theorem 1. We need only to deal with the klth element ˆ kl of  our major target is to approximate each term by a sum of a sequence of i.i.d random variables. Without confusion, we write ˆ ˆ E(·|Y = y) = E(·|y) and E(·|Y = y) = E(·|y) throughout the proof. The proof can be done through the asymptotic linear representations of U-statistics in the lemmas. Step 1: By Lemma A.3, we have n n 1  ˆ 1  (Rkl (yj ) − ERkl (Y)) = √ (zkj zlj + Rkl (yj ) − 2ERkl (Y)) + op (1). √ n n j=1

(A.1)

j=1

Step 2: Using Lemma A.1, we obtain n n 1  1  (rˆk (yj ) − rk (yj ) + rk (yj ))(rˆl (yj ) − rl (yj ) + rl (yj )) rˆk (yj )rˆl (yj ) = √ √ n n j=1

j=1

n n n 1  1  1  =√ rl (yj )(rˆk (yj ) − rk (yj )) + √ rk (yj )(rˆl (yj ) − rl (yj )) + √ rk (yj )rl (yj ) + op (1). n n n j=1

j=1

j=1

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

1501

By Lemma A.2, we have n n 1  1  (rˆk (yj )rˆl (yj ) − Erk (Y)rl (Y)) = √ (zkj rl (yj ) + zlj rk (yj ) + rk (yj )rl (yj ) − 3Erk (Y)rl (Y)) + op (1). √ n n j=1

(A.2)

j=1

Step 3: By Lemma A.1, we have n n 1 ˆ 1  ˆ (Rkh (yj ) − Rkh (yj ))Rhl (yj ) Rkh (yj )Rˆ hl (yj ) = √ √ n n j=1

j=1

n n 1  ˆ 1  +√ (Rhl (yj ) − Rhl (yj ))Rkh (yj ) + √ Rkh (yj )Rhl (yj ) + op (1). n n j=1

j=1

Therefore, Lemma A.3 yields n n 1  ˆ 1  (Rkh (yj )Rˆ hl (yj ) − E(Rkh (Y)Rhl (Y))) = √ (zhj zlj Rkh (yj ) + zkj zhj Rhl (yj ) + Rkh (yj )Rhl (yj ) √ n n j=1

j=1

− 3E(Rkh (Y)Rhl (Y))) + op (1).

(A.3)

Step 4: Use Lemma A.1 again to obtain n n 1  2 1  2 (rˆh (yj )rˆk (yj )rˆl (yj )) = √ (rh (yj )rl (yj )(rˆk (yj ) − rk (yj )) + 2rk (yj )rl (yj )rh (yj )(rˆh (yj ) − rh (yj )) √ n n j=1

j=1

+ rh2 (yj )rk (yj )(rˆl (yj ) − rl (yj )) + rh2 (yj )rk (yj )rl (yj )) + op (1). By Lemmas A.2 and A.3, we have n n 1  1  2 (rˆh (yj )rˆk (yj )rˆl (yj ) − E(rh2 (Y)rk (Y)rl (Y))) = √ (zlj rh2 (yj )rk (yj ) + zkj rh2 (yj )rl (yj ) + 2zhj rh (yj )rl (yj )rk (yj ) √ n n j=1

j=1

+ rh2 (yj )rk (yj )rl (yj ) − 5E(rh2 (Y)rk (Y)rl (Y))) + op (1).

(A.4)

Step 5: Invoking Lemma A.1 again, we derive n n 1  ˆ 1 ˆ ((Rhl (yj ) − Rhl (yj ))rk (yj )rh (yj ) + (rˆh (yj ) − rh (yj ))Rhl (yj )rk (yj ) Rhl (yj )rˆk (yj )rˆh (yj ) = √ √ n n j=1

j=1

+ (rˆk (yj ) − rk (yj ))rh (yj )Rhl (yj ) + rk (yj )rh (yj )Rhl (yj )) + op (1). Therefore, by Lemmas A.2 and A.3, we achieve n 1 1  ˆ (Rhl (yj )rˆk (yj )rˆh (yj ) − E(Rhl (Y)rk (Y)rh (yj ))) = √ (zhj zlj rk (yj )rh (yj ) + zhj rk (yj )Rhl (yj ) + zkj rk (yj )Rhl (yj ) √ n n j=1

+ Rhl (yj )rk (yj )rh (yj ) − 4E(Rhl (Y)rk (Y)rh (Y))) + op (1).

(A.5)

Similarly, we have n 1  ˆ 1 (Rkh (yj )hˆ k (yj )rˆl (yj ) − E(Rkh (Y)rh (Y)rl (yj ))) = √ (zkj zhj rh (yj )rl (yj ) + zhj rl (yj )Rkh (yj ) + zlj rh (yj )Rkh (yj ) √ n n j=1

+ Rkh (yj )rh (yj )rl (yj ) − 4E(Rkh (Y)rh (Y)rl (Y))) + op (1).

(A.6)

Finally, combining the results of (A.1)–(A.6), we derive that ˆ kl can be written asymptotically as a sum of i.i.d. random variables. Hence, Central Limit Theorem yields the desired result with the variance of (2.2). 

1502

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

Proof of Theorem 2. The proof is also almost identical to that of Theorem 2 of Zhu and Fang (1996). Hence we omit the details of the proof.  ˆ ) −  () Proof of Theorem 3. Let K be the true value of the dimension of the CDR subspace. As can be seen from Theorem 2, k ( k √ = Op (1/ n) when K > k. Therefore, if K > k, ⎛  ⎞ ⎛  ⎞ k (ln  ( K (ln  ( ˆ ) + 1 −  ( ˆ )) ˆ ) + 1 −  ( ˆ )) 2 2 K k i i i i ⎠ − ⎝n i=1 G(K) − G(k) = ⎝n i=1 − 2 ln(n) − 2 ln(n) ⎠ p p ˆ ) + 1 −  ( ˆ ˆ ) + 1 −  ( ˆ p p (ln i ( (ln i ( i )) i )) i=1 i=1 K ˆ ) + 1 −  ( ˆ (ln i ( k2 − K 2 i )) i=k+1 + 2 ln(n) =n  p ˆ ) + 1 −  ( ˆ p (ln i ( i )) i=1 K ˆ ) + o(2 ( ˆ )) 2 ( k2 − K 2 i i=k+1 i =n  + 2 ln(n) p 2 ˆ 2 ˆ p  () + o(i ()) i=1 i K

2i ()

i=k+1 →n  p 2

 () i=1 i

+ 2 ln(n)

k2 − K 2 > 0. p

(A.7)

If K < k, ⎛  ⎞ ⎛  ⎞ k (ln  ( K (ln  ( ˆ ˆ ˆ ˆ K2 ⎠ ⎝ k2 ⎠ i ) + 1 − i ()) i ) + 1 − i ()) i=1 i=1 ⎝ G(K) − G(k) = n p − 2 ln(n) − 2 ln(n) − n p ˆ ) + 1 −  ( ˆ ˆ ) + 1 −  ( ˆ p p (ln i ( (ln i ( i )) i )) i=1 i=1 k ˆ ) + 1 −  ( ˆ (ln i ( k2 − K 2 i )) i=K+1 = −n  + 2 ln(n) p ˆ ) + 1 −  ( ˆ p (ln i ( i )) i=1 k ˆ ) + o(2 ( ˆ )) 2 ( k2 − K 2 i i=K+1 i + 2 ln(n) = −n  p 2 ˆ 2 ˆ p  () + o(i ()) i=1 i →2 ln(n)

k2 − K 2 > 0. p

(A.8)

It follows from (A.7) and (A.8) that Kˆ → K in probability.  A.4. Proof of lemmas Proof of Lemma A.1. We only need to show that this lemma holds when T1 (·) and T2 (·) are the same because this lemma can be proven easily when T1 (·) and T2 (·) are different by the Cauchy inequality. We only prove the case with T1 (·) = T2 (·) = rk (·) because the proof for other cases is essentially the same. Note that n n n n 1  2 1  1  2 1  (rˆk (yj ) − rk (yj ))2 = √ rk (yj ). rˆk (yj ) − 2 √ rˆk (yj )rk (yj ) + √ √ n n n n j=1

j=1

j=1

j=1

First, we compute the expectation of the first term. For this, we write it into a U-statistic. n n n n 1  2 1    T N (yj )G−1 N(yi )Zki NT (yj )G−1 N(yj )Zkj rˆk (yj ) = √ √ k,n k,n 0 0 2 0 0 n nn j=1

j=1 i0 =1 j0 =1

n n n 1    T =√ N (yj )G−1 (q)N(yi )Zki NT (yj )G−1 (q)N(yj )Zkj 0 0 2 0 0 nn j=1 i0 =1 j0 =1



+ op (1) =

n 

Cn3 i
u1 (Zki , yi , Zkj , yj , Zkj , yj ) + op (1) =: 0

0

0

0

√ nU1 + op (1).

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

1503

We then compute the expectation of U1 as follows: EU 1 = E(NT (Y2 )G−1 (q)N(Y1 )Zk1 NT (Y3 )G−1 (q)N(Y3 )Zk3 ) = E(Zk1 NT (Y1 )G−1 (q)N(Y2 )NT (Y3 )G−1 (q)N(Y3 )Zk3 ) = E(rk (Y1 )NT (Y1 ))E(G−1 (q)N(Y3 )rk (Y3 )) = E(rk (Y1 )2 ) + O(hm ). Similarly, we have n 1  E(rk (yj )rˆk yj ) = E(rk2 (Y)) + O(hm ). √ n j=1

The conclusion of Lemma A.1 arrives. Since the proof of Lemmas A.2 and A.3 are similar, we only prove Lemma A.2 here.



√  Proof of Lemma A.2. Write (1/ n) nj=1 T(yj )rˆk (yj ) as a U-statistic: n n n 1  1  T(yj )rˆk (yj ) = √ T(yj )NT (yj )(nGJ,n )−1 N(yi )Zki √ n n j=1

j=1 i=1

√ =

=:

T −1 T −1 n  T(yj )N (yj )G (q)N(yi )Zki + T(yi )N (yi )G (q)N(yj )Zkj + op (1) 2

Cn2 i
n

1  Cn2 i
un (zki , yi , zkj , yj ) + op (1) =



nUn + op (1),

(A.9)

write = GJ,n − G(q). Lemma 6.4 of Zhou et al. (1998) yields   = o(h). Hence, the second equality holds by invoking the

= G−1 (q) − G−1 (q) (I + G−1 (q) )−1 G−1 (q). The projection of this U-statistic (A.9) is relationship G−1 J,n Uˆ n =

n 

E(Un |zkj , yj ) − (n − 1)Eun (Zk1 , Y1 , Zk2 , Y2 ),

j=1

where uh (·) is the kernel of the U-statistic Un . We prove that Uˆ n can be asymptotically equal to a sum of i.i.d. random variables. To compute EU n first, we can obtain that EU n = Eun (Zk1 , Y1 ; Zk2 , Y2 ) = E(T(Y1 )NT (Y1 )G−1 (q)N(Y2 )Zk2 )) = E(T(Y1 )NT (Y1 ))E(G−1 (q)N(Y2 )rk (Y2 )) = E(T(Y)rk (Y)) + O(hm ). The last equality holds by invoking Theorem 2.1 of Zhou et al. (1998). Note that un1 (zk1 , y1 ) =: E(un (zk1 , y1 ; Zk2 , Y2 )|zk1 , y1 ) =

(T(y1 )NT (y1 )E(G−1 (q)N(Y2 )Zk2 ) + E(T(Y2 )NT (Y2 )G−1 (q)N(y1 )Zk1 )) 2

= 12 (T(y1 )rk (y1 ) + zk1 T(y1 )) + O(hm ). Thus, the centered conditional expectation is as follows: u˜ n1 (zk1 , y1 ) = E(un (zk1 , y1 ; Zk2 , Y2 )|zk1 , y1 ) − E(un (Zk1 , Y1 ; Zk2 , Y2 )) = 12 (T(y1 )rk (y1 ) + zk1 T(y1 )) − E(T(Y)rk (Y)) + O(hm ).

1504

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

Following the classical theory of U-statistics (see Serfling, 1980), we can obtain n 2 Uˆ n − Eun (Zk1 , Zk2 , Y1 , Y2 ) = u˜ n1 (zkj , yj ) n j=1

=

n 2 1 (T(yj )rk (yj ) + zkj T(yj )) − E(T(Y)rk (Y)) + O(hm ). n 2

(A.10)

j=1

√ Now we verify that Un can be approximated by its projection Uˆ n at a rate 1/ nh, that is, √ √ n(Uˆ n − Un ) = Op (1/ nh).

(A.11)

Similar to the Proof of Lemma 5.3 of Zhu and Zhu (2007), we only need to show E(un (Zk1 , Y1 , Zk2 , Y2 ))2 = O(1/h2 ), where un (·) is defined in (A.9). Clearly, E(un (Zk1 , Y1 , Zk2 , Y2 ))2  12 ((NT (Y1 )G−1 (q)N(Y2 )Zk2 T(Y1 ))2 + (NT (Y2 )G−1 (q)N(Y1 )Zk1 T(Y2 ))2 ). It is easy to see that 2 |Y )) E(NT (Y1 )G−1 (q)N(Y2 )Zk2 T(Y1 ))2 = E((T(Y1 )2 (NT (Y1 )G−1 (q)N(Y2 ))2 E(Zk2 2

= E(T(Y1 )2 NT (Y1 )G−1 (q)N(Y2 ))NT (Y2 )G−1 (q)N(Y1 )rk2 (Y2 )) = trace(E(T(Y1 )2 G−1 (q)N(Y1 )NT (Y1 ))E(rk2 (Y2 )G−1 (q)N(Y2 )NT (Y2 )))

 trace(EZ1 |4 E(N(Y2 )NT (Y2 )G−1 (q))2 ) = O(1/h2 ) and E(NT (Y2 )G−1 (q)N(Y1 )Zk1 T(Y2 ))2 = O(1/h2 ). The operator trace(A) is the sum of the diagonal elements of matrix A. Therefore, (A.11) holds. It means that asymptotically equivalent. From the above results, we have that, together with (A.10) and (A.11), n √ √ 1  T(yj )rˆk (yj ) = nUn + op (1) = nUˆ n + op (1) √ n j=1

n √ 1  =√ (T(yj )rk (yj ) + E(T(Y)NT (Y))G−1 (q)N(yj )zkj ) − nE(T(Y)rk (Y)) + op (1). n j=1

Then n n 1  1  (T(yj )rˆk (yj ) − T(yj )rk (yj )) = √ (zkj E(T(Y)NT (Y))G−1 (q)N(yj ) − E(T(Y)rk (Y))) + op (1) √ n n j=1

j=1

n 1  (zkj T(yj ) − E(T(Y)rk (Y))) + op (1). = √ n j=1

The proof is finished.  References Bura, E., 2003. Using linear smoothers to assess the structural dimension of regressions. Statist. Sinica 13, 143–162. Bura, E., Cook, R.D., 2001. Extending sliced inverse regression: the weighted chi-square test. J. Amer. Statist. Assoc. 96, 996–1003. Cook, R.D., 1996. Graphics for regressions with a binary response. J. Amer. Statist. Assoc. 91, 983–992. Cook, R.D., 1998. Regression Graphics: Ideas for Studying Regressions through Graphics. Wiley, New York. Cook, R.D., 2000. SAVE: a method for dimension reduction and graphics in regression. Comm. Statist. Theory Methods 29, 2109–2121. Cook, R.D., Weisberg, S., 1991. Discussion to “Sliced inverse regression for dimension reduction”. J. Amer. Statist. Assoc. 86, 316–342. Ferre, L., 1998. Determining the dimension in sliced inverse regression and related methods. J. Amer. Statist. Assoc. 93, 132–140.

√ √ nUn and nUˆ n are

Z. Yu et al. / Journal of Statistical Planning and Inference 139 (2009) 1493 -- 1505

Li, K.C., 1991. Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86, 316–342. Li, Y.X., Zhu, L.X., 2007. Asymptotics for sliced average variance estimation. Ann. Statist. 35, 41–69. Schumaker, L.L., 1981. Spline Functions. Wiley, New York. Serfling, R., 1980. Approximation Theorems of Mathematical Statistics. Wiley, New York. Stute, W., Zhu, L.X., 2005. Nonparametric checks for single-index models. Ann. Statist. 33, 1048–1083. Zhou, S., Shen, X., Wolfe, D.A., 1998. Local asymptotics for regression splines and confidence regions. Ann. Statist. 26, 1760–1782. Zhu, L.P., Zhu, L.X., 2007. On kernel method for sliced average variance estimation. J. Multivariate Anal. 98, 970–991. Zhu, L.X., Fang, K.T., 1996. Asymptotics for kernel estimate of sliced inverse regression. Ann. Statist. 24, 1053–1068. Zhu, L.X., Miao, B.Q., Peng, H., 2006. On sliced inverse regression with high-dimensional covariates. J. Amer. Statist. Assoc. 101, 630–643. Zhu, L.X., Ng, K.W., 1995. Asymptotics of sliced inverse regression. Statist. Sinica 5, 727–736. Zhu, L.X., Ohtaki, M., Li, Y.X., 2007. On hybrid methods of inverse regression-based algorithms. Comput. Statist. Data Anal. 51, 2621–2635.

1505