Journal of Statistical Planning and Inference 139 (2009) 3570 -- 3578
Contents lists available at ScienceDirect
Journal of Statistical Planning and Inference journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / j s p i
The sliced inverse regression algorithm as a maximum likelihood procedure夡
María Eugenia Szrettera , Víctor Jaime Yohaib,∗ a b
Universidad de Buenos Aires, Argentina Departamento de Matemáticas, Universidad de Buenos Aires, Pabellón I—Ciudad Universitaria, 1428 Buenos Aires, and CONICET, Argentina
A R T I C L E
I N F O
A B S T R A C T
It is shown that the sliced inverse regression procedure proposed by Li corresponds to the maximum likelihood estimate where the observations in each slice are samples of multivariate normal distributions with means in an affine manifold. © 2009 Elsevier B.V. All rights reserved.
Article history: Received 6 December 2008 Received in revised form 5 April 2009 Accepted 7 April 2009 Available online 18 April 2009 Keywords: Sliced inverse regression Maximum likelihood estimates Central mean subspace
1. Introduction Regression models are used to describe a relationship between a response variable y and a group of covariates x = (x1 , . . . , xp ) ∈
Rp . When there is not an adequate parametric model that gives a satisfactory fit to the data, non-parametric techniques appear as
more flexible tools. However, when p increases the number of observations required to use local smoothing techniques increases exponentially, and consequently these methods become unfeasible. This limitation of the non-parametric techniques is known in the statistical literature as the dimensional curse. One way to overcome this difficulty is to use models where y depends in a non-parametric way only on a few projections of the vector x along some directions, i.e., the response y depends on x only through b1 x, . . . , bK x, where K is much smaller than p and bi , 1 i K are vectors on Rp that can without loss of generality be assumed to have norm one. Li (1991) considers the following model:
y = g(b1 x, . . . , bK x, ),
(1)
where g is a smooth function and the error is independent of x. Note that once the directions b1 , . . . , bK are known the local smoothing procedure to determine g involves only K variables, and therefore if K is small the dimensionality curse is overcome. A more general approach to the problem of dimension reduction can be found in Cook and Li (2002). Let b1 , . . . , bK vectors in Rp , then the subspace V generated by these vectors is called a mean dimension-reduction subspace if
E(y|x) = g(b1 x, . . . , bK x).
(2)
夡 This research was partially supported by Grants X-094 from Universidad de Buenos Aires, PID 5505 from CONICET and PAV 120 and PICT 21407 from ANPCYT, Argentina. ∗ Corresponding author. Tel.: +54 11 4576 3375. E-mail addresses:
[email protected] (M.E. Szretter),
[email protected],
[email protected] (V.J. Yohai). 0378-3758/$ - see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2009.04.008
M.E. Szretter, V.J. Yohai / Journal of Statistical Planning and Inference 139 (2009) 3570 -- 3578
3571
Note that the validity of (2) does not depend on the base of V which is used. Let
V0 = ∩Vm , where the intersection is overall mean dimension-reduction subspaces Vm . If V0 is itself a mean dimension-reduction subspace, it is called the central mean subspace (CMS). Cook and Li (2002) gave mild conditions for the existence of V0 . Additional existence results can be found in Cook (1994, 1996, 1998). Note that if the CMS exists, it should be unique. We will suppose in the remainder of this paper that V0 exists. The CMS can be estimated by using the sliced inverse regression (SIR) algorithm proposed by Li (1991). The procedure consists of dividing the observations in slices according to the value of y belonging to intervals I1 , . . . , IH and assuming that E(x|y ∈ Ih ) − E(x), 1 h H belong to the space
V∗0 = {c = b : b ∈ V0 }, where is the covariance matrix of x. We will also assume that we know the dimension K of V0 . Several authors have studied how to determine K. Among them, we can cite Li (1991), Schott (1994) and Ferré (1998). Since the seminal paper by Li (1991), several procedures using the inverse regression approach have been proposed. We can cite among many others: sliced average variance estimation (Cook and Weisberg, 1991), principal Hessian directions (Li, 1992; Cook, 1998), inverse regression (Cook and Ni, 2005), graphical methods (Cook, 1994; Cook and Wetzel, 1993), simple contour regression (Li et al., 2005), Fourier estimation (Zhu and Zeng, 2006) and principal fitted components (PFC) (Cook, 2007). In this paper we show that the SIR procedure is equivalent to estimate a K-dimensional subspace V0 ⊂ Rp by maximum likelihood assuming that the distribution of x given that y ∈ Ih is Np (h , ), where Np (, ) denotes the p-dimensional multivariate normal distribution with mean and covariance matrix , and that the conditional means 1 , . . . , H belong to an affine manifold of dimension K. In Section 3, we discuss the connection between this result and those in Cook (2007). In Section 2 we describe the SIR algorithm. In Section 3 we state the main results regarding the equivalence between the estimates of the subspace obtained with the SIR algorithm and the maximum likelihood estimates. The Appendix contains proofs. 2. Sliced inverse regression model be Let (xi , yi ) with 1 i N be a random sample of a distribution satisfying (2). Call the covariance matrix of x. Let x·· and the sample mean and sample covariance estimates of the xi 's, respectively. The SIR algorithm is as follows: −1/2 [x − x·· ]. (1) Standardize xi by computing zi = i (2) Divide the range of y in H slices I1 , . . . , IH , where Ih = (h , h+1 ], −∞ = 1 < 2 < · · · < H+1 = ∞. Let (zhj , yhj ), 1 j nh , be the standardized observations of the sample whose y value falls in Ih . (3) For each slice, compute zh , the sample mean of the observations zi whose y value falls in Ih and perform a (weighted) principal components analysis for zh as follows. Compute the eigenvalues and eigenvectors of the weighted covariance matrix =
H 1 nh zh zh . N h=1
. The estimate of the CMS is the subspace g1 , . . . , gK be the eigenvectors corresponding to the K largest eigenvalues of (4) Let spanned by −1/2 bk = gk ,
1 k K.
(3)
3. The SIR estimate as a maximum likelihood procedure We will show that the SIR procedure described in Section 2 is equivalent to estimate the CMS subspace by maximum likelihood assuming that the sliced samples xhj , 1 j nh , are independent random samples Np (h , ), where h belongs to an affine manifold of dimension K. Observe that Theorem 3.1 of Li (1991) implies that under some assumptions on the distribution ∗ of x, the conditional means h = E(x|y ∈ Ih ) belongs to the affine manifold of dimension K, V0 + a, where a ∈ Rp and can be taken equal to E(x). To guarantee that the SIR algorithm estimates the CMS, we will assume the coverage condition that 1 − E(x), . . . , H − E(x) generate V∗0 . We need the following definitions: x·· =
H nh 1 xhj , N h=1 j=1
xh· =
nh 1 xhj , nh j=1
3572
M.E. Szretter, V.J. Yohai / Journal of Statistical Planning and Inference 139 (2009) 3570 -- 3578
B=
H 1 nh (xh· − x·· )(xh· − x·· ) N
(4)
h=1
and W=
H nh 1 (xhj − xh· )(xhj − xh· ) . N
(5)
h=1 j=1
Note that x·· and xh· are p-dimensional column vectors and B and W are p × p matrices. Let C be the orthogonal matrix [c1 · · · cp ], where ci is an eigenvector of B−1/2 WB−1/2 corresponding to the eigenvalue i , and 1 2 · · · p . Denote by the diagonal matrix with 1 , . . . , p in the diagonal, then B−1/2 WB−1/2 = C C . Let Ch be the matrix with the first h columns of C. Theorem 1. Assume that for 1 h H, (xhj )1 j nh are i.i.d. column vectors in Rp with distribution Np (h , ), where h belongs to V∗0 + a, V∗0 is a K-dimensional subspace of Rp and a ∈ Rp . We also assume that the H samples are independent. Then, (a) The maximum likelihood estimate of is = W + B1/2 Cp−K C B1/2 . p−K −1/2 B −1/2 corresponding to the eigenvalues (b) Let ti , 1 i p, be orthogonal eigenvectors of norm one of 1 2 · · · p . The maximum likelihood estimate of h is 1/2 −1/2 (x − x·· ) + x·· , h = TK TK h·
1 h H,
, of x − x·· on the K-dimensional affine h is the orthogonal projection, using the norm associated to where TK = [ t1 · · · tK ]. Then h· ∗ ∗ 1/2 1/2 manifold V + x·· , where V is the subspace spanned by t1 , . . . , tK .
(c) (d)
1/2
ti is an eigenvector of BW −1 corresponding to the eigenvalue 1/ p−i+1 . can also be written as =
H nh 1 h )(xhj − h ) (xhj − N h=1 j=1
and it satisfies H nh 1 −1 (x − h ) h ) = p. (xhj − hj N
(6)
h=1 j=1
Remark 1. We should mention that Cook (2007) proposed a general inverse regression model called principal fitted components (PFC) model. The PFC model assumes that the distribution of x given y is Np ((y), ), where (y) = a + f(y), where and are unknown matrices of dimension p × K and K × H, respectively, a ∈ Rp is an unknown vector and f : R → RH is a known function. Therefore (y) belongs to the manifold V + a, where V is the subspace of dimension K generated by the columns of
. If the observations xhj , 1 j nh , are those for which y ∈ Ih as in the SIR algorithm, the model assumed in Theorem 1 is the PFC model corresponding to f(y) = (1I1 (y), . . . , 1Ih (y)) , where 1A denotes de indicator function of the set A. In Cook (2007) the maximum likelihood estimates of the manifold V + a is obtained assuming that is known. Instead, in Theorem 1 we derive the simultaneous maximum likelihood estimates of V + a and . The following theorem establishes the relation between the maximum likelihood estimate of V∗0 given in Theorem 1 and the estimate of the CMS given by the SIR algorithm. ∗ , where V ∗ is the maximum likelihood −1 V Theorem 2. The estimate of the CMS obtained with the SIR algorithm coincides with ∗ estimate of V0 given in Theorem 1. It also coincides with the subspace spanned by the K eigenvectors corresponding to the K largest eigenvalues of BW −1 . Acknowledgement We thank to the referee for his valuable suggestions, which considerably improved the paper.
M.E. Szretter, V.J. Yohai / Journal of Statistical Planning and Inference 139 (2009) 3570 -- 3578
3573
Appendix Proof of Theorem 1. Proof of (b): Let N = ln L(1 , . . . , H , ) = c −
H
h=1 nh .
Then the logarithm of the likelihood is
H N 1 nh (x − h ) −1 (xhj − h ), ln || − j=1 hj 2 2
(7)
h=1
where c = −(1/2)pN ln 2 . We start by estimating 1 , . . . , H assuming known. Put h = ah + a, where a1 , . . . , aH are in a K-dimensional subspace V∗ ⊂ RP and a ∈ RP . Let us find first a minimizing nh H (xhj − ah − a) −1 (xhj − ah − a)
(8)
h=1 j=1
for ah , 1 h H given. Put H 1 nh ah . N
a=
h=1
Then, it is easy to verify that (8) equals nh H (xhj − ah − (x·· − a)) −1 (xhj − ah − (x·· − a)) + N(a − (x·· − a)) −1 (a − (x·· − a)), h=1 j=1
and therefore a = x·· − a. Since a ∈ V∗ , and h = ah − a + x·· , redefining ah as ah − a we have h = ah + x·· , where the h 's are in ∗ V∗ too. Let aV , 1 h H, be the orthogonal projection of xh· − x·· in V∗ when Rp is endowed with the metric induced by the h norm x 2 = x −1 x. Then, since the cross product terms are zero, we can write nh nh H H (xhj − h ) −1 (xhj − h ) = (xhj − x·· − ah ) −1 xhj − x·· − ah h=1 j=1
h=1 j=1
=
nh H H ∗ ∗ (xhj − xh· ) −1 (xhj − xh· ) + nh xh· − x·· − aV −1 xh· − x·· − aV h h h=1 j=1
+
H
h=1
∗ ∗ −1 nh (ah − aV ah − aV . h ) h
(9)
h=1
Then, for fixed , the minimum of (7) is obtained by taking
h = ah + x··
(10)
and V∗ the K-dimensional subspace minimizing the second term of the right side of (9) given by H
∗ ∗ −1 nh (xh· − x·· − aV xh· − x·· − aV . h ) h
(11)
h=1 ∗∗
∗
∗∗ ∗ = nh −1/2 V is the orthogonal projection of vh Consider the transformation vh = nh −1/2 (xh· − x·· ), V = −1/2 V . Then cV h h in V∗∗ with the metric induced by the identity matrix. Then finding V∗ minimizing (11) is equivalent to finding V∗∗ minimizing 1/2
H
∗∗ 2
vh − cV
. h
1/2
(12)
h=1
Then V∗∗ is the solution to the principal components of the sample vh , 1 h H. Note that the sample covariance matrix of the vh 's is (N/H)−1/2 B−1/2 , where B is given in (4). Then, according to Theorem 5.5 of Seber (1986), (12) is minimized as follows. Let ti , 1 i p, be orthogonal eigenvectors of −1/2 B−1/2 with norm one corresponding to the eigenvalues 1 2 · · · p and let TK = [t1 · · · tK ]. Then V∗∗ is the subspace spanned by ti , 1 i K, and min ∗∗ V
H
H
∗∗ 2 ∗ ∗
−1 xh· − x·· − aV nh (xh· − x·· − aV
vh − cV
= min h h ) h ∗
h=1
V
=N
h=1
p i=K+1
i .
(13)
3574
M.E. Szretter, V.J. Yohai / Journal of Statistical Planning and Inference 139 (2009) 3570 -- 3578
Besides ∗
cV h = TK TK vh . ∗∗
−1/2
= nh Since V∗ = 1/2 V∗∗ and aV h
∗
1/2 cV , we obtain that V∗ is the subspace spanned by 1/2 ti , 1 i K, and h
∗
1/2 aV TK TK −1/2 (xh· − x·· ). h =
This proves (b). Proof of (a): Now we will find the maximum likelihood estimate of . Given a p × p matrix A we denote by 1 (A) · · · p (A) the eigenvalues of A when they are all real. From (7), (9), (10) and (13) we have p H nh N 1 N −1 ln L( , . . . , , ) = c − | − (x − x ) (x − x ) − i (−1 B) ln | max hj h· hj h· 1 H 2 2 2 1 ∈V∗ +a, ...,H ∈V∗ +a j=1 i=K+1 h=1
⎤ p N⎣ −1 −1 ⎦ ln || + trace{ W} + i ( B) , =c− 2 ⎡
i=K+1
where W is given in (5). Then should minimize p
f () = ln || + trace(−1 W) +
i (−1 B).
i=K+1
Put
∗ = B−1/2 B−1/2
(14)
and A = B−1/2 WB−1/2 . Then it is straightforward to show that f (B1/2 ∗ B1/2 ) = ln |B| + ln |∗ | + trace{∗−1 (B−1/2 WB−1/2 )} +
p
i (∗−1 )
i=K+1
= ln |B| + g(∗ ), where p
g(∗ ) = ln |∗ | + trace{∗−1 A} +
i (∗−1 ).
i=K+1
Then we have to find ∗ minimizing g(∗ ). We can write
∗ = DD
(15)
with = diag(1 , . . . , p ), 1 2 · · · p 0 the eigenvalues of ∗ and D = [d1 · · · dp ] the orthogonal matrix where di is an eigenvector corresponding to i . Then ⎛ ∗
ln | | = ln ⎝
p
⎞
i ⎠ =
i=1
p
trace{∗−1 A} = trace(D =
ln(i ),
i=1 −1
−1
D A) = trace(
p di Adi i=1
i
and p i=K+1
i (∗−1 ) =
1
1
+ ··· +
1
p−K
.
D AD)
M.E. Szretter, V.J. Yohai / Journal of Statistical Planning and Inference 139 (2009) 3570 -- 3578
3575
Then we have to find the values of (D, ) minimizing g∗ (D, ) = g(DD ) =
p
ln(i ) +
i=1
p di Adi
i
i=1
+
p−K
1
i=1
i
subject to the constraints that D is an orthogonal matrix and is a diagonal matrix, with positive diagonal elements. Observe that the function r(a) = ln(a) + b/a is minimized by taking a = b. Then we have
i = di Adi + 1, i = di Adi ,
1 i p − K,
(16)
p − K + 1ip
(17)
and D is an orthogonal matrix minimizing m(D) =
p−K
p
ln(di Adi + 1) +
i=1
ln(di Adi ).
i=p−K+1
Consider the spectral decomposition A = C C , where C is orthogonal and diagonal with diagonal elements 1 2 · · · p . Put E = C D,
(18)
then E = [e1 · · · ep ] is an orthogonal matrix minimizing ∗
m (E) =
p−K
⎛ ⎞ p 2 ⎝ ln j e + 1⎠ +
⎛ ⎞ p 2⎠ ⎝ ln j e .
p
ji
i=1
j=1
ji
i=p−K+1
j=1
Since ln(x + 1) and ln(x) are concave we have m∗ (E)
p p−K
e2ji ln(j + 1) +
i=1 j=1
=
p
p−K i=1
0 fj 1,
p
e2ji ln(j )
i=p−K+1 j=1
fj ln(j + 1) +
j=1
where fj =
p
p (1 − fj ) ln(j ),
(19)
j=1
e2ji . Clearly p
fj = p − K.
(20)
j=1
We are going to show that the minimum of b(f1 , . . . , fp ) =
p j=1
fj ln(j + 1) +
p (1 − fj ) ln(j ) j=1
subject to (20) occurs when fj = 1, for 1 j p − K and fj = 0 for p − K + 1 j p. To prove this, it is enough to show that if fi0 < 1 for some 1 i0 p − K we can still decrease b(f1 , . . . , fp ). In fact in that case there exists j0 > p − K such that fj0 > 0. Take > 0 such that fi0 + < 1 and fj0 − > 0 and put fi∗ = fi0 + , fj∗ = fj0 − and fi∗ = fi for i i0 , j0 . 0 0 Then b(f1 , . . . , fp ) − b(f1∗ , . . . , fp∗ ) = − ln(i0 + 1) + ln(i0 ) + ln(j0 + 1) − ln(j0 ) (j0 + 1)i0 = ln (i0 + 1)j0 ⎞ ⎛ 1 ⎜1 + ⎟ j0 ⎟ ⎜ = ln ⎜ ⎟ 0. 1 ⎠ ⎝ 1+
i0
3576
M.E. Szretter, V.J. Yohai / Journal of Statistical Planning and Inference 139 (2009) 3570 -- 3578
Then by (19) we get p−K
m∗ (E)
ln(j + 1) +
j=1
p
ln(j ).
j=p−K+1
On the other hand p−K
m∗ (Ip ) =
ln(j + 1) +
j=1
p
ln(j ).
j=p−K+1
Then the identity matrix E = Ip is an orthogonal matrix which minimizes (19) and therefore by (18) the orthogonal matrix D = C minimizes m(D). Then by (15)–(17) we get ∗
= C C + Cp−K C p−K = B−1/2 WB−1/2 + Cp−K Cp−K ,
where Ch denotes the matrix with the first h columns of C. Using (14) we get = W + B1/2 Cp−K C B1/2 , p−K and this proves (a). Proof of (c): Take −1/2 1/2
zi =
B
ci ,
where ci is the eigenvector of A = B−1/2 WB−1/2 corresponding to the eigenvalue i . We start proving that zi is an eigenvector of −1/2 corresponding to the eigenvalue 1/ where is given by −1/2 B i
i
i + 1 if 1 i p − K, i = i if p − K + 1 i p.
(21)
We have to prove that −1/2
−1/2
B
zi =
zi
i
or equivalently −1
B1/2 c = c . i B1/2 i i This is the same as B−1/2 c = c B−1/2 i i i and by part (a) we only need to prove B−1/2 (W + B1/2 Cp−K Cp−K B1/2 )B−1/2 ci = i ci
which is the same as (A + Cp−K Cp−K )ci = i ci .
(22)
Since Aci = i ci and Cp−K Cp−K ci = ci if 1 i p − K and 0 if p − K + 1 i p, (22) holds. Since we have shown that zi and tp−i+1 are both eigenvectors of the same matrix, associated with the same eigenvalue, to 1/2 z is an eigenvector of BW −1 corresponding to the eigenvalue 1/ . Then we have to prove (c) it is enough to show that v = i
prove that BW −1 B1/2 ci =
1
i
B1/2 ci
and this is equivalent to B1/2 W −1 B1/2 ci = which is true.
1
i
ci
i
i
M.E. Szretter, V.J. Yohai / Journal of Statistical Planning and Inference 139 (2009) 3570 -- 3578
3577
h , 1 h H, the maximum likelihood estimate of is the maximum likelihood estimate of Proof of (d): Note that given h , 1 j nh , 1 h H. when we want to fit a normal Np (0, ) distribution to the observations xhj − Finally, to prove (6) note that −1 ) p = trace(Ip ) = trace( ⎛ ⎛ ⎞⎞ nh H 1 −1 ⎝ h )(xhj − h ) ⎠⎠ = trace ⎝ (xhj − N h=1 j=1
⎛
⎞ H 1 −1 ⎠ h )(xhj − h ) = trace ⎝ (xhj − N nh
h=1 j=1
=
=
1 N
nh H
−1 ) h )(xhj − h ) trace((xhj −
h=1 j=1
H nh 1 −1 (x − h ) h )) trace((xhj − hj N h=1 j=1
=
H nh 1 −1 (x − h ) h ). (xhj − hj N h=1 j=1
Proof of Theorem 2. The vectors gi , 1 i K, computed in step 5 of the SIR procedure are the eigenvectors corresponding to −1/2 B −1/2 and satisfy the K largest eigenvalues of
−1/2
−1/2
B
gi = i gi ,
(23)
where 1 > 1 · · · p > 0.
−1/2 bi = gi , 1 i K. Eq. (23) is equivalent to The estimate of the CMS obtained using the SIR algorithm is spanned by bi = i bi B
= B + W we get and using that
bi (1 − i ) = i W bi . B Then, we have bi =
1 B bi =
i
1 W bi , 1 − i
(24)
and then, by (24), we have bi = BW −1 BW −1
1 i 1 W bi = B b = bi . 1 − i 1 − i i 1 − i
bi is an eigenvector of BW −1 corresponding to the eigenvalue i /(1 − i ), which is the i-th largest eigenvalue of BW −1 . Then 1/2 ti is an eigenvector of BW −1 corresponding to its i-th largest eigenvalue. Then the By part (c) of Theorem 1, it turns out that
−1 1/2 ti , 1 i K, which is equal to the subspace bi , 1 i K, is the same as the subspace generated by subspace generated by ∗ , where V ∗ is as in part (b) of Theorem 1. −1 V References
Cook, R.D., 1994. On the interpretation of regression plots. Journal of the American Statistical Association 89, 177–189. Cook, R.D., 1996. Graphics for regressions with a binary response. Journal of the American Statistical Association 91, 983–992. Cook, R.D., 1998. Principal Hessian directions revisited. Journal of the American Statistical Association 93, 84–94. Cook, R.D., 2007. Fisher lecture: dimension reduction in regression. Statistical Science 22, 1–26. Cook, R.D., Li, B., 2002. Dimension reduction for conditional mean in regression. Annals of Statistics 30, 455–474. Cook, R.D., Ni, L., 2005. Sufficient dimension reduction via inverse regression: a minimum discrepancy approach. Journal of the American Statistical Association 100, 410–428. Cook, R.D., Weisberg, S., 1991. Discussion of sliced inverse regression for dimension reduction, by K.C. Li. Journal of the American Statistical Association 86, 328–332.
3578
M.E. Szretter, V.J. Yohai / Journal of Statistical Planning and Inference 139 (2009) 3570 -- 3578
Cook, R.D., Wetzel, N., 1993. Exploring regression structure with graphics (with discussion). Test 2, 33–100. Ferré, L., 1998. Determining the dimension in sliced inverse regression and related methods. Journal of the American Statistical Association 93, 132–140. Li, K.C., 1991. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association 86, 316–327. Li, B., Zha, H., Chiaromonte, F., 2005. Contour regression: a general approach to dimension reduction. The Annals of Statistics 33, 1580–1616. Li, K.-C., 1992. On principal Hessian directions for data visualization and dimension reduction: another application of Stein's lemma. Journal of the American Statistical Association 87, 1025–1039. Schott, J.R., 1994. Determining the dimensionality in sliced inverse regression. Journal of the American Statistical Association 89, 141–148. Seber, G.A.F., 1986. Multivariate Observations. Wiley, New York. Zhu, Y., Zeng, P., 2006. Fourier methods for estimating the central subspace and the central mean subspace in regression. Journal of the American Statistical Association 101, 1638–1651.