Journal of Multivariate Analysis 122 (2013) 148–161
Contents lists available at ScienceDirect
Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva
Direction estimation in single-index models via distance covariance Wenhui Sheng, Xiangrong Yin ∗ Department of Statistics, 204 Statistics Building, University of Georgia, Athens, GA 30602, United States
article
info
Article history: Received 17 August 2012 Available online 6 August 2013 AMS 2010 subject classifications: 62G08 62G20 62H20 Keywords: Brownian distance covariance Central subspace Distance covariance Single-index model Sufficient dimension reduction
abstract We introduce a new method for estimating the direction in single-index models via distance covariance. Our method keeps model-free advantage as a dimension reduction approach. In addition, no smoothing technique is needed, which enables our method to work efficiently when many predictors are discrete or categorical. Under regularity conditions, we show that our estimator is root-n consistent and asymptotically normal. We compare the performance of our method with some dimension reduction methods and the singleindex estimation method by simulations and show that our method is very competitive and robust across a number of models. Finally, we analyze the UCI Adult Data Set to demonstrate the efficacy of our method. © 2013 Elsevier Inc. All rights reserved.
1. Introduction Dimension reduction without loss of information is an important statistical analysis tool to explore a data set without assuming a specific parametric model. It is becoming more and more important due to the increasing data volume and dimension in many scientific fields. Let X be a p × 1 predictor vector, and let Y be a univariate response. The aim of dimension reduction is to seek one or the fewest linear combinations of X , say βT X , where β is a vector or p × d matrix, such that Y depends on X only through βT X . That is, Y yX |βT X , where the notation ‘‘y’’ means ‘‘independent of’’. The column space of β is called a dimension reduction subspace [19,3,4]. When the intersection of all dimension reduction subspaces is itself a dimension reduction subspace, it is called the central subspace, denoted by SY |X [3,4]. The dimension of SY |X is denoted by dim(SY |X ). Under mild conditions [4,6,34] the central subspace exists and is unique. Throughout this article, we assume the central subspace exists which is unique, and we consider a special case of central subspace, that is, the central subspace is one dimensional, or in other words, dim(SY |X ) = 1. The regression model, which corresponds to this special central subspace, is a general single-index model, written as Y = g (ηT X , ϵ), where η is a p × 1 vector, ϵ is an unknown random error independent of X , and g is an unknown link function. We propose a new method to estimate a basis of the central subspace SY |X = Span(η). 2. Background and motivation The single-index model as a special case of sufficient dimension reduction is a practically useful generalization of the classical linear regression model. Since it is increasingly popular in many scientific fields, the estimation of single-index coefficients or single-index direction is an important objective in statistical analysis. The classical single-index model assumes that the conditional mean of a univariate response variable E (Y |X ) depends on the p × 1 predictor vector X only through a single linear combination of X ; that is, Y = g (ηT X ) + ϵ , where η is a p × 1 vector, g is unknown and E (ϵ|X ) = 0. A
∗
Corresponding author. E-mail addresses:
[email protected] (W. Sheng),
[email protected],
[email protected] (X. Yin).
0047-259X/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jmva.2013.07.003
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161
149
more general single-index model with the form Y = g (ηT X , ϵ) was considered by Li and Duan [21], where ϵ is an unknown random error independent of X . This model means that the single index may exist beyond the conditional mean function, and that Y is independent of X given ηT X . In such a case, the central subspace SY |X = Span(η), whose dimension is one, which is our consideration in this article. Many proposed dimension reduction methods can be applied to estimate the single-index direction in single-index models under the framework of the central subspace. Generally, we may classify these methods into three classes. We refer to the three classes as inverse methods, forward methods and correlation methods. For inverse methods, the most representative ones are sliced inverse regression (SIR) [19], sliced average variance estimation (SAVE) [10] and directional regression (DR) [22]. These methods are based on the idea of inverse regression. Inverse methods require certain conditions on X , such as the linearity condition and constant covariance condition. In practice, these conditions are hard to verify, which may affect the reliability of these methods. Forward methods relax those conditions on X , for instance, dMAVE [31], rMAVE [32] and SR [30]. However, they need some smoothing technique such as kernel estimation, and because of this they meet the problem of data sparseness. Due to the use of kernel smoothing, they require at least one of the predictors to be continuous [18,17]. For correlation methods, Zhu and Zeng [37] and Zeng and Zhu [36] proposed a Fourier transformation method and general integral approach, respectively. However their methods require X to be multivariate normal, or need to estimate marginal density nonparametrically involving kernel smoothing. Yin and Cook [33] and Yin, Li and Cook [34] proposed a method to recover central subspace by minimizing a Kullback–Leibler distance. Again, their methods need to estimate the density functions nonparametrically, which involves kernel smoothing. Cook and Forzani [7] proposed a likelihood method for dimension reduction, which requires that the conditional distribution of X |Y is normal. Recently, Ma and Zhu [26] provided a different but nice view of sufficient dimension reduction methods. However, their estimation procedures again involve nonparametric estimation of conditional mean or related forms. When the central subspace is one dimensional and the single index is within the conditional mean function, many nonparametric or semiparametric methods have been proposed to estimate the single-index coefficient too. Powell, Stock and Stoker [27] and Härdle and Stoker [16] proposed an average derivative method (ADE). Horowitz and Härdle [17] generalized the ADE method for the case where some components of X are discrete. Härdle, Hall and Ichimura [15] introduced a simultaneous minimization method to estimate the single-index coefficient and the univariate link function. These methods again require at least one predictor to be continuous and the link function to be smooth. Cui, Härdle and Zhu [11] considered a rather general form of single-index model where the single index can be in both the regression mean function and variance function. They proposed the estimating function method (EFM), which avoids the data sparsity problem by using a fixed-point iterative algorithm, however their method still required the link function to be smooth. In summary, while the existence of the single-index model as a special case of one-dimensional central subspace does not require strong assumptions but only very weak conditions [34], in order to estimate the single-index direction, one has to assume either certain distribution conditions on X , or X |Y , or require at least one of the predictors to be continuous, or the link function to be smooth, regardless of dimension reduction approaches or specifically designed single-index model estimation approaches. In this article, we propose a new method, based on distance covariance [29,28] without the aforementioned assumptions, to estimate the direction η in the general single-index model. Comparing with dimension reduction methods, our approach keeps the same advantages such as being model free. Comparing with specifically designed singleindex estimation approaches, our method does not require parametric or semi-parametric assumptions, or at least one of the predictors to be continuous, or that the link function be smooth. No smoothness is needed in our estimation procedure, which is especially useful when many predictors are discrete or categorical. To our knowledge, only Cook and Li [8] have proposed a dimension reduction method for regressions that can deal with many categorical predictors, but they assumed that the distribution of X |Y follows one-parameter exponential families. In addition, we show that under regular conditions, our estimator of the single-index direction is root-n consistent and asymptotically normal. The article is organized as follows. Section 3 describes the methodology behind our method and states the main results. Section 4 discusses the numerical studies, which includes simulation studies and a real data analysis, followed by a short discussion in Section 5. All proofs are given in Appendix A with a long proof arranged in a supplementary file (see Appendix B). 3. Methodology We first give some notations which will be useful in the following sections. Let B be a p × d matrix and let Span(B) stand for the subspace of Rp spanned by the columns of B. Let ΣX be the covariance matrix of X , which is assumed to be nonsingular. Let PB(ΣX ) denote the projection operator which projects onto Span(B) relative to the inner product ⟨a, b⟩ = aT ΣX b. That is, PB(ΣX ) = B(BT ΣX B)−1 BT ΣX . Let QB(ΣX ) = I − PB(ΣX ) , where I is the identity matrix. 3.1. The model The model that we consider in a traditional way is Y = g (ηT X , ϵ),
(1)
where Y is a univariate random variable representing the response, X is a p × 1 random vector representing the predictor, g is an unknown function, η is a p × 1 vector, and ϵ is an unknown error independent of X . Under the dimension reduction
150
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161
set-up, model (1) becomes Y yX |ηT X , and the central subspace SY |X = Span(η), which is our goal to estimate. 3.2. The method Let β ∈ Rp . We will show that under a mild condition, solving (2) with respect to β will yield a basis of SY |X , or in other words, the single-index direction. max V 2 (βT X , Y ).
(2)
βT ΣX β=1
In formula (2), V (βT X , Y ) is the distance covariance (DCOV) between βT X and Y . Indeed, DCOV between Y and βT X is the non-negative number V (βT X , Y ), whose square is defined [29] as
V (β X , Y ) = T
2
R1+1
|fβT X ,Y (t , s) − fβT X (t )fY (s) |2 w(t , s)dtds,
(3)
where f (·) denotes the characteristic function and w(t , s) is a specially chosen positive weight function. Székely and Rizzo [28] then gave a simpler but equivalent form of (3) as follows:
V 2 (βT X , Y ) = E |βT X − βT X ′ ||Y − Y ′ | + E |βT X − βT X ′ |E |Y − Y ′ |
− E |βT X − βT X ′ ||Y − Y ′′ | − E |βT X − βT X ′′ ||Y − Y ′ |, (4) where (X , Y ) and (X , Y ′′ ) are i.i.d copies of (X , Y ). The definition of DCOV requires that E |X | < ∞ and E |Y | < ∞, which guarantees the DCOV is finite. Thus throughout the paper we assume E |X | < ∞ and E |Y | < ∞. DCOV is a new distance measure of dependence, analogous to product-moment covariance. In fact, the proposed DCOV has a nice distance correlation T T 2 2 2 interpretation [29] as R = V (β X , Y )/ V (β X , βT X )V 2 (Y , Y ), which is analogous to product-moment correlation. ′
′
′′
However, for the purpose of estimating the single-index direction, we only need to maximize the squared DCOV instead of R2 with respect to β. Note that for any constant c , V 2 (c βT X , Y ) = |c |V 2 (βT X , Y ), [28, Theorem 3], thus putting a constraint on the length of β is enough to make the maximization procedure work. Hence, for estimation purposes and convenience we set the constraint βT ΣX β = 1. One may use other constraints such as V 2 (βT X , βT X ) = 1 which leads to using R2 essentially, or βT β = 1. The next proposition indicates that if we maximize V 2 (βT X , Y ) with respect to β under the constraint βT ΣX β = 1, then the solution is a basis of the central subspace. T T 2 T Proposition 1. Let η to be a basis of the central subspace SY |X and ηT ΣX η = 1. If Pη( ΣX ) X yQη(ΣX ) X , then V (η X , Y ) ≥ T T p 2 V (β X , Y ) for any β ∈ R with β ΣX β = 1. The equality holds if and only if Span(β) = Span(η), or in other words, β = η or β = −η. T T Proposition 1 indicates that under the condition Pη( ΣX ) X yQη(ΣX ) X and the length constraint, solving (2) will yield a basis of the central subspace. It is true that solving (2) does not have a unique solution in terms of η. It will have two solutions: η or −η because V 2 (−ηT X , Y ) = V 2 (ηT X , Y ). However, η and −η span the same space, thus as long as the central subspace exists, which is what we assumed and are interested in, then the problem we considered in this article is identifiable. According to the properties of distance covariance, the length constraint in Proposition 1 is needed to make the maximization procedure work. T T T T In Proposition 1, when X is normal, the condition Pη( ΣX ) X yQη(ΣX ) X will be satisfied, because Cov(Pη(ΣX ) X , Qη(ΣX ) X ) = 0. However, normality is not necessary. The independence condition in Proposition 1 will be discussed further in Section 3.5.
3.3. Estimation of η Let (X, Y) = {(Xi , Yi ) : i = 1, . . . , n} be a random sample from (X , Y ). As Székely, Rizzo and Bakirov [29] discussed, the sample version of V 2 (βT X , Y ), denoted by Vn2 (βT X, Y), has the following simple form.
Vn2 (βT X, Y) =
n 1
n2 k,l=1
Akl (β)Bkl ,
(5)
where, for k, l = 1, . . . , n, Akl (β) = akl (β) − a¯ k. (β) − a¯ ·l (β) + a¯ ·· (β) akl (β) = |βT Xk − βT Xl |, a¯ ·l (β) =
n 1
n k=1
akl (β),
a¯ k. (β) = a¯ ·· (β) =
n 1
n l =1
n 1
n2 k,l=1
akl (β),
akl (β).
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161
151
Similarly, define bkl = |Yk − Yl | and Bkl = bkl − b¯ k. − b¯ ·l + b¯ ·· . Here | · | is the Euclidean norm in the respective dimension, which is then the usual absolute distance in current content. The simple form of Vn2 (βT X, Y) in (5) benefits the computation
ˆ X β = 1, where Σ ˆ X is the sample version of ΣX . Therefore, as our goal is to maximize Vn2 (βT X, Y) with respect to β under βT Σ the estimator, denoted by ηn , of a basis of the central subspace is ηn = arg maxβT Σˆ X β=1 Vn2 (βT X, Y).
ˆ X β = 1 to maximize Vn2 (βT X, Y) is the Sequential Quadratic ProgramAn algorithm that incorporates the constraint βT Σ ming procedure (SQP; [13, Ch. 6]). The idea of SQP is to model this problem at each iterate β(k) by a Quadratic Programming (QP) subproblem and to use the solution of this subproblem to find the new iterate β(k+1) . This procedure iterates until convergence. In SQP procedure, the Lagrangian function plays a key role [2, pp. 7–10]. In our case, it is L(β, λ) = ˆ X β − 1). The construction of the QP subproblem is based on this Lagrangian function. The outline of Vn2 (βT X, Y) + λ(βT Σ the algorithm is as follows: 1. Supply initial: we randomly generate 1000 standard normal vectors and normalize them to be unit length. Then select the best one (which gives the biggest DCOV) as the initial to start the SQP algorithm. 2. Construct and solve QP subproblem to obtain the search direction dβ : given a current iterate (β(k) , λ(k) ), the QP subproblem is constructed as
minimize − dβ
1
∇ L(β(k) , λ(k) )T dβ + dTβ H (k) dβ
subject to ∇ h(β
2
(k) T
(k)
) dβ + h(β ) = 0
, h(β) = βT ΣX β − 1, ∇ h is the gradient of the constraint function h at β(k) , ∇ L is the gradient of the Lagrangian function at β(k) , and H (k) is the quasi-Newton approximation of the Hessian of the Lagrangian function at β(k) . The solution of the QP subproblem, dβ , is then used as the search direction to form a new iterate∗ . 3. Choose the step length α : the step length parameter α is determined in order to produce a sufficient decrease in a merit function. Once the step length α is estimated, the new iterate is β(k+1) = β(k) + α dβ . λ(k) needs to be updated too. One obvious way is to use the optimal multipliers of the QP subproblem. Let the optimal multiplier of the QP be λQP , and set dλ = λQP − λ(k) , then λ(k+1) = λ(k) + α d⋆λ . 4. Repeat step 2 and step 3 until the difference between β(k) and β(k+1) is sufficiently small (say, less than 10−6 ), which where dβ = β − β
(k)
means the algorithm converges.
∗ Gill, Murray and Wright [13, pp. 237–247] give details of how to solve the QP subproblem and related properties and issues of QP subproblem.
⋆ For the details of the algorithm for computing the step length, please see Gill, Murray and Wright [13, Section 4.3.2.1, pp. 100–102]. Discussions about different merit functions and the purpose of incorporating a merit function in the SQP algorithm can be found in [2, pp. 27–37]. This scheme seems to work well in our simulations and real data analysis. The MATLAB √ code for our algorithm is available from the authors. In the next section, we show that the estimator ηn is consistent, and n asymptotically normal. 3.4. Asymptotic properties of ηn Proposition 2. Let η ∈ Rp be a basis of the central subspace with ηT ΣX η = 1, and ηn = arg maxβT Σˆ β=1 Vn2 (βT X, Y). Assume X T T p Pη( Σ ) X yQη(Σ ) X and the support of X ∈ R , say S, is a compact set, then there exists a constant c = 1 or c = −1 such that X
X
P
ηn − → c η as n → ∞. Similar to the population level, Vn2 (βT X, Y) = Vn2 (−βT X, Y), thus maximizing Vn2 (βT X, Y) with respect to β under the constraint will have two solutions: ηn or −ηn , which, respectively, spans the same subspace. The purpose of using the constant c = 1 or c = −1 is to make sure that the first nonzero component of ηn and c η have the same sign. In general, the support of X does not have to be compact. However, Yin, Li and Cook [34, Proposition 11] showed that as long as compact set S is large enough, then SY |Xs = SY |X , where Xs is X restricted to S. Hence we can restrict our discussion to a compact set S for simplifying the proof. Under such a condition, E |X | < ∞ holds, which together with E |Y | < ∞ satisfies the definition of distance covariance. The proof of Proposition 2 is given in Appendix A. Indeed, we can further prove the √ n-consistency and asymptotic normality of the estimator as stated below. And the proof of Proposition 3 is again delayed in Appendix A. Proposition 3. Let η ∈ Rp be a basis of the central subspace with ηT ΣX η = 1, and ηn = arg maxβT Σˆ β=1 Vn2 (βT X, Y). Under X the same conditions as in Proposition 2 √and also the regularity conditions in the supplementary file (see Appendix B), there exists a constant c = 1 or c = −1 such that n(ηn − c η) → N (0, V11 ), where V11 is a covariance matrix defined in the supplementary file (see Appendix B).
152
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161
3.5. Discussion about the condition in Proposition 1 T T In Section 3.2, Proposition 1 requires the condition Pη( ΣX ) X yQη(ΣX ) X . In this section, we will establish an approximate result for arbitrary predictors, so that this condition is not as strong as it seems to be. This is based on the fundamental result that, when dimension p is reasonably large, low dimensional projections of the predictor are approximately multivariate normal [12,14]. To be more specific, we adopt what was developed by Li and Yin [24]: re-write quantities such as X , β, η as Xp , βp , ηp . Let Sp denote the unit sphere in Rp : {u ∈ Rp : |u| = 1}, and Unif(Sp ) denote the uniform distribution on Sp . The result of Diaconis and Freedman [12] states that, if βp ∼ Unif(Sp ), then under mild conditions, the conditional distribution of
βTp Xp |βp converges weakly in probability (to be abbreviated as w.i.p.) to normal as p → ∞. That is, the sequence of conditional characteristic functions E (eit βp Xp |βp ) converges (pointwise) in probability to a normal characteristic function. As Li and Yin T
[24] commented, intuitively, this means when p is large, the distribution of βTp Xp is nearly normal for most of the βp . Here, the parameter βp is treated as random to facilitate the notion of ‘‘most of the βp ’’. We will adopt this assumption, which makes sense to assume βp y(Y , Xp ), where the data (Y , Xp ) are treated as fixed. With βp being random, the dimension reduction relation should be stated as Y yXp |(βTp Xp , βp ). Definition 1 ([24, Definition 6.1]). Let βp be a p × d dimensional random matrix whose columns are i.i.d. Unif(Sp ), and
βp y(Up , Y ). We say that Y and Up are asymptotically conditionally independent given (βTp Up , βp ), in terms of weak convergence in probability if, for any random vector ζp satisfying ζp ∼ Unif(Sp ) and ζp y(βp , Up , Y ), the sequence (Y , βTp Up , ζpT Up )|(βp , ζp ) converges w.i.p. to (Y ∗ , U ∗ , V ∗ ) in which Y ∗ yV ∗ |U ∗ . If this holds we write Y yUp |(βTp Up , βp )
w.i.p. as p → ∞.
Definition 2 ([24, Definition 6.2]). We will say that a sequence of p × p matrices {Σp : p = 1, 2, . . .} is regular if p−1 trace(Σp ) → σ 2 and |Σp |2F = o(p2 ). In Definition 2, | · |F denotes the Frobenius norm. Li and Yin [24] had a nice discussion on these two definitions and reasons for using them. We are now ready to state our result, whose proof is in Appendix A. Proposition 4. Suppose that ηp ∼ Unif(Sp ). Suppose, furthermore, that 1. {ΣXp } is a regular sequence with σX2 being the limit of its traces as p → ∞; 2. ηp y(Xp , Y ); 3. p−1 XpT Xp = E (p−1 XpT Xp ) + oP (1); 4. E (|Y |) < ∞ and E (|Xp |) < ∞. If Y yXp |(ηTp Xp , ηp ) and the conditional distribution of Y |(ηTp Xp = c , ηp ) converges w.i.p. for each c, then when p → ∞, for any βp ∼ Unif(Sp ), V 2 (ηTp X , Y ) ≥ V 2 (βTp X , Y ). The equality holds if and only if βp = ηp or βp = −ηp . The constraint, βp ∈ Unif(Sp ), means that we use βTp βp = 1, which is not the same as the constraint we used previously for the purpose of identifiability, but it can greatly simplify the proof by directly using existing results when p → ∞. However, we can easily scale it back to satisfy the previous constraint by
βp
βTp ΣX βp
. The condition that ‘‘the conditional
distribution of Y |(ηTp Xp = c , ηp ) converges w.i.p. for each c’’ means that the regression relation stabilizes as p → ∞, see Li and Yin [24] for more details. Conditions 1 and 3 are respective to Assumptions 1.2 and 1.1 in [12]. More detailed comments on the two conditions can be found in [24, Lemma 6.2 and Theorem 6.1]. Condition 4 is from the definition of distance covariance [29], which requires that the first moments of Xp and Y are finite. T T Proposition 4 indicates that when p is large, the condition Pη( ΣX ) X yQη(ΣX ) X should hold approximately, so does Proposition 1. Our simulations in Section 4.1 indicate that p = 10 seems already large enough for Proposition 4 to be true. As commented by Diaconis and Freedman [12], the results in Proposition 4 hold for most of the distributions. However, they also pointed out that with long-tailed data, asymptotic normality may be still not valid, such as Cauchy distribution. In Section 4.1, we give a simulation example where the data set is standard Cauchy, and the result shows that our method is still quite robust in such case, and it has a much better performance than other methods considered in the simulation study. 4. Numerical studies In Section 4.1, we compare the performance of our method with four dimension reduction methods: SIR [19], SAVE [10], PHD [20,5] and LAD [7] and one most recently developed nonparametric single-index estimation method, EFM [11]. For LAD and EFM, we use Forzani’s code and Cui’s code, respectively.
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161
153
Table 1
¯ m ) and its standard error (SE) for Model A. Average estimation accuracy (∆ Part (1)
Part (2)
Part (3)
n
Method
¯m △
SE
n
Method
¯m △
SE
n
Method
¯m △
SE
100
SIR PHD SAVE LAD EFM DCOV
0.9361 0.3652 0.2744 0.2183 0.0364 0.1205
0.0896 0.1107 0.0700 0.0930 0.0010 0.0283
100
SIR PHD SAVE LAD EFM DCOV
0.8179 0.6930 0.5772 0.3114 0.0157 0.2062
0.1189 0.1524 0.1789 0.1679 0.0045 0.0710
100
SIR PHD SAVE LAD EFM DCOV
0.1393 0.7397 0.5756
0.0421 0.1572 0.3376
SIR PHD SAVE LAD EFM DCOV
0.9391 0.2598 0.1889 0.1231 0.0226 0.0785
0.888 0.0738 0.0481 0.0319 0.0062 0.0224
200
SIR PHD SAVE LAD EFM DCOV
0.7220 0.6224 0.4332 0.1809 0.0100 0.1425
0.1415 0.1829 0.1441 0.0884 0.0024 0.0631
200
SIR PHD SAVE LAD EFM DCOV
0.9420 0.1762 0.1252 0.0822 0.0147 0.0514
0.0913 0.0460 0.0325 0.0218 0.0038 0.0142
400
SIR PHD SAVE LAD EFM DCOV
0.6348 0.5782 0.3216 0.1151 0.0065 0.1026
0.1129 0.2073 0.1252 0.0709 0.0020 0.0253
400
200
400
*
*
*
0.0124 0.0306
0.0036 0.0402
0.0957 0.6265 0.1471
0.0263 0.1618 0.1125
SIR PHD SAVE LAD EFM DCOV
*
*
0.0085 0.0011
0.0020 0.0021
SIR PHD SAVE LAD EFM DCOV
0.0661 0.5206 0.0672 0.0379 0.0059 0.0006
0.0170 0.1534 0.0246 0.0101 0.0013 0.0002
LAD does not work sometimes.
We choose four dimension reduction methods SIR, SAVE, PHD and LAD, because SIR, SAVE and PHD are classic dimension reduction methods, while LAD is similar to our approach in that both methods maximize objective functions. Their method directly maximizes the likelihood, whereas ours maximizes a distance covariance function. There are also many methods designed for estimating the single index specifically. We choose to compare with EFM, because it is the most recently developed method for the purpose of estimating the single index and has advantages over other methods as [11] showed. In Section 4.2, we will analyze the UCI Adult Data Set and compare our result with results from other existing methods. For comparison, we use the following distance [25] to measure the accuracy of the estimates:
∆m (S1 , S2 ) = ∥PS1 − PS2 ∥, where ∥ · ∥ is the maximum singular value of a matrix, S1 and S2 are two d-dimensional subspaces of Rp . And PS1 and PS2 are the usual orthogonal projections onto S1 and S2 , respectively. Smaller ∆m means a more accurate estimate. 4.1. Simulation studies We simulated five different models to illustrate the efficacy of our method. Let β1 = (1, 1, 1, 1, 1, 0√ , 0, 0, 0, 0)T , β2 = T T T (1, −1, 0, 0, 0, 0, 0, 0, 0, 0) , β3 = (1, 1, 0, 0, 0, 0, 0, 0, 0, 0) and β4 = (2, 1, 0, 0, 0, 0, 0, 0, 0, 0) / 5. We report the results using p = 10. We run sample sizes n = 100, 200 and 400 for each model. For each n, we report the results of the ¯ m and standard error (SE) of ∆′m s by using 100 data replicates. We set error ϵ ∼ N (0, 1), the five models are mean ∆
(A) Y = (βT1 X )2 + ϵ, (B) Y = exp(βT2 X ) + ϵ, (C ) Y = 0.2(βT1 X )2 ϵ, 1 βT3 X > 2 (D) Y = 0
otherwise
(E ) P (Y = 1|X ) = exp(g (βT4 X ))/[1 + exp(g (βT4 X ))] g (βT4 X ) = exp(5βT4 X − 2)/{1 + exp(5βT4 X − 3)} − 1.5. Predictors are generated based on the following schemes: part (1), standard normal predictors X ∼ N (0, I10 ); part (2), non-normal predictors; part (3), discrete predictors. The purpose of generating three different types of predictors is that we would like to compare the performances of our method with five other methods under different scenarios. Model A. Model A is very similar to Example 2 in [33]: nonlinear structure in the regression mean function. Predictors for part (2) and part (3) are generated as follows: part (2), x1 ∼ N (−8, 4), x2 ∼ F (4, 10), x3 ∼ χ 2 (5), x4 ∼ t (15), x5 ∼ t (3), xi ∼ N (0, 1), i = 6, . . . , 10; part (3), xi ∼ Poisson(1), i = 1, . . . , 5, xi ∼ N (0, 1), i = 6, . . . , 10. Simulation results are listed in Table 1. The mean function is symmetric about 0 in part (1) and semi-symmetric in part (2), as expected, SIR fails. But SIR performs very well in part (3), due to the fact that predictors in part (3) are discrete and positive, as such the model is not symmetric anymore, and it has a strong linear trend. PHD and SAVE perform relatively well for part (1) as nonlinear
154
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161 Table 2
¯ m ) and its standard error (SE) for Model B. Average estimation accuracy (∆ Part (1)
Part (2)
Part (3)
n
Method
¯m △
SE
n
Method
¯m △
SE
n
Method
¯m △
SE
100
SIR PHD SAVE LAD EFM DCOV
0.2626 0.5969 0.8433 0.2739 0.2764 0.1784
0.0593 0.1279 0.2020 0.0824 0.2859 0.0462
100
SIR PHD SAVE LAD EFM DCOV
0.3880 0.7075 0.7242 0.5311 0.3753 0.4830
0.0977 0.0655 0.1606 0.1071 0.2947 0.1108
100
SIR PHD SAVE LAD EFM DCOV
0.3154 0.7699 0.7975
0.0774 0.0999 0.2083
*
*
0.4063 0.2018
0.2559 0.1018
200
SIR PHD SAVE LAD EFM DCOV
0.1843 0.5107 0.3238 0.1686 0.1822 0.1214
0.0503 0.1098 0.1630 0.0448 0.2293 0.0287
200
SIR PHD SAVE LAD EFM DCOV
0.3451 0.6700 0.4791 0.5151 0.3618 0.4240
0.0760 0.0615 0.1650 0.0785 0.2700 0.0922
200
SIR PHD SAVE LAD EFM DCOV
0.2176 0.6560 0.3169 0.1883 0.3180 0.0864
0.0577 0.1145 0.1144 0.0567 0.2538 0.0795
400
SIR PHD SAVE LAD EFM DCOV
0.1248 0.3945 0.1574 0.1192 0.1617 0.0821
0.0310 0.1060 0.0414 0.0296 0.2025 0.0201
400
SIR PHD SAVE LAD EFM DCOV
0.3118 0.6668 0.4061 0.5087 0.3939 0.4252
0.0544 0.0538 0.1371 0.0466 0.2738 0.0588
400
SIR PHD SAVE LAD EFM DCOV
0.1581 0.5821 0.1765 0.1313 0.2851 0.0232
0.0391 0.1136 0.0449 0.0339 0.2224 0.0393
*
LAD does not work sometimes.
Table 3
¯ m ) and its standard error (SE) for Model C. Average estimation accuracy (∆ Part (1)
Part (2)
Part (3)
n
Method
¯m △
SE
n
Method
¯m △
SE
n
Method
¯m △
SE
100
SIR PHD SAVE LAD EFM DCOV
0.9351 0.5555 0.4706 0.3511 0.7189 0.4867
0.0970 0.1475 0.1716 0.2408 0.1663 0.3451
100
SIR PHD SAVE LAD EFM DCOV
0.7715 0.9389 0.9407 0.8503 0.9367 0.8185
0.1302 0.0734 0.0670 0.1255 0.0730 0.1343
100
SIR PHD SAVE LAD EFM DCOV
0.4119 0.8361 0.9331 0.5403 0.8431 0.4014
0.1114 0.1308 0.1076 0.1584 0.1164 0.0927
200
SIR PHD SAVE LAD EFM DCOV
0.9478 0.4240 0.2897 0.1713 0.7199 0.2177
0.0734 0.1132 0.0734 0.0634 0.1425 0.2473
200
SIR PHD SAVE LAD EFM DCOV
0.6482 0.9227 0.9086 0.7572 0.9413 0.6479
0.1437 0.0815 0.0799 0.1366 0.0630 0.1345
200
SIR PHD SAVE LAD EFM DCOV
0.2910 0.7834 0.8450 0.3248 0.8649 0.2772
0.0706 0.1414 0.1851 0.0824 0.1136 0.0713
400
SIR PHD SAVE LAD EFM DCOV
0.9469 0.3156 0.1975 0.1068 0.7101 0.0927
0.0746 0.0830 0.0463 0.0293 0.1589 0.0243
400
SIR PHD SAVE LAD EFM DCOV
0.5163 0.9034 0.8842 0.5766 0.9361 0.4739
0.1189 0.0903 0.0992 0.1293 0.0685 0.1105
400
SIR PHD SAVE LAD EFM DCOV
0.2175 0.7107 0.6396 0.2020 0.8316 0.1704
0.0510 0.1558 0.2659 0.0511 0.1245 0.0508
structure exists in the model, while they perform worse in part (2) and part (3) as structure tends to be more linear. LAD does not work sometimes for discrete predictors, especially when the sample size is small, which may be due to the existence of singular matrices in the estimation procedure which is not acceptable to LAD. DCOV outperforms other dimension reduction methods consistently, except that EFM is better than DCOV under Model A most of the time. Note that this may not be a surprise as Model A is favored by the EFM method. Model B. Model B is similar to the example in Cook and Nachtsheim [9, Example 2.5]. Predictors for part (2) and part (3) are generated as follows: part (2), x1 , x2 ∼ exp(1), xi ∼ N (0, 1), i = 3, . . . , 10; part (3), x1 , x2 ∼ binomial(10, 0.3), xi ∼ N (0, 1), i = 3, . . . , 10. The mean function in this model is monotonic with respect to the single index, thus it is in favor of SIR. Simulation results are listed in Table 2. As expected, SIR does a nice job, so do LAD and EFM. Again LAD seems to have trouble with discrete predictors sometimes, due to the same reason mentioned above. DCOV has a better performance overall. Model C. Model C is very similar to Example 3 in [33]: nonlinear structure in the regression variance function. Predictors for part (2) and part (3) are generated as follows: part (2), x1 ∼ N (0, 1), x2 ∼ t (5), x3 ∼ Ga(9, 0.5), x4 ∼ F (5, 12), x5 ∼ χ 2 (13), xi ∼ N (0, 1), i = 6, . . . , 10; part (3), xi ∼ Poisson(1), i = 1, . . . , 5, xi ∼ N (0, 1), i = 6, . . . , 10. Simulation results are listed in Table 3. In such a case, SAVE and LAD should perform well as expected, when predictors are normal. SIR also performs well when predictors are discrete (part (3)), in which case the linear trend is strong again. EFM’s performance seems
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161
155
Table 4
¯ m ) and its standard error (SE) for Model D. Average estimation accuracy (∆ Part (1)
Part (2)
Part (3)
n
Method
¯m △
SE
n
Method
¯m △
SE
n
Method
¯m △
SE
100
SIR PHD SAVE LAD EFM DCOV
0.4540 0.5628 0.8790
0.1235 0.1201 0.2006
100
0.6911 0.8819 0.8920
0.1684 0.1318 0.1169
100
0.1030 0.1750 0.2860
*
*
*
*
*
*
*
*
*
*
*
0.3920
0.1218
0.5873
0.2416
SIR PHD SAVE LAD EFM DCOV
0.3931 0.7587 0.6719
*
SIR PHD SAVE LAD EFM DCOV
0.2799
0.1000
SIR PHD SAVE LAD EFM DCOV
0.3073 0.4182 0.7334
0.0750 0.1030 0.3036
0.4777 0.6556 0.8199
0.2275 0.2644 0.2414
0.0708 0.1578 0.3322
*
*
*
*
*
*
*
*
*
*
*
0.2655
0.0732
0.5170
0.2479
SIR PHD SAVE LAD EFM DCOV
0.2986 0.5003 0.5385
*
SIR PHD SAVE LAD EFM DCOV
0.1455
0.0933
SIR PHD SAVE LAD EFM DCOV
0.2249 0.2959 0.3442
0.0507 0.0722 0.2646
0.3221 0.4374 0.4823
0.2028 0.2622 0.3325
0.0523 0.1079 0.2420
*
*
*
*
*
*
*
*
*
*
*
0.1725
0.0487
0.4873
0.2491
SIR PHD SAVE LAD EFM DCOV
0.2244 0.3293 0.3012
*
SIR PHD SAVE LAD EFM DCOV
0.0401
0.0584
200
400
*
200
400
200
400
LAD and EFM do not work sometimes.
Table 5
¯ m ) and its standard error (SE) for Model E. Average estimation accuracy (∆ Part (1)
Part (2)
Part (3)
n
Method
¯m △
SE
n
Method
¯m △
SE
n
Method
¯m △
SE
100
SIR PHD SAVE LAD EFM DCOV
0.5791 0.8872 0.8987 0.6519 0.7306 0.5680
0.1399 0.1211 0.1227 0.1835 0.1839 0.1607
100
SIR PHD SAVE LAD EFM DCOV
0.5262 0.9128 0.8881 0.5940 0.6830 0.4921
0.1256 0.1135 0.1263 0.1729 0.1908 0.1409
100
SIR PHD SAVE LAD EFM DCOV
0.5437 0.9564 0.8865 0.5929 0.6652 0.5043
0.1209 0.0554 0.1337 0.1479 0.1758 0.1282
200
SIR PHD SAVE LAD EFM DCOV
0.4256 0.8329 0.8567 0.4471 0.5107 0.3997
0.1129 0.1730 0.1406 0.1003 0.2101 0.1073
200
SIR PHD SAVE LAD EFM DCOV
0.4196 0.8732 0.8007 0.4248 0.4294 0.3470
0.0994 0.1495 0.1817 0.1122 0.1757 0.1133
200
SIR PHD SAVE LAD EFM DCOV
0.4288 0.9548 0.8092 0.4606 0.4970 0.4280
0.1071 0.0615 0.1679 0.0996 0.1429 0.0513
400
SIR PHD SAVE LAD EFM DCOV
0.3373 0.6981 0.7210 0.3233 0.2521 0.2745
0.0888 0.2187 0.1951 0.0852 0.0730 0.0709
400
SIR PHD SAVE LAD EFM DCOV
0.2965 0.7961 0.6018 0.2842 0.2453 0.2341
0.0702 0.1892 0.2097 0.0721 0.0670 0.0563
400
SIR PHD SAVE LAD EFM DCOV
0.3220 0.9378 0.6503 0.3386 0.3977 0.4296
0.0726 0.0736 0.2190 0.0683 0.0808 0.0406
consistently not good, which may be due to the fact that the available information is in the variance function, and the mean function is mostly noise. Again, across sample sizes and different predictors, DCOV has a better performance consistently. Model D. Model D is constructed so that the response is discrete. Predictors for part (2) and part (3) are generated as follows: part (2), xi ∼ standard Cauchy, i = 1, . . . , 10; part (3), X is generated as x1 ∼ Poisson(1), x2 ∼ Binomial(10, 0.3), xi ∼ Poisson(1), i = 3, . . . , 10. The purpose of part (2) is to test the robustness of our method when asymptotic normality for large p may not be satisfied. Simulation results are listed in Table 4. LAD fails sometimes for the same reason as mentioned before, and due to the discrete response. The discrete response results in fewer observations in one of the classes, which causes a singular conditional matrix in the algorithm and fails to have a proper estimate. EFM does not work in many cases under Model D, which may be because the link function is not smooth. DCOV has a better and stable performance across the sample sizes and different predictors. Model E. Model E is same as the model in Example 2 simulated by Cui, Härdle and Zhu [11]. Predictors for part (2) and part (3) are generated as the following: part (2), xi ∼ U (−2, 2), i = 1, . . . , 10; part (3), xi ∼ Bernoulli(0.5), for i = 1, . . . , 10. Part (2) is exactly the same as design (C) in [11]. Simulation results are listed in Table 5. Overall DCOV still has top performance. A reviewer pointed out that when predictors are all discrete or categorical, CS never exist, for details see Bierens and Hartog [1]. Nevertheless, we simulated models with many/all discrete predictors: part 3 of Models A–E, where CS does not exist. However, we find that our method does estimate the directions that generated the models accurately. We do
156
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161 Table 6 Accuracy of variance formula for Model B. n = 100
ηˆ 1
ηˆ 2
ηˆ 3
ηˆ 4
ηˆ 5
SE
0.0061 −0.6783 0.1120 (−0.8979, −0.4587)
0.0071 0.7138 0.1188 (0.4809, 0.9466)
0.0071 0.0348 0.1369 (−0.2336, 0.3032)
0.0081 0.1479 0.1384 (−0.1235, 0.4129)
0.0112 −0.0434 0.1408 (−0.3194, 0.2325)
ηˆ 6
ηˆ 7
ηˆ 8
ηˆ 9
ηˆ 10
SE(ηn ) 95% C.I.
0.0071 0.0252 0.1548 (−0.2782, 0.3286)
0.0086 0.0224 0.1283 (−0.2291, 0.2739)
0.0075 −0.0155 0.1080 (−0.2273, 0.1962)
0.0073 −0.0636 0.1494 (−0.3565, 0.2292)
0.0084 −0.0058 0.1364 (−0.2731, 0.2615)
n = 200
ηˆ 1
ηˆ 2
ηˆ 3
ηˆ 4
ηˆ 5
SE
0.0016 −0.7124 0.0873 (−0.8835, −0.5413)
0.0017 0.6780 0.0667 (0.5472, 0.8088)
0.0027 −0.0919 0.0948 (−0.2777, 0.0939)
0.0028 0.0207 0.0761 (−0.1284, 0.1698)
0.0029 0.0624 0.1023 (−0.1381, 0.2630)
ηˆ 6
ηˆ 7
ηˆ 8
ηˆ 9
ηˆ 10
0.0024 −0.0206 0.0925 (−0.2020, 0.1607)
0.0025 −0.0093 0.0864 (−0.1786, 0.1601)
0.0025 −0.0273 0.0987 (−0.2207, 0.1660)
0.0023 0.1079 0.0968 (−0.0818, 0.2975)
0.0029 −0.0842 0.0892 (−0.2591, 0.0906)
ηn
SE(ηn ) 95% C.I.
SE
ηn
ηn
SE(ηn ) 95% C.I.
SE
ηn
SE(ηn ) 95% C.I.
SE: the standard error of the variances for each estimated parameter. For randomly generated data, ηn is the estimated direction, SE(ηn ) is its standard error, and 95% C.I. is the respective confidence interval.
not know why, but believe that it may be because this distance covariance measure itself may be a natural measure for categorical variables, and our algorithm does not involve any complicated smoothing techniques. In summary, based on these simulations we believe that DCOV has good and consistent performance, regardless of the models and the distributions of predictors. This is perhaps because the DCOV method has two additional advantages: it does not require any specific model structures, and it does not put any strict conditions on predictors. Our simulation results for Part (2) with p = 10 indicate that the requirement of large p in Proposition 4 seems not to be restrictive. Further simulations (not reported here) indicated that results are still very reasonable when p is as small as 5. Simulation results in part (2) of model (D) indicate that long-tailed predictors may also work for DCOV, demonstrating the robustness of DCOV. Another striking feature from these simulations is that when predictors are discretely distributed (part (3) of all models), DCOV seems to have consistent superb performance. Accuracy of variance formula. To check the asymptotic variance formula that we derived in Section 3.4, we report the results for Model B under normal predictors (Part (1)) in Table 6, with sample size of n = 100 (top part), and 200 (bottom part). We run 100 data replicates and calculate the sample variance for each of the data with our variance formula, then calculate the standard error (SE) of these variances for each estimated parameter to see the stability of our estimates. We also report randomly generated data with estimated direction (ηn ) of β2 = (1, −1, 0, . . . , 0)T , its standard errors (SE(ηn )), and the respective 95% confidence interval. It seems that our variance estimates are quite stable (as shown in the first row) for both sample sizes. The 95% confidence interval for one randomly generated data when n = 100 seems already very reasonable (the last row) and they are improved as sample size n increases from 100 to 200. We also check the results for other four models (not reported), which are very similar to the ones of Model B reported here. 4.2. The UCI Adult Data Set In this section, we analyze the UCI Adult Data Set, which was originally used to explore the relationship between income and its possible determinants. The data was obtained from the Census Bureau database and is available on the website: http://archive.ics.uci.edu/ml/datasets/Adult. We extract the first 3000 subjects and after excluding a few missing data, the data set has 2755 subjects with 15 variables. The variable ‘‘income exceeds over $50,000/year’’ is the response variable, which is coded by two values ‘‘0’’ or ‘‘1’’. The explanatory variables are:
• Sex (categorical): 1 = Male, 0 = Female. • Native-country (categorical): 1 = United-States, 0 = others. • Work-class (categorical): 1 = Federal-gov, 2 = Local-gov, 3 = Private, 4 = Self-emp-inc (self-employed, incorporated), 5 = Self-emp-not-inc (self-employed, not incorporated), 6 = State-gov. • Marital-status (categorical): 1 = Divorced, 2 = Married-AF-spouse (married, armed forces spouse present), 3 = Marriedciv-spouse (married, civilian spouse present), 4 = Married-spouse-absent [married, spouse absent (exc. separated)], 5 = Never-married, 6 = Separated, 7 = Widowed. • Occupation (categorical): 1 = Adm-clerical (administrative support and clerical), 2 = Armed-Forces, 3 = Craft-repair, 4 = Exec-managerial (executive managerial), 5 = Farming–fishing, 6 = Handlers–cleaners, 7 = Machine-opinspct
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161
157
Table 7 Distances (∆m ) between DCOV estimate and other estimates for adult data and their p-values.
DCOV-SIR DCOV-SAVE DCOV-PHD DCOV-LAD DCOV-EFM DCOV-PS
Distance (∆m )
pˆ
0.7288 0.9958 0.9827
0 0.5718 0.2412
*
*
0.5777 0.6941
0 0
* LAD does not work for this data. p-value equals to the proportion of 10,000 distances in benchmark calculation, which are less than or equal to the corresponding distances (∆m ) in the Table.
• • • • •
• • • •
(machine operator inspection), 8 = Other-service, 9 = Priv-house-serv (private household services), 10 = Prof-specialty (professional specialty), 11 = Protective-serv, 12 = Sales, 13 = Tech-support, 14 = Transport-moving. Relationship (categorical): 1 = Husband, 2 = Not-in-family, 3 = Other-relative, 4 = Own-child, 5 = Unmarried, 6 = Wife. Race (categorical): 1 = Amer-Indian-Eskimo, 2 = Asian-Pac-Islander, 3 = Black, 4 = Other, 5 = White. Age (integer): number of years of age and greater than or equal to 17. Fnlwgt (continuous): The final sampling weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the United States. Education (ordinal): 1 = Preschool (less than first grade), 2 = 1st-4th, 3 = 5th-6th, 4 = 7th-8th, 5 = 9th, 6 = 10th, 7 = 11th, 8 = 12th (12th grade no diploma), 9 = HS-grad (high school grad-diploma or equiv.), 10 = Somecollege (some college but no degree), 11 = Assoc-voc (associate degree occupational/vocational), 12 = Assoc-acdm (associate degree–academic program), 13 = Bachelors, 14 = Masters, 15 = Prof-school (professional school), 16 = Doctorate. Education-num (continuous): Number of years of education. Capital-gain (continuous): A profit that results from investments into a capital asset. Capital-loss (continuous): A loss that results from investments into a capital asset. Hours-per-week (continuous): Usual number of hours worked per week.
Note that all the explanatory variables are categorical except the last seven variables. We use dummy variables for categorical predictors. By doing so, we have 41 predictors, where the first 35 ones are class variables, and the remaining ones are continuous. We borrow the way of Cui, Härdle and Zhu [11] to handle the data set before analyzing it: first, we make a logarithm transformation for the predictors X37 = ‘‘fnlwgt’’, X39 = ‘‘capital-gain’’ and X40 = ‘‘capital-loss’’ to have log(‘‘fnlwgt’’), log(1 + ‘‘capital-gain’’) and log(1 + ‘‘capital-loss’’); second, we standardize each predictor individually to obtain mean 0 and variance 1; thirdly, we drop the predictor ‘‘education’’. Cui, Härdle and Zhu [11] gave the reasons for these changes. We apply our method DCOV, along with SIR [19], SAVE [10], PHD [20,5], LAD [7], EFM [11] and Yu and Ruppert’s method [35] to analyze this data. We denote Yu and Ruppert’s [35] method as ‘‘PS’’, because they used ‘‘PS’’ to denote the fitted single-index model in their paper. To compare different estimates, we adopt a benchmark criterion proposed by Li, Wen and Zhu [23]. The benchmark distance is the average of 10,000 ∆m = ∥Pβ1 − Pβ2 ∥ which are calculated by simulating 10,000 pairs of standard normal random vectors β1 and β2 , where both β1 and β2 are in R41 and β1 yβ2 . We then calculate the proportion (ˆp) of 10,000 distances in the benchmark calculation, which are less than or equal to the distance between two compared estimates. Note that the benchmark distance, the average of the 10,000 ∆m = ∥Pβ1 − Pβ2 ∥, is 0.9877. A smaller pˆ indicates that the two estimates agree with each other. These results are listed in Table 7. The p-values of distances between DCOV and SIR, DCOV and EFM, and DCOV and PS are 0, which indicate the estimates of DCOV, SIR, EFM and PS are essentially very close and the distance between DCOV and EFM is the smallest one. However, the p-value of distances between DCOV and SAVE and the P-value between DCOV and PHD is big, which mean that the estimate of DCOV is not close to the estimates of SAVE and PHD. We understand that conditions such as linearity and constant variance, required for SIR, SAVE and PHD, may fail in the data. Nevertheless, regardless of conditions, SIR finds a linear trend while SAVE and PHD tend to recover nonlinear structure. Thus, it may not be strange to see the success of SIR in finding the direction while SAVE and PHD failed. Also, LAD does not work for this real data set, perhaps because there are too many categorical predictors and discrete response in the data set. With six continuous variables, EFM and PS seem work well. Having obtained the DCOV estimate ηn based on the real data, we fit a logistic regression model: P (‘‘income’’ > 50,000|X ) = exp{g (ηTn X )}/[1 + exp{g (ηTn X )}]. Fig. 1 shows the plot of g (ηTn X ) versus predicted response probability, where g (z ) = −1.15 + 0.32z − 0.35z 2 + 0.33z 3 − 0.05z 4 and z = ηTn X is the estimated single index with ηn shown in Table 8. 5. Discussion In this article, we propose a new method to estimate the direction in single-index models based on distance covariance. To our knowledge, the method requires the weakest conditions so far among methods for estimating the single index. It
158
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161 Table 8 Estimated direction (ηn ) based on DCOV. Variables
ηn of DCOV
Sex Native-country Work-class Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc Marital-Status Divorced Married-AF-spouse Married-civ-spouse Married-spouse-absent Never-married Separated Occupation Adm-clerical Armed-Forces Craft-repair Exec-managerial Farming–fishing Handlers–cleaners Machine-op-inspct Other-service Priv-house-serv Prof-specialty Protective-serv Sales Tech-support Relationship Husband Not-in-family Other-relative Own-child Unmarried Race Amer-Indian-Eskimo Asian-Pac-Islander Black Other Age Fnlwgt Education-num Capital-gain Capital-loss Hours-per-week
0.1143 0.0229 0.0668
−0.0299 −0.0483 0.0980
−0.0116 −0.1839 −0.0212 0.4982
−0.0493 −0.3055 −0.0841 −0.0775 −0.0180 −0.0376 0.1817
−0.0799 −0.0891 −0.0596 −0.1072 −0.0282 0.1441 0.0062 0.0363 0.0284 0.3670
−0.2591 −0.0690 −0.1975 −0.1660 −0.0411 0.0268
−0.0784 −0.0247 0.1820 0.0189 0.2772 0.2303 0.1082 0.1803
requires very mild conditions on X , no particular assumption on Y |X , or X |Y except the existence of the marginal moments. In terms of estimation procedure, it requires no likelihood, no complicated smoothing techniques and has a very simple algorithm. Székely, Rizzo and Bakirov [29] empirically demonstrated that for the power of detecting the dependence, DCOV is quite close to the likelihood ratio test in the multivariate normal case, while DCOV is more powerful than the likelihood ratio test under the non-normal and nonlinear structures. This character of DCOV may qualify our method to be the most stable and consistent competitor, comparing with other dimension reduction methods or single-index estimation methods. The reason is that many methods heavily depend on a particular structure of the model or the correctness of the model specification, while DCOV does not rely much on these aspects. Indeed, comparing to other methods, although SIR, LAD or EFM occasionally perform slightly better than our approach, for example, in Model B part (2) SIR performs better and in Model C part (1), LAD performs better, our estimation procedure performs overall much better than SIR, LAD or EFM. We believe that this method is very useful, because of its consistent superb performance in different set-ups. Although this paper focuses on single-index models, the idea can be extended to multiple-index models, which are currently under investigation. Acknowledgments The authors would like to thank the Editor, an Associate Editor, and two referees for their valuable comments, which greatly improved the paper. The authors also would like to thank Liliana Forzani for providing the code of the LAD method,
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161
159
Fig. 1. Adult Data Set: plot of predicted response probability based on the single-index model.
Yan Yu for providing the code of the PS method, and Lixing Zhu for providing the code of the EFM method. Yin’s research is supported in part by NSF grants DMS-0806120 and DMS-1205546. Appendix A Proof of Proposition 1. Let η0 be the projection of β onto η, which means η0 = Pη(ΣX ) β = c η, where c is a scalar. Let ⊥,T T ⊥ 2 2 η⊥ 0 = β − η0 , where the orthogonality ‘⊥’ is the inner product induced by ΣX , then 1 = β ΣX β = c + η0 ΣX η0 ≥ c . Hence, |c | ≤ 1. Now, by (3) T T V 2 (βT X , Y ) = |Eei⟨t ,β X ⟩+i⟨s,Y ⟩ − Eei⟨t ,β X ⟩ Eei⟨s,Y ⟩ |2 dw
|E {ei⟨t ,β
|E {ei⟨t ,(η0 +η0
|Eei⟨t ,η0
|Eei⟨t ,η0
|Eei⟨t ,η0 X ⟩+i⟨s,Y ⟩ − Eei⟨t ,η0 X ⟩ Eei⟨s,Y ⟩ |2 dw
= = = = ≤
T X⟩
[E (ei⟨s,Y ⟩ |X )]} − Eei⟨t ,β ⊥,T
T
⊥,T
X⟩
⊥,T
X⟩ 2
T
)X ⟩
T X⟩
Eei⟨s,Y ⟩ |2 dw ⊥,T
[E (ei⟨s,Y ⟩ |ηT0 X )]} − Eei⟨t ,(η0 +η0 T
)X ⟩
Eei⟨s,Y ⟩ |2 dw
{Eei⟨t ,η0 X ⟩+i⟨s,Y ⟩ − Eei⟨t ,η0 X ⟩ Eei⟨s,Y ⟩ }|2 dw T
T
| |Eei⟨t ,η0 X ⟩+i⟨s,Y ⟩ − Eei⟨t ,η0 X ⟩ Eei⟨s,Y ⟩ |2 dw T
T
T
= V 2 (ηT0 X , Y ) ≤ V 2 (ηT X , Y ). The third equality follows from the assumption Y yX |ηT X , and η0 = c η, where |c | ≤ 1. The fourth equality follows from T T the assumption Pη( ΣX ) X yQη(ΣX ) X . The last inequality follows from the second property of Theorem 3 in [28], and that |c | ≤ 1. The maximum is achieved by setting |c | = 1, which indicates Span(β) = Span(η), or in other words, β = η or β = −η. Proof of Proposition 2. Let G = (Ip , 0), then ηn = Gθ n and η = Gθ . The conclusion follows from Lemma 2 in the supplementary file (see Appendix B).
160
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161
Proof of Proposition 3. Let G = (Ip , 0) be a p × (p + 1) matrix, where Ip is a p × p identity matrix. Then ηn = Gθ n and
η = Gθ . By Lemma 3 in the supplementary file (see Appendix B), we have T
V11 is GVG .
√
n(ηn − η) =
√
D
nG(θ n − θ) − → N (0, V11 ), where
⊥ p Proof of Proposition 4. For βp ∈ unif(Sp ), let βp = aηp + bη⊥ p , where ηp , ηp ∈ unif(S ), and a, b are scalars such that T a2 ηTp ΣXp ηp + b2 ηp⊥,T ΣXp η⊥ p = βp ΣXp βp , where orthogonality ‘‘⊥’’ is induced by the inner product with respect to ΣX as T p T ⊥,T ⊥ before. Because, βp , ηp , and η⊥ p ∈ unif(S ), and ηp ΣXp ηp , ηp ΣXp ηp , and βp ΣXp βp ≥ 0, the maximum of a is achieved p when b = 0. That is, βp = aηp and thus |a| = 1, because βp and ηp ∈ unif(S ). Under the conditions in the Proposition, exactly following the proof of Theorem 6.1 in [24, at the bottom of page 2158], T ⊥,T ⊥ ∗ ∗ ∗ we have that ηTp Xp yηp⊥,T Xp |(ηp , η⊥ p ) w.i.p. as p → ∞. That is, (Y , ηp X , ηp X )|(ηp , ηp ) converges w.i.p. to (Y , U , V ) in ∗ ∗ ∗ ∗ ∗ which U yV . By Definition 1 in Section 2, and Proposition 4.6 of [6], we have that V y(Y , U ). For convenience, suppose that f is any function, we simply write E [f (βTp Xp , Y , ηTp Xp |βp , ηp )] as E [f (βTp Xp , Y , ηTp Xp )].
V 2 (βTp X , Y ) =
|Eei⟨t ,βp X ⟩+i⟨s,Y ⟩ − Eei⟨t ,βp X ⟩ Eei⟨s,Y ⟩ |2 dw
|E {ei⟨t ,βp X ⟩ [E (ei⟨s,Y ⟩ |X )]} − Eei⟨t ,βp X ⟩ Eei⟨s,Y ⟩ |2 dw
|E {ei⟨t ,(aηp +bηp
= =
T
T
T
T
T
⊥,T
)X ⟩
⊥,T
[E (ei⟨s,Y ⟩ |ηTp X )]} − Eei⟨t ,(aηp +bηp T
)X ⟩
Eei⟨s,Y ⟩ |2 dw
∗ ∗ ∗ ∗ ∗ ∗ −→ (w.i.p. as p → ∞) |E {ei⟨t ,(aU +bV )⟩ [E (ei⟨s,Y ⟩ |U ∗ )]} − Eei⟨t ,(aU +bV )⟩ Eei⟨s,Y ⟩ |2 dw ∗ ∗ ∗ ∗ ∗ ∗ = |E {ei⟨t ,bV ⟩ [E (ei⟨t ,aU ⟩+i⟨s,Y ⟩ |U ∗ )]} − Eei⟨t ,bV ⟩ Eei⟨t ,aU ⟩ Eei⟨s,Y ⟩ |2 dw ∗ ∗ ∗ ∗ ∗ = |Eei⟨t ,bV ⟩ {Eei⟨t ,aU ⟩+i⟨s,Y ⟩ − Eei⟨t ,aU ⟩ Eei⟨s,Y ⟩ }|2 dw ∗ ∗ ∗ ∗ ∗ = |Eei⟨t ,bV ⟩ |2 |Eei⟨t ,aU ⟩+i⟨s,Y ⟩ − Eei⟨t ,aU ⟩ Eei⟨s,Y ⟩ |2 dw ∗ ∗ ∗ ∗ ≤ |Eei⟨t ,aU ⟩+i⟨s,Y ⟩ − Eei⟨t ,aU ⟩ Eei⟨s,Y ⟩ |2 dw = V 2 (aU ∗ , Y ∗ ) = |a|V 2 (U ∗ , Y ∗ ) ←− (w.i.p. as p → ∞) |a|V 2 (ηTp X , Y ). Therefore, w.i.p. as p → ∞, V 2 (βTp X , Y ) ≤ V 2 (ηTp X , Y ), and the maximum is achieved when |a| = 1, that is, Span(βp ) = Span(ηp ). Appendix B. Supplementary data Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.jmva.2013.07.003. References [1] H.J. Bierens, J. Hartog, Non-linear regression with discrete explanatory variables, with an application to the earnings function, Journal of Econometrics 38 (1988) 269–299. [2] P. Boggs, J. Tolle, Sequential quadratic programming, Acta Numerica 4 (1995) 1–51. [3] R.D. Cook, Using dimension-reduction subspaces to identify important inputs in models of physical systems, Proceedings in Physics, Engineering and Science Section (1994) 18–25. [4] R.D. Cook, Graphics for regressions with a binary response, Journal of the American Statistical Association 91 (435) (1996) 983–992. [5] R.D. Cook, Principal Hessian directions revisited, Journal of the American Statistical Association 93 (441) (1998) 84–94. [6] R.D. Cook, Regression Graphics: Ideas for Studying Regressions Through Graphics, Wiley, New York, 1998. [7] R.D. Cook, L. Forzani, Likelihood-based sufficient dimension reduction, Journal of the American Statistical Association 104 (485) (2009) 197–208. [8] R.D. Cook, L. Li, Dimension reduction in regressions with exponential family predictors, Journal of Computational and Graphical Statistics 18 (3) (2009) 774–791. [9] R.D. Cook, C. Nachtsheim, Reweighting to achieve elliptically contoured covariates in regression, Journal of the American Statistical Association 89 (426) (1994) 592–599. [10] R.D. Cook, S. Weisberg, Sliced inverse regression for dimension reduction: comment, Journal of the American Statistical Association 86 (414) (1991) 328–332. [11] X. Cui, W. Härdle, L. Zhu, The EFM approach for single-index models, The Annals of Statistics 39 (3) (2011) 1658–1688. [12] P. Diaconis, D. Freedman, Asymptotics of graphical projection pursuit, The Annals of Statistics 12 (3) (1984) 793–815. [13] P.E. Gill, W. Murray, M.H. Wright, Practical Optimization, Academic Press, New York, 1981. [14] P. Hall, K.-C. Li, On almost linearity of low dimensional projections from high dimensional data, The Annals of Statistics 21 (2) (1993) 867–889.
W. Sheng, X. Yin / Journal of Multivariate Analysis 122 (2013) 148–161
161
[15] W. Härdle, P. Hall, H. Ichimura, Optimal smoothing in single-index models, The Annals of Statistics 21 (1) (1993) 157–178. [16] W. Härdle, T.M. Stoker, Investigating smooth multiple regression by the method of average derivatives, Journal of the American Statistical Association 84 (408) (1989) 986–995. [17] J.L. Horowitz, W. Härdle, Direct semiparametric estimation of single-index models with discrete covariates, Journal of the American Statistical Association 91 (436) (1996) 1632–1640. [18] H. Ichimura, Semiparametric least squares (sls) and weighted sls estimation of single-index models, Journal of Econometrics 58 (1–2) (1993) 71–120. [19] K.-C. Li, Sliced inverse regression for dimension reduction, Journal of the American Statistical Association 86 (414) (1991) 316–327. [20] K.-C. Li, On principal hessian directions for data visualization and dimension reduction: another application of Stein’s lemma, Journal of the American Statistical Association 87 (420) (1992) 1025–1039. [21] K.C. Li, N. Duan, Regression analysis under link violation, The Annals of Statistics 17 (1989) 1009–1052. [22] B. Li, S. Wang, On directional regression for dimension reduction, Journal of the American Statistical Association 102 (479) (2007) 997–1008. [23] B. Li, S. Wen, L. Zhu, On a projective resampling method for dimension reduction with multivariate responses, Journal of the American Statistical Association 103 (483) (2008) 1177–1186. [24] B. Li, X. Yin, On surrogate dimension reduction for measurement error regression: an invariance law, The Annals of Statistics 35 (5) (2007) 2143–2172. [25] B. Li, H. Zha, F. Chiaromonte, Contour regression: a general approach to dimension reduction, The Annals of Statistics 33 (4) (2005) 1580–1616. [26] Y. Ma, L. Zhu, A semiparametric approach to dimension reduction, Journal of the American Statistical Association 107 (497) (2012) 168–179. [27] J.L. Powell, J.H. Stock, T.M. Stoker, Semiparametric estimation of index coefficients, Econometricica 57 (1989) 1403–1430. [28] G.J. Székely, M.L. Rizzo, Brownian distance covariance, The Annals of Applied Statistics 3 (6) (2009) 1236–1265. [29] G.J. Székely, M.L. Rizzo, N.K. Bakirov, Measuring and testing dependence by correlation of distances, The Annals of Statistics 35 (6) (2007) 2769–2794. [30] H. Wang, Y. Xia, Sliced regression for dimension reduction, Journal of the American Statistical Association 103 (482) (2008) 811–821. [31] Y. Xia, A constructive approach to the estimation of dimension reduction directions, The Annals of Statistics 35 (2007) 2654–2690. [32] Y. Xia, H. Tong, W.K. Li, L.-X. Zhu, An adaptive estimation of dimension reduction space, Journal of the Royal Statistical Society. Series B (Statistical Methodology) 64 (3) (2002) 363–410. [33] X. Yin, R.D. Cook, Direction estimation in single-index regressions, Biometrika 92 (2) (2005) 371–384. [34] X. Yin, B. Li, R.D. Cook, Successive direction extraction for estimating the central subspace in a multiple-index regression, Journal of Multivariate Analysis 99 (2008) 1733–1757. [35] Y. Yu, D. Ruppert, Penalized spline estimation for partially linear single-index models, Journal of the American Statistical Association 97 (460) (2002) 1042–1054. [36] P. Zeng, Y. Zhu, An integral transform method for estimating the central mean and central subspace, Journal of Multivariate Analysis 101 (2010) 271–290. [37] Y. Zhu, P. Zeng, Fourier methods for estimating the central subspace and the central mean subspace in regression, Journal of the American Statistical Association 101 (476) (2006) 1638–1651.