Accepted Manuscript Robust feature screening for elliptical copula regression model Yong He, Liang Zhang, Jiadong Ji, Xinsheng Zhang
PII: DOI: Reference:
S0047-259X(19)30268-4 https://doi.org/10.1016/j.jmva.2019.05.003 YJMVA 4516
To appear in:
Journal of Multivariate Analysis
Received date : 15 May 2019 Accepted date : 15 May 2019 Please cite this article as:, Robust feature screening for elliptical copula regression model, Journal of Multivariate Analysis (2019), https://doi.org/10.1016/j.jmva.2019.05.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Robust Feature Screening for Elliptical Copula Regression Model Yong Hea , Liang Zhanga , Jiadong Jia,∗, Xinsheng Zhangb a School
of Statistics, Shandong University of Finance and Economics, Jinan, 250014, China b School of Management, Fudan University, Shanghai, 200433, China
Abstract In this paper, we propose a flexible semi-parametric regression model called Elliptical Copula Regression (ECR) model, which covers a large class of linear and nonlinear regression models such as the additive regression model and the linear transformation model. In addition, ECR model can capture the heavy-tail characteristic and tail dependence between variables, thus it can be widely applied in many areas such as econometrics and finance. We mainly focus on the feature screening problem for ECR model in an ultra-high dimensional setting here. We propose a robust feature screening procedure for ECR model, in which two types of correlation coefficients are involved: Kendall’s τ correlation and canonical correlation. Theoretical analysis shows that the procedure enjoys sure screening property, i.e., with probability tending to 1, the feature screening procedure selects out all important variables and substantially reduces the dimensionality to a moderate size against the sample size. Thorough numerical studies are conducted to illustrate its advantage over existing feature screening methods. At last, the proposed procedure is applied to a gene-expression real data set to show its empirical usefulness. Keywords: Canonical Correlation, Elliptical Copula, Feature Screening, Kendall’s τ 2010 MSC: Primary 62G08, Secondary 62G35
1. Introduction In the last decades, data sets with large dimensionality have arisen in various areas such as finance, chemistry, etc. due to the great development of the computer storage capacity and processing power and feature selection with these big data is of fundamental importance to many contemporary applications. The sparsity assumption is common in high dimensional feature selection literatures, i.e., only a few variables are critical for model fitting and forecasting of certain response of interest. Specifically, for the linear regression setting, statisticians care about how to select out the important variables from thousands or even millions of variables. In fact, a huge amount of literature spring up since the appearance of the Lasso estimator [27]. To name a few, there exist SCAD by [8], Adaptive Lasso by [32], MCP by [30], the Dantzig selector by [4], group Lasso by [29]. This research area is very active, and as a result, the list of references here is illustrative rather than comprehensive. The aforementioned feature selection methods perform well when the dimensionality is not “too” large, theoretically in the sense that it is of polynomial order of the sample size. However, in the setting where the dimensionality is of exponential order of the sample size, which is typically referred to as ultra-high dimension, the aforementioned methods may encounter both theoretical and computational issue. For example, the Uniform Uncertainty Principle (UUP) condition to guarantee the oracle property of Dantzig selector estimator may be difficult to satisfy, and the computational cost would increase dramatically by implementing linear programs. [9] first proposed Sure Independence Screening (SIS) and its further improvement named Iterative Sure Independence Screening (ISIS), to alleviate the computational burden in ultra-high dimensional setting. The basic idea goes as follows. In the first step, reduce the dimensionality to a moderate size against the sample size by sorting the marginal Pearson correlation between covariates and the response, and removing those covariates whose marginal correlation with response is lower than a certain threshold. In the second stage one may exploit Lasso, SCAD, etc. to the variables survived in the first step. The SIS (ISIS) has the sure screening property under certain conditions, that is, with probability tending to 1, the SIS procedure can select out all important variables. The last decade has witnessed plenty of variants of ∗ Corresponding
author Email addresses:
[email protected] (Yong He ),
[email protected] (Jiadong Ji)
Preprint submitted to Journal of Multivariate Analysis
May 15, 2019
SIS to handle the ultra-high dimensionality for more general regression models. [7] proposed a feature screening procedure for ultra-high dimensional additive models. [10] proposed a feature screening procedure for ultra-high dimensional varying coefficient models. [25] proposed censored rank independence screening of high-dimensional survival data which is robust to predictors that contain outliers and works well for a general class of survival models. [31] and [6] proposed model-free feature screening. [17] proposed to screen Kendall’s τ correlation while [19] proposed to screen distance correlation which both show robustness to heavy-tailed data. [18] also proposed a robust rank SIS (RSIS) which is robust to outliers. [16] proposed to screen the canonical correlation between the response and all possible sets of k variables, which performs well particularly for selecting out variables that are pairwise jointly important with other variables but marginally insignificant. This list of references for screening methods is also illustrative rather than comprehensive. For the development of the screening methods in the last decade, we refer to the review paper of [22]. The main contribution of the paper is two-fold. On the one hand, we innovatively propose a very flexible semiparametric regression model called Elliptical Copula Regression (ECR) model, which can capture the thick-tail property of variables and the tail dependence between variables. In specific, the ECR model has the following representation: p X f0 (Y) = β j f j (X j ) + , (1) j=1
where Y is the response variable, X1 , . . . , X p are predictors, is the error term, f j (·) are univariate monotonic functions. We say (Y, X> )> = (Y, X1 , . . . , X p )> satisfies a Elliptical copula regression model if the marginally 4 e = (Y, e> )> = e X transformed random vectors Z ( f0 (Y), f1 (X1 ), . . . , f p (X p ))> follows elliptical distribution. From the representation of ECR model in (1), it can be seen that the ECR model covers a large class of linear and nonlinear regression models such as the additive regression model and the linear transformation model, which makes it more applicable in many areas such as econometrics, finance and bioinformatics. On the other hand, we propose a robust dimension reduction procedure for the ECR model in the ultra-high dimensional setting. The robustness is achieved by combining two types of correlation, which are Kendall’s τ correlation and canonical correlation. The canonical correlation is employed to capture the joint information of a set of covariates and the joint relationship between the response and this set of covariates. Note that for ECR model in (1), only (Y, X> )> is observable rather than the transformed variables. Thus the Kendall’s τ correlation is exploited to estimate the canonical correlations due to its invariance under strictly monotone marginal transformations. The dimension reduction procedure for ECR model is achieved by sorting the estimated canonical correlations and leaving the variable that attributes a relatively high canonical correlation at least once into the active set. The proposed screening procedure enjoys the sure screening property and reduces the dimensionality substantially to a moderate size under mild conditions. Numerical results show that the proposed approach performs much better than some existing feature screening methods. We introduce some notations adopted in the paper. For any vector µ = (µ1 , . . . , µd )> ∈ Rd , let µ−i q denote the Pd 2 Pd Pd (d − 1) × 1 vector by removing the i-th entry from µ. |µ|0 = i=1 I{µi , 0}, |µ|1 = i=1 |µi |, |µ|2 = i=1 µi Pd d×d and |µ|∞ = maxi |µi |. Let A = [ai j ] ∈ R . k AkL1 = max1≤ j≤d i=1 |ai j |, kAk∞ = maxi, j |ai j | and kAk1 = Pd Pd q×d , Bi,− j denote the i-th row of B with its j-th entry removed and i=1 j=1 |ai j |. For a matrix B = [bi j ] ∈ R B−i, j denote the j-th column of B with its i-th entry removed, B−i,− j denotes a (q − 1) × (d − 1) matrix obtained by removing the i-th row and j-th column of B. We use λmin ( A) and λmax ( A) to denote the smallest and largest eigenvalues of A respectively. For a set H, denote by |H| or Card(H) the cardinality of H. For a real number x, denote by bxc the largest integer smaller than or equal to x. For two sequences of real numbers {an } and {bn }, we write an = O(bn ) if there exists a constant C such that |an | ≤ C|bn | holds for all n, write an = o(bn ) if limn→∞ an /bn = 0, and write an bn if there exist constants c and C such that c ≤ an /bn ≤ C for all n. In addition, a k-combination of a set S is a subset of k distinct elements of S. The rest of the paper is organized as follows: in Section 2, we introduce the Elliptical copula regression model and present the proposed dimension reduction procedure by ranking the estimated canonical correlations. In Section 3, we present the theoretical properties of the proposed procedure with more detailed proofs collected in the Appendix. In Section 4, we conduct thorough numerical simulations to investigate the empirical performance of the procedure. In section 5, a real gene-expression data example is given to illustrate its empirical usefulness. In the last section, we give a brief discussion on possible future research directions.
2
2. Methodology 2.1. Elliptical Copula Regression Model To present the Elliptical Copula Regression model, we first need to introduce the elliptical distribution. The elliptical distribution generalizes the multivariate normal distribution, which includes symmetric distributions with heavy tails, like the multivariate t-distribution. Elliptical distributions are commonly used in robust statistics to evaluate multivariate statistical procedures. Specifically, the definition of elliptical distribution is given as follows: Definition 2.1. (Elliptical distribution) Let µ ∈ R p and Σ ∈ R p×p with rank(Σ) = q ≤ p. A p-dimensional random vector Z = (Z1 , . . . , Z p )> is elliptically distributed, denoted by Z ∼ ED p (µ, Σ, ζ), if it has a stochastic representation d Z = µ + ζAU, where U is a random vector uniformly distributed on the unit sphere S q−1 in Rq , ζ ≥ 0 is a scalar random variable independent of U, A ∈ R p×q is a deterministic matrix satisfying AA> = Σ with Σ called scatter matrix. d
The representation Z = µ + ζAU is not identifiable since we can rescale ζ and A. We require Eζ 2 = q to make the model identifiable, which makes the covariance matrix of Z to be Σ. In addition, we assume Σ is non-singular. In this paper, we only consider continuous elliptical distributions with Pr(ζ = 0) = 0. Another equivalent definition of the elliptical distribution is by its characteristic√function, which has the form exp(it > u)ψ(t > Σt), where ψ(·) is a properly defined characteristic function and i = −1. In fact, ζ and ψ are mutually determined by each other. It is well known that marginal and conditional distributions of an elliptical distribution are elliptical as well. Besides, for Z ∼ ED p (µ, Σ, ζ), the conditional expectation of Z1 given the remaining variables Z −1 is E(Z1 |Z −1 ) = µ1 + Σ1,−1 (Σ−1,−1 )−1 (Z −1 − µ−1 ), which is a linear combination of Z −1 . This property makes the traditional linear regression interpretable in the elliptical context, where (Σ−1,−1 )−1 Σ−1,1 can be viewed as the regression coefficients. Given the definition of elliptical distribution, we are ready to introduce the Elliptical Copula Regression (ECR) model. Definition 2.2. (Elliptical copula regression model) Let f = { f0 , f1 , . . . , f p } be a set of monotone univariate functions and Σ be a positive-definite correlation matrix with diag(Σ)=I. We say that a d-dimensional random e = (Y, e > )> = e X vector Z = (Y, X1 , . . . , X p )> satisfies the Elliptical Copula Regression model if and only if Z > 2 ( f0 (Y), f1 (X1 ), . . . , f p (X p )) ∼ EDd (0, Σ, ζ) with Eζ = d and e> β + , or equivalently, f0 (Y) = Ye = X
p X
β j f j (X j ) + ,
j=1
where Y is the response and X = (X1 , . . . , X p )> are covariates, is the error term and d = p + 1. We require diag(Σ)=I in Definition 2.2 for identifiability because the shifting and scaling are absorbed into the marginal functions f . For ease of presentation, we denote Z = (Y, X1 , . . . , X p )> ∼ ECR(Σ, ζ, f ) in the following sections. The ECR model allows the data to come from heavy-tailed distribution and is thus more flexible and more useful in modelling many modern data sets, including financial data, genomics data and functional Magnetic Resonance Imaging (fMRI) data. Remark 2.1. As far as we know, only a few literatures focus on elliptical regression, especially in high-dimensional setting. As [28] pointed out, the property that makes elliptical distributions attractive in the context of linear regression analysis is the linearity of conditional expectations. [11] and [3] considered high-dimensional Gaussian copula regression model, which is more restrictive than ECR model. In essence, for the ECR model, we assume the response and covariates (Y, X)> follows the meta-elliptical distributions [12, 13]. The flexibility of the ECR model inherits from the meta-elliptical to some extent. e follow a linear regression model, Notice that the transformed variable Ye and the transformed covariates X however, the transformed variables are unobservable, only Y, X are observable. One naive way is to firstly estimate the marginal f j ’s and then apply existing screening methods to transformed data. [13] provided a way to estimate f j ’s. However, this naive method would aggregate the error of estimating f j ’s and is time-consuming due to 3
the high-dimensionality. In the following, by virtue of canonical correlation and Kendall’s τ correlation, we will present an adaptive screening procedure without estimating the marginal transformation functions f while capturing the joint information of a set of covariates and the joint relationship between the response and this set of covariates. Although the marginal transformation functions f are unknown, the proposed screening procedure is adaptive to these nuisance marginal functions, which achieves flexibility to handle real application. 2.2. Adaptive Robust Screening for ECR Model In this section we will present the adaptive robust screening for ECR model. We first introduce the Kendall’s τ-based estimator of correlation matrix in Subsection 2.2.1, then we introduce the Kendall’s τ-based estimator of canonical correlation in Subsection 2.2.2, which are both of fundamental importance for the detailed robust screening procedure to be introduced in Subsection 2.2.3. 2.2.1. Kendall’s τ Based Estimator of Correlation Matrix In this section we present the estimator of the correlation matrix based on Kendall’s τ. Let Z 1 , . . . , Z n be n independent observations where Z i = (Yi , Xi1 , . . . , Xip )> . The sample Kendall’s τ correlation of Z j and Zk is defined by n o X 2 b sign (Zi j − Zi0 j )(Zik − Zi0 k ) . τ j,k = n(n − 1) 1≤i
= ( f0 (Yi ), f1 (Xi1 ), . . . , f p (Xip ))> for i ∈ {1, . . . , n}, then Z e i can be viewed as the latent observaLet Z tions from Elliptical distribution ED(0, Σ, ζ). We can estimate Σ j,k (the ( j, k)-th element of Σ) by Sb j,k , where π Sb j,k = sin b τ j,k . (2) 2
This is because the Kendall’s τ correlation is invariant under strictly monotone marginal transformations and the b = [Sb j,k ]d×d with Sb j,k defined in fact that Σ j,k = sin(πτ j,k /2) holds for an Elliptical distribution [20]. Define by S b (2). We call S the rank-based estimator of the correlation matrix.
2.2.2. Kendall’s τ Based Estimator of Canonical Correlations Canonical correlations could capture the pairwise correlations within a subset of covariates and the joint regression relationship between the response and the subset of covariates. In this section, we present the Kendall’s τ based estimator of canonical correlation between the transformed response f0 (Y) and k (a fixed number) transformed covariates fm1 (Xm1 ), . . . , fmk (Xmk ) , where (m1 , . . . , mk ) is a k-combination of the set {1, . . . , p}. In this way, we bypass estimating the marginal transformation functions. Recall that for ECR model, ( f0 (Y), f1 (X1 ), . . . , f p (X p ))> ∼ ED(0, Σ, ζ) and its corresponding correlation matrix em1 , . . . , X emk } is exactly Σ = (Σ s,t ). Denote I = {1} and J = {m1 , . . . , mk }, the canonical correlation between Ye and {X is defined as a> ΣI×J b p ρc = sup √ > , a ΣI×I a b> ΣJ×J b a,b
where we define ΣI×J = (Σ s,t ) s∈I,t∈J . For ρc and J, we suppress their dependence on the parameter k for notational simplicity. It can be shown that > (ρc )2 = ΣI×J Σ−1 J×J ΣI×J .
b Thus the canonical correlation In Section 2.2.1 we present the Kendall’s τ based estimator of Σ and denote it by S. c ρ can be naturally estimated by: q bI×J S b−1 S b> . bc = S ρ J×J I×J
bJ×J is not positive definite (not invertible), we first project S bJ×J into the cone of positive semidefinite matriIf S ces. In particular, we propose to solve the following convex optimization problem: eJ×J = arg min kS bJ×J − Sk∞ . S S
4
The matrix element-wise infinity norm k·k∞ is adopted for the sake of further technical developments. Empirically, bJ×J and truncates all of the we can use a projection procedure that computes a singular value decomposition of S negative singular values to be zero. Numerical study in Section 4 shows that this procedure works well. 2.2.3. Screening procedure In this section, we present the screening procedure by sorting canonical correlation estimated by Kendall’s τ. The Screening procedure goes as follows: first collect sets of k transformed variables by picking k from p e`,m1 , . . . , X e`,mk } transformed variables and hence there are Ckp (the combinatorial number) different ways, i.e., {X e`,m1 , . . . , X e`,mk }, we denote its canonical correlation with f0 (Y) by ρc` with ` = 1, . . . , Ckp . For each k-variable set {X and estimate it by q bI×J S b−1 S b> , bc = S ρ J×J I×J
`
b is the rank-based estimator of the correlation matrix introduced in Section 2.2.1. Then we sort these where S canonical correlations {b ρc` , ` = 1, . . . , Ckp } and select the variables that attributes a relatively large canonical correlation at least once into the active set. Specifically, let M∗ = {1 ≤ i ≤ p, βi , 0} be the true model of size s = |M∗ | and define the sets for i ∈ {1, . . . , p} n o bc` , Ini = `; (Xi , Xi1 , . . . , Xik−1 ) with max |im − i| ≤ kn is used in calculating ρ 1≤m≤k−1
where kn is a parameter determining a neighborhood set in which variables jointly with Xi are included to calculate the canonical correlation with the response. Finally we estimate the active set as follow: n o c ct = 1 ≤ i ≤ p : max ρ b M > t , n ` n n `∈Ii
where tn is a threshold parameter which controls the the size of the estimated active set. If we set kn = p, then all k-variable sets including Xi are considered in Ini . However, if there is a natural index for all the covariates such that only the neighboring covariates are related, which is often the case in portfolio tracking in finance, it is more appropriate to consider a kn much smaller than p. As for the parameter k, a relatively large k may bring more accurate results, but will increase the computational burden. Empirical simulation results show that the performance by taking k = 2 is already good enough and substantially better than taking k = 1 which is equivalent to sorting marginal correlation. 3. Theoretical properties In this section, we present the theoretical properties of the proposed approach. In the screening problem, what ct with we care about most is whether the true non-zero index set M∗ is contained in the estimated active set M n high probability for a properly chosen threshold tn , i.e., whether the procedure has sure screening property. To this end, we assume that the following three assumptions hold. Assumption 1 Assume p > n and log p = O(nξ ) for some ξ ∈ (0, 1 − 2κ), where κ is a constant satisfying 0 ≤ κ ≤ 21 . Assumption 2 For all ` ∈ {1, . . . , Ckp }, λmax ((ΣJ ` ×J ` )−1 ) ≤ c0 for some constant c0 , where J ` = {m`1 , . . . , m`k } is the index set of variables in the `-th k-variable sets. Assumption 3 For some 0 ≤ κ ≤ 1/2, mini∈M∗ max`∈Ini ρc` ≥ c1 n−κ . Assumption 1 specifies the scaling between the dimensionality p and the sample size n. Assumption 2 requires that the minimum eigenvalue of the covariance matrix of any k covariates is lower bounded. Assumption 3 is the fundamental basis for guaranteeing the sure screening property, which means that any important variable is correlated to the response jointly with some other variables. Technically, Assumption 3 entails that an important variable will not be veiled by the statistical approximation error resulting from the estimated canonical correlation.
5
Theorem 3.1. Assume that Assumptions 1-3 hold, then for some positive constants c∗1 and C, as n goes to infinity, we have c ∗ −κ b` ≥ c1 n ≥ 1 − O exp −Cn1−2κ , Pr min maxn ρ i∈M∗ `∈Ii
and
cc∗ n−κ ≥ 1 − O exp −Cn1−2κ . Pr M∗ ⊂ M 1
Theorem 3.1 shows that, by setting the threshold of order c∗1 n−κ , all important variables can be selected out with probability tending to 1. However, the constant c∗1 remains unknown. To refine the theoretical result, we assume the following to hold. Assumption 4 For some 0 ≤ κ ≤ 1/2, maxi
cc∗ n−κ | = s ≥ 1 − O exp −Cn1−2κ , Pr |M 1
where s is the size of M∗ .
Theorem 3.2 guarantees the exact sure screening property without any condition on kn . Besides that, the theorem guarantees the existence of c∗1 and C, it still remains unknown how to select the constant c∗1 . If we know cc∗ n−κ is approximately n. Obviously, that s < n/ log n in advance, one can select a constant c∗ such that the size of M c c ∗ we have Mc1 n−κ ⊂ Mc∗ n−κ with probability tending to 1. The following theorem is particularly useful in practice, summarizing the above discussion. Theorem 3.3. Assume that Assumptions 1-4 hold. If s = |M∗ | ≤ n/ log n, we have for any constant γ > 0 Pr (M∗ ⊂ Mγ ) ≥ 1 − O exp −Cn1−2κ , n o bc` is among the largest bγnc of max`∈In1 ρ bc` , · · · , max`∈Inp ρ bc` . where Mγ = 1 ≤ i ≤ p; max`∈Ini ρ
The above theorem guarantees that one can reduce dimensionality to a moderate size against n while ensuring the sure screening property, which further guarantees the validity of a more sophisticated and computationally efficient variable selection method. Theorem 3.3 heavily relies on Assumption 4. If there is natural order of the variables, and any important variable together with only the adjacent variables contributes to the response variable, then Assumption 4 can be totally removed while imposing a constraint on the parameter kn . The following theorem summarizes the above discussion.
Theorem 3.4. Assume Assumptions 1-3 hold, λmax (Σ) ≤ c2 nτ for some τ ≥ 0 and c2 > 0, and further assume ∗ kn = c3 nτ for some constants c3 > 0 and τ∗ ≥ 0. If 2κ + τ + τ∗ < 1, then there exists some θ ∈ [0, 1 − 2κ − τ − τ∗ ) such that for γ = c4 n−θ with c4 > 0, we have for some constant C > 0, Pr (M∗ ⊂ Mγ ) ≥ 1 − O exp −Cn1−2κ . ∗
The assumption kn = c3 nτ is reasonable in many fields such as biology and finance. An intuitive example is in genomic association study where millions of genes tend to cluster together and function together with adjacent genes. The procedure by ranking the estimated canonical correlation and reducing the dimension in one step from a large p to bn/ log nc is a crude and greedy algorithm and may result in many fake covariates due to the strong 6
correlations among them. Motivated by the ISIS method in [9], we propose a similar iterative procedure which achieves sure screening property in multiple steps. The iterative procedure works as follows. Let the shrinking ∗ factor δ → 0 be properly chosen such that δn1−2κ−τ−τ → ∞ as n → ∞ and we successively perform dimensionality reduction until the number of remaining variables drops to below sample size n. Specifically, define a subset bc` is among the largest bδpc of all . M1 (δ) = 1 ≤ i ≤ p : maxn ρ (3) `∈Ii
In the first step we select a subset of bδpc variables, M1 (δ) by Equation (3). In the next step, we start from the variables indexed in M1 (δ), and apply a similar procedure as (3), and again obtain a sub-model M2 (δ) ⊂ M1 (δ) with size bδ2 pc. Iterate the steps above and finally obtain a sub-model Mk (δ), with size [δk p] < n. For the iterative procedure, we have the following theoretical result. ∗
Theorem 3.5. Assume that the conditions in Theorem 3.4 hold, let δ → 0 satisfying δn1−2κ−τ−τ → ∞ as n → ∞, then we have Pr M∗ ⊂ Mk (δ) ≥ 1 − O exp −Cn1−2κ .
Theorem 3.5 guarantees the sure screening property of the iterative procedure and the step size δ can be chosen in the same way as ISIS in [9]. 4. Simulation Study In this section we conduct thorough numerical simulation to illustrate the empirical performance of the proposed robust screening procedure (denoted as CCH). We compare the proposed procedure with three methods, the method proposed by [16] (denoted as CCK), the rank correlation screening approach proposed by [17] (denoted as RRCS) and the initially proposed SIS procedure by [9]. To illustrate the robustness of the proposed procedure, we consider the following five models which include linear regression with thick-tail covariates and error term, linear transformation model with thick-tail error term, additive model and more general regression model. In fact, the settings in Model 1, Model 2 and Model 3 exactly fit to Elliptical Copula Regression model, while strictly speaking, those in Model 4 and Model 5 do not. We design the settings in Model 4 and Model 5 to show the robustness of the proposed procedure against violation of ellipticity. Model 1, Linear model setting: Yi = 0.9 + β1 Xi1 + · · · + β p Xip + i , where β = (1, −0.5, 0, . . . , 0)> with the last p − 2 components being 0. The covariates and the error term (X, ) is jointly sampled from multivariate normal N(0, Σ) or multivariate t with 1 degree of freedom, noncentrality parameter 0 and scale matrix Σ, where Σ = (Σi, j ) ∈ R(p+1)×(p+1) . The diagonal entries of Σ are 1 and the off-diagonal entries Σi, j = ρ for 1 ≤ i , j ≤ p, and Σi,(p+1) = Σ(p+1),i = 0 for 1 ≤ i ≤ p.
Model 2, Another linear model setting: Yi = β1 Xi1 + · · · + β p Xip + i , where β = (5, 5, 5, 0, . . . , 0)> with the last p − 3 components being 0. The covariates and the error term (X, ) is jointly sampled from multivariate normal N(0, Σ) or multivariate t with 1 degree of freedom, noncentrality parameter 0 and scale matrix Σ, where Σ = (Σi, j ) ∈ R(p+1)×(p+1) . The diagonal entries of Σ are 1 and the off-diagonal entries Σi, j = ρ for 1 ≤ i , j ≤ p, and Σi,(p+1) = Σ(p+1),i = 0 for 1 ≤ i ≤ p. Model 3, Linear transformation model setting: Let
H(Y) = X> β + . We set H(Y) = log(Y) which corresponds to a special case of BOX-COX transformation. The regression coefficients β = (3, 1.5, 2, 0, . . . , 0)> with the last p − 3 components being 0. The covariates and the error term (X, ) is jointly sampled from multivariate normal N(0, Σ) or multivariate t with degree 3, noncentrality parameter 0 and scale matrix Σ, where Σ = (Σi, j ) ∈ R(p+1)×(p+1) . The diagonal entries of Σ are 1 and the off-diagonal entries Σi, j = ρ for 1 ≤ i , j ≤ p, and Σi,(p+1) = Σ(p+1),i = 0 for 1 ≤ i ≤ p. Model 4, Additive model from [23]: Let
Yi = 5 f1 (X1i ) + 3 f2 (X2i ) + 4 f3 (X3i ) + 6 f4 (X4i ) + i , 7
Table 1 The proportions of the active sets selected by different feature screening procedures that contain all true variables in Model 1 base on 500 replications. The left part presents the results for (X, ) from multivariate normal and the right part presents presents the results for (X, ) from multivariate t with 1 degree of freedom.
(p, n)
Multivariate Normal
Method
Multivariate t (degree 1)
ρ=0
ρ=0.1
ρ=0.5
ρ=0.9
ρ=0
ρ=0.1
ρ=0.5
ρ=0.9
(100,20)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.004 0.004 1 0.724 1 0.688
0.002 0.004 1 0.672 1 0.648
0 0.006 1 0.430 1 0.386
0.014 0.008 1 0.070 1 0.092
0.002 0.006 1 0.334 1 0.342
0.004 0.006 1 0.298 1 0.346
0.012 0.006 1 0.224 1 0.218
0.006 0.004 1 0.042 1 0.062
(500,20)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0 0 1 0.448 1 0.412
0.002 0 1 0.424 1 0.362
0 0 1 0.248 1 0.188
0 0 1 0.026 1 0.026
0.002 0.002 1 0.192 1 0.194
0.002 0.002 1 0.196 1 0.180
0 0.002 1 0.106 1 0.104
0 0 1 0.022 1 0.016
(100,50)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0 0 1 1 1 1
0 0 1 1 1 0.998
0.002 0.002 1 0.934 1 0.934
0.010 0.014 1 0.374 1 0.348
0.004 0.020 1 0.518 1 0.566
0.004 0.018 1 0.488 1 0.536
0.010 0.034 1 0.372 1 0.410
0.020 0.028 1 0.136 1 0.136
(500,50)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0 0 1 0.988 1 0.984
0 0 1 0.984 1 0.972
0 0 1 0.842 1 0.826
0 0 1 0.190 1 0.166
0 0 1 0.358 1 0.392
0 0.002 1 0.320 1 0.380
0 0.002 1 0.218 1 0.236
0 0.002 1 0.046 1 0.052
with f1 (x) = x, f2 (x) = (2x − 1)2 , f3 (x) = and
sin(2πx) , 2 − sin(2πx)
f4 (x) = 0.1 sin(2πx) + 0.2 cos(2πx) + 0.3 sin2 (2πx) + 0.4 cos3 (2πx) + 0.5 sin3 (2πx). The covariates X = (X1 , . . . , X p )> are generated for j ∈ {1, . . . , p} by Xj =
W j + tU , 1+t
where W1 , . . . , W p and U are i.i.d. Uniform [0,1]. For t = 0, this is the independent uniform case while t = 1 corresponds to a design with correlation 0.5 between all covariates. The error term are sampled from N(0, 1.74). The regression coefficients β in the model setting is obviously (5, 3, 4, 6, 0, . . . , 0)> with the last p − 4 components being 0. Model 5, A model generated by combining Model 3 and Model 4: Let H(Yi ) = 5 f1 (X1i ) + 3 f2 (X2i ) + 4 f3 (X3i ) + 6 f4 (X4i ) + i ,
8
Table 2 The proportions of the active sets selected by different feature screening procedures that contain all true variables in Model 2 base on 500 replications. The left part presents the results for (X, ) from multivariate normal and the right part presents presents the results for (X, ) from multivariate t with 1 degree of freedom.
(p, n)
Multivariate Normal
Method
Multivariate t (degree 1)
ρ=0
ρ=0.1
ρ=0.5
ρ=0.9
ρ=0
ρ=0.1
ρ=0.5
ρ=0.9
(100,20)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.504 0.584 1 0.994 1 1
0.388 0.490 1 0.988 1 1
0.154 0.280 1 0.986 1 1
0.074 0.228 1 0.976 1 1
0.176 0.014 1 0.432 1 0.998
0.128 0.018 1 0.492 1 1
0.048 0.016 1 0.506 1 0.994
0.030 0.026 1 0.444 1 0.918
(500,20)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.186 0.242 1 0.978 1 1
0.100 0.150 1 0.986 1 1
0.038 0.066 1 0.974 1 1
0.012 0.056 1 0.926 1 1
0.040 0.004 1 0.260 1 0.992
0.018 0.004 1 0.342 1 0.994
0.006 0.004 1 0.344 1 0.992
0.004 0.002 1 0.270 1 0.868
(100,50)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.994 0.998 1 1 1 1
0.978 0.988 1 1 1 1
0.874 0.958 1 1 1 1
0.758 0.932 1 1 1 1
0.914 0.104 1 0.682 1 0.996
0.840 0.096 1 0.708 1 1
0.478 0.082 1 0.716 1 0.996
0.266 0.090 1 0.686 1 0.962
(500,50)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.948 0.968 1 1 1 1
0.886 0.944 1 1 1 1
0.630 0.792 1 1 1 1
0.376 0.678 1 1 1 1
0.660 0.022 1 0.492 1 0.998
0.472 0.022 1 0.556 1 0.998
0.162 0.018 1 0.528 1 0.996
0.070 0.020 1 0.440 1 0.942
where H(Y) is the same as in Model 3 and the functions { f1 , f2 , f3 , f4 } are the same with Model 4. The covariates X = (X1 , . . . , X p )> are generated in the same way as in Model 4. The error term are sampled from N(0, 1.74).
For models which included ρ, we take ρ = 0, 0.1, 0.5, 0.9. For all the models, we consider four combinations of (n, p): (20,100), (50,100), (20,500), (50, 500). All simulation results are based on 500 replications. We evaluate the performance of different screening procedures by the proportion that the true model is included in the selected active set in 500 replications. To guarantee a fair comparison, for all the screening procedures, we choose the variables whose coefficients rank in the first largest bn/ log nc values. For our method CCH and the method CCK in [16], two parameters kn and k are involved. The simulation study shows that when kn is small, the performances for different combinations of (kn , k) are quite similar. Thus we only present the results of (kn , k) = (2, 2), (2, 3) for illustration, which are denoted as CCH1, CCH2 for our method and CCK1 and CCK2 for the method by [16]. From the simulation results, we can see that the proposed CCH methods detect the true model much more accurately than SIS, RRCS and CCK meothods in almost all cases. Specifically, for the motivating Model 1 in [16], from Table 1, we can see that when the correlations among covariates become large, all the SIS, RRCS and CCK methods perform worse (the proportion of containing the true model drops sharply), but the proposed CCH procedure shows robustness against the correlations among covariates and detects the true model for each replication. Besides, for the heavy-tailed error term following t(1), we can see that all the SIS, RRCS and CCK methods perform very bad while the CCH method still works very well. For Model 2, from Table 2, we can see that when the covariates are multivariate normal and the error term is normal, then all the methods works well when the sample size is relatively large while CCK and CCH require less sample size compared with RRCS and 9
Table 3 The proportions of the active sets selected by different feature screening procedures that contain all true variables in Model 1 base on 500 replications. The left part presents the results for (X, ) from multivariate normal and the right part presents presents the results for (X, ) from multivariate t with 3 degree of freedom.
(p, n)
Multivariate Normal
Method
Multivariate t (degree 3)
ρ=0
ρ=0.1
ρ=0.5
ρ=0.9
ρ=0
ρ=0.1
ρ=0.5
ρ=0.9
(100,20)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.284 0.018 1 0.152 1 0.282
0.208 0.004 1 0.172 1 0.254
0.122 0.006 1 0.072 1 0.098
0.054 0 1 0.018 1 0.054
0.210 0.012 1 0.062 1 0.106
0.174 0.012 1 0.058 1 0.09
0.082 0 1 0.042 1 0.058
0.040 0 1 0.008 1 0.022
(500,20)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.068 0 1 0.038 1 0.100
0.032 0 1 0.038 1 0.056
0.014 0 1 0.010 1 0.020
0.006 0 1 0.004 1 0.016
0.034 0 1 0.012 1 0.016
0.024 0 1 0.010 1 0.026
0.010 0 1 0.010 1 0.016
0.002 0 1 0.002 1 0.010
(100,50)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.878 0.266 1 0.696 1 0.856
0.852 0.172 1 0.596 1 0.740
0.656 0.040 1 0.272 1 0.324
0.454 0.002 1 0.066 1 0.088
0.812 0.064 1 0.214 1 0.302
0.746 0.036 1 0.186 1 0.248
0.526 0.010 1 0.088 1 0.110
0.340 0.004 1 0.020 1 0.040
(500,50)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.646 0.020 1 0.324 1 0.548
0.558 0.014 1 0.276 1 0.436
0.344 0.004 1 0.062 1 0.074
0.138 0 1 0.008 1 0.010
0.504 0.004 1 0.050 1 0.086
0.426 0.002 1 0.046 1 0.072
0.188 0 1 0.026 1 0.034
0.070 0 1 0 1 0.010
SIS. If the error term is from t(1), then SIS, RRCS and CCK meothods perform bad especially when the ratio p/n is large. In contrast, the CCH approach still performs very well. We should notice that the RRCS also shows certain robustness and CCK2 is slightly better than CCK1 because the important covariates are indexed by three consecutive integers. The CCH’s advantage over the CCK is mainly illustrated by the results of Model 3 to Model 5. In fact, Model 3 is an example of the linear transformation model, Model 4 is an example of the additive model and Model 5 is an example of a more complex nonlinear regression model. The CCK approach relies heavily on the linear regression assumption while CCH is more applicable. For the linear transformation regression model, from Table 3, we can see that CCK performs badly especially when the ratio p/n is large. The approach RRCS ranks the Kendall’s τ correlation which is invariant to monotone transformations, thus it exhibits robustness for Model 3, but it still performs much worse than CCH. For the additive regression model and Model 5, by Table 4, similar conclusions can be drawn as discussed for Model 3. It is worth mentioning that although we require the marginal transformation functions to be monotone in Definition 2.2 for theoretical analysis, the simulation study shows that the proposed screening procedure is not sensitive to the requirement, and performs pretty well even when the transformation functions are not monotone. In fact, the marginal transformation functions f2 , f3 , f4 in Model 4 and Model 5 are all not monotone. Thus, the proposed CCH procedure performs very well not only for heavy-tailed error terms, but also for various unknown transformation functions, which shows robustness. This means that in practice, CCH can be used as a safe replacement of the CCK, RRCS or SIS.
10
Table 4 The proportions of the active sets selected by different feature screening procedures that contain all true variables in Model 4 and Model 5 base on 500 replications. The left part presents the results for Model 4 with t ∈ {0, 0.5, 1} and the right part presents presents the results for Model 5 with t ∈ {0, 0.5, 1}.
Model 4
Model 5
(p, n)
Method
t=0
t = 0.5
t=1
t=0
t = 0.5
t=1
(100,20)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.038 0.048 0.938 0.380 0.978 0.506
0.046 0.056 0.968 0.356 0.956 0.458
0.132 0.142 0.938 0.584 0.938 0.682
0.038 0.002 0.938 0.142 0.978 0.166
0.046 0.012 0.968 0.146 0.956 0.144
0.132 0.032 0.938 0.300 0.938 0.302
(100,50)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.420 0.512 1 0.922 1 0.988
0.352 0.406 1 0.938 1 0.990
0.504 0.496 1 0.990 1 0.996
0.420 0.100 1 0.690 1 0.674
0.352 0.110 1 0.614 1 0.588
0.504 0.182 1 0.836 1 0.810
(500,20)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.002 0.002 0.804 0.168 0.858 0.222
0.004 0.008 0.884 0.176 0.890 0.224
0.020 0.026 0.812 0.354 0.842 0.454
0.002 0 0.804 0.038 0.858 0.052
0.004 0 0.884 0.068 0.890 0.086
0.020 0.008 0.812 0.168 0.842 0.178
(500,50)
RRCS SIS CCH1 CCK1 CCH2 CCK2
0.098 0.158 1 0.794 1 0.930
0.092 0.140 1 0.856 1 0.956
0.228 0.222 1 0.984 1 0.998
0.098 0.002 1 0.310 1 0.350
0.092 0.026 1 0.344 1 0.344
0.228 0.036 1 0.642 1 0.622
5. Real Example In this section we apply the variable selection method to a gene expression data set for an eQTL experiment in rat eye reported in [24]. The data set has been analyzed by [7, 14, 26] and can be downloaded from the Gene Expression Omnibus at accession number GSE5680. For this data set, 120 12-week-old male rats were selected for harvesting of tissue from the eyes and subsequent microarray analysis. The microarrays used to analyze the RNA from the eyes of the rats contain over 31,042 different probe sets (Affymetric GeneChip Rat Genome 230 2.0 Array). The intensity values were normalized using the robust multi-chip averaging method [2, 15] to obtain summary expression values for each probe set. Gene expression levels were analyzed on a logarithmic scale. We are interested in finding the genes correlated with gene TRIM32, which was found to cause BardetCBiedl syndrome [5]. BardetCBiedl syndrome is a genetically heterogeneous disease of multiple organ systems, including the retina. Of more than 31,000 gene probes including >28,000 rat genes represented on the Affymetrix expression microarray, only 18,976 exhibited sufficient signal for reliable analysis and at least 2-fold variation in expression among 120 male rats generated from an SR/JrHsd × SHRSP intercross. The probe from TRIM32 is 1389163 at, which is one of the 18, 976 probes that are sufficiently expressed and variable. The sample size is n = 120 and the number of probes is 18,975. It’s expected that only a few genes are related to TRIM32 such that this is a sparse high dimensional regression problem. Direct application of the proposed approach on the whole dataset is slow, thus we select 500 probes with the largest variances of the whole 18,975 probes. [14] proposed nonparametric additive model to capture the relationship between expression of TRIM32 and candidate genes and find most of the plots of the estimated 11
CCH2
RRCS
60 40 20 0
CCK2
SIS
13 139554 1394152_a 1398593_at 1377454_at 1378021_at 1398586_at 1377657_at 1374951_at 1367739_at 1375191_at 1387611_at 1394437_at 1394110_at 1383578_at 1394027_at 1371056_at 1380582_at 1375920_at 1388741_at 78 6_ t 07 at 4_ at
13 139858 1385967_a 1395351_at 1385107_at 1376569_at 1375118_at 1369159_at 1397122_at 1389950_at 1374487_at 1397515_at 1370946_at 1378364_at 1394117_at 1383368_at 1397313_at 84 6 t 131375984_at 71 66 _a 13245_4_at 82 a_ t 31 at 4_ at
13 139858 1395357_a 1376567_at 1375118_at 85 9 t 131375109_at 71 66 _a 13245_4_at 137094a_at 1383094_at 1392596_at 1389958_at 1382527_at 1375532_at 1385432_at 1374481_at 1378365_at 1395567_at 1394431_at 1374900_at 69 3_ t 23 at 0_ at
0
0
0
10
20
20
20
40
40
30
40
60
60
50
80
80
60
70
100
CCK1
13 136773 1394111_a 1395548_at 1398592_at 1373254_at 1382644_at 1394159_at 1375193_at 1387611_at 1369417_at 1394020_at 1378026_at 1370746_at 1381795_at 1398585_at 1375927_at 1379281_at 1380585_at 1370900_at 79 2_ t 81 at 8_ at
13 137334 1368889_a 1382297_at 1398591_at 1370944_at 1384716_at 1392056_at 1390595_at 1374952_at 1370959_at 1375190_at 1380531_at 1368953_at 97 8 t 131398127_at 87 59 _a 13902_5_at 138714a_at 71 4 t 131382960_at 69 86 _a 37 2_ t 1_ at a_ at
13 137334 1368889_a 1382297_at 1370941_at 1384716_at 1390596_at 1380532_at 70 3 t 131398950_at 69 59 _a 13371_4_at 74 a t 131398959_at 87 59 _a 13902_5_at 138714a_at 1368954_at 1386558_at 1392182_at 1395400_at 1397129_at 91 7_ t 90 at 6_ at
0
0
50
50
100
100
150
150
200
80
200
CCH1
Fig. 1. Boxplot of the ranks of the first 20 genes ordered by b rUj . The top panels present the results by CCH1, CCH2, RRCS respectively. The bottom panels present the results by CCK1, CCK2, SIS respectively. The probes printed in blue color in top left and top middle panels are the three most influential probes identified by CCH procedures. The probe 1398594 at in red color is detected as influential by CCH, RRCS and SIS procedures.
additive components are highly nonlinear, thus confirming the necessity of taking into account nonlinearity. The Elliptical Copula Regression (ECR) model can also capture the nonlinear relationship and thus it is reasonable to apply the proposed robust dimension reduction procedure on this data set. For the real data example, we compare the selected genes by procedures introduced in a simulation study, which are the SIS ([9]), the RRCS procedure ([17]), CCK procedure ([16]) and the proposed CCH procedure. To detect influential genes, we adopt a bootstrap procedure similar to [16, 17]. We denote the respective correlation esis , ρ errcs , ρ ecck and ρ ecch . The detailed algorithm is coefficients calculated using the SIS, RRCS, CCK, CCH by ρ presented in Algorithm 1. The box-plot of the ranks of influential genes is illustrated in Fig.1, from which we can see that the proposed CCH procedure selects out three very influential genes 1373349 at, 1368887 at and 1382291 at (emphasized in Fig.1 by blue color), which were not detected as influential by the other feature screening methods. The reason 12
Algorithm 1 A bootstrap procedure to obtain influential genes Input: D = (Xi , Yi ), i ∈ {1, . . . , n} Output: Index of influential genes eisis , ρ eirrcs , ρ eicck and ρ eicch and By the data set (Xi , Yi ), i ∈ {1, . . . , n} , calculate the correlations coefficients ρ b b b e( j1 ) ≥ ρ e( j2 ) ≥ · · · ≥ ρ e( j p ) , where ρ e can be ρ esis , ρ errcs , ρ ecck and ρ ecch , thus the set {bj1 , · · · , bj p } then order them as ρ b b varies with different screening procedure. We denote by j1 · · · j p to represent an empirical ranking of e(s) ≥ ρ e(t) and the component indices of X based on the contributions to the response, i.e., s t indicates ρ we informally interpret as “ the sth component of X has at least as much influence on the response as the tth component. The ranking b r j of the jth component is defined to be the value of r such that bjr = j. ei from the bth bootstrap 2: For each 1 ≤ i ≤ p, employ the SIS, RRCS, CCK and CCH procedures to calculate ρ i eb , b = 1, . . . , 200. samples, denoted as ρ e1b , . . . , ρ ebp by bj1b · · · bjbp and calculate the corresponding rank b 3: Denote the ranks of ρ rbj for the jth component of X. 4: Given a value α = 0.05, compute the (1 − α) level, two-sides and equally tailed interval for the rank of the jth component, i.e., an interval [b r Lj , b rUj ] where 1:
5:
Pr(b rbj ≤ b r Lj |D) ≈ Pr(b rbj ≥ b rUj |D) ≈
α . 2
Treat a variable as influential if b rUj is ranked the top 20 positions.
Fig. 2. 3-dimensional plots of a Genralized Additive Model (GLM) fit. The left panel presents the result by treating TRIM32 as response and 1373349 at, 1368887 at as covariates. The middle panel presents the result by treating TRIM32 as response and 1373349 at, 1382291 at as covariates. The left panel presents the result by treating TRIM32 as response and 1382291 at , 1368887 at as covariates.
we select out the three influential genes is that there exists strong nonlinearity relationship between the response and the combination of the three covariates genes. Fig.2 illustrate the above findings. Besides, gene 1398594 at is detected as influential by CCH and RRCS procedures, which is also emphasized by red colour in Fig. 1. By scatter plot, we find that the nonlinearity between gene 1398594 at and TRIM 32 gene is obvious and CCH and RRCS procedure can both capture the nonlinear relationship. In fact, the gene symbol of 1373349 at is LOC683313, its gene ontology molecular function is “structural molecule activity”. The gene symbol of 1368887 at is Cdh22 and its gene ontology molecular function is “calcium ion binding” and “metal ion binding”, which may play a role in the growth of neuroendocrine organ. The above findings are based on a statistical analysis, which may help to reveal underlying disease mechanism. The findings also need to be further validated by experiments in laboratories. The screening procedure is helpful by narrowing down the number of research targets to a few top ranked genes from the 500 candidates. 13
6. Discussion We propose a very flexible semi-parametric ECR model and consider the variable selection problem for ECR model in the ultra-high dimensional setting. We propose a robust feature screening procedure for ECR model. Theoretical analysis shows that the procedure enjoys sure screening property, i.e., with probability tending to 1, the screening procedure selects out all important variables and substantially reduces the dimensionality to a moderate size against the sample size. We set kn to be a small value and it performs well as long as there is a natural index for all the covariates such that the neighboring covariates are correlated. If there is no natural index group in prior, we can do statistical clustering for the variables before screening. The performance of the screening procedure then would rely heavily on the clustering performance, which we leave as a future research topic. Also one referee pointed out that it is quite an interesting problem to allow k to diverge with n. For theoretical analysis, b such as that provided in [1] could be useful. We also leave this problem the tight bound on the operator norm of S, as a future work as a lot of work still needs to be done. Appendix In this appendix, we present the detailed proofs of the main theorems. First we give a useful lemma which b and is particularly establishes an estimation error bound for the (s, t)-th element of the rank-based estimator S useful in the following proofs. Lemma 1. Let Sb s,t = sin(πb τ s,t /2). For any c > 0 and some positive constant C > 0, the following result hold: Pr Sb s,t − Σ s,t ≥ cn−κ = O exp(−Cn1−2κ ) . Proof: By a similar argument as in the proof of Theorem 4.2 in [21], we can obtain that 2 2 Pr b τ s,t − τ s,t ≥ t ≤ 2 exp − 2 nt2 . π π
By taking t = cn−κ , we then have
which concludes the Lemma.
2c2 Pr Sb s,t − Σ s,t ≥ cn−κ ≤ 2 exp − 2 n1−2κ , π
Proof of Theorem 3.1 > Firstly, by the definition of canonical correlation, it holds that (ρc` )2 = ΣI×J ` Σ−1 J ` ×J ` ΣI×J ` . In Assumption 2, we assume that λmax (ΣJ ` ×J ` ) ≤ c0 , and note that ΣI×J ` = (Σ1,m`1 , . . . , Σ1,m`k ) is a row vector, we have
(ρc` )2 ≤ c0
k X (Σ1,m`t )2 .
(4)
t=1
By Assumption 3, if i ∈ M∗ , then there exists `i ∈ Ini such that ρc`i ≥ c1 n−κ . Without loss of generality, we assume that Σ1,m`i is the element with largest absolute value, i.e., |Σ1,m`i | = max1≤t≤k |Σ1,m`i |. By Equation (4), t 1 1 it holds that |Σ1,m`i | ≥ c∗1 n−κ for some c∗1 > 0. For Σ1,m`i , we denote the corresponding Kendall’s τ estimator as 1 1 bI×J `i , then the following result hold: Sb `i ∈ S 1,m1
c∗ Pr min |Sb1,m`i | ≥ 1 n−κ 1 i∈M∗ 2
c∗ b ≥ Pr min S 1,m`i − Σ1,m`i ≤ 1 n−κ . 1 1 i∈M∗ 2
Furthermore, by similar argument as in the proof of Theorem 4.2 in [21], we can easily get c∗ c∗ b 1 −κ Pr min S 1,m`i − Σ1,m`i ≥ n ≤ p ∗ Pr b τ1,m`i − τ1,m`i ≥ 1 n−κ . 1 1 1 1 i∈M∗ 2 π 14
(5)
By Lemma 1, it further holds that ∗2 c∗ c1 1−2κ 1 −κ ≤ 2 exp − 2 n . Pr b τ1,m`i − τ1,m`i ≥ n 1 1 π 2π
In Assumption 3, we assume that log(p) = o(n1−2κ ), thus it further holds that for some constant C, Pr min Sb1,m`i − Σ1,m`i ≥ c∗1 n−κ ≤ 2 exp −Cn1−2κ . i∈M∗
1
1
Combining with Equation (5), we have Pr mini∈M∗ |Sb1,m`i | ≥ c∗1 n−κ ≥ 1 − 2 exp −Cn1−2κ . Besides, it is easy to 1 show that (ρbc )2 ≥ max1≤t≤k (Sb1,m` )2 , and hence, `
t
bc` ≥ c∗1 nκ ≥ 1 − O exp −Cn1−2κ . Pr min maxn ρ i∈M∗ `∈Ii
cc∗ nκ ≥ 1 − O exp −Cn1−2κ , which concludes the theorem. The above result further implies that Pr M∗ ⊂ M 1 Proof of Theorem 3.2
The proof of Theorem 3.2 is split into the following 2 steps to make it more readable. Step I: In this step we aim to prove the following result holds: ! c2 −κ c 2 = O exp(−Cn1−2κ ) , ρ ) − (e ρ ) > cn Pr max (b 1≤`≤Ckp
`
`
bI×J ` Σ−1` ` S b> b ` ` ` ` where (e ρc` )2 = S J ×J I×J ` . Note that the determinants of the matrices ΣJ ×J and SJ ×J are polynomials of finite order in their entries, thus the following inequality holds, b −κ ≤ Pr max1≤s,t≤k |Sb s,t − Σ s,t | > cn−κ Pr |S J ` ×J ` | − |ΣJ ` ×J ` | > cn τ s,t ) − sin( π2 τ s,t ) ≥ cn−κ ≤ k2 ∗ Pr |Sb s,t − Σ s,t | > cn−κ = k2 ∗ Pr sin( π2 b τ s,t − τ s,t ≥ 2c n−κ ≤ k2 ∗ Pr b π
According to Lemma 1, we have that 2c −κ 2c2 1−2κ Pr b τ s,t − τ s,t ≥ n ≤ 2 exp − 2 n . π π
Thus for some positive constant C ∗ , the following inequality holds: b −κ ` ` ` ` Pr |S | − |Σ | > cn ≤ exp −C ∗ n1−2κ . J ×J J ×J
In Assumption 3, we assume that log p = O(nξ ) with ξ ∈ (0, 1 − 2κ), thus it further holds that for some positive constant C, ! b −κ Pr max |SJ ` ×J ` | − |ΣJ ` ×J ` | > cn ≤ exp −Cn1−2κ . 1≤`≤Ckp
Note that k is finite and by the adjoint matrix expansion of an inverse matrix, similar to the above analysis, we have that, for any positive c, !
b −1 −1 −κ ≤ exp −Cn1−2κ . Pr max (SJ ` ×J ` ) − (ΣJ ` ×J ` ) > cn 1≤`≤Ckp
∞
15
Note that
c2
b
−1 −1 bI×J ` k2 b (b ρ` ) − (e ρc` )2 ≤ kS − (ΣJ ` ×J ` )−1 ≤ (S − (ΣJ ` ×J ` )−1 , J ` ×J ` ) ∞ (SJ ` ×J ` )
∞
∞
c2 thus we have Pr max1≤`≤Ckp (b ρ` ) − (e ρc` )2 > cn−κ = O exp(−Cn1−2κ ) . Step II: In this step, we will first prove that for any c > 0, Pr max1≤`≤Ckp |e ρc` − ρc` | ≥ cn−κ = O exp(−Cn1−2κ ) . By Lemma 1, it holds that Pr Sb s,t − Σ s,t ≥ cn−κ = O exp(−Cn1−2κ ) . Assumption 3 states that log p = O(nξ ) with ξ ∈ (0, 1 − 2κ), thus we can obtain that Pr max s,t |Sb s,t − Σ s,t | = bI×J ` Σ−1` ` S b> ` , by the property of ΣJ ` ×J ` , for any c > 0, we have O exp(−Cn1−2κ ) . Recall that (e ρc )2 = S J ×J
`
Pr
max
1≤`≤Ckp
|e ρc`
−
ρc` |
I×J
≥ cn
−κ
!
= O exp(−Cn1−2κ ) .
Further under Assumption 4, mini
c2 As a result of Step I, the following equation holds: Pr max1≤`≤Ckp (b ρ` ) − (e ρc` )2 > cn−κ = O exp(−Cn1−2κ ) . Thus we further have bc` < c∗1 n−κ ≥ 1 − O exp(−Cn1−2κ ) . Pr max maxn ρ (6) i
In Theorem 3.1, we have obtained the following inequality: bc` ≥ c∗1 n−κ ≥ 1 − O exp −Cn1−2κ . Pr min maxn ρ i∈M∗ `∈Ii
cc∗ n−κ ≥ 1−O exp −Cn1−2κ , which concludes At last, combine the result in Equation (6), we have Pr M∗ = M 1 the theorem. Proof of Theorem 3.4 ∗
Let δ → 0 satisfying δn1−2κ−τ−τ → ∞ as n → ∞ and define bc` is among the largest bδpc of all , M(δ) = 1 ≤ i ≤ p : maxn ρ `∈Ii c f = 1 ≤ i ≤ p : max ρ e is among the largest bδpc of all , M(δ) ` n `∈Ii
bI×J ` Σ−1` ` S b> bI×J ` S b−1` ` S b> where (e ρc` )2 = S ρc` )2 = S J ×J I×J ` and (b J ×J I×J ` . We will first show that Pr M∗ ⊂ M(δ) ≥ 1 − O exp −Cn1−2κ .
(7)
cc∗ n−κ ≥ 1 − O exp −Cn1−2κ . By According to Theorem 3.1, it is equivalent to show that Pr M∗ ⊂ M(δ) ∩ M 1 Step I in the proof of Theorem 3.2, it is also equivalent to show that f ∩M cc∗ n−κ ≥ 1 − O exp −Cn1−2κ . Pr M∗ ⊂ M(δ) 1 Finally, by Theorem 3.1 again, to prove Equation (7) is equivalent to prove that 16
f Pr M∗ ⊂ M(δ) ≥ 1 − O exp −Cn1−2κ .
(8)
Recall that in the proof of Theorem 3.1, we obtained the following result that ec` ≥ c∗1 nκ ≥ 1 − O exp −Cn1−2κ . Pr min maxn ρ i∈M∗ `∈Ii
If we can prove that
Pr
p X i=1
ec` )2 (maxn ρ `∈Ii
−1+τ∗ +τ
≤ cn
p
!
≥ 1 − O exp −Cn1−2κ
then we will get, with probability larger than 1 − O exp −Cn1−2κ ,
Card 1 ≤ i ≤
ec` p; maxn ρ `∈Ii
≥
ec` min max ρ i∈M∗ `∈Ini
≤
cp n1−2κ−τ−τ∗
,
(9)
,
∗
which further indicates that the result in (8) holds due to δn1−2κ−τ−τ → ∞. So to end the whole proof, we just need to show that the result in (9) holds. bI×J i0 Σ−1i i S b> bI×J i0 = (Sb i0 , . . . , Sb i0 ). eci0 = max`∈Ini ρ ec` . Note that (e For each 1 ≤ i ≤ p, let ρ ρci0 )2 = S , with S J 0 ×J 0 I×J i0 1,m1 1,mk Under Assumption 1, we have k X bI×J i0 k2 , (Sb1,mi0 )2 = c0 kS (e ρci0 )2 ≤ c0 2 t=1
which further indicates
p X i=1
t
bI×T k2 , with T = {2, . . . , d}. (e ρci0 )2 ≤ c0 kkn kS 2
Note that Pr Sb s,t − Σ s,t ≥ cn−κ ≤ 2 exp −2c2 n1−2κ /π2 , thus similar to the argument in [16], we can easily get that ! p X ∗ ec` )2 ≤ cn−1+τ +τ p ≥ 1 − O exp −Cn1−2κ . Pr (maxn ρ i=1
`∈Ii
Finally, following the same idea of iterative screening as in the proof of Theorem 1 of [9], we finish the proof of the theorem. Acknowledgements Yong He’s Work is supported by the grant of National Science Foundation of China (NSFC 11801316), Natural Science Foundation of Shandong Province (ZR2019QA002, ZR2019MA035) and National Statistical Scientific Research Project (2018LY63). Xinsheng Zhang’s research is partially supported by the grant of National Science Foundation of China (NSFC 11571080). Jiadong Ji’s work is supported by the grant of National Science Foundation of China (NSFC 81803336) and the grant from the Natural Science Foundation of Shandong Province (ZR2018BH033). The authors would like to thank the Editor Dietrich von Rosen, the Associate Editor and the anonymous reviewers for their constructive comments that led to a major improvement of this article. All remaining errors are our own. References [1] R. F. Barber, M. Kolar, ROCKET: Robust confidence intervals via Kendall’s tau for transelliptical graphical models, Ann. Statist. 46 (2018) 3422–3450. [2] B. Boldstad, R. Irizarry, M. Astrand, T. Speed, A comparison of normalization methods for high density oligonucleotide array data based on bias and variance, Bioinformatics 19 (2003) 185–193. [3] T. T. Cai, L. Zhang, High-dimensional Gaussian copula regression: Adaptive estimation and statistical inference, Statist. Sinica 28 (2018) 963–993. [4] E. Candes, T. Tao, The Dantzig selector: Statistical estimation when p is much larger than n, Ann. Statist. 35 (2007) 2313–2351.
17
[5] A. P. Chiang, J. S. Beck, H.-J. Yen, M. K. Tayeh, T. E. Scheetz, R. E. Swiderski, D. Y. Nishimura, T. A. Braun, K.-Y. A. Kim, J. Huang, et al., Homozygosity mapping with SNP arrays identifies trim32, an e3 ubiquitin ligase, as a Bardet–Biedl syndrome gene (BBS11), Proc. Nat. Acad. Sci. 103 (2006) 6287–6292. [6] H. Cui, R. Li, W. Zhong, Model-free feature screening for ultrahigh dimensional discriminant analysis, J. Amer. Statist. Assoc. 110 (2015) 630–641. [7] J. Fan, Y. Feng, R. Song, Nonparametric independence screening in sparse ultra-high dimensional additive models, J. Amer. Statist. Assoc. 106 (2011) 544–557. [8] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc. 96 (2001) 1348– 1360. [9] J. Fan, J. Lv, Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 (2008) 849–911. [10] J. Fan, Y. Ma, W. Dai, Nonparametric independence screening in sparse ultra-high dimensional varying coefficient models., J. Amer. Statist. Assoc. 109 (2014) 1270. [11] J. Fan, L. Xue, H. Zou, Multitask quantile regression under the transnormal model, J. Amer. Statist. Assoc. 111 (2016) 1726–1735. [12] H. B. Fang, K. T. Fang, S. Kotz, The meta-elliptical distributions with given marginals , J. Multivariate Anal. 82 (2002) 1–16. [13] F. Han, H. Liu, Scale-invariant sparse pca on high-dimensional meta-elliptical data, J. Amer. Statist. Assoc. 109 (2014) 275–287. [14] J. Huang, J. L. Horowitz, F. Wei, Variable selection in nonparametric additive models, Ann. Statist. 38 (2010) 2282–2313. [15] R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf, T. P. Speed, Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics 4 (2003) 249–264. [16] X. B. Kong, Z. Liu, Y. Yao, W. Zhou, Sure screening by ranking the canonical correlations, Test 26 (2017) 1–25. [17] G. Li, H. Peng, J. Zhang, L. Zhu, Robust rank correlation based screening, Ann. Statist. 40 (2012) 1846–1877. [18] G. Li, H. Peng, L. Zhu, Nonconcave penalized m-estimation with a diverging number of parameters, Statist. Sinica 21 (2011) 391–419. [19] R. Li, W. Zhong, L. Zhu, Feature screening via distance correlation learning, J. Amer. Statist. Assoc. 107 (2012) 1129–1139. [20] F. Lindskog, A. Mcneil, U. Schmock., Kendall’s tau for elliptical distributions., Credit Risk (2003) 149–156. [21] H. Liu, F. Han, M. Yuan, J. Lafferty, L. Wasserman, High-dimensional semiparametric Gaussian copula graphical models, Ann. Statist. 40 (2012). [22] J. Y. Liu, W. Zhong, R. Li, A selective overview of feature screening for ultrahigh-dimensional data, Sci. China Math. 58 (2015) 1–22. [23] L. Meier, S. V. D. Geer, P. B¨uhlmann, High-dimensional additive modeling, Ann. Statist. 37 (2009) 3779–3821. [24] T. E. Scheetz, K.-Y. A. Kim, R. E. Swiderski, A. R. Philp, T. A. Braun, K. L. Knudtson, A. M. Dorrance, G. F. DiBona, J. Huang, T. L. Casavant, et al., Regulation of gene expression in the mammalian eye and its relevance to eye disease, Proc. Nat. Acad. Sci. 103 (2006) 14429–14434. [25] R. Song, W. Lu, S. Ma, X. Jessie Jeng, Censored rank independence screening for high-dimensional survival data, Biometrika 101 (2014) 799–814. [26] T. Sun, C.-H. Zhang, Scaled sparse linear regression, Biometrika 99 (2012) 879–898. [27] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 (1996) 267–288. [28] A. M. Wesselman, B. M. S. V. Praag, Elliptical regression operationalized , Econom. Lett. 23 (1987) 269–274. [29] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 (2006) 49–67. [30] C.-H. Zhang, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist. 38 (2010) 894–942. [31] L. Zhu, L. Li, R. Li, L. Zhu, Model-free feature screening for ultrahigh dimensional data, J. Amer. Statist. Assoc. 106 (2011) 1464–1475. [32] H. Zou, The adaptive lasso and its oracle properties, J. Amer. Statist. Assoc. 101 (2006) 1418–1429.
18