Nonparametric significance testing and group variable selection

Nonparametric significance testing and group variable selection

Journal of Multivariate Analysis 133 (2015) 51–60 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www.e...

414KB Sizes 0 Downloads 119 Views

Journal of Multivariate Analysis 133 (2015) 51–60

Contents lists available at ScienceDirect

Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva

Nonparametric significance testing and group variable selection Adriano Zanin Zambom a,∗ , Michael G. Akritas b a

Department of Statistics, State University of Campinas (UNICAMP), Brazil

b

Department of Statistics, The Pennsylvania State University, USA

article

info

Article history: Received 20 July 2013

AMS subject classification: 62Gxx Keywords: Backward elimination Dimension reduction Heteroscedasticity Lack-of-fit tests Local polynomial regression Nonparametric regression

abstract In the context of a heteroscedastic nonparametric regression model, we develop a test for the null hypothesis that a subset of the predictors has no influence on the regression function. The test uses residuals obtained from local polynomial fitting of the null model and is based on a test statistic inspired from high-dimensional analysis of variance. Using p-values from this test, and multiple testing ideas, a group variable selection method is proposed, which can consistently select even groups of variables with diminishing predictive significance. A backward elimination version of this procedure, called GBEAMS for Group Backward Elimination Anova-type Model Selection, is recommended for practical applications. Simulation studies, suggest that the proposed test procedure outperforms the generalized likelihood ratio test when the alternative is non-additive or there is heteroscedasticity. Additional simulation studies reveal that the proposed group variable selection procedure performs competitively against other variable selection methods, and outperforms them in selecting groups having nonlinear or dependent effects. The proposed group variable selection procedure is illustrated on a real data set. © 2014 Elsevier Inc. All rights reserved.

1. Introduction Advances in data collection technologies and data storage devices have enabled the collection of large data sets, involving a large number of observations on many variables, in several disciplines. When the objective is to build a predictive model, the challenges presented by high dimensional data sets have opened new frontiers for statistical research. Including insignificant predictors results in complicated models with less predictive power and reduced ability to discern and interpret the influence of the predictors. The underlying principles of modern model building are parsimony and sparseness. As such, variable selection is a fundamental component of model building, and has received extensive attention over the last 20 years. Due to readily available software, variable selection is often performed by modeling the expected response at covariate value x as m(x) = xβ. Classical approaches to variable selection, such as stepwise selection or elimination procedures, and best subset variable selection, can be computationally intensive or ignore stochastic errors. A new class of methodologies addresses variable selection through minimization of a constrained or penalized objective function, such as Tibshirani’s [26] Lasso, Fan and Li’s [15] SCAD, Efron, Hastie, Johnstone, and Tibshirani’s [13] least angle regression, Zou’s [36] adaptive Lasso and Candes and Tao’s [10] Dantzig selector. A different approach exploits the conceptual connection between model testing j and variable selection: dropping variable j from the model is equivalent to not rejecting the null hypothesis H0 : βj = 0.



Corresponding author. E-mail addresses: [email protected], [email protected] (A.Z. Zambom).

http://dx.doi.org/10.1016/j.jmva.2014.08.014 0047-259X/© 2014 Elsevier Inc. All rights reserved.

52

A.Z. Zambom, M.G. Akritas / Journal of Multivariate Analysis 133 (2015) 51–60

Abramovich, Benjamini, Donoho and Johnstone [1] showed that application of the false discovery rate (FDR) controlling j procedure of Benjamini and Hochberg [6] on p-values resulting from testing each H0 can be translated into minimizing a model selection criterion similar to that used in [27] and others. Working with orthogonal designs, they also showed that their method is asymptotically minimax for ℓr loss, 0 < r ≤ 2, simultaneously throughout a range of sparsity classes, provided the level q for the false discovery rate is set to q < 0.5. Generalizations of this methodology to non-orthogonal j designs differ mainly in the generation of the p-values for testing H0 : βj = 0, and the false discovery rate method employed. Bunea, Wegkamp and Auguste [9] use p-values generated from the standardized regression coefficients resulting from fitting the full model and employ the Benjamini and Yekutieli [8] method for controlling false discovery rate under dependency, while Benjamini and Gavrilov [5] use p-values from a forward selection procedure where the ith stage p-to-enter is the ith stage constant in the multiple-stage false discovery rate procedure in [7]. Model checking and variable selection procedures based on the assumption of a linear model may fail to discern the relevance of covariates whose effect on m(x) is nonlinear. Because of this, procedures for both model checking and variable selection have been developed under more general/flexible models. See, for example [20,29,17,25], and references therein. However, the methodological approaches in this literature have been distinct from those of model checking. Zambom [33] and Zambom and Akritas [34] develop a consistent variable selection procedure using p-values from testing the residual significance of each variable in a heteroscedastic nonparametric model. In many applications, covariates come in groups (see Section 4 for an example), and the issue of selecting the groups of variables with predictive significance arises. The most common group selection procedures are the group Lasso [32], and the adaptive group Lasso [29]. Alternative methods with concave penalties have also been used, as for example the group SCAD. Zhu and Li [35] develop an alternative method based on nonlinear dimension reduction which is shown to perform well in nonparametric additive models. See also [23] who, using averages of the genes within each group, perform selection based on a procedure combining hierarchical clustering and Lasso. To our knowledge, test-based variable selection procedures have not been extended to group selection. In principle, this extension is straightforward: the p-values for the significance of each individual covariate are replaced by the p-values for testing the significance of each group of variables. For the Bunea, Wegkamp and Auguste [9] procedure, such an extension is automatic since p-values for groups of variables are readily available in the context of a linear model. Moreover, if the groups of variables are orthogonal, or nearly orthogonal, such an extension of the Bunea et al. [9] procedure is expected to satisfy optimality properties similar to those of Abramovich, Benjamini, Donoho and Johnstone [1] (under sparseness of the groups of variables; see Section 3). In this paper we consider a heteroscedastic nonparametric model, and extend the test-based variable selection method of Zambom and Akritas [34]. Specifically, the present paper achieves two objectives. First, it develops a procedure for testing the residual predictive significance of a group of variables, in a nonparametric heteroscedastic model. The term ‘‘predictive significance’’ is used to highlight the fact that the test procedure is designed to detect the effect of the variables on the regression function, while maintaining its level if the variables only affect other aspects of the conditional distribution of the response variable, such as the variance function. As with the test in [34], the present test is modeled after the high-dimensional one-way ANOVA test of Akritas and Papadatos [2]. The main innovation in the present test lies in the construction of the ‘‘augmented cells’’, which is the most critical part for the construction of a successful test. Simulations suggest that the proposed test achieves competitive power, and accurate type I error rates under heteroscedasticity, in accordance with its asymptotic properties. Secondly, the paper introduces GBEAMS (Group Backward Elimination Anova-type Model Selection), a backward elimination procedure for group variable selection using the Benjamini and Yekutieli [8] method applied on the p-values resulting from testing the predictive significance of each group. A group selection version of the consistency result in [34] can be established along the same lines. Additional simulations suggest that the proposed group variable selection procedure performs competitively, and outperforms group Lasso and group SCAD in selecting groups having nonlinear effects. The paper is organized as follows. Section 2 describes the proposed methodology for testing the predictive significance of a group of variables, derives the asymptotic distribution of the test statistic, under the null hypothesis and local alternatives, and presents results of simulation studies comparing its performance to that of the generalized likelihood ratio test of Fan and Jiang [14]. Section 3 describes the test-based group variable selection procedure, and presents results of simulation studies comparing its performance to that of group Lasso and the procedure introduced in [35]. The analysis of a real data set involving gene expression levels of healthy and cancerous colon tissues is presented in Section 4. 2. Nonparametric model checking 2.1. The hypothesis and the test statistic Assume we have n observations, (Yi , Ui ), i = 1, . . . , n, of the response variable Y and covariates U = (X, Z), where X and Z have dimensions r and s respectively (r + s = d), which remain fixed as n → ∞. Let m(x, z) = E (Y |X = x, Z = z) denote the regression function. The heteroscedastic nonparametric regression model is Y = m(X, Z) + σ (X, Z)ϵ,

(1)

where ϵ has zero mean and constant variance and is uncorrelated from X and Z. The goal is to test the null hypothesis that Z does not contribute to the regression function, i.e. H0 : m(x, z) = m1 (x).

(2)

A.Z. Zambom, M.G. Akritas / Journal of Multivariate Analysis 133 (2015) 51–60

53

The idea for testing this hypothesis is to treat the covariate values Zi , i = 1, . . . , n, as the levels of a high-dimensional ˆ 1 (xi ) being the observation from factor level one-way analysis of variance design, with the null hypothesis residual ξˆi = Yi −m Zi , and construct an analysis of variance based test statistic. Because the asymptotic theory for high-dimensional analysis of variance requires more than one observation per factor level [2], we will employ smoothness conditions, which will be stated below, and augment each factor level by including residuals from nearby covariate values. With univariate covariate, such factor level augmentation was carried out in [28,34] by including in each factor level the residuals corresponding to neighboring covariate values. With a multivariate covariate, the challenge is to define the neighboring covariate values in a way that results in a test statistic with good power properties. There are several meaningful ways of doing so. For example, one can use either the first principal component, or the first supervised principal component [4], or the first principal direction in projection pursuit regression (cf. [16]), and define neighboring covariate values in terms of any of these univariate quantities. Let Pn = ZT Cn denote the sample version of any of these quantities, and let P = ZT C denote its limiting version; for example, in projection pursuit C will be the first projective direction and Cn its estimate. In the hypothesis test simulations reported in Section 2.3, we used an adaptation of the supervision idea of Bair et al. [4] combined with the first projective direction, described below. Having a univariate surrogate of Z, we augment each cell Pn,i = ZTi Cn by including additional p − 1, for p odd, residuals ξˆℓ which correspond to the p − 1 nearest neighbors Pn,ℓ of Pn,i . To be specific, we consider the (ξˆi , Pn,i ), i = 1, . . . , n, arranged so that Pn,i1 < Pn,i2 whenever i1 < i2 , and for each Pn,i , (p − 1)/2 < i ≤ n − (p − 1)/2, define the nearest neighbor window Wi as Wi (Cn ) =



j : |FˆP (Pn,j ) − FˆP (Pn,i )| ≤

p−1



2n

,

(3)

where FˆP is the empirical distribution function of Pn,1 , . . . , Pn,n . Wi (Cn ) defines the augmented cell corresponding to Pn,i .

Note that the augmented cells are defined as sets of indices rather than as sets of ξˆi values. The vector of (n − p + 1)p constructed ‘‘observations’’ in the augmented one-way analysis of variance design is

ξˆ Cn = (ξˆj , j ∈ W(p−1)/2+1 (Cn ), . . . , ξˆj , j ∈ Wn−(p−1)/2 (Cn ))T .

(4)

Let MST and MSE denote the balanced one-way analysis of variance mean squares due to treatment and error, respectively, computed on the data ξˆ Cn . More specifically, the proposed test statistic is based on p

MST − MSE = where ξˆi. = (1/p)

n  (ξˆi. − ξˆ.. )2 −

n − 1 i=1

1

p n  

(ξˆj − ξˆi. )2 ,

(5)

np − n i=1 j∈W (C ) i n

  ξˆj and ξˆ.. = (1/np) ni=1 j∈Wi (Cn ) ξˆj . ˆ 1 (xi ), i = 1, . . . , n, will be formed using the local polynomial of order q regression In this paper the residuals ξˆi = Yi − m 

j∈Wi (Cn )

estimator defined by

ˆ 1 (Xi ) = eT1 XTXi WXi XXi m 

−1

XTXi WXi Y =

n 

w( ˜ Xi , Xj )Yj ,

i = 1, . . . , n,

(6)

j =1

−1/2

where Wx = diag{KHn (X1 − x), . . . , KHn (Xn − x)}, with KHn (x) = |Hn |−1/2 K (Hn

x) for K (·) a bounded, non-negative

r-variate kernel function of bounded variation and with bounded support and Hn bandwidth matrix, and

is a symmetric positive definite r × r

1/2



1

.

Xx =  .. 1

(X1 − x)T .. . (Xn − x)T

vechT (X1 − x)(X1 − x)T





.. .   vechT (Xn − x)(Xn − x)T

···



 · · · , ···

with vech denoting the half-vectorization operator; see [22] for further details. We finish this section with a description of the construction of the component Pn = ZT Cn . First, we adapt the idea of supervision of Bair et al. [4] the following way. Let pj , j = 1, . . . , s, denote the p-values obtained by applying the test of j

Zambom and Akritas [34] for testing the hypothesis H0 which specifies that Zj , the jth coordinate of Z, has no effect on the regression function of the model with response variable Y and covariate vector (X, Zj ). For a threshold parameter θ define the index set

J = {j : pj < θ}

(7)

and let ZJ be the vector formed from the J coordinates of Z. Depending on the cardinality of ZJ we use either its first principal component or the first projective direction from regressing ξˆ on ZJ . A more detailed description is given in the simulation section. Simulation results not reported here suggest that the proposed procedure is robust to the choice of θ ; the value θ = 0.2 will be used for the remainder of the paper.

54

A.Z. Zambom, M.G. Akritas / Journal of Multivariate Analysis 133 (2015) 51–60

2.2. Asymptotic results Consider the following conditions: (a) (b) (c) (d) (e)

E |Y |ρ < ∞ for some ρ > 2. The marginal densities fX and fP of X and P, respectively, are bounded away from zero. fX is uniformly continuous. The q + 1 derivatives of m1 (x) exist and are Lipschitz uniformly continuous and bounded. σ 2 (., t ) := E (ξ 2 |ZT C = t ) is Lipschitz continuous, supu σ 2 (u) < ∞, and supi E (ϵi4 ) < ∞. 1/2

Also, assume that the eigenvalues, λi , i = 1, . . . , d − 1, of the bandwidth matrix Hn the same rate and satisfy:

(1) nλi4(q+1) → 0, (2) (3)

2(d−1) n i log n 2

λ

(

)

defined in (6), converge to zero at

i = 1, . . . , d − 1.

→ ∞,

i = 1, . . . , d − 1.

n1−2/ρ λdi −1

→ ∞,

ln n[ln n(ln ln n)1+δ ]2/ρ

i = 1, . . . , d − 1.

Theorem 2.1. Assume conditions (a)–(e) and the conditions on the bandwidth (1)–(3) hold. Then, under H0 in (2), the asymptotic distribution of the test statistic in (5) is given by d

n1/2 (MST − MSE) → N where τ =

 



0,

2p(2p − 1) 3(p − 1)

 τ2 ,

2 σ 2 (x, t )fX|ZT C=t (x)dx fZT C (t )dt.

An estimate of τ 2 can be obtained by modifying Rice [24]’s estimator as follows:

τˆ 2 =

1

n−2  (ξˆj − ξˆj−1 )2 (ξˆj+2 − ξˆj+1 )2 .

(8)

4(n − 3) j=2

Next, we present the asymptotic theory of the test statistic under additive local alternatives, and under general local alternatives which are defined, respectively, as

˜ 2 (z), H1A : m(x1 , z) = m1 (x1 ) + n−1/4 m

(9)

˜ 2 (z) + n−1/4 m ˜ 12 (x1 , z), H1G : m(x1 , z) = m1 (x1 ) + n−1/4 m ˜ 2, m ˜ 12 for functions m will also require some of the following conditions. Let Z1 and Z2 be independent copies of Z.

(10)

 ˜ 2 (Z) = 0 = E m ˜ 12 (X1 , Z) . For simplicity in the derivations, the stated results which satisfy E m 

Condition 1: Condition 2: Condition 3: Condition 4:





˜ 22 (Z)|CT Z = t ) is uniformly continuous in t for all C. E (m ˜ 2 (Z1 )m ˜ 2 (Z2 )|CT Z1 = t1 , CT Z2 = t2 ) is uniformly continuous in t1 and t2 for all C. E (m ˜ 12 (X, Z)m ˜ 2 (Z)|CT Z = t ) is uniformly continuous in t for all C. E (m ˜ 12 (X, Z1 )m ˜ 2 (Z2 )|CT Z1 = t1 , CT Z2 = t2 ) is uniformly continuous in t1 and t2 for all C. E (m

Theorem 2.2. Suppose that the assumptions of Theorem 2.1 hold. Then: 1. Under H1A in (9), if Conditions 1 and 2 hold, as n → ∞, d

n1/2 (MST − MSE) → N

  2p(2p − 1) 2 µA , τ . 3(p − 1)

2. Under H1G in (10), if Conditions 3 and 4 hold, as n → ∞, d

n1/2 (MST − MSE) → N



µG ,

2p(2p − 1) 3(p − 1)

τ2



where

µA = pE [m2 (Z1 )m2 (Z2 )I (P1 = P2 )] µG = pE {[m2 (Z1 )m2 (Z2 ) + m12 (X1 , Z1 )m12 (X1 , Z2 ) + 2m12 (X1 , Z1 )m2 (Z2 )]I (P1 = P2 )}, with abuse in notation we denote Pi as the population version of the supervised principal component.

A.Z. Zambom, M.G. Akritas / Journal of Multivariate Analysis 133 (2015) 51–60

55

Table 1 Rejection rates for the homoscedastic additive model.

|Z| 3 5 10

κ

Method

ANOVA-type-SPC + EDR GRLT ANOVA-type-SPC + EDR GRLT ANOVA-type-SPC + EDR GRLT

0

0.2

0.4

0.6

0.8

0.061 0.048 0.063 0.046 0.068 0.058

0.46 0.88 0.36 0.77 0.29 0.55

0.92 1 0.93 1 0.81 1

0.99 1 0.99 1 0.99 1

1 1 1 1 1 1

2.3. Simulations: Model checking procedures We compare the proposed analysis of variance-type (ANOVA-type) hypothesis test for groups with the generalized likelihood ratio test of Fan and Jiang [14]. The data is generated under three situations: a homoscedastic additive model, a homoscedastic non-additive model, and a heteroscedastic non-additive model. All covariates, in all models, are independent standard normal. The homoscedastic additive model is Y = X1 + κ(Z1 + Z2 + Z3 ) + ϵ,

where ϵ ∼ N (0, 1),

(11)

the homoscedastic non-additive model is X

κ(Z1 +Z2 )

Y = X1 2 (1 + κ(Z1 + Z2 )) + X2

+ ϵ,

where ϵ ∼ N (0, 0.12 ),

(12)

where ϵ ∼ N (0, 0.52 ).

(13)

and the heteroscedastic non-additive model is Y = X1 + κ(sin(Z1 Z2 ) + Z3 + Z4 ) + Z1 Z2 ϵ,

In each situation we simulate 2000 data sets of size n = 200. All simulations were performed in R. Tables 1, 2, and 3, show the simulation results for models (11), (12), and (13), respectively. We report the rejection rates for the hypothesis test H0 as in (2) where Z = (Z1 , Z2 , Z3 ) for model (11), Z = (Z1 , Z2 ) for model (12) and Z = (Z1 , Z2 , Z3 , Z4 ) for model (13). Also, we report the results of the test for more realistic situations where irrelevant predictors are included in the group to be tested, in particular, we consider the cardinality of Z as either |Z| = 5 or 10. The construction of the component Pn = ZT Cn is as follows. When all variables in the group being tested are nonsignificant, the expected number of rejections when testing for the significance of the |Z| variables at level α is |Z|α . Hence, if the index set J defined in (7) is of size ⌈|Z|α⌉ or less, where ⌈·⌉ denotes the ceiling function, we take Pn to be the first principal component of ZJ . If the size of J is larger than ⌈|Z|α⌉, then we use the first projective direction from regressing ξˆ on ZJ . Through extensive simulations it is seen that this strategy maintains the level of the procedure accurately while achieving high power for detecting relevant groups. It is seen that the Generalized Likelihood Ratio test, which is designed for homoscedastic additive models, achieves better power under model (11), but is extremely liberal under heteroscedasticity; see Table 3. Table 2 suggests that the GRLT has low power against non-additive alternatives even in the homoscedastic case. Remark. If X is irrelevant for predicting Y but is correlated with Z, the power of the proposed test decreases. In simulations not reported here, we found that by regressing X on Z (coordinatewise, e.g. by a single index model), and replacing X by the resulting residuals gives much better power while maintaining the level. We view that as a form of nonparametric orthogonalization. 3. Nonparametric group variable selection 3.1. The GBEAMS procedure In this section we will present a test-based group variable selection. For this purpose we will make a slight change in notation by letting X denote the entire vector of covariates. Thus, we consider the nonparametric regression model Yi = m(Xi ) + σ (Xi )εi ,

i = 1, . . . , n,

where εi is the independent error with zero mean and constant variance. Suppose that the covariates are classified in d groups identified by the indices Jℓ = {j : Xj belongs to group ℓ}, ℓ = 1, . . . , d, and let sℓ denote the size of group Jℓ . Moreover, we assume sparseness in the sense that only the variables in a subset I0 = {J1 , . . . , Jd0 } ⊂ {J1 , . . . , Jd } of the groups influence the regression function. Define the hypothesis H0ℓ : m(x) = m1 (x(−Jℓ ) ),

ℓ = 1, . . . , d,

56

A.Z. Zambom, M.G. Akritas / Journal of Multivariate Analysis 133 (2015) 51–60 Table 2 Rejection rates for the homoscedastic non-additive model.

|Z|

κ

Method

ANOVA-type-SPC + EDR GRLT ANOVA-type-SPC + EDR GRLT ANOVA-type-SPC + EDR GRLT

2 5 10

0

0.02

0.04

0.06

0.08

0.54 0.074 0.064 0.058 0.058 0.048

0.30 0.09 0.16 0.06 0.16 0.04

0.65 0.16 0.52 0.08 0.38 0.08

0.79 0.23 0.67 0.16 0.52 0.11

0.80 0.38 0.72 0.27 0.55 0.16

Table 3 Rejection rates for the heteroscedastic non-additive model with 4 covariates in Z.

|Z|

κ

Method

ANOVA-type-SPC + EDR GRLT ANOVA-type-SPC + EDR GRLT ANOVA-type-SPC + EDR GRLT

4 5 10

0

0.1

0.2

0.3

0.4

0.065 0.512 0.058 0.430 0.066 0.378

0.31 0.89 0.25 0.82 0.20 0.72

0.66 1 0.61 1 0.45 0.98

0.87 1 0.89 1 0.80 1

0.95 1 0.96 1 0.92 1

where x(−Jℓ ) is the set of all covariates except those whose indices are in Jℓ , and let πℓ denote the p-value resulting from (ℓ)

testing this hypothesis with the procedure of Section 2, using the first projective direction to form the windows Wi . Let H0 denote the null hypothesis corresponding to the p-value π(ℓ) , ℓ = 1, . . . , d, where π(1) ≤ · · · ≤ π(d) denote the ordered p-values. For a choice of the level α , compute

k = max

    

i : π(ℓ) ≤

   

    

α d   d  j−1   ℓ

(14)

j =1

(ℓ)

and reject the hypotheses H0 , ℓ = 1, . . . , k. If no such k exists, no hypotheses are rejected. The basic group selection procedure selects the groups of variables with indices corresponding to the k rejected null hypotheses. Using arguments similar to those Zambom and Akritas [34], this group selection procedure can be shown to consistently select even groups of variables with diminishing predictive significance. Finally, we will assume the dimension reduction model of Li [18], i.e.



 m(x) = g (Bx), where B is a K ×



si

matrix.

(15)

i

Under the dimension reduction model (15), this hypothesis can be written equivalently as H0ℓ : g (Bx) = g (B(−Jℓ ) x(−Jℓ ) ),

ℓ = 1, . . . , d,

(16)

where B(−Jℓ ) is the K × (d − sℓ ) matrix obtained by omitting the columns of B with indices in Jℓ . Let  B denote the Sliced Inverse Regression (SIR) estimator of B, and  B(−Jℓ ) be the corresponding submatrix. With this notation, let 1/2  2p(2p − 1) 1/2 τˆℓ4 zℓ = n (MSTℓ − MSEℓ ) 3(p − 1) be the test statistic for testing the hypothesis (16) with  B(−Jℓ ) X(−Jℓ ) playing the role of X in Theorem 2.1, and  B(Jℓ ) X(Jℓ ) playing the role of Z, where X(J ) is the set of all covariates whose indices are in Jℓ and  B(J ) is the corresponding submatrix of  B. ℓ



The proposed test-based variable selection procedure GBEAMS consists of the following steps: 1. Compute the p-value for H0ℓ as πℓ = 1 − Φ (zℓ ), ℓ = 1, . . . , d. 2. Compute k as in (14), for a choice of level α , where π(1) , . . . , π(d) are the ordered p-values. If k = d, stop and retain all groups. If k < d, (a) update x by eliminating the covariates of the group corresponding to π(d) , (b) update d to d − 1, (c) update  B by eliminating the columns corresponding to the deleted variables, (d) update the test statistic zℓ , ℓ = 1, . . . , d. 3. Repeat steps 1 and 2, with the updated zℓ , ℓ = 1, . . . , d.

A.Z. Zambom, M.G. Akritas / Journal of Multivariate Analysis 133 (2015) 51–60

57

Table 4 Results for Model 1. Method GBEAMS(α = GBEAMS(α = GBEAMS(α = GBEAMS(α = GBEAMS(α = Group Lasso Group SCAD KSIR-GAM

Group ranking 0.03) 0.04) 0.05) 0.08) 0.10)

Group selection

G1

G2

G3

G4

G1

G2

G3

G4

Grest

0.91 0.88 0.87 0.88 0.90 0.99 0.99 0.96

0.77 0.74 0.80 0.79 0.78 0.27 0.29 0.92

1 0.99 0.99 0.99 0.98 0.33 0.31 0.92

0.93 0.98 0.94 0.95 0.94 0.97 0.97 0.96

0.93 0.92 0.96 0.94 0.93 0.99 0.93 0.98

0.77 0.77 0.80 0.78 0.86 0.59 0.15 0.90

0.97 1 0.99 0.99 0.99 0.52 0.23 0.94

0.93 0.94 0.94 0.91 0.97 0.98 0.92 0.95

0.18 0.16 0.20 0.25 0.51 0.50 0.13 0.12

3.2. Variations of the GBEAMS procedure Another approach for constructing a group variable selection procedure is to use a single application of the Benjamini and Yekutieli [8] method for controlling the false discovery rate. This is similar to one of the two procedures proposed in [9]. However, this did not perform well in simulations and is not recommended. A backward elimination approach was used in [19], but without incorporating multiple testing ideas. The dimension reduction step in GBEAMS can be implemented with another linear dimension reduction method (see for example [30]), or by KSIR, which is a nonlinear dimension reduction method (see [31]). 3.3. Simulations: variable selection procedure We compare the proposed procedure GBEAMS to the procedure in [35] (KSIR-GAM in the tables), Group Lasso (available in R package grplasso) and Group SCAD (available in R package grpreg) in two different setups. In both cases the groups consist of 5 covariates each. The variables within group i are denoted by Xi1 , . . . , Xi5 . In the first setup we use the additive model in Zhu and Li which we call Model 1. In this model, the Xij follow a uniform distribution between 0 and 1, variables within each of the 10 groups admit a compound symmetric correlation structure with correlation 0.2 while variables in different groups are independent. The response Y is related to 4 groups through the relation Y = 9G1 + 1.5 exp(3G2 ) + 75(G3 − 0.5)2 + 3.75 sin(2π (G4 − 1)/3) + ε, where ε ∼ N (0, 1) and G1 = X11 − X12 G2 = X21 + X22 + X23 − X21 × X22 − X22 × X23 − X21 × X23 G3 = X31 G4 = X41 + X42 + X43 + X44 + X45 . The model in the second setup, called Model 2, uses standard normal variables with dependence structure both within and between groups. In particular, the variables in groups 1, 3, 4 and 5 are generated as a 20 dimensional Gaussian distribution with mean 0, variance 1 and covariance 0.8; the variables in groups 2, 6, 7 and 8 are generated similarly. The response variable is related to 2 groups through the relation Y = 9G1 + 1.5 exp(3G2 ) + ε, where ε ∼ N (0, 1) and G1 and G2 are as given above. The simulation results for Models 1 and 2 are reported in Tables 4 and 5 respectively. Under ‘‘Group Ranking’’ we report the proportion of times that each of the truly relevant groups is ranked above all the irrelevant groups. Note that, because GBEAMS is a backward elimination algorithm, it provides a natural way of ranking groups, in the sense that the first group to be eliminated is ranked last and so forth. Under ‘‘Group Selection’’ we report the proportion of times each of the relevant groups is selected by each method. Also, column Grest reports the average selected proportion of all the irrelevant groups. All proportions are based on 100 simulation runs and each row represents an independent simulation of 100 runs. Note that the α level for GBEAMS plays no role for group ranking and thus the different proportions shown in the tables represent simulation variability. From both Tables 4 and 5 it is seen that the rate with which group Lasso selects the irrelevant groups is much higher than the selection rates of some of the relevant groups; in Table 4 the relevant groups with low selection rates are those having highly nonlinear effects, while in Table 5 the relevant group with low selection rate has linear effect but is highly correlated with irrelevant groups. This suggests that both nonlinearity of effects and dependence between groups adversely affect group Lasso’s ability to select relevant groups. Moreover, the high selection rate of irrelevant groups suggests that group Lasso tends to select larger models. The results of group SCAD for ranking resemble those of group Lasso, and while group SCAD selects irrelevant predictors with a lower rate, its ability to select relevant groups with nonlinear effects is even

58

A.Z. Zambom, M.G. Akritas / Journal of Multivariate Analysis 133 (2015) 51–60 Table 5 Results for Model 2. Method GBEAMS(α = GBEAMS(α = GBEAMS(α = GBEAMS(α = GBEAMS(α = Group Lasso Group SCAD KSIR-GAM

0.03) 0.04) 0.05) 0.08) 0.10)

Group ranking

Group selection

G1

G2

G1

G2

Grest

0.70 0.70 0.75 0.73 0.75 0.04 0.59 0.14

0.87 0.89 0.91 0.88 0.90 0.84 0.88 0.96

0.72 0.75 0.72 0.76 0.77 0.58 0.20 0.36

0.80 0.85 0.82 0.81 0.82 0.93 0.79 0.99

0.18 0.16 0.19 0.25 0.22 0.53 0.11 0.20

worse than that of group Lasso. In Table 4 it is seen that for Model 1 the proposed procedure obtains results similar to those of KSIR-GAM for groups 3 and 4, while it achieves slightly lower rates for groups 1 and 2. In Table 5 it is seen that for Model 2, the ranking and selection rates GBEAMS achieves for group 2 are slightly lower than those achieved by KSIR-GAM, but for group 1 GBEAMS achieves considerably better rates than KSIR-GAM. This suggests that the presence of high dependence between groups, correlated with irrelevant groups as in the present simulation setting, can adversely affect the ability of KSIR-GAM to distinguish between relevant and irrelevant groups. Finally, it is seen that GBEAMS is fairly robust to the choice of α level. In particular, only for the choice of α = 0.1 in Model 1, GBEAMS seemed to select larger models. In addition, we evaluated the proposed procedure under two simulation scenarios resembling more realistic microarray data applications. In the first scenario we generated 50 groups with 5 covariates each under Model 1. The rates at which the proposed procedure ranks G1 , . . . , G4 top 4 are 0.69, 0.50, 0.91 and 0.79 respectively. Similar to the outcomes in Table 4, these results are close to those of KSIR for G3 and G4 , which have rates 0.90 and 0.81, while they are lower to those of KSIR for G1 and G2 , where KSIR obtains rates 0.78 and 0.77. In the second scenario we generated 10 groups with 150 covariates each, where the response only depends on the first group through Y = exp(G1 ) + ε . The group G1 in this case is generated in the following way G1 = X11 + · · · + X16 − X17 − · · · − X110 + X1i + X2i + X3i + X4i , where the last 4 terms are interactions randomly selected from all two-way interactions of the first 10 predictors without replacement. The proposed procedure ranked G1 as the top group 83% of the time, slightly higher rate compared to that of KSIR-GAM, which achieves 81%. 4. Real data example The proposed procedure will be illustrated with an analysis of the colon cancer data set of Alon, Barkai, Notterdam, Gish, Ybarra, Mack and Levine [3]. The data set was obtained from the Affymetrix technology and shows expression levels of 40 tumor and 22 normal colon tissues of 6500 human genes. A selection of 2000 genes with highest minimal intensity across the samples has been made by Alon, Barkai, Notterdam, Gish, Ybarra, Mack and Levine [3] and is publicly available at http://microarray.princeton.edu/oncology. Different clustering methods have been applied to this data set in several previous studies including Dettling and Bhlmann [11], Dettling and Bhlmann [12], and Ma, Song and Huang [21]. To illustrate the proposed analysis of variance-type group variable selection procedure we first apply a clustering method to form the groups. We chose the supervised clustering procedure Wilma proposed by Dettling and Bhlmann [11] which is available in the package supclust in R. Wilma requires as input the number of clusters to be formed, and we specified 60, 55, 50 and 45 clusters. The next step of the proposed procedure requires dimension reduction through SIR. However, because the number of genes (2000) is much larger than the sample size (62) it is not possible to use SIR. Therefore to estimate B, we ran SIR on the set of predictors composed of the first supervised principal component of each cluster. We also ran the group Lasso procedure for binary responses using the package grplasso in R on the same clusters/groups returned by Wilma. However, a corresponding modification is needed for the calculation of the degrees of freedom needed for the application of the Cp criterion; see [32]. This calculation requires the estimator βˆ from fitting all individual covariates. Since the number of covariates is much larger than the sample size, we obtain an approximation of the required estimator by first obtaining the estimator βˆ P from fitting the first PC from each cluster. Since the PCs are linear combinations of the covariates in each cluster, having a coefficient for a cluster’s PC translates into coefficients for the covariates in that cluster. Table 6 shows the groups selected by GBEAMS and the group Lasso procedure for the different specified number of clusters returned by Wilma. We note that there is significant overlap in the genes included in the clusters selected by each method across the different numbers of total clusters specified. Thus, the different number of total clusters specified is not critical for selecting the important genes. The proposed method selects more clusters, which is contrary to the simulation results. This is probably due to the linear link function for the logit used in grplasso. For example, when the logit of the fitted probability of cancerous tissue is plotted against the first PC of cluster 4, which is selected by the proposed method but not by group Lasso, it shows a nonlinear effect (top panel of Fig. 1), whereas the non-linearity is much less pronounced when plotted against the first PC of cluster 2, which is selected by the proposed method and by group Lasso (bottom panel of Fig. 1).

A.Z. Zambom, M.G. Akritas / Journal of Multivariate Analysis 133 (2015) 51–60

59

Table 6 Results for colon data set. No. initial clusters

Procedure

Clusters selected

60

GBEAMS

55

Group Lasso GBEAMS

50

Group Lasso GBEAMS

45

Group Lasso GBEAMS

1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 15, 16, 18, 23, 24, 25, 28, 31, 33, 34, 39, 42, 46 1, 2, 3, 8, 9, 16, 17, 37 1, 2, 3, 4, 5, 6, 8, 10, 11, 12, 15, 16, 17, 22, 23, 24, 25, 26, 33, 38, 42, 45 2, 3, 7, 9, 13, 18 1, 2, 3, 4, 5, 6, 8, 10, 11, 12, 14, 15, 16, 18, 21, 23, 24, 25, 26, 36, 40, 45, 47, 48, 49 2, 3, 7, 9, 27, 29 1, 2, 3, 4, 5, 6, 8, 10, 11, 12, 15, 16, 18, 21, 22, 23, 24, 25, 31, 34, 37, 40, 42, 44, 45 2, 3, 7, 9, 10, 16

Group Lasso

Fig. 1. A graph showing the truth (dot–dash), an estimate (dashes), another estimate (solid), and 95% pointwise confidence limits (small dashes).

Acknowledgments The first author was supported by CAPES/Fulbright grant 15087657 and Faepex 573/13. The second author was supported by National Science Foundation grant DMS-0805598. Appendix A. Supplementary data Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.jmva.2014.08.014. References [1] F. Abramovich, Y. Benjamini, D.L. Donoho, I.M. Johnstone, Adapting to unknown sparsity by controlling the false discovery rate, Ann. Statist. 34 (2006) 584–653. [2] M.G. Akritas, N. Papadatos, Heteroscedastic one-way ANOVA and lack-of-fit tests, J. Amer. Statist. Assoc. 99 (2004) 368–382. [3] U. Alon, N. Barkai, D. Notterdam, K. Gish, S. Ybarra, D. Mack, A. Levine, Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. 96 (1999) 6745–6750. [4] E. Bair, T. Hastie, D. Paul, R. Tibshirani, Prediction by supervised principal components, J. Amer. Statist. Assoc. 101 (2006) 119–137. [5] Y. Benjamini, Y. Gavrilov, A simple forward selection procedure based on false discovery rate control, Ann. Appl. Stat. 3 (2009) 179–198. [6] Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B 57 (1995) 289–300. [7] Y. Benjamini, A.M. Krieger, D. Yekutieli, Adaptive linear step-up procedures that control the false discovery rate, Biometrika 93 (2006) 491–507. [8] Y. Benjamini, D. Yekutieli, The control of the false discovery rate in multiple testing under dependency, Ann. Statist. 29 (2001) 1165–1188. [9] F. Bunea, M. Wegkamp, A. Auguste, Consistent variable selection in high dimensional regression via multiple testing, J. Statist. Plann. Inference 136 (2006) 4349–4364. [10] E. Candes, T. Tao, The Dantzig selector: statistical estimation when p is much larger than n, Ann. Statist. 35 (2007) 2313–2351. [11] M. Dettling, P. Bhlmann, Supervised clustering of genes, Genome Biol. 3 (12) (2002) research0069.1-0069.15. [12] M. Dettling, P. Bhlmann, Finding predictive gene groups from microarray data, J. Multivariate Anal. 90 (2004) 106–131. [13] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Ann. Statist. 32 (2004) 407–499. [14] J. Fan, J. Jiang, Nonparametric inferences for additive models, J. Amer. Statist. Assoc. 100 (2005) 890–907.

60

A.Z. Zambom, M.G. Akritas / Journal of Multivariate Analysis 133 (2015) 51–60

[15] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc. 96 (2001) 1348–1360. [16] P. Hall, On projection pursuit regression, Ann. Statist. 17 (1989) 573–588. [17] J. Huang, J.L. Horowitz, Variable selection in nonparametric additive models, 2010. Available at http://faculty.wcas.northwestern.edu/∼jlh951/papers/ HHW-npam.pdf. [18] K.C. Li, Sliced inverse regression for dimension reduction, J. Amer. Statist. Assoc. 86 (1991) 316–327. [19] L. Li, R.D. Cook, C. Nachtsheim, Model-free variable selection, J. R. Stat. Soc. Ser. B 67 (2005) 285–299. [20] R. Li, H. Liang, Variable selection in semiparametric regression modeling, Ann. Statist. 36 (2008) 261–286. [21] S. Ma, X. Song, J. Huang, Supervised group Lasso with applications to microarray data analysis, BMC Bioinform. 8 (2007) 60. [22] Masry, Multivariate local polynomial regression for time series: uniform strong consistency rates, J. Time Ser. Anal. 17 (1996) 571–599. [23] M.Y. Park, T. Hastie, R. Tibshirani, Averaged gene expressions for regression, Biostatistics 8 (2007) 212–227. [24] J. Rice, Bandwidth choice for nonparametric regression, Ann. Statist. 12 (1984) 1215–1230. [25] C.B. Storlie, H.D. Bondell, B.J. Reich, H.H. Zhang, Surface estimation, variable selection, and the nonparametric oracle property, Statist. Sinica 21 (2011) 679–705. [26] R. Tibshirani, Regression shrinkage and selection via the Lasso, J. R. Stat. Soc. Ser. B 58 (1996) 267–288. [27] R. Tibshirani, K. Knight, The covariance inflation criterion for adaptive model selection, J. R. Stat. Soc. Ser. B 61 (1999) 529–546. [28] L. Wang, M.G. Akritas, I.V. Keilegom, An ANOVA-type nonparametric diagnostic test for heteroscedastic regression models, J. Nonparametr. Stat. 20 (2008) 365–382. [29] H. Wang, Y. Xia, Shrinkage estimation of the varying coefficient model, J. Amer. Statist. Assoc. 104 (2008) 747–757. [30] Y. Xia, A multiple-index model and dimension reduction, J. Amer. Statist. Assoc. 103 (2008) 1631–1640. [31] Y.-R. Yeh, S.-U. Huang, Y.-J. Lee, Nonlinear dimension reduction with kernel sliced inverse regression, IEEE Trans. Knowl. Data Eng. 21 (2009) 1590–1603. [32] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B 68 (2006) 49–67. [33] A.Z. Zambom, Hypothesis testing and variable selection in nonparametric regression, (Doctoral dissertation), Department of Statistics, Penn State University, 2012. [34] A.Z. Zambom, M. Akritas, Nonparametric lack-of-fit testing and consistent variable selection, Statist. Sinica 24 (2014) 1837–1858. [35] H. Zhu, L. Li, Biological pathway selection through nonlinear dimension reduction, Biostatistics 12 (2011) 429–444. [36] H. Zou, The adaptive lasso and its oracle properties, J. Amer. Statist. Assoc. 101 (2006) 1418–1429.