Journal of Statistical Planning and Inference 141 (2011) 2839–2848
Contents lists available at ScienceDirect
Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi
Shrinkage tuning parameter selection in precision matrices estimation Heng Lian Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
a r t i c l e i n f o
abstract
Article history: Received 27 March 2009 Received in revised form 19 October 2010 Accepted 7 March 2011 Available online 12 March 2011
Recent literature provides many computational and modeling approaches for covariance matrices estimation in a penalized Gaussian graphical models but relatively little study has been carried out on the choice of the tuning parameter. This paper tries to fill this gap by focusing on the problem of shrinkage parameter selection when estimating sparse precision matrices using the penalized likelihood approach. Previous approaches typically used K-fold cross-validation in this regard. In this paper, we first derived the generalized approximate cross-validation for tuning parameter selection which is not only a more computationally efficient alternative, but also achieves smaller error rate for model fitting compared to leave-one-out cross-validation. For consistency in the selection of nonzero entries in the precision matrix, we employ a Bayesian information criterion which provably can identify the nonzero conditional correlations in the Gaussian model. Our simulations demonstrate the general superiority of the two proposed selectors in comparison with leave-one-out cross-validation, 10-fold crossvalidation and Akaike information criterion. & 2011 Elsevier B.V. All rights reserved.
Keywords: Adaptive lasso BIC Generalized approximate cross-validation Precision matrix SCAD penalty
1. Introduction Undirected graphical models provide an easily understood way of describing and visualizing the relationships between multiple random variables. Usually the goal is to seek the simplest model that adequately explains the observations. In the Gaussian case, the zero entries in the inverse of the covariance matrix, or the precision matrix, correspond to independence of those two random variables conditional on all others. Efficient estimation of a covariance matrix or its inverse is an important statistical problem. In network modeling, for example, correct identification of the nonzero entries provides an understanding of the relationships between the expression levels of different genes. There also exists an abundance of methods in multivariate analysis that requires the estimation of the covariance matrix or its inverse, such as principal component analysis (PCA) or linear discriminant analysis (LDA). Due to the recent surge in interests in the estimation of the covariance matrix with the appearance of a large amount of data generated by modern experimental advances such as microarrays, a large number of different estimators have been proposed by now. Mostly motivated by the fact that the sample covariance matrix is typically a noisy estimator when the sample size is not significant and the resulting dense matrix is difficult to interpret, almost all modern estimators regularize the matrix by making it sparse. This notion of sparsity in the context of estimating covariance matrices has been noticed by some author as early as in Dempster (1972) who simplified the matrix structure by setting some entries to zero.
E-mail address:
[email protected] 0378-3758/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2011.03.008
2840
H. Lian / Journal of Statistical Planning and Inference 141 (2011) 2839–2848
Traditionally, for the identification of zero elements, forward-selection or backward-selection is performed with each element tested at each step. However, it is computationally infeasible even for data with a moderate number of variables. Li and Gui (2006) proposed using threshold gradient descent for estimating the sparse precision matrix in the Gaussian graphical models. Bickel and Levina (2008) developed asymptotic theories on banding methods for both covariance and precision matrix estimation based on direct thresholding of the elements. Another way to estimate the graphical model is to perform a regression for each variable on all the remaining ones. For example, Dobra and West (2004) used a Bayesian approach that employs a stochastic algorithm which can deal with tens of thousands of variables. Recently, there has been much interest on the estimation of sparse covariance or precision matrices using penalized likelihood method. Meinshausen and Buhlmann (2006) estimates the conditional independence separately for each random variable using the Lasso penalty. Note that setting up separate regressions for each node does not result in a valid likelihood for the covariance matrix and thus in order to obtain a positive-definite estimator, some post-processing is typically performed as the last step. Yuan and Lin (2007) used semi-definite programming algorithms to achieve sparsity by penalizing the normal likelihood with Lasso penalty on the elements of the precision matrix. The Lasso penalty will force some of the coefficients to be exactly zero. Compared to traditional model selection methods using information criteria, Lasso is continuous and thus more stable. However, several authors (Meinshausen and Buhlmann, 2006; Zhao and Yu, 2006) have noted that Lasso is in general not consistent for model selection unless some nontrivial conditions on the covariates are satisfied. Even when those conditions are indeed satisfied, the efficiency of the estimator is compromised when one insists on variable selection consistency since the coefficients are over-shrunk. To address these shortcomings of Lasso, Fan and Li (2001) proposed the smoothly clipped absolute deviation (SCAD) penalty which takes into account several desired properties of the estimator such as sparsity, continuity, asymptotic unbiasedness. They also show that the resulting estimator possesses the oracle property, i.e. it is consistent for variable selection and behaves the same as when the zero coefficients are known in advance. These results are extended to the case with a diverging number of covariates in Fan and Peng (2004). Zou (2006) proposed adaptive Lasso using a weighted L1 penalty with weights determined by an initial estimator and similar oracle property followed. The idea behind the adaptive Lasso is to assign higher penalty for zero coefficients and lower penalty for larger coefficients. In the context of Gaussian graphical models, Lam and Fan (2007) studied rates of convergence of sparse covariance/ precision matrices estimation via a general penalty function including SCAD and adaptive Lasso penalties as special cases and showed that these penalty functions attenuated the bias problem associated with Lasso penalty. In Fan et al. (2009), through local linear approximation to the SCAD penalty function, efficient computation of the penalized likelihood is achieved using the graphical Lasso algorithm of Friedman et al. (2008). Oracle properties as defined in Fan and Li (2001) were shown for the precision matrix estimator in Fan et al. (2009). The attractive oracle property of the penalized estimator depends critically on the appropriate choice of the tuning parameter. For penalized likelihood method, most of the studies mentioned above employed cross-validation (CV) for the selection of the tuning parameter. Cross-validation requires fitting the model based on different subsets of the observations multiple times, which increased the computational complexity of this approach. As an approximation to leave-one-out cross-validation (LOOCV), Craven and Wahba (1979) proposed generalized cross-validation (GCV) for smoothing spline, and further derived generalized approximate cross-validation (GACV) for non-Gaussian data (Dong and Wahba, 1996). We will follow similar strategies and derive GACV for the Gaussian graphical model that can be computed efficiently without multiple fitting of the model. However, for linear regression, Wang et al. (2007) showed that generalized cross-validation cannot select the tuning parameter satisfactorily. In particular, the tuning parameter chosen by GCV tend to overestimate the model size. Because of the asymptotic equivalence of generalized cross-validation, leaveone-out cross-validation and Akaike information criterion (AIC) in this simple model, the authors proposed to use Bayesian information criterion (BIC) for consistent model selection. We will demonstrate that similar conclusions can be reached for our problem of precision matrices estimation. In this paper, we estimate the precision matrices using the same computational approach as in Fan et al. (2009). However, we focus on the problem of tuning parameter selection in penalized Gaussian graphical model. In the next section, two selectors, GACV and BIC, are proposed in this context. Simulation studies are carried out in Section 3 which demonstrate the superior performance of the proposed selectors. A real microarray dataset is also analyzed for illustration. Finally, some remarks conclude the article in Section 4.
2. Penalized estimation and tuning parameter selection Suppose x1,y,xn are i.i.d. observations generated from a p-dimensional multivariate Gaussian distribution with mean 0 P and unknown covariance matrix R0 , where xi ¼ (xi1,y,xip)T. Denote the sample covariance matrix by S ¼ ni¼ 1 xi xTi =n. The 1 inverse of S is a natural estimator of the precision matrix X0 ¼ R0 which is the estimator that maximizes the Gaussian log-likelihood maxlogjXjTrðSXÞ, X
where jXj is the determinant of the matrix X.
H. Lian / Journal of Statistical Planning and Inference 141 (2011) 2839–2848
2841
However, if the true precision matrix is known to be sparse or the dimension of the random vector is large, the performance of the estimator can typically be improved by maximizing the penalized log-likelihood instead: ^ ðlÞ ¼ argmaxlogjXjTrðSXÞ X X
p X p X
plij ðjoij jÞ,
i¼1j¼1
where oij , i¼1,y,p, j¼1,y,p, are the entries of X and lij are the tuning parameters, which for now are left unspecified to allow for very general penalty functions. Note that we also penalize the diagonal elements as in Fan et al. (2009). ~ ij jg , where g 4 0 In the above, using the penalty lij ¼ l,pl ðjxjÞ ¼ ljxj produces the Lasso estimator, while using lij ¼ l=jo ~ ij is any consistent estimator of oij , paired with the same pl ðjxjÞ ¼ ljxj produces the adaptive Lasso estimator. and o Another commonly used penalty function proposed by Fan and Li (2001) is the SCAD penalty defined by its derivative pl ðjxjÞ ¼ lIðjxj r lÞ þ
ðaljxjÞ þ Iðjxj 4 lÞ, ða1Þ
with a ¼3.7 according to the suggestion made in Fan and Li (2001). Unlike the Lasso estimator, which produces substantial biases for large elements, the adaptive Lasso as well as the SCAD estimator achieved the oracle property as defined in Fan and Li (2001), which estimates the precision matrix as well as in the ideal situation where the zero elements are known. Efficient maximization with either Lasso or adaptive Lasso penalty can be directly performed using the graphical Lasso algorithm (Friedman et al., 2008). To take advantage of graphical Lasso algorithm, Fan et al. (2009) suggested using local linear approximation to recast the SCAD penalized likelihood as weighted Lasso in each step. It was pointed out that a onestep algorithm can perform as well as the fully iterative local linear approximation algorithm. The reader are referred to Fan et al. (2009) for further details. Here we focus on tuning parameter selection which consists of a single parameter l for all three penalty functions mentioned above. In Fan et al. (2009), it was shown that for both adaptive Lasso penalty and the SCAD penalty, when l-0 pffiffiffi and nl-1, the resulting estimators attain the oracle property. Hence, the choice of l is critical. In Fan et al. (2009), they proposed to use K-fold cross-validation (KCV), with K ¼10 probably the most popular choice in the literature. In K-fold cross-validation, one partitions the data into K disjoint subsets and chooses l that maximizes the score K X
KCVðlÞ ¼
^ nk ðlogjX
ðkÞ
^ ðlÞjTrðSðkÞ X
ðkÞ
ðlÞÞÞ,
k¼1
where nk is the sample size of the k-th partition, S( k) is the sample covariance matrix based on the data with k-th partition ðkÞ
^ excluded, and X ðlÞ is the estimate of the precision matrix based on the data excluding the k-th partition with l as the tuning parameter. The usual leave-one-out cross-validation (LOOCV) is just a special case of the above with K ¼n so that each partition consists of one single observation. Computation of KCV entails K maximization problems fitted with each partition deleted in turn for each fixed l as well as a final maximization using the optimal tuning parameter. Thus the computation of KCV is infeasible for large datasets especially when K is large. In the smoothing spline literature, the most popular approach for tuning parameter selection is the generalized cross-validation (GCV) for Gaussian data. For non-Gaussian data, Dong and Wahba (1996) proposed the generalized approximate cross-validation (GACV), which is obtained by constructing an approximation to the LOOCV based on the log-likelihood. The formula presented there does not directly apply to our problem since their derivation is based on regression problems. We also derive a GACV score based on an approximation to the LOOCV for our problem, which turns out to be ^ ðlÞjTrðSX ^ ðlÞÞÞ þ GACVðlÞ ¼ nðlogjX
n X
^ ðlÞ1 x xT ÞT vecðX ^ ðlÞðSðiÞ SÞX ^ ðlÞÞ, vecðX i i
i¼1
where SðiÞ ¼
P
T jai xj xj =ðn1Þ,
and vecðÞ is the vectorization operator for a matrix concatenating all its columns. The
derivation of GACV is complicated by the multivariate nature of each left-out observation. The detail is deferred to Appendix A. Maybe surprisingly, even though GACV is motivated by an efficient approximation to LOOCV, it almost always performs better than LOOCV and sometimes better than 10-fold CV as demonstrated by the simulation studies in the next section. In classical model selection literature, it is well understood that CV, GCV and AIC share similar asymptotic properties. All these criteria are loss efficient but inconsistent for model selection (Shao, 1997; Yang, 2005). For penalized linear regression with the SCAD penalty, it has been verified by Wang et al. (2007) that GCV is not able to identify the true model consistently when it is used for tuning parameter selection. To address this problem, Wang et al. (2007) employed a variable selection criterion known to be consistent in the classical literature, BIC, as the tuning parameter selector and proved that the resulting tuning parameter can identify the true model consistently. Similar conclusion has been drawn for adaptive Lasso (Wang and Leng, 2007). In Wang et al. (2009), the theory is further extended to linear regression problems
2842
H. Lian / Journal of Statistical Planning and Inference 141 (2011) 2839–2848
with a diverging number of parameters. In our context, with BIC, we select the optimal l by minimizing ^ ðlÞj þ TrðSX ^ ðlÞÞ þ k logn , BICðlÞ ¼ logjX n ^ ðlÞ. where k is the number of nonzero entries in the upper diagonal portion of the estimated precision matrix X We conjecture that our GACV proposed above is also not appropriate for model selection, although a formal proof seems illusive due to the complicated form of the GACV score. Nevertheless, we can extend the consistency proof for BIC using empirical process theory to show that the tuning parameter selected by BIC will result in consistent model identification. The result is stated below while the proof is relegated to Appendix B. Theorem 1. Denoting the optimal tuning parameter selected using BIC by l^ BIC , if Conditions1–3 in Appendix B hold, then the penalized likelihood estimator correctly identifies all the zero elements in the true precision matrix. That is, prðS l^ ¼ S T Þ-1, BIC ^ ij ðlÞa0g is the set of nonzero entries above and including the diagonal in the estimated precision matrix where S l ¼ fði,jÞ : ir j, o and S T is similarly defined to be the set of nonzero entries in the true precision matrix X0 . The conditions imposed on the penalty function can be verified for both SCAD penalty and adaptive Lasso penalty. Thus for these two penalty functions, BIC identifies the correct model for the precision matrix. 3. Numerical results 3.1. Simulations In this section we compare the performance of different tuning parameter selectors for Lasso, adaptive Lasso and SCAD ~ ij jg , we use the Lasso estimator penalty estimators in Gaussian graphical models. For adaptive Lasso penalty with lij ¼ l=jo as the initial consistent estimator and set g ¼ 0:5 following Fan et al. (2009). Besides CV, KCV (10-fold), GACV and BIC, we also use AIC as the tuning parameter selector, which is defined by ^ ðlÞj þ TrðSX ^ ðlÞÞ þ2k=n, AICðlÞ ¼ logjX ^ ðlÞ. where k is the number of nonzero entries in the upper diagonal part of X We use three examples with different covariance structures in our simulation. The first one has a tridiagonal precision matrix:
X0 : o0ii ¼ 1, o0i,i1 ¼ o0i,i þ 1 ¼ 0:5: The second one has a dense precision matrix with exponential decay:
X0 : o0ij ¼ 0:5jijj , where no entries are exactly zero but many are so small that penalization is expected to reduce the variability of the estimators. The third one has a more general sparse precision matrix. For each non-diagonal element o0ij , i oj of X0 , with probability 0.8 we set it to be zero, otherwise we sample a value for o0ij from a uniform distribution on ½1,0:5 [ ½0:5,1. Each diagonal entry is set as twice of the sum of the absolute values of the corresponding row elements excluding the diagonal entry itself. For each example, we use n i.i.d. random vectors generated from a multivariate Gaussian distribution Nð0, X1 0 Þ with dimension p ¼20. We consider both n ¼50 and 100. The errors are calculated based on 500 simulated datasets in each ^ JF as well as the entropy example. To compare the performance of different selectors, we use the Frobenius norm JX0 X loss (Yuan and Lin, 2007) defined by ^ jþ TrðX1 X ^ Þp: ^ Þ ¼ logjX1 X EntropyðX0 , X 0 0 For the performance in terms of sparsity of the matrix, we use false positives (number of zero entries in the true precision matrix identified to be nonzero) and false negatives (number of nonzero entries in the true precision matrix identified to be zero). The results for the Gaussian models are summarized in Tables 1–3 for Lasso penalty, adaptive Lasso penalty and SCAD penalty respectively, which gives the average losses and the average number of false positives and negatives for each case together with the corresponding standard errors. For sparse precision matrices, the BIC approach outperforms LOOCV, KCV and AIC in terms of the two loss measures as well as the sum of false positives and false negatives. For dense matrices, although the number of false negatives is generally larger compared to other selectors, the performance of BIC in terms of loss is still superior. That BIC selected tuning parameter will result in a larger number of false negatives is as expected, since BIC will in general select a larger tuning parameter and thus more zeros in the estimated precision matrix (Shao, 1997; Yang, 2005). Based on our simulations, tuning parameter selected by AIC generally performs the worst. Finally, maybe surprisingly, the performance of GACV is almost always better than LOOCV, and in many cases also better than KCV. The reader should note that GACV can be computed much faster than either LOOCV or 10-fold CV. As an illustration, for our first example using Lasso penalty, with n ¼100 and 500 simulated datasets, the computation time for GACV is 43 s, BIC 18 s, AIC 18 s, KCV 376 s, LOOCV 4 1 h.
H. Lian / Journal of Statistical Planning and Inference 141 (2011) 2839–2848
2843
Table 1 Simulation results for Lasso estimators using five different tuning parameter selectors. The reported average errors are based on 500 simulated datasets. The numbers in the brackets are the corresponding standard errors.
Tridiagonal n ¼100 Frobenius Entropy FP FN n ¼50 Frobenius Entropy FP FN Dense n ¼100 Frobenius Entropy FP FN n ¼50 Frobenius Entropy FP FN Random sparse n ¼100 Frobenius Entropy FP FN n ¼50 Frobenius Entropy FP FN
BIC
AIC
CV
KCV
GACV
7.02 (1.44) 0.80 (0.17) 98.36 (35.46) 2.63 (2.60)
17.40 (7.60) 1.85 (0.51) 205.97 (32.49) 0.16 (0.55)
7.11 (1.24) 0.88 (0.14) 164.93 (16.80) 2.90 (2.60)
7.21 (1.38) 0.89 (0.15) 153.17 (19.09) 3.06 (2.67)
6.37 (1.12) 0.90 (0.15) 149.61 (32.08) 0.98 (1.34)
11.39 (2.93) 1.38 (0.35) 109.51 (51.95) 7.84 (5.55)
70.03 (35.30) 7.29 (2.72) 212.61 (25.04) 0.77 (1.56)
12.36 (2.02) 1.41 (0.19) 190.37 (33.86) 6.15 (2.52)
12.35 (1.99) 1.47 (0.17) 170.42 (35.81) 6.21 (3.00)
12.03 (5.68) 1.45 (0.49) 171.36 (35.87) 4.53 (3.47)
2.08 (0.50) 0.93 (0.17) 0 (0) 258.24 (51.63)
5.31 (2.36) 2.05 (0.5) 0 (0) 54.38 (30.45)
2.94 (0.38) 0.97 (0.12) 0 (0) 243.64 (37.94)
2.08 (0.54) 0.94 (0.17) 0 (0) 257.50 (42.15)
1.84 (0.24) 0.94 (0.16) 0 (0) 190.36 (39.54)
3.32 (1.03) 1.48 (0.38) 0 (0) 256.57 (76.06)
27.97 (13.34) 6.58 (1.79) 0 (0) 34.42 (32.89)
3.57 (0.75) 1.56 (0.26) 0 (0) 224.57 (42.33)
4.21 (0.23) 1.75 (0.15) 0 (0) 256.85 (9.63)
3.68 (1.38) 1.55 (0.54) 0 (0) 170.43 (44.12)
25.52 (8.84) 0.87 (0.1) 60.56 (31.16) 47.35 (13.74)
52.63 (29.36) 1.89 (0.53) 156.34 (27.77) 8.40 (6.23)
27.39 (7.24) 0.91 (0.11) 91.53 (27.11) 44.59 (11.55)
26.19 (8.32) 0.89 (0.12) 83.16 (23.16) 45.34 (12.61)
22.69 (6.42) 0.87 (0.13) 131.10 (37.72) 30.81 (12.21)
33.42 (17.80) 1.36 (0.49) 73.19 (48.83) 58.17 (14.66)
93.59 (46.31) 5.38 (2.27) 277.11 (27.08) 6.68 (5.85)
34.41 (9.68) 1.52 (0.17) 95.40 (18.37) 54.89 (11.00)
34.69 (10.24) 1.51 (0.17) 94.89 (19.13) 55.36 (11.19)
34.74 (17.53) 1.55 (0.56) 139.02 (50.53) 35.83 (13.43)
With p ¼20 the number of parameters is actually 20 21=2 ¼ 210, already much bigger than the sample size. We also did some simulation with p ¼50. With this large p, the simulation is much slower than p ¼20, and thus we only performed simulations using KCV, GACV and BIC, with adaptive Lasso penalty, and only used the first example. The results are reported in Table 4. For p¼50 and n ¼100, for example, BIC only takes 80 s, GACV 192 s, while 10-fold CV takes about half an hour.
2844
H. Lian / Journal of Statistical Planning and Inference 141 (2011) 2839–2848
Table 2 Simulation results for adaptive Lasso estimators using five different tuning parameter selectors.
Tridiagonal n¼ 100 Frobenius Entropy FP FN n¼ 50 Frobenius Entropy FP FN Dense n¼ 100 Frobenius Entropy FP FN n¼ 50 Frobenius Entropy FP FN General sparse n¼ 100 Frobenius Entropy FP FN n¼ 50 Frobenius Entropy FP FN
BIC
AIC
CV
KCV
GACV
5.74 (1.64) 0.78 (0.18) 58.08 (22.60) 2.54 (2.79)
22.49 (10.01) 1.88 (0.57) 197.77 (55.73) 0.41 (0.88)
7.24 (2.38) 0.91 (0.23) 94.63 (15.33) 0.82 (1.19)
7.11 (2.21) 0.90 (0.21) 92.63 (14.26) 0.87 (1.25)
6.31 (2.40) 0.82 (0.24) 73.01 (23.06) 1.45 (1.77)
13.45 (6.81) 1.65 (0.49) 65.21 (26.98) 7.15 (5.33)
95.56 (47.25) 5.42 (1.64) 277.63 (57.21) 2.21 (3.07)
20.21 (9.08) 2.13 (0.58) 104.52 (20.21) 4.42 (3.39)
17.97 (7.11) 1.99 (0.47) 99.21 (19.15) 4.68 (3.58)
13.94 (6.06) 1.69 (0.44) 71.52 (26.92) 6.73 (5.18)
1.63 (0.64) 0.79 (0.17) 0 (0) 285.72 (29.37)
6.25 (2.76) 1.98 (0.57) 0 (0) 133.80 (54.91)
2.14 (0.62) 0.99 (0.20) 0 (0) 233.08 (20.58)
1.91 (0.52) 0.92 (0.18) 0 (0) 243.84 (19.12)
1.75 (0.44) 0.85 (0.17) 0 (0) 266.06 (28.14)
4.29 (3.00) 1.84 (0.78) 0 (0) 267.14 (41.44)
29.63 (14.67) 5.94 (1.54) 0 (0) 106.85 (49.48)
5.34 (1.56) 2.26 (0.37) 0 (0) 226.00 (16.11)
5.27 (1.68) 2.23 (0.42) 0 (0) 227.57 (19.02)
4.04 (1.85) 1.82 (0.54) 0 (0) 264.71 (40.81)
19.35 (5.37) 0.77 (0.13) 29.38 (19.88) 54.65 (11.53)
55.67 (26.73) 1.88 (0.45) 215.02 (45.96) 17.20 (11.42)
25.06 (9.55) 1.03 (0.21) 85.22 (18.72) 44.26 (8.62)
23.87 (8.56) 0.99 (0.20) 79.18 (21.08) 45.43 (9.28)
20.13 (6.29) 0.82 (0.17) 44.22 (24.29) 51.36 (10.96)
34.35 (19.80) 1.40 (0.48) 49.11 (28.73) 54.34 (11.19)
105.55 (59.75) 4.61 (1.33) 212.26 (47.45) 21.53 (10.21)
50.31 (14.94) 1.97 (0.36) 94.42 (15.51) 45.11 (8.82)
46.18 (19.53) 1.81 (0.46) 83.44 (25.35) 47.91 (11.06)
39.06 (31.00) 1.54 (0.69) 58.26 (38.58) 52.72 (11.86)
3.2. A real dataset Here we apply some parameter selection methods on the ALL dataset (Chiaretti et al., 2004) as an illustration. The data represent 79 samples from patients with B-cell acute lymphoblastic leukemia, hybridized on HG-U95Av2 Affymetrix GeneChip arrays with a total of 12,625 probe sets. We use 37 BCR/ABL samples and 42 normal samples in our analysis. We
H. Lian / Journal of Statistical Planning and Inference 141 (2011) 2839–2848
2845
Table 3 Simulation results for SCAD estimators using five different tuning parameter selectors.
Tridiagonal n ¼100 Frobenius Entropy FP FN n ¼50 Frobenius Entropy FP FN Dense n ¼100 Frobenius Entropy FP FN n ¼50 Frobenius Entropy FP FN General sparse n ¼100 Frobenius Entropy FP FN n ¼50 Frobenius Entropy FP FN
BIC
AIC
CV
KCV
GACV
5.84 (4.06) 0.90 (0.26) 64.08 (20.58) 3.78 (2.93)
22.80 (7.20) 2.20 (0.44) 158.79 (26.89) 1.32 (1.38)
10.29 (3.29) 1.25 (0.32) 99.45 (22.97) 3.11 (2.94)
8.02 (2.83) 1.06 (0.31) 61.23 (20.59) 3.28 (1.58)
6.07 (5.13) 1.10 (0.41) 84.93 (26.51) 3.31 (2.54)
12.39 (8.76) 1.71 (1.41) 65.48 (23.53) 8.15 (4.45)
62.83 (36.70) 6.55 (1.71) 180.72 (27.83) 4.76 (2.85)
18.95 (11.29) 2.18 (0.76) 97.79 (23.24) 4.85 (5.66)
15.38 (4.24) 1.94 (0.38) 81.38 (15.75) 7.69 (7.16)
18.39 (11.22) 2.17 (1.74) 89.23 (25.80) 7.84 (4.21)
1.68 (0.78) 0.82 (0.20) 0 (0) 268.78 (51.99)
6.75 (2.32) 2.37 (0.54) 0 (0) 112.72 (30.07)
1.94 (0.23) 0.91 (0.12) 0 (0) 203.94 (31.79)
1.92 (0.23) 0.89 (0.12) 0 (0) 202.06 (21.59)
1.72 (1.25) 0.89 (0.23) 0 (0) 212.92 (46.41)
2.64 (1.87) 1.23 (0.83) 0 (0) 244.80 (81.30)
29.53 (10.51) 7.03 (1.29) 0 (0) 103.87 (12.99)
3.04 (0.52) 1.48 (0.19) 0 (0) 219.14 (32.97)
3.34 (0.57) 1.57 (0.19) 0 (0) 237.14 (32.97)
3.11 (2.30) 1.37 (1.32) 0 (0) 221.14 (65.51)
19.99 (6.69) 0.78 (0.31) 73.42 (27.41) 58.10 (12.25)
72.61 (26.93) 2.41 (0.39) 155.89 (24.76) 26.51 (9.12)
30.87 (17.37) 1.26 (0.38) 109.65 (21.94) 49.84 (11.46)
21.63 (6.90) 0.92 (0.24) 81.25 (17.53) 50.07 (12.54)
20.68 (8.05) 0.95 (0.27) 84.29 (23.37) 41.29 (11.64)
34.58 (25.36) 1.06 (0.45) 67.75 (26.64) 53.85 (10.29)
133.92 (86.84) 5.43 (1.65) 167.04 (36.23) 29.95 (10.65)
45.50 (15.84) 2.04 (0.68) 102.08 (23.48) 38.51 (11.05)
39.37 (14.19) 1.24 (0.52) 83.32 (15.30) 44.34 (9.37)
39.85 (22.57) 1.31 (10.41) 79.14 (25.94) 44.00 (9.25)
first throw out probe sets with expression measurements below 100 in at least 60 of the samples, or with interquartile range across the samples below 0.5 on the log base 2 scale, resulting in 2391 probe sets remaining for subsequent testing. Then p-values for each probe set is calculated using mt.maxT in the R package multtest. We use the 10 probe sets with the smallest p-values and another 10 randomly selected, and thus consider n ¼79, p¼20. The graphical structures for the 20 probe sets, estimated using adaptive Lasso penalty with tuning parameter selected by GACV and BIC respectively, are shown in Fig. 1. The result obtained from KCV is visually very similar to that obtained by GACV and thus omitted here. The
2846
H. Lian / Journal of Statistical Planning and Inference 141 (2011) 2839–2848
Table 4 Simulation results for adaptive Lasso estimators using three different tuning parameter selectors with p ¼50.
Tridiagonal n¼ 100 Frobenius Entropy FP FN n¼ 50 Frobenius Entropy FP FN
PSEN1
ABL1
BIC
KCV
GACV
9.24 (2.93) 1.26 (0.34) 59.76 (25.96) 3.28 (3.22)
13.11 (3.18) 1.41 (0.38) 114.64 (31.06) 2.18 (3.17)
13.25 (3.02) 1.33 (0.38) 112.37 (25.79) 2.20 (2.85)
32.86 (16.92) 1.98 (0.82) 82.45 (34.47) 12.33 (7.42)
50.33 (20.78) 2.35 (1.16) 102.57 (49.51) 8.74 (5.26)
44.92 (17.32) 2.36 (1.00) 93.48 (38.68) 9.78 (5.39)
ABL1 ABL1
RGS19 CSNK2B
PSEN1
ABL1
ABL1 ABL1
RGS19
KLF9
KLF9
CSNK2B
RFTN1
AHNAK
FDPS
FYN
RFTN1
AHNAK
FDPS
FYN
ZNF467
WTAP
ZNF467
WTAP
TUBA4A
CSNK1A1 ECOP
CASP8 MAN2A2
USP7
MX1
CSNK1A1
TUBA4A ECOP
CASP8 MAN2A2
USP7
MX1
Fig. 1. Estimated network structure using Gaussian graphical models with adaptive Lasso penalty. (a) Tuning parameter chosen by GACV. (b) Tuning parameter chosen by BIC.
genes represented by each probe set is also shown (one gene can be represented by multiple probe sets on the array). From the figure, we see that the network estimated by BIC is much sparser than that estimated by GACV (due to the larger tuning parameter selected). In the figure, the 10 probe sets with the smallest p-values are those from ABL1 to MX1, moving clockwise. Based on the network constructed using BIC, it is clear that these 10 probe sets in general have more connections with other genes, while such structure is not easy to distinguish from the denser network estimated by GACV. As a final note, for such applications, it makes sense to estimate the network based on desired sparsity which is a priori determined, even though BIC can provide an automatic choice for the tuning parameter. 4. Concluding remarks In this paper we compare several approaches for tuning parameter selection in penalized Gaussian graphical models. As an approximation to leave-one-out cross-validation, we derived generalized approximate cross-validation in the current context which is much faster to compute. Simulations show that GACV even outperforms the leave-one-out version. For model identification, we employ BIC for tuning parameter selection, and proved its consistency property. In our simulations with sparse matrices or dense matrices with many small entries, tuning parameter selected based on BIC clearly has better performance than all other approaches.
H. Lian / Journal of Statistical Planning and Inference 141 (2011) 2839–2848
2847
Acknowledgement The author wants to thank the anonymous reviewer whose comments and suggestions lead to an improvement of the paper, and in particular the real data analysis is added at the reviewer’s suggestion. This research is supported by Singapore Ministry of Education Tier 1 Grant 36/09. Appendix A. Derivation of GACV Denote the log-likelihood function by LðS, XÞ ¼ logjXjTrðSXÞ: ^ ðlÞ is simply denoted by X and similarly X ^ ðiÞ ðlÞ is In this section only, to simplify notation, the shrinkage estimator X T ðiÞ denoted by X . Thus it is implicitly understood that the estimators depend on a fixed l. Let Xi ¼xixi be the covariance P matrix based on a single observation so that S ¼ ni¼ 1 Xi =n. The LOOCV score is defined by ! n n n X X X X ðiÞ ðiÞ CVðlÞ ¼ logjX jTr Xi X LðXi , XðiÞ Þ ¼ nLðS, XÞ þ ðLðXi , XðiÞ ÞLðXi , XÞÞ ¼ i¼1
nLðS, XÞ þ
i
T n X dLðXi , XÞ i¼1
dX
i¼1
i¼1
dX,
where we interpret dLðXi , XÞ=dX ¼ dLðXi , XÞ=dvecðXÞ to be a p2-dimensional column vector of partial derivatives and similarly dX ¼ vecðXðiÞ XÞ. Besides, here as well as in the following, like the definition of generalized degrees of freedom in Fan and Li (2001) and Wang et al. (2007), the partial derivatives corresponding to the zero elements in X are ignored. Using matrix calculus such as presented in Bishop (2006), we have dLðXi , XÞ ¼ vecðX1 Xi Þ dX and we only need to deal with the term dX ¼ vecðXðiÞ XÞ. P Denote the penalized log-likelihood based on the sufficient statistic S by LðS, XÞ ¼ LðS, XÞ i,j plij ðjoij jÞ, Taylor expansion gives us 0¼
dLðSðiÞ , XðiÞ Þ dLðS, XÞ d2 LðS, XÞ d2 LðS, XÞ dS, þ dX þ 2 dX dX dX dS dX
where d2 LðS, XÞ=dX2 ¼ dðLðS, XÞ=dvecðXÞÞ=dvecðXÞ is the p2 p2 Hessian matrix, and d2 LðS, XÞ=dX dS is defined similarly. Like before, dX ¼ vecðXðiÞ XÞ and dS¼vec(S( i) S) actually denote their vectorized version. Since dLðS, XÞ=dX ¼ 0, it immediately follows that !1 d2 LðS, XÞ d2 LðS, XÞ dS: dX ¼ 2 dX dS dX From matrix calculus, we have d2 LðS, XÞ=dX dS ¼ Ip2 p2 and d2 LðS, XÞ=dX2 ¼ ðX1 X1 þ DÞ where D is a diagonal matrix with diagonal elements plij 00 ðjoij jÞ. Thus we have the approximation dX ½d2 LðS, XÞ=dX2 1 dS. Even for moderate p, inversion of this p2 p2 matrix is computationally infeasible. However, note that typically we consider only the situation with l ¼ oð1Þ and plij 00 ðjoij jÞ ¼ oð1Þ (for example, the second derivative for SCAD penalty function is exactly zero for nonzero elements). Thus we can approximate ðX1 X1 þ DÞ1 by ðX1 X1 Þ1 ¼ X X and dX ðX XÞ dS ¼ vecðX ðSðiÞ SÞ XÞ which involves only p p matrices and no inversion of matrices is required. Appendix B. Proof of Theorem 1 We only need to assume the general conditions on the penalty function that guarantees the oracle property of the estimator with appropriately chosen tuning parameter. In particular, we assume that Condition 1. maxfjpln 00 ðjo0ij jÞ : o0ij a0g-0. Condition 2. lim inf n-1 lim inf x-0 þ pln 0 ðxÞ=ln 4 0. pffiffiffi Condition 3. The (theoretically) optimal tuning parameter satisfies ln -0 and nln -1. For an arbitrary model S specified by the constraints that only some of the elements in the precision matrix can be nonzero, i.e. S D fði,jÞ : i rjg is the set of elements not constrained to be zero, denote by LS the value of the constrained maximized likelihood with infinite data: LS ¼ maxX EðlogjXjTrðXXÞÞ, where the maximization is performed over X with zero entries for all ði,jÞ 2 S. We partition L ¼ ½0,1Þ into three parts: L0 ¼ fl 2 L : S l ¼ S T g, L ¼ fl 2 L : S l KS T g, ! S T g, where S l is the model identified by the estimator when l is used as the tuning parameter, and L þ ¼ fl 2 L : S l+ S T is the true model S T ¼ fði,jÞ : i rj, o0ij a0g. We will prove separately that under-fitting probability and over-fitting probability are both negligible.
2848
H. Lian / Journal of Statistical Planning and Inference 141 (2011) 2839–2848
Bounding the under-fitting probability: If S l KS T , we have that P ^ S Þ þjS j logn ^ ðlÞÞ þ jS j logn Z LðS, X BICðlÞ ¼ LðS, X LSl , l l l n n ^ S is the unpenalized maximum likelihood estimator based on model S. The convergence above is based on the where X uniform convergence of the empirical distribution since the class of log-likelihood functions is Glivenko–Catelli. Similarly, with the optimal choice of ln that satisfies Condition 3, P ^ ðln ÞÞ þjS j logn BICðln Þ ¼ LðS, X LST : ln n
Thus we have prfinf l2L BICðlÞ 4BICðln Þg-1 since LST oLS when SKS T . Bounding the over-fitting probability: Now suppose l 2 L þ . Since ^ ðln ÞÞ þjS j logn BICðln Þ ¼ LðS, X ln n and ^ ðlÞÞ þ jS j BICðlÞ ¼ LðS, X l
logn , n
with jS l j4 jS ln j ¼ jS T j (with probability 1), we will get prðinf l2L þ BICðlÞ 4 BICðln ÞÞ-1 if it can be shown that ^ ðlÞÞLðS, X ^ ðln ÞÞ ¼ Op ð1=nÞ. LðS, X P We will use the usual notion for sample mean: Pn LðX, XÞ ¼ ni¼ 1 LðXi , XÞ=n ¼ LðS, XÞ and use PLðX, XÞ to denote the pffiffiffi corresponding population mean for a fixed precision matrix X. Let Gn ¼ nðPn PÞ be the empirical process. We write ^ ðlÞÞLðS, X ^ ðln ÞÞ r Pn LðX, X ^ S ÞPn LðX, X ^ ðln ÞÞ ¼ Pn LðX, X ^ S ÞPLðX, X ^ S ÞðPn LðX, X ^ ðln ÞÞPLðX, X ^ ðln ÞÞÞ LðS, X l l l ^ ^ þ PLðX, X S ÞPLðX, X ÞðPLðX, X ðln ÞÞPLðX, X ÞÞ, l
0
where X0 is the true precision matrix. Define M1 ðXÞ ¼ the previous display can be written as
0
pffiffiffi ^ S ÞLðX, X ÞÞ, and M ðXÞ ¼ pffiffiffi ^ ðln ÞÞLðX, X ÞÞ, then nðLðX, X nðLðX, X 0 2 0 l
^ ðlÞÞLðS, X ^ ðln ÞÞ r 1 Gn M ðXÞ 1 Gn M ðXÞ þ ðPLðX, X ^ S ÞPLðX, X ÞÞðPLðX, X ^ ðln ÞÞPLðX, X ÞÞ LðS, X 1 2 0 0 l n n ¼: ðIÞ þ ðIIÞ þ ðIIIÞ þ ðIVÞ: pffiffiffi ^ We first bound (I) from above. Classical maximum likelihood theory tells us that Hn :¼ nðX S l X0 Þ is asymptotically normal (for the non-constrained elements of the matrix). Applying Lemma 19.31 in van der Vaart (1998), we have P Gn ðM1 ðXÞHTn dLðX, X0 Þ=dXÞ-0 and then it is easily seen that Gn(M1(X)) ¼OP(1). Similarly, given that Fan et al. (2009) have ^ shown that X ðln Þ is also asymptotically normal, we have GnM2(X)¼OP(1). For (III) and (IV), a simple Taylor expansion around X yields both term to be of order OP(1/n). References Bickel, P.J., Levina, E., 2008. Covariance regularization by thresholding. Annals of Statistics 36, 2577–2604. Bishop, C.M., 2006. Pattern recognition and machine learning, Information Science and Statistics. Springer Science þBusiness Media, New York, NY. Chiaretti, S., Li, X., Gentleman, R., Vitale, A., Vignetti, M., Mandelli, F., Ritz, J., Foa, R., 2004. Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103, 2771–2778. Craven, P., Wahba, G., 1979. Smoothing noisy data with spline functions. Numerische Mathematik 31, 377–403. Dempster, A.P., 1972. Covariance selection. Biometrics 28, 157–175. Dobra, A., West, M., 2004. Bayesian covariance selection. Working Paper, ISDS, Duke University. Dong, X.A., Wahba, G., 1996. A generalized approximate cross validation for smoothing splines with non-Gaussian data. Statistica Sinica 6, 675–692. Fan, J.Q., Feng, Y., Wu, Y., 2009. Network exploration via the adaptive lasso and SCAD penalties. Annals of Applied Statistics 3, 521–541. Fan, J.Q., Li, R.Z., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360. Fan, J.Q., Peng, H., 2004. Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics 32, 928–961. Friedman, J., Hastie, T., Tibshirani, R., 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441. Lam, C., Fan, J.Q., 2007. Sparsistency and rates of convergence in large covariance matrices estimation. Arxiv online /http://www.citebase.org/ abstract?id=oai:arXiv.org:0711.3933S. Li, H.Z., Gui, J., 2006. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 7, 302–317. Meinshausen, N., Buhlmann, P., 2006. High-dimensional graphs and variable selection with the Lasso. Annals of Statistics 34, 1436–1462. Shao, J., 1997. An asymptotic theory for linear model selection. Statistica Sinica 7, 221–242. van der Vaart, A.W., 1998. Asymptotic Statistics. Cambridge University Press, Cambridge, UK, New York. Wang, H., Li, B., Leng, C.L., 2009. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society Series B—Methodological 71, 671–683. Wang, H., Li, R., Tsai, C.L., 2007. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94, 553–568. Wang, H.S., Leng, C.L., 2007. Unified LASSO estimation by least squares approximation. Journal of the American Statistical Association 102, 1039–1048. Yang, Y.H., 2005. Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation. Biometrika 92, 937–950. Yuan, M., Lin, Y., 2007. Model selection and estimation in the Gaussian graphical model. Biometrika 94, 19–35. Zhao, P., Yu, B., 2006. On model selection consistency of Lasso. Journal of Machine Learning Research 7, 2541–2563. Zou, H., 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429.