Journal of Statistical Planning and Inference 139 (2009) 385 -- 395
Contents lists available at ScienceDirect
Journal of Statistical Planning and Inference journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / j s p i
Regularization through variable selection and conditional MLE with application to classification in high dimensions Eitan Greenshteina , Junyong Parkb,∗ , Guy Lebanonc a b c
Department of Statistical Science, Duke University, NC 27708-0251 Department of Mathematics and Statistics, University of Maryland, Baltimore County, MD 21250, USA College of Computing, Georgia Institute of Technology, GA 30332, USA
A R T I C L E
I N F O
Article history: Received 30 January 2007 Received in revised form 15 February 2008 Accepted 10 April 2008 Available online 10 May 2008 Keywords: Classification High dimensions Conditional MLE Stein's unbiased estimator
A B S T R A C T
It is often the case that high-dimensional data consist of only a few informative components. Standard statistical modeling and estimation in such a situation is prone to inaccuracies due to overfitting, unless regularization methods are practiced. In the context of classification, we propose a class of regularization methods through shrinkage estimators. The shrinkage is based on variable selection coupled with conditional maximum likelihood. Using Stein's unbiased estimator of the risk, we derive an estimator for the optimal shrinkage method within a certain class. A comparison of the optimal shrinkage methods in a classification context, with the optimal shrinkage method when estimating a mean vector under a squared loss, is given. The latter problem is extensively studied, but it seems that the results of those studies are not completely relevant for classification. We demonstrate and examine our method on simulated data and compare it to feature annealed independence rule and Fisher's rule. © 2008 Elsevier B.V. All rights reserved.
1. Introduction In this paper we consider the problem of finding a classifier for a response variable Y ∈ {−1, 1} based on a high-dimensional vector (X1 , . . . , Xm ) ∈ Rm of explanatory variables. Here, by high dimensionality we mean m?n where n is the size of the training set. In particular we focus our attention on linear predictors for Y of the form ⎞ ⎛ m aj Xj + a0 ⎠ , Yˆ = sign ⎝
(1)
j=1
where a0 , a1 , . . . , am are constants. Searching for a good linear classifier in high dimensions requires special care, as estimators are prone to overfitting. To avoid overfitting, we suggest a regularization that involves (a) variable/model selection, and (b) correction of selection bias through conditional maximum likelihood. Specifically, when a model is selected at stage (a), its parameters are estimated at stage (b) by a conditional MLE, which takes the selection into account. The likelihood is based on assuming a multivariate normal distribution of the vector (X1 , . . . , Xm ) conditioned on the value of Y, i.e., we assume (Xj |Y =−1) ∼ N(j , s2 ) and (Xj |Y =1) ∼ N(j , s2 ) independently, j = 1, . . . , m. We will assume that the variance s2 is known. Denote = (1 , . . . , m ), s = (1 , . . . , m ).
∗
Corresponding author. Tel.: +1 410 455 2407; fax: +1 410 455 1066. E-mail address:
[email protected] (J. Park).
0378-3758/$ - see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2008.04.027
386
E. Greenshtein et al. / Journal of Statistical Planning and Inference 139 (2009) 385 -- 395
In considering linear classifiers, when both m and n are large it is reasonable to assume normality by the central limit theorem. Due to Lindberg's CLT, large m implies that aj Xj will be close to normal when aj are comparable in size even if the individual Xj are not normal. Large n implies that averages of independent Xji , i = 1, . . . , n (defined below as Zj ) are close to normal. When searching for the values a1 , . . . , am underlying the linear classifier, we assume w.l.o.g. that a2 = m a2 = 1. In this j=1 j case the optimal choice for (a1 , . . . , am ) is the vector that maximizes | aj j − aj j |. Note that the optimal choice of a1 , . . . , am is the same regardless of the misclassification loss (the value of a0 does depend on the loss). In order to see it, observe that aj Xj ∼ N( aj j , s2 ) ≡ N(2 , s2 ) conditional on Y = 1; here i , aj Xj ∼ N( aj j , s2 ) ≡ N(1 , s2 ) conditional on Y = −1 and i = 1, 2 are implicitly defined. Hence, an optimal choice of a1 , . . . , am is such that V ≡ | j aj j − j aj j | = |1 − 2 | is maximized. Under a 0--1 loss, given any choice of (a1 , . . . , am ), the corresponding optimal choice of a0 , is
+ 1 , a0 = − 2 2 assuming w.l.o.g. that 1 < 2 . A formal argument showing that the optimal a1 , . . . , am is the same regardless of the misclassification loss may be obtained using the theory of comparison of experiments, implying that the experiment that consists of the distributions N(1 , s2 ) and N(2 , s2 ), dominates the experiment that consists of the distributions N(1 , s2 ) and N(2 , s2 ) if and only if |1 − 2 | |1 − 2 |. See Lehmann (1986, p. 86), for some basic theory on comparison of experiments and some additional references. If and s are known, the optimal choice for a1 , . . . , am is opt
aj
=
j D
,
(2)
where D = − s and D =
2 j j . In practice, however,
, s as well as D are unknown, which prevents us from using (2)
to obtain the classifier. A natural (naive) alternative is to estimate a1 , . . . , am through maximum likelihood estimation. We now demonstrate the potential inefficiency of the naive MLE approach. For simplicity, we assume that the estimation is performed based on a training set of n positive and n negative examples. Subtracting the empirical means of the two n-size samples, we obtain a vector Z = (Z1 , . . . , Zm )
2s2 (3) Zj ∼ N j , ≡ N(j , 2 ), j = 1, . . . , m. n Here
Zj =
i i i Xj,−1 − i Xj,1 n
,
i , k = −1, 1 corresponds to training from the classes labeled −1 and 1 correspondingly. where in Xj,k
Note that, Zj is the MLE for j . opt
As a result, the MLE for aj aˆ mle = j
Zj Z
is
.
To demonstrate the inefficiency of aˆ mle we consider asymptotics where n → ∞, m = m(n) and m?n. Denoting D = j noticing that Zj = (j + j ) where j ∼ N(0, 2s2 /n) are independent, the quantity Z j aˆ mle × j ≡ V≡ j j Z
2 j and
becomes D + O D/n) ( p . V = √ D + Op ( D/n) + (2m/n) + Op ( m/n) In the following asymptotics, let the variance s2 be fixed. For any fixed D, or more generally when D = o( m/n), we have V = op (1) which implies that asymptotically, the corresponding classifier is equivalent to random guessing. This is since our classification is equivalent to testing H0 : aj Xj ∼ N(1 , s2 ) versus H1 : aj Xj ∼ N(2 , s2 ), when |1 − 2 | = o(1). Furthermore, the value of √ opt V corresponding to the optimal choice a is D which may be large. The fact that potentially (i.e., an `oracle') may obtain a
E. Greenshtein et al. / Journal of Statistical Planning and Inference 139 (2009) 385 -- 395
387
√ classifier with corresponding value V = D, while the MLE yields a classifier with a corresponding V = o(1), demonstrates the inefficiency of the naive MLE approach. Note, amle , j = 1, . . . , n, define a random classifier once aˆ0 is determined. Consider the special choice for aˆ 0 , denoted A0 , which j minimizes the maximum among the two misclassification probabilities (i.e., A0 is an equalizer). Then A0 is the following function of aˆ j , j = 1, . . . , m, and s:
A0 = −
mle aˆ j (j − j ) . 2
We summarize the above in the following Proposition 1. Proposition 1. Consider a sequence of problems with m, n → ∞. Suppose D = o( m/n). Consider the sequence of random classifiers n mle of the form: sign( j=1 aˆ j Xj + A0 ). Then the corresponding sequence of misclassification probabilities converges in probability to 0.5. We emphasize the above choice of A0 depends on the unknown parameters. However, any other choice would make the sequence of the maximum among the two misclassifications probabilities even larger. The above asymptotic consideration, where m increases and D is fixed or more generally when D is small relative to m/n, reflects a situation of sparsity where most Zj have a mean close to zero. As a result most explanatory variables are nearly irrelevant for our classification task. We will elaborate on this point in the next section, but the sparse setup is the setup that motivates us. opt More sophisticated estimators of aj based on careful regularization will yield better classifiers, i.e., with larger values of V. Our opt
result concerning asymptotic equivalence between random guessing and the resulting estimator when estimating aj
by MLE
is very much related to Theorem 1 of Fan and Fan (to appear) (but obtained independently). Remark 1. In practice we may have sample sizes n1 , n2 , n1 n2 from each group. However, the procedure of choosing (a1 , a2 , . . . , am ) depends only on the vector Z = (Z1 , . . . , Zn ). Our results are readily applied when replacing 2s2 /n by s2 (1/n1 + 1/n2 ) in the variance of Zj , i.e., Zj ∼ N(j , s2 (1/n1 + 1/n2 )). Furthermore, in many applications, it is formally more accurate to consider n1 and n2 as random. A popular method of regularization is by model or variable selection. When no prior information is known, a natural model selection procedure will select the variables whose corresponding |Zj | are large. For example, such methods may include all the variables corresponding to j for which |Zj | > C, where C is an appropriate threshold. Once a model is selected, a standard routine is to perform MLE based on the selected model. In the context of estimating a sparse high-dimensional vector of means such a method is also termed `hard thresholding'. The pioneering research, on the magnitude of the threshold in such a context when estimating a vector of means, is Donoho and Johnstone (1994, 1995) and Foster and George (1994). When rescaling so that the variance of Zj is 1, a `universal threshold' suggested in those papers is C = 2 log(n). Further results may be found in Johnstone and Silverman (2005). Donoho and Johnstone (1994) suggested a modification of hard thresholding by a soft thresholding, i.e., where the mean of Zj is estimated by sign(Zj )(|Zj | − C)+ . One advantage of the soft thresholding is that its smoothness enables evaluating its performance through Stein's unbiased estimator of the risk. We develop similar concepts in the context of classification. The analog of a hard thresholding is the (unconditional) MLE. An analog of the soft thresholding is our conditional MLE, defined in the next section, which also defines a smooth shrinkage estimator. Shrinkage by conditional MLE is also appealing in non-classification problems, for example estimation of means. We choose to emphasize the classification problem because it is seldom studied in terms of thresholding and shrinkage estimation, see, however, the recent paper by Fan and Fan (to appear). There are various advantages to the conditional MLE relative to unconditional MLE as shown in the next section. As in Donoho and Johnstone (1995), we obtain in Section 3 an unbiased estimator of the risk, using Stein's method. Such estimation is useful in examining various choices of thresholds, which is a useful alternative to cross validation. An important message is that the optimal threshold in classification may behave dramatically different from the optimal threshold in the context of estimating the mean vector under a square loss. Contrasting with the problem of mean estimation, in classification many more variables Zj may be selected. Hence the extensive research on the former problem is not completely relevant for classification. The difference between the optimal thresholds in estimation versus classification in various simulations may be seen in Table 2 of Section 4. Finally, we mention a recent work on high-dimensional classification by Bickel and Levina (2004). Their setup is also of multivariate normal explanatory variables, but under their formulation the major problem becomes the estimation of the unknown covariance matrix, rather than inference concerning the vector of means as in our formulation. They show that the naive Bayes assumption is often not very harmful and the resulting procedures are not much worse than the optimal procedure that does take dependence into account. Their result suggests that studying a model with independent Xj , as we do, is of interest.
388
E. Greenshtein et al. / Journal of Statistical Planning and Inference 139 (2009) 385 -- 395
2. Regularization through subset selection and conditional MLE The discussion in the introduction suggests that a reasonable approach is to choose aj which are estimates of a0j = j /, j = 1, . . . , m, under some regularization procedure. The regularization method suggested next has two components. First, selecting only a size k subset of explanatory variables, out of X1 , . . . , Xm thereby decreasing the variability due to estimating the corresponding j . Second, compensate for the selection bias, introduced by the above selection. When there is no prior information about the relevance and importance of the explanatory variables, a most natural subset selection is of the explanatory variables that correspond to indices j for which |Zj | > C. Assuming, w.l.o.g. |Z1 | · · · |Zk | C · · · | Zm |, we select X1 , . . . , Xk , as the explanatory variables. Here C is a tuning parameter that defines a collection of regularization methods. A natural estimator of j is the MLE conditional on |Zj | > C. However, in this paper, we will modify this natural estimator by considering the conditional MLE conditional on Zj > C or conditional on Zj < − C, depending on the sign of Zj . We will elaborate on the reasons for this modification in the sequel. One reason is that the modification defines a smooth estimator, while the non-modified conditional MLE has discontinuity at C and −C, the other reason has to do with simpler computation. For given values and C, we now give the explicit expressions for the `variable selection conditional MLE' estimator, denoted ˆ CMLE (z , C, ), and its modification denoted ˆ ≡ ˆ (z , C, ). ˆ CMLE ≡ j j j j j ⎧ (1/ )((zj − j )/ ) ⎪ ⎨ argmax CMLE ˆ ((−C − j )/ ) + 1 − ((C − j )/ ) j = j ⎪ ⎩ 0 ⎧ (1/ )((zj − j )/ ) ⎪ ⎪ argmax ⎪ ⎪ 1 − ((C − j )/ ) ⎪ ⎪ ⎨ j ˆ j = 0 ⎪ ⎪ ⎪ (1/ )((zj − j )/ ) ⎪ ⎪ ⎪ ⎩ argmax ((−C − )/ ) j j
if |zj | > C, otherwise,
if zj > C, if |zj | C, if zj > − C.
Recall, Zj ∼ N(j , 2 ), where 2 = 2s2 /n. The above expressions are obtained by the following (4)--(6). The density of Zj conditional on Zj > C is (1/ )((zj − j )/ ) 1 − ((C − j )/ )
I(zj > C),
(4)
while the density conditional on Zj < − C is (1/ )((zj − j )/ )
((−C − j )/ )
I(zj < − C).
(5)
The density of Zj conditional on |Zj | > C is (1/ )((zj − j )/ ) I(|z | > C), ((−C − j )/ ) + 1 − ((C − j )/ ) j
(6)
where and denote the density function and cumulative distribution function of standard normal, respectively. ˆ are nearly the same. See Fig. 2, ˆ CMLE and Notice that for a sparse setup, as we have in mind, C will be typically large. Thus, ˆ (z, C, 1) with ˆ CMLE (z, C, 1) as functions of z for in the sequel, that compares a minor further modification (to be defined) of j j various values of C; e.g., for C = 3 the graphs are very close. Thus, using either conditioning and selection methods yield practically the same results in the sparse setup that we have in mind. One should be more skeptic about our procedure and its motivation when the optimal C is small. The modified conditional MLE has a few advantages. It is smooth, while the original conditional MLE is discontinuous at the points C and −C. The smoothness of the modified conditional MLE enables us to use Stein's method of estimation of the risk, as we do in the next section. That method helps us in deciding on the appropriate value of C. Our following Proposition 2, which implies an invariance of the modified conditional MLE, is not true for the original conditional MLE. Proposition 2 simplifies the computations. In the sequel we will consider the above modified conditional MLE and we will modify it further. Given a value C, let {G } be the family of distributions parameterized by j where G is defined in (4). Then, {G } is an j
exponential family. Consider the conditional MLE for j conditional on {Zj > C}.
j
j
E. Greenshtein et al. / Journal of Statistical Planning and Inference 139 (2009) 385 -- 395
389
conditinal m.l.e.
10
5
0
−5 0
1
2
3
4
5 z
6
7
8
9
10
ˆ j versus Zj when C = 0 and = 1. Fig. 1. The graph of
By symmetry, we consider in the following only the case Zj > C. Given an observation Zj > C, distributed G , it is easy to check j
ˆ for is the solution of that the conditional MLE j j ˆ + · h((C − ˆ )/ ), Zj = j j
(7)
where h(a) = (a)/(1 − (a)) and and denote density function and cumulative distribution of a standard normal distribution, respectively. ˆ (Z , C, ), given an observation Z. The following proposition For any combination of C and , there is a corresponding MLE j j gives the relation between the modified conditional MLEs, corresponding to various values of C and . Proposition 2. ˆ j (Zj , C, ) = C + ˆ j ((Zj − C)/ , 0, 1). ˆ (Z , C, ))/ − h((C − ˆ (Z , C, ))/ ) = 0. After slight modifications, we have ˆ (Z , C, ) is a solution of (Z − Proof. j j j j j j j ⎛ ⎞ Zj − C ˆ j (Zj , C, ) − C ˆ j (Zj , C, ) − C ⎠ = 0. − − h ⎝−
ˆ (Z , C, ) − C)/ , then the above is Z ∗ − ∗ − h(−∗ ) = 0, which means ∗ is the MLE given Z ∗ Denote Zj∗ = (Zj − C)/ and ∗j = ( j j j j j j j
ˆ is one to one function of Z . From this when C = 0 and = 1. By strict convexity of conditional likelihood, MLE is unique, i.e., j j ∗ ˆ ((Z − C)/ , 0, 1) + C. ˆ (Z ∗ , 0, 1) = ˆ ((Z − C)/ , 0, 1) and ˆ (Z , C, ) = ˆ (Z , C, ) − C/ = = uniqueness, j
j
j
j
j
j
j
j
j
j
ˆ (Z , 0, 1) is illustrated in Fig. 1. The function j j We now modify our estimator once more. Let ˆ , 0) if Zj > 0, max( j
j = ˆ min(j , 0) otherwise. The last modification makes sense in a sparse situation where most of the values of j , j = 1, . . . , m, are assumed to be nearly 0. In such a case one should not shrink to a value further than 0. Finally, our suggested conditional MLE for a0j , denoted aˆ 0j , j = 1, . . . , k, is aˆ 0j =
j , d
where d = ( 1 , . . . , m ). Similarly to the above one obtains the unconditional MLE as
Zj , aˆˆ 0j = Z
(8)
390
E. Greenshtein et al. / Journal of Statistical Planning and Inference 139 (2009) 385 -- 395
5
5
4
4
3
3
2
2
1
1
0
0 0
1
2
3
4
5
0
1
2
z
3
4
5
3
4
5
z
5
5
4
4
3
3
2
2
1
1
0
0 0
1
2
3 z
4
5
0
1
2 z
ˆ j , 0), while the dotted line represents ˆ CMLE , both as functions of z, z 0. The graphs for z 0 are symmetric. The upper Fig. 2. The solid line represents j = max( j left and right panels are for C = 0 and 1, the lower left and right panels are for C = 2 and 3, in all cases = 1.
where Zj = Zj · I(|Zj | > C) for an indicator function I. Note that the notations above suppress the dependence of aˆˆ 0j and aˆ 0j on C. We further introduce the notations V(C) = aˆ 0 j j
and
V(C) =
aˆˆ 0j j .
Recall, V(C) and V(C) are considered as a performance criterion of classification, the larger they are the better is the corresponding classifier. In the following graphs, we demonstrate the performance of conditional MLE and the unconditional MLE in a simulation study, through the pair of functions V(C) and V(C). We study various combinations of l and , where l variables are normal with mean and variance 1, and m − l variables have mean 0 and variance 1, e.g., EZ j = , j = 1, . . . , l, EZ j = 0, j = l + 1, . . . , m, where m = 105 . Fig. 2 shows the simulation results. Each curve in each graph is based on average of 100 simulations. Discussion: Fig. 3 indicates that typically V0 = maxC V(C) is larger than V0 = maxC V(C). Yet, it is larger by 10% or less and sometimes even comparable or (very) slightly smaller. In practice, however, the optimal C is not given. The above graphs indicate that errors in approximating the optimal C will have more severe effect in the case of the unconditional MLE, since typically V(C) > V(C). Moreover, for the conditional MLE, we have a good procedure for approximating the optimal C, as demonstrated in the following section. Those facts make conditional MLE further advantageous when comparing with the unconditional one. In order to fully define a classifier, we should still define the parameter aˆ 0 , given aˆ 1 , . . . , aˆ m . An obvious way is the following. Let ˆ 1 = 1/n ni=1 m aˆ X , where the summation is over the n examples (Yi , Xi1 , . . . , Xim ) for which Yi = −1. Similarly define ˆ 2 . j=1 j ij Let
ˆ + ˆ 1 aˆ 0 = − 2 , 2 where we assume w.l.o.g. that ˆ 1 < ˆ 2 .
(9)
E. Greenshtein et al. / Journal of Statistical Planning and Inference 139 (2009) 385 -- 395
391
10 5 5
0
0
2
4 c
6
8
20
0
0
2
4 c
6
8
0
2
4 c
6
8
0
2
4 c
6
8
0
2
4 c
6
8
15 10
10 5 0
0
2
4 c
6
8
0
20 20 10
0
0
2
4 c
6
8
0
40
40
20
20
0
0
2
4 c
6
8
0
Fig. 3. The graphs represent V(C) (dashed) and V(C) (dotted) for m = 105 and = 1 versus C. Graphs in the first row are for (, l) = (1, 2000) and (1.5, 1500), the second row for (2, 1000) and (2.5, 300), the third row for (3, 250) and (3.5, 200) and the fourth row for (4, 150) and (4.5, 100).
3. Choosing the appropriate regularization parameter C In this section we will consider the problem of choosing the regularization parameter C. For every C there are corresponding conditional maximum likelihood estimators j = (Zj , C, ), j = 1, . . . , m associated with a given data set Z1 , . . . , Zm , and . The optimal C argmax C
j · j
≡ argmax V(C) C
cannot be found, since the values 1 , . . . , m are unknown. Note that a naive approach of plugging in j for j trivially yields C = 0 as the optimizer. The choice C = 0 could be very poor due to overfitting. Experimental evidence as in Fig. 3 makes this clear, as well as the formal explanation given in Section 1.
392
E. Greenshtein et al. / Journal of Statistical Planning and Inference 139 (2009) 385 -- 395
We consider the use of Stein's unbiased estimator of the risk to obtain a good estimation of V(C) for the purpose of approximating the optimal C. Note, V(C) is a random variable, estimating the value of a random variable (rather than the value of a parameter) for a particular realization is not a standard statistical practice. However, Stein's method of estimation of the risk is helpful for it. An important advantage of Stein's unbiased estimation method over the more straightforward cross validation is that it does not use a validation set thereby enabling the use of a larger training set. Stein's unbiased estimator of the risk: In the expression for V =V(C), the denominator is given and the unknown quantity j · j should be estimated. The method suggested is based on Stein's unbiased estimation of the risk, applied to the exponential family {G }, see Brown (1986, p. 99). The idea is to introduce a function of the data, denoted U, such that E U(Zj ) ≡ E Uj = E j · j . j j j The obvious advantage of the left-hand side over the right-hand side is that the expression U involves only the data and does not involve the unknown parameter j , hence j Uj is an `unbiased estimator' of the (random) quantity j · j . We denote, for a fixed C and ,
j =
d
(Zj , C, ), dZj
ˆ . and similarly denote ˆ (Z) = (Z), while (Z) = 0 for Z > 0 such that ˆ (Z) < 0. Similarly for Z < 0. ˆ (Z) 0 we have Note that for Z > 0 such that From (7) we get
(Z) =
1 . ˆ )/ ) 1 − h ((C −
In the same way, for Z < 0, we derive
(Z) =
1 . ˆ )/ ) 1 − h ((C +
Lemma 1. E Uj = E j j where Uj = j Zj − 2 j . Proof. The proof is straightforward and is based on the principle of Stein's unbiased estimator of the risk, see Brown (1986, p. 99). We apply the technique to the exponential family of distributions parameterized by j , obtained when conditioning that a N(j , 1) variable is greater than C, see (4). ˆ = ˆ (Z, C, ), on a grid of points. Note in order to apply Lemma 1 in an actual computation, we numerically evaluate By Proposition 2, it is enough to evaluate it for C = 0 and = 1. ˆ Lemma 1 motivates us to estimate the quantity V(C) = j j /d by the following V(C), where ˆ V(C) =
[j|Zj >C] Uj
d
.
(10)
ˆ The relationship between V(C) and V(C) is illustrated in Fig. 4 for various parameter configurations. Each graph in Fig. 4 represents a single realization. Since the agreement seems to be good, at least in sparse situations where argmaxC V(C) is achieved for large C, it is a good practice to select ˆ Cˆ = argmax V(C) C
as the regularization parameter. As mentioned before in non-sparse cases, when the optimal C is not large, one should be skeptic about the solution provided by our method. 4. Simulations Our heuristics implies that one should expect our procedure to produce good classifiers in sparse situations, where C is large. In those cases when zj > 0, MLE conditioning on Zj > C is practically the same as MLE conditioning on |Zj | > C. Similarly when zj < 0. Furthermore, in those situations the Stein unbiased estimator for V(C) is reliable. Indeed, in Table 1, it is demonstrated that our classification procedure is doing good, relative to other procedures as the setup is more sparse and whence the optimal C larger.
E. Greenshtein et al. / Journal of Statistical Planning and Inference 139 (2009) 385 -- 395
10
20
5
10
0
393
0
−5 0
1
2
3
4
5
0
1
2
c 20
20
10
10
0
0 0
1
2
3
4
5
3
4
5
3
4
5
3
4
5
c
3
4
5
0
1
2
c
c 40
30 20 10 0
20 0 0
1
2
3
4
5
0
1
2
c
c
40
40
20
20
0
0 0
1
2
3
4
5
c
0
1
2 c
ˆ Fig. 4. The graphs illustrate V(C) (dashed) versus Stein's unbiased estimator V(C) (solid). Graphs in the first row are for (, l) = (1, 2000) and (1.5, 1500), the second row for (, l) = (2, 1000) and (2.5, 300), the third row for (3,250) and (3.5, 200) and the fourth row for (4,150) and (4.5, 100).
Table 1 Misclassification error rates by conditional MLE, FAIR (Fan and Fan, to appear) and Fisher's rule (i.e., without variable selection) (, l)
(1, 2000) (1.5, 1500) (2, 1000) (2.5, 300) (3, 250) (3, 125) (3, 60) (3.5, 200) (3.5, 100) (3.5, 50) (4, 150) (4, 75) (4, 40) (4, 20) (4.5, 100) (4.5, 50) (4.5, 25) (4.5, 15)
m = 104
m = 105
Cond. MLE
FAIR
Fisher
Cond. MLE
FAIR
Fisher
0.0250 0.0045 0.0000 0.0436 0.0303 0.0827 *0.1588 0.0272 0.0724 *0.1287 0.0321 0.0762 *0.1185 *0.1780 0.0482 0.0928 *0.1390 *0.1771
0.0069 *0.0000 *0.0000 *0.0053 *0.0012 *0.0560 0.2156 *0.0005 *0.0412 0.1863 *0.0006 *0.0444 0.1761 0.3182 *0.0027 *0.0742 0.2293 0.3273
*0.0053 0.0000 0.0000 0.0078 0.0022 0.0680 0.2307 0.0011 0.0523 0.2023 0.0012 0.0562 0.1922 0.3295 0.0048 0.0887 0.2449 0.3383
0.3220 0.1408 0.0711 *0.1339 *0.0751 *0.1511 *0.2592 *0.0509 *0.1086 *0.1845 *0.0481 *0.0972 *0.1472 *0.2220 *0.0587 *0.1049 *0.1539 *0.1996
0.1913 *0.0707 *0.0400 0.2029 0.1587 0.3069 0.4042 0.1375 0.2918 0.3918 0.1429 0.2956 0.3873 0.4426 0.1846 0.3242 0.4102 0.4460
*0.1903 0.0716 0.0415 0.2061 0.1622 0.3096 0.4056 0.1412 0.2946 0.3934 0.1466 0.2984 0.3889 0.4435 0.1881 0.3267 0.4115 0.4468
Error rate with * represents minimum error rate in the row for each m.
We present simulations for the cases m = 105 and 104 ; and and l are the same as in Fig. 3. We consider the case where 2 = 1 and n = 25, i.e., the variance of aj Xj is s2 = 25/2, when a2j = 1. Notice that when m = 105 , the performance of FAIR is very similar to the performance of Fisher's rule. This is because in our simulation, the number of selected variables in FAIR is very large, typically more than 60,000 variables out of the 105 variables are selected. The conditional MLE procedure selects only dozens or hundreds of variables depending on the configuration.
394
E. Greenshtein et al. / Journal of Statistical Planning and Inference 139 (2009) 385 -- 395
Table 2 Average of 100 optimal C under V(C) versus the optimal C under a squared loss (, l)
C under classification
C under squared loss
(1, 2000) (1.5, 1500) (2, 1000) (2.5, 300) (3, 250) (3.5, 200) (4, 150) (4.5, 100)
0.00 (0.00) 0.97 (0.19) 1.92 (0.15) 2.59 (0.19) 2.79 (0.10) 2.92 (0.09) 3.07 (0.08) 3.25 (0.09)
4.12 (0.28) 4.39 (0.30) 4.15 (0.72) 3.55 (0.32) 3.08 (0.12) 2.98 (0.09) 3.01 (0.08) 3.16 (0.10)
The numbers in ( ) indicate standard deviations.
5
4
4
3
3 C
C
5
2
2
1
1
0
1
2
3
0
4
1
2
Δ
3
4
3
4
Δ 5
4
4
3
3
C
C
5
2
2
1
1
0
1
2
3 Δ
4
0
1
2 Δ
Fig. 5. The left panel in the first row and the right panel represent the graphs of C versus for l = 200 and 500, respectively. In the second row, l = 1000 and 2000. Solid lines are the optimal C for estimation and dotted lines for classification.
Estimation versus classification: Table 2 shows optimal C for different combinations of (, l) under classification, in comparison with those for estimation under a squared loss. Those optimal values are obtained through simulations in various combinations. In both cases of classification and estimation, we examine the class of conditional MLE procedures parameterized by C, and the optimal value of C in each case is reported in Table 2. In each realization we compute the value of C that maximizes V(C), and the value of C that minimizes m ( ˆ − j )2 . Averages over 100 realizations are reported for each configuration of l and . j=1 j It should be emphasized that the corresponding optimal values of C in classification versus estimation may be very different, e.g., when (, l) = (1, 2000) and (1.5, 1500). Under square loss, when the signals are weak, the optimal C is large, so that most of the estimates for i are 0, but in the corresponding classification problem the optimal C is small (C = 0 in the case (, l) = (1, 2000)), so many (even all) of the variables are selected. Fig. 5 represents graphs of versus C for various values of l. Regardless of the value of l, when is small, the optimal C's may be rather different in classification versus estimation; the optimal C in classification is typically smaller. However, optimal C values are getting similar as increases. 5. Concluding remarks Consider the case of correlated data with known covariance matrix. In this case, if we scale all the variables to have the same variance, we should (i) select all the variables with corresponding Zj > C or Zj < − C, estimate their mean using conditional MLE
E. Greenshtein et al. / Journal of Statistical Planning and Inference 139 (2009) 385 -- 395
395
(ii) use the linear transformation on the selected variables, for which the new variables are independent, (iii) estimate the mean of the transformed variables based on the estimates in (i), and proceed as before. The problem of unknown covariance matrix is beyond the scope of this paper. Some previous studies concerning regularization in discriminant analysis are Campbell (1980), Friedman (1989) and Hastie et al. (1995). The regularization in those papers is based on inverse of the covariance matrix, but under high-dimensional problem such as 105 variables, this procedure is numerically infeasible. Another regularization direction is to act as if the features are independent, namely Naive Bayes, in Bickel and Levina (2004). By doing this one avoids the noise from attempting to estimate many correlation coefficients. Acting as if the features are independent together with variable selection through conditional MLE brings us to our main work on regularization in high-dimensional problem. Our framework of shrinkage by conditional MLE is a less arbitrary than soft shrinkage and some other shrinkage methods. The framework described in this paper is based on specific modeling assumptions. Other modeling assumption (for example non-normal data) would lead to analogous different shrinkage procedures. In this manner, we consider our proposal to be a general gateway to other model-dependent shrinkage methods. Acknowledgments We are grateful to the referees and an Associate Editor for helpful comments. Crucial discussions with Anirban DasGupta are gratefully acknowledged. References Bickel, P., Levina, E., 2004. Some theory for Fisher's linear discriminant function, ”naive Bayes”, and some alternatives where there are many more variables than observations. Bernoulli 10 (6), 989--1010. Brown, L.D., 1986. Fundamentals of Statistical Exponential Families, with Applications in Statistical Decision Theory. IMS, Hayward, CA. Campbell, N.A., 1980. Shrunken estimation in discriminant and canonical variable analysis. Appl. Statist. 29 (1), 5--14. Donoho, D.L., Johnstone, I.M., 1994. Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425--455. Donoho, D.L., Johnstone, I.M., 1995. Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 (4), 1200--1224. Fan, J., Fan, Y., to appear. High dimensional classification using features annealed independence rules. Ann. Statist. Foster, D.P., George, E.L., 1994. The risk inflation criterion for multiple regression. Ann. Statist. 22, 1947--1975. Friedman, J., 1989. Regularized discriminant analysis. J. Amer. Statist. Assoc. 84, 165--175. Hastie, T., Buja, A., Tibshirani, R., 1995. Penalized discriminant analysis. Ann. Statist. 23, 73--102. Johnstone, I.M., Silverman, B.W., 2005. Empirical Bayes selection of wavelet thresholds. Ann. Statist. 33 (4), 1700--1752. Lehmann, E.L., 1986. Testing Statistical Hypothesis. second ed. Wiley, New York.