Expert Systems with Applications 41 (2014) 1463–1475
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Feature selection for high-dimensional multi-category data using PLS-based local recursive feature elimination Wenjie You a,b, Zijiang Yang b, Guoli Ji a,⇑ a b
Department of Automation, Xiamen University, 361005 Xiamen, China School of Information Technology, York University, Toronto M3J 1P3, Canada
a r t i c l e
i n f o
Keywords: High-dimensional multi-category problem Partial least squares Recursive feature elimination Feature selection
a b s t r a c t This paper focuses on high-dimensional and ultrahigh dimensional multi-category problems and presents a feature selection framework based on local recursive feature elimination (Local-RFE). Using this analytical framework, we propose a new feature selection algorithm, PLS-based local-RFE (LRFE-PLS). In order to compare the effectiveness of the proposed methodology, we also present PLS-based GlobalRFE which takes all categories into consideration simultaneously. The advantage of the proposed algorithms lies in the fact that PLS-based feature ranking can quickly delete irrelevant features and RFE can concurrently remove some redundant features. As a result, the selected feature subset is more compact. In this paper the proposed algorithms are compared to some state-of-the-art methods using multiple datasets. Experimental results show that the proposed algorithms are competitive and work effectively for high-dimensional multi-category data. Statistical tests of significance show that LRFEPLS algorithm has better performance. The proposed algorithms can be effectively applied not only to microarray data analysis but also to image recognition and financial data analysis. 2013 Elsevier Ltd. All rights reserved.
1. Introduction Analyzing a small sample of high-dimension data is a thorn in the fields of machine learning and pattern recognition. The essence of problems in high-dimensional small sample data lies in the data containing information redundancy as well as having high noise and strong correlation among features. Effective modeling of high-dimensional small sample datasets involves mining the complete potential information hidden in the data. This requires one to retain the maximum amount of useful information while simultaneously removing redundancy and noise. In recent years, a large number of high-dimensional and ultrahigh dimensional datasets (Dudoit, Fridlyand, & Speed, 2002; Fan & Fan, 2008) appeared in the areas such as microarray analysis, image recognition and financial data analysis. High dimensional feature space exerts influence on both accuracy and efficiency of learning methods. Many data mining classification algorithms lack efficiency or even fail in such high dimensional classification problems (Jain, Duin, & Mao, 2000). In order to deal with such high (ultrahigh) dimensional datasets, feature selection is one of the most important and popular methods. Feature selection includes two aspects: a selection strategy and evaluation criteria. According to the task at hand feature selection defines different metrics in order to choose the most distinguished features from original feature space. The key is to ⇑ Corresponding author. E-mail addresses:
[email protected] (Z. Yang),
[email protected] (G. Ji). 0957-4174/$ - see front matter 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.08.043
establish the evaluation criteria which can be used to distinguish the feature subset that can help in classification and has low redundancy. According to the relationship between the evaluation function and classifier, feature selection can be divided into three types (Guyon & Elisseeff, 2003): Filters, Wrappers and Embedded approaches. Filter methods select features by class separability criteria and are independent of used back-end classifier. As their name indicates, they use a separability index as an evaluation criterion to rank features and select the top features to form feature subset (i.e., F-test Le Cao, Bonnet, & Gada, 2009, ReliefF Kononenko, 1994). The advantage of filter methods is that they are simple and fast. Wrapper methods select features depending on the feedback of a classifier and a learning algorithm is included in the feature selection process. A wrapper method uses learning algorithms to evaluate the prediction accuracy of a potential feature subset (i.e., SFS Kudo & Sklansky, 2000, GA Jain et al., 2000). Although this method is time-consuming, the results are the best for the specific learning technique since it is based on a learning algorithm. Wrapper methods have better performance. However, with the increase of the feature dimensions, the number of iterations in the learning algorithm also increases substantially, which is one of the biggest challenges when wrapper methods are applied in analyzing highdimensional datasets. Embedded methods evaluate feature subsets with some characteristic of classifiers. Rather than directly using classifiers, its feature selection process is part of the machine learning algorithm. During training of the learning algorithm, the algorithm decides which features will be selected and which
1464
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475
features will be ignored (i.e., Lasso Tibshirani, 1996). The computational complexity and extension to multiclass problems are major issue when the number of features becomes excessively large (Sun, Todorovic, & Goodison, 2010). In practical applications, some stateof-the-art methods can not deal with high-dimensional problems due to the fact that the computational efficiency is intolerable. In this paper, we focus on the feature ranking of Filter methods, which does not evaluate all possible feature subsets, and only ranks all the features based on the defined feature measure. More importantly it is computationally efficient, which is particularly important for the high-dimensional or ultrahigh dimensional datasets (Zhang, Lu, & Zhang, 2006). Most of the traditional feature selection methods define some metrics to evaluate an individual feature (Arauzo-Azofra, Aznarte, & Benitez, 2011). This univariate method has a defect. It ignores the correlation relationship among features. A more accurate approach needs to consider the joint probability distribution among features. Compared to univariate methods, multivariate methods take the relationship among features into account more and can detect those features with a relatively small main effect, but a strong interaction effect (Ji, Yang, & You, 2011). Relief algorithms (Kira and Rendell, 1992) based on an iteratively adjusted margin are the best methods for multi-feature selection. ReliefF (Kononenko, 1994), which is an improvement of the Relief algorithm, can solve multi-category classification and regression. The SVM-RFE feature selection method proposed by Guyon, Weston, Barnhill, and Vapnik (2002) is one of the best approaches. It is a sequential backward reduction method which makes use of the maximum interval principle of SVM to assess and eliminate irrelevant and redundant features (genes in biological datasets). It has been widely used in many high-dimensional data analysis problems (Youn, Koenig, Jeong, & Baek, 2010). Feature importance ranking based on Random Forest (RF) (Breiman, 2001) uses a random forest to evaluate and select each feature. It indirectly examines the effect of the interaction between features without any assumptions on the data distribution (Boulesteix, Porzelius, & Daumer, 2008; Diaz-Uriarte & Alvarez de Andres, 2006). All these state-ofthe-art methods consider the correlation among features. In this paper, the proposed PLS-based local-RFE and global-RFE feature selection algorithms also consider the correlation among features. In the fields of data mining and machine learning, analysis for multi-category problem is another challenge (Allwein, Schapire, & Singer, 2000). Many traditional learning algorithms only apply to two-category problems. Good solutions already exist for two-category problems. Compared to two-category problems, there are still no uniform standards and specifications for multi-category problems (Galar, Fernandez, Barrenechea, Bustince, & Herrera, 2011). They are solved mainly through converting multi-category problems into multiple two-category problems. Similarly feature selection for multi-category problems is also a challenge in machine learning and a perfect solution system is yet to be created. Recently, Wang, Zhou, and Chu (2008) combined one-versus-all (OVA) with some two-category feature selection methods, such as Relief and mRMR, and obtained feature subsets by the defined class-dependent criteria. The authors found that selecting an independent feature subset for each OVA classifier produces better results on some standard benchmark datasets. Experiments found that the results of class-dependent classifiers based on OVA decomposition are superior to the result where multi-category methods are directly applied. Support vector machine (SVM) was originally designed only for two-category problems. When dealing with multi-category problems, we need to extend and improve SVM recursive feature elimination (SVM-RFE). Based on SVM-RFE, Zhou and Tuck (2007) extended four MSVM-RFE algorithms which may be suitable for multi-category problems. These algorithms use the idea of OVA decomposition to convert the multi-objective optimization prob-
lem into a single-objective optimization problem by using weighted average. The new feature (genes) ranking criteria are obtained through the analysis in the single objective optimization problem. Granitto and Burgos (2009) combined OVA with the RFE method. Through experiments in high-dimensional multi-category datasets, they found that the OVA-RFE method with RF and SVM had better recognition performance. The experiments also show that this method can significantly improve the robustness of the algorithm. Motivated by the above literature, we present an analytical framework of local recursive feature elimination. PLS-based feature ranking (Ji et al., 2011) is introduced in the framework, and a new Partial Least Squares (PLS) based local recursive feature elimination algorithm is proposed to achieve feature selection efficiently on high-dimensional multi-category problems. In order to compare the effectiveness of the proposed methodology, we also present PLS-based Global-RFE which takes all categories into consideration simultaneously. The advantage of the proposed algorithms lies in the fact that PLS-based feature ranking can quickly delete irrelevant features and RFE can concurrently remove some redundant features. In this paper, our proposed algorithms are compared with some state-of-the-art methods (ReliefF, SVM and RF) using multiple microarray datasets. Experimental results show that our algorithm obtains good evaluation performance in terms of accuracy, Kappa measure and the number of used features. Furthermore, it has good computational efficiency. Statistical test of significance also show that our proposed local recursive feature elimination algorithm have better performance than other algorithm. LRFE-PLS has the following characteristics. First, it is computationally efficient especially for high-dimensional datasets. Second, it can make the result more robust so that good recognition accuracy can be obtained using default parameters. Third, due to the introduced RFE, it usually can get smaller feature subset without degrading recognition rate. Our algorithms can be effectively applied not only to microarray data analysis but also to other high-dimensional multi-category problems such as image recognition and financial data analysis. The rest of the paper is organized as follows. In Section 2, definitions of relevant concepts and the general analytical framework of local recursive feature elimination is presented. Section 3 develops new PLS-based local recursive feature elimination algorithm on the analytical framework to achieve feature selection on highdimensional multi-category data. Feature selection on multiple datasets, as well as the comparison and evaluation of the related algorithms are provided in Section 4. Finally we present conclusions and discuss future work in Section 5. 2. Methodology Let p be the number of features and n be the number of observations. Our data matrix is a n p matrix X. Xi denotes the i-th sample observation, i.e., Xi = (xi1, xi2, . . ., xip)T (i = 1, 2, . . ., n); Xj denotes the j-th feature (dimension) (j = 1, 2, . . ., p). When n << p, X demonstrates the characteristics of a small sample in high-dimensional space. Let X = {X1, X2, . . ., Xp} be the original feature space. We assume that there exist an optimal feature subset X(U) X, a redundant feature subset X(R) X which includes the features positively or negatively correlated with the features in X(U), and an irrelevant feature subset X(I) X including the features orthogonal to the features in X(U) (Kohavi & John, 1997). 8 ( ðUÞ > > < Relev ant Feature Subset Optimal Feature Subset : X Feature Space : X Redundant Feature Subset : X ðRÞ > > : Irrelev ant Feature Subset : X ðIÞ
The goal of feature selection is to eliminate irrelevant features X(I) and redundant features X(R) in order to represent the original features using as few features as possible so that optimal classification
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475
1465
learning can be achieved. Good feature selection methods improve the generalization ability of the learning model, as well as enhance the understandability of the model. 2.1. Related concepts
Definition 1 (Feature Measure: FM). Given a feature set X and category variable y, the evaluation score of the j-th feature Xj(j = 1, . . ., p) is defined as:
ScoreðX j Þ ¼ f ðX ðkÞ ; yÞ;
X ðkÞ # X k+1
ð1Þ (k)
where f is a mapping f:R ? R, (1 6 k 6 p), X is any subset containing Xj and its size is equal to k. According to whether the examined feature measure method considers the correlation among features or not, this paper divides the feature measure methods into two categories: single-feature measure (SFM) methods and multi-feature measure (MFM) methods. If k = 1 in (1), the score function is only related with one feature variable, that is, an Xj ’s score is determined only by the individual feature Xj itself, and it is not related to any other feature Xi("i – j). This implies an assumption of independence among features. This type of score calculation is called SFM. Some ranking algorithms in Filter methods such as F-test normally are SFM methods, which do not consider the possible interactions among the features. If k > 1 in (1), the score function is an explicit or implicit function of multiple feature variables, that is, the score of feature Xj is determined by multiple features X(k) (1 < k 6 p). Many features or even all the features have potential impacts on the score of feature Xj. This type of score calculation is called MFM. Some ranking algorithms in Filter methods such as ReliefF are typical MFM methods. Compared to SFM methods, MFM methods consider the correlation among features. Therefore, we can introduce recursive feature elimination strategy on the MFM-based feature ranking to eliminate the redundant features to some extent. Recursive Feature Elimination (RFE) strategy is a recursive cycle that occurs as follows: pre-defined feature measure rules for the current dataset are used to calculate the ranking scores for all the features, and the ranking score represents a feature’s classification ability. The feature with the smallest score is removed. Then the ranking score for the remaining features are recalculated, and the feature with the current smallest score is removed. This process is repeated until only one feature remains in the feature set. RFE returns a list of feature serial numbers which represents the order of importance. RFE indirectly uses feature ranking heuristic criteria to ensure good performance of the algorithm. An RFE process does not depend on the specific classifier. As the baseline of feature ranking algorithms, MFM-based feature ranking can be applied in a RFE process. However, SFM-based feature ranking does not work. After RFE strategy is introduced to MFM-based feature ranking, the feature ranking process can quickly delete irrelevant features while eliminating some redundant features. 2.2. Frameworks of local-RFE Motivated by the literature (Granitto & Burgos, 2009; Wang et al., 2008; Zhou & Tuck, 2007), a general framework based on local-RFE suitable for multi-category problems is presented. It consists of three key components: (1) Localization process, a complete multi-category problem is localized into a series of two-category sub-problems. OVA, one-versus-one (OVO), and error correcting output codes (ECOC) are the most popular strategies (Allwein et al., 2000); (2) Feature Measure, the appropriate MFM method is selected, and the feature score vectors are calculated for each two-category sub-problem. Different feature measure methods can be used simultaneously in practical applications,
Fig. 1. Local-RFE: a simple framework flowchart.
including homogenous and heterogeneous feature measures and their fusion; (3) RFE process, all score vectors are weighted average and feature ranking is performed with RFE strategy. The generated feature ranking must be based on MFM, and RFE strategy can only work effectively under MFM. In practice, more than one feature could be removed at a single step, and even the features that ranked the lowest a% (a = 10) Guyon et al., 2002 can be eliminated from each iteration recursively. A simple framework flowchart is shown in Fig. 1. This paper first decomposes the g-class problem into g two-category sub-problems using OVA; Next, we calculate the indicators based on homogenous feature measure independently in each of the two-category sub-problems and obtain the corresponding feature score; Finally, MFM-based feature ranking is implemented on the simple averages of the g score lists. Then a RFE strategy is introduced. We denote this framework as Local-RFE (LRFE). In addition, this paper denotes the feature selection methods which can be directly applied in multi-category problems and RFE can be introduced to as Global-RFE (GRFE). GRFE such as ReliefF and RF-based feature selection with RFE (Granitto, Furlanello, Biasioli, & Gasperi, 2006; Ruan, Li, Li, Gong, & Wang, 2006) takes all categories into consideration simultaneously. 3. Proposed PLS-based local-RFE and global-RFE algorithms For multi-category feature selection, in this section, we develop novel feature selection algorithms, PLS-based Local-RFE and PLSbased Global-RFE. We first introduce the basic principles of PLS, and then present PLS-based local and global solution models. Finally the corresponding algorithms (LRFE-PLS and GRFE-PLS) are given. 3.1. Basic principles of PLS Partial Least Squares (PLS) is a non-parametric method based on the idea of high-dimensional projection. PLS aims to find uncorrelated linear transformations of the original input variables which have high covariance with the response variables. Using these latent variables, PLS can predict response variables, perform regression, and reconstruct original dataset matrix simultaneously. X denotes the n p matrix that contains p predictor variables for n observations, and y denotes the corresponding dependent variable vector with size n 1; both are standardized. PLS components ti (i = 1, . . ., nfac) are constructed to maximize the objective function based on the sample covariance between y and t = Xw, that is, PLS searches the weight vector w sequentially to satisfy the criterion
wi ¼ argmax cov2 ðXw; yÞ
ð2Þ
wT w¼1
subject to the orthogonal constraint wTi ðX T XÞwj ¼ 0ð1 6 j < iÞ. In case of multi-response variables, Y is a matrix with q columns and the criterion can be simplified as maximizing var(Xw)corr2(Xw, Y) (Boulesteix & Strimmer, 2006; Wold, Johansson, & Cocchi,
1466
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475
1993). To derive the components t1, t2, . . ., tnfac, PLS decomposes X and y to produce a bilinear representation of the data (Martens & Naes, 1989):
X ¼ t1 pT1 þ þ t nfac pTnfac þ E y ¼ t 1 q1 þ þ t nfac qnfac þ f
ð3Þ
where nfac is the number of latent variables. ti (n 1) is a score vector, pi (p 1) is a loading vector of X, and qi is a scalar. E and f are the residuals of X and y, respectively. Eq. (3) can be deemed as ordinary least squares problems. In fact, the idea of PLS is to estimate loadings and scores by a regression, which is a sequence of bilinear models fitted by least squares. At every step (i = 1, . . ., nfac) vector wi is estimated to obtain the PLS component that has maximal sample covariance with y and each component ti is uncorrelated with previously components. That is, the first component t1 is extracted on a basis of the covariance between X and y; the i-th component ti (i = 2, . . ., nfac) is extracted using the residuals of X and y from the previous step. Basic PLS algorithms mainly include (Wold, 1975) Non-linear Iterative Partial Least Squares (NIPALS) and de Jong (1993) Statistically Inspired Modification of PLS (SIMPLS). The SIMPLS is different from NIPALS in two ways: first, successive ti is calculated explicitly as linear combinations of X; second, SIMPLS directly finds the weight vector (referred to ri) which is applied to the original (not deflated) matrix X. Therefore, SIMPLS algorithm is slightly superior to NIPALS since SIMPLS is actually derived to solve a specific objective function, i.e., to maximize covariance. Our analysis and calculation in this paper is based on SIMPLS algorithm. The SIMPLS procedure (with multi-response variables Y) can be concluded as follows:
3.2. Models For g-class multi-category problems, now we propose two different solution methods: the first method makes use of OVA to decompose the problem into a set of two-category problem, and the second method takes all categories into consideration simultaneously. The response matrix y in (2) consists of continuous response variables. It is a typical quantitative prediction, which fits the continuous numerical data. However, we have a qualitative response variable y consisting of categories c1, c2, . . ., cg in the current multi-category classification context. The response information indicating category label y should be encoded into a response vector ~ in order to achieve category label independence ~ (or matrix Y) y (Nguyen & Rocke, 2002). 3.2.1. Model 1: local solution model based on OVA decomposition Using an OVA decomposition method, the g-class problem is decomposed into g two-category sub-problems. We also coded the class label of two-category sub-problem as binary response ~r is the n 1 vector of (Nguyen & Rocke, 2002). For example, y the sample class labels. The samples in r-th class have positive labels and all other samples have negative labels (Please see the details in Algorithm 2). Thus, g PLS models with a single response variable are established. Therefore, we use the SIMPLS algorithm to solve the following optimization problem:
S = XTY For i = 1 to nfac do If i = 1, [u, s, v] = SVD (S) 1 P Ti1 S If i > 1, [u, s, v] = SVD S P i1 PTi1 P i1 ri = u(:, 1), ti = X ri, pi ¼ X T t i = t Ti t i Store ri, ti, pi into R, T and P respectively End
where nfac is the number of factors to be retained in the formation of the model, R is the weight matrix, P is the loading matrix, and T is the score matrix. The weight vector wi in NIPALS and weight vector ri in SIMPLS are different when i is larger than 1. That is, NIPALS and SIMPLS use different methods to compute the weight vector. More details about SIMPLS can be found in de Jong (1993). In PLS procedure, the extracted component t = Xw represents as much variation of X as possible. At the same time it should be associated with Y (or y) as much as possible in order to explain the information of Y. To analyze the explanation of variation of X to Y, we introduce importance in projection (VIP) (Wold et al., 1993) to quantitatively denote the impact of each Xj on Y. Definition 2 (Variable Importance in Projection: VIP). Let r(Xi, Xj) be the correlation coefficient between two variables Xi and Xj. Given the explanation of variation of component th to Y, P RdðY; th Þ ¼ 1q qk¼1 r2 ðyk ; th Þ, and the accumulation of variation explanation of t1, t2, . . ., tnfac to Y, RdðY; t1 ; t2 ; ; t nfac Þ ¼ Pnfac RdðY; t h Þ, we define the following: h¼1
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u Pnfac 2 u h¼1 RdðY; t h Þwjh t j VIPðX ; nfacÞ ¼ p RdðY; t 1 ; . . . ; tnfac Þ
as variable importance in projection of Xj to Y where nfac is the number of latent variables (factors) and wjh is the j-th weight of axis wh indicating the marginal contribution of Xj to construct components th. Larger VIP(Xj; nfac) indicates more importance in the interpretation of Xj to Y. In fact, the interpretation of Xj to Y is passed by th. If the explanatory power of th to Y is strong and Xj plays an important role in structure th, then the explanatory power of Xj to Y should be regarded as significant. That is for the components th on a large value of Rd(Y; th). If the values wjh are also high, then the interpretation of Xj to Y is strong. Therefore, the indicator of VIP can be used in measuring features to realize Algorithm 1: PLS-based Feature Measure (PLS_FM). For further details on PLS-based feature selection please refer to our previous work (Ji et al., 2011; Yang, You, & Ji, 2011).
ð4Þ
8 ~r Þ max J r ðwr Þ ¼ cov 2 ðXwri ; y > > > > < s:t: kwr k ¼ 1; i > ðwri ÞT ðX T XÞwrj ¼ 0; > > > : j ¼ 1; . . . ; i 1; r ¼ 1; . . . ; g
ð5Þ
A group of solutions wr1 ; wr2 ; . . . ; wrnfac 2 Rp are obtained, and PLS-FMs can be calculated using the VIP measure. We call it PLS-based Local Feature Measure (PLS-LFM), that is, the weighted average P scoreðX j Þ ¼ gr¼1 cr LFMjr (ci is weighting coefficient) of the obtained g local feature measure sequences fLFMjr ; j ¼ 1; . . . ; pg (r = 1, . . ., g) are calculated and ranked. Theorem 1. If fðX i ; yi Þ X i 2 Rp ; yi 2 Y C ; i ¼ 1; . . . ; ng is the standardized dataset with the given categories and Y C ¼ fc1 ; c2 ; . . . ; cg g (g > 2) is class set, then PLS-LFM with nfac > 1 is a multi-feature measure (MFM) method. Proof. We only discuss the case of the r-th two-category subproblem, and other similar.
LFMjr
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u Pnfac u ~r r 2 h¼1 Rdðy ; t h Þwjh ¼ tp r r ~ Rdðy ; t1 t rnfac Þ
ð6Þ
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475
1467
From the SIMPLS algorithm, when nfac is larger than 1, we learn 1
weight vector r h ¼ kh Sh1 ¼ kh ðIp Ph1 ðPTh1 Ph1 Þ PTh1 ÞX T y0 (1 < h 6 nfac), where kh is a scalar, Ip is the p p identity matrix, ~r . If we assume Ph-1 = [p1, . . ., ph-1] and y0 is the centered y
Algorithm 1. PLS_FM Input:
1
C ¼ ðcji Þpp ¼ Ip Ph ðPTh Ph Þ P Th , then r h ¼ kh CX T y0 . Thus, we have P T r jh ¼ kh pi¼1 cji ðX i Þ y0 (j = 1, . . ., p). Therefore, the value of rjh is relevant to multiple features Xi when cji(i – j) are not all zeros. h
Output: 01: 02:
Obviously, it is a MFM. Therefore, when nfac is larger than 1, LFM jr in Eq. (6) is a MFM. In particular, when nfac is equal to 1, we know r 1 ¼ k1 S0 where k1 is a scalar and S0 ¼ X T y0 from the SIMPLS algorithm. Thus, we T have rj1 ¼ k1 ðX j Þ y0 for j = 1, . . ., p. Therefore, the value of rj1 is only j relevant to X , not Xi(i – j). Thus, LFM jr in Eq. (6) actually is a SFM. Note: Boulesteix (Boulesteix, 2004) proved that the lists of features obtained with ordered squares weight vector w21 from the first PLS component is of the same order as from F-statistic. 3.2.2. Model 2: global solution model based on class coding extension Let fðX i ; yi Þ X i 2 Rp ; yi 2 Y C ; i ¼ 1; . . . ; ng denote a given dataset with known categories where YC = {c1, c2, . . ., cg} is the class label set and let g denote the number of categories. The original class labels (y)n1 are encoded. The dependent variable of our PLS model is deng e ¼ ðy Þ fined as Y (n observed samples, g categories), ij ng 2 f0; 1g
yij ¼ Iðyi ¼ cj Þ ¼
1 yi ¼ cj 0
others
;
i ¼ 1; . . . ; n; j ¼ 1; . . . ; g
ð7Þ
e ¼ ðy Þ , Therefore, the dependent variable matrix is encoded as Y ij ng e contains information about class memberwhere each column in Y ~ equals to 1 if a sample belongs to ship of samples. The i-th row of Y class cj (i.e., yij = 1) and 0 otherwise. Using the above method, the original class labels (y)n1 are exe ¼ ðy Þ . Thus, the tended into the multiple-response variable Y ij ng multiple-response variable PLS model is established. The SIMPLS algorithm is able to achieve a unified expression of single response variables and multiple response variables. We also use the SIMPLS algorithm (de Jong, 1993) to solve the multiple-response variable PLS model. SIMPLS can calculate w1, w2, . . ., wnfac e Rp by solving the following optimization problem:
8 ~ max JðwÞ ¼ cov 2 ðXwi ; YÞ > > > < s:t: kwi k ¼ 1; > wT ðX T XÞwj ¼ 0; > > : i j ¼ 1; ; i 1
ð8Þ
PLS-based Feature Measure (PLS-FM) can be calculated using VIP as an indicator of score. The obtained PLS-FM-based score of all features are ranked.
03: 04: 05: 06: 07:
LRFE_PLS firstly uses the OVA decomposition method to obtain g two-category sub-problems and the single-response variable PLS models are established for each of the two-category sub-problems. Then LRFE_PLS calculates the LFM indicators based on a group of projection vectors t and projection direction vectors w that are generated in each of the single-response variable PLS models. The features with the smallest score are removed in each iteration, and eventually a descending order of all features is given (i.e., features are ranked from highest to lowest importance). In each recursive loop, LRFE_PLS removes the features with the smallest score and then the PLS model is remodeled with the remaining features using OVA decomposition to obtain new scoring coefficients. LRFE_PLS implements this process recursively, and finally the feature ranking list is obtained. The detailed Algorithm 2 (LRFE_PLS: PLS-based Local-RFE) is provided as below. Algorithm 2 LRFE_PLS Input:
Output: 01: 02: 03: 04: 05: 06:
Theorem 2. If fðX i ; yi Þ X i 2 Rp ; yi 2 Y C ; i ¼ 1; . . . ; ng is the standardized dataset with the given categories and YC = {c1, c2, . . ., cg} (g > 2) is a class set, then PLS-FM with nfac > 1 is a multi-feature measure (MFM) method.
07: 08: 09:
Proof. Similar to the proof of Theorem 1. Omitted. h
10:
3.3. Algorithms Now we present two new feature selection algorithms which are PLS-based LRFE (LRFE_PLS) and PLS-based GRFE (GRFE_PLS). From Theorem 1 the Feature Measure (PLS-LFM) of a local solution model based on OVA decomposition is Multi-feature Measure (MFM). Similarly from Theorem 2 the Feature Measure (PLS-FM) of a global solution model based on class coding extension is Multi-feature Measure (MFM). From RFE we know that all feature ranking which meet MFM definition can be applied to RFE process.
TrnXnp // A training data set with p features ClsYn1 // A vector of classes nfac // Number of components Score // The score of each feature Encode class label ClsYn1, and generate ClassYng using Eq. (7) Obtain T and W (or R) by calling SIMPLS(TrnX, ClassY, nfac), and calculate Rd For i = 1 to p do Calculate vip for each feature in terms of Eq. (4) Score(i) = vip End Return Score
11: 12: 13: 14: 15:
TrnXnp // A training dataset with p features ClsYn1 // A vector of classes nfac // Number of components (default value is 2) idx // The index of each feature Initialization: Set Score = [0,0,. . .,0], g = number of categories, idx = [ ] and S = [1,2,. . .,p] Repeat Update TrnX, whose features only include the features in S For r = 1 to g do Update ClsY so that the r-th class with positive labels and all other classes with negative labels Calculate VIP in term of Eq.(4) by calling SIMPLS(TrnX, ClsY, nfac) Score = Score +VIP End Sort Score in descending order and record the sorted Score and the index Rank Find the index e in Rank so that Score(e) has the minmum magnitude among Score Update idx by adding the feature e on top of idx Update S by removing feature e from S Until |S| < nfac Update idx by adding S on top of idx Return idx
Similarly GRFE_PLS calculates Feature Measure indicators based on the projection vector t and projection direction vector w generated in the PLS modeling process. The features with smaller score are removed in each iteration, and eventually a descending order of all
1468
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475
features is given. In each recursive loop, GRFE_PLS removes the features with the smallest scoring coefficient and then the PLS model is remodeled using the remaining features to obtain new ranking coefficients. GRFE_PLS implements this process recursively and finally the feature ranking list is obtained (earlier elimination means lower rank). The detailed Algorithm 3 (GRFE_PLS: PLS-based GlobalRFE) is provided as below. It should be noted before Algorithm 2 and Algorithm 3 return idx, the residual features in S should be moved at the head of idx since the dimension of the current data in an iterative process must be greater than or equal to nfac. Algorithm 3. GRFE_PLS Input:
Output: 01: 02: 03: 04: 05: 06: 07: 08: 09: 10: 11:
TrnXnp // A training dataset with p features ClsYn1 // A vector of classes nfac // Number of components idx // The index of each feature Initialization: Set idx = [ ] and S = [1,2,. . .,p] Repeat Update TrnX, whose features only include the features in S Obtain Score by calling PLS_FM(TrnX, ClsY, nfac) Sort Score in descending order and record the sorted Score and the index Rank Find the index e in Rank so that Score(e) has the minmum magnitude among Score Update idx by adding the feature e on top of idx, Update S by removing feature e from S Until |S| < nfac Update idx by adding S on top of idx Return idx
3.4. Complexity analysis of the proposed algorithms Consider n p matrix X, n q matrix Y, g categories and nfac latent variables, the time complexity of SIMPLS (we analyze the pseudo code provided in the Appendix of de Jong (1993)) is TSIMPLS = O(nfac q2p + nfac np + qnp) (q < n << p). The examined high-dimensional small sample problem in this paper satisfies the condition q 6 g < n << p. In order to avoid computational cost caused by cross-validation and be easy to use, this paper also provides an empirical parameter setting method (see Section 4.2). That is, for high-dimensional multi-category problems, the number of factors (components) nfac is inferior or equal to g, which is the number of categories. For multi-category problems (g P 3), our implementation meets the condition q = g P nfac (Eq. (7)). Thus, the time complexity of SIMPLS can be expressed as O(g(n + g2)p). For two-category problems (g = 2), our implementation meet the conditions q = 1 and g = nfac. Hence, the time complexity of SIMPLS is O(gnp). Our algorithms can be analyzed as follows: the complexity of Algorithm 1 is O(g(n + g2)p). For Algorithm 2, the outer loop runs p-2 iterations (k from p to 3) and the time complexity of inner loop is O(g2nk). Therefore, the complexity of Algorithm 2 is O((g2n + log p)p2). The complexity of Algorithm 3 is O((g3 + gn + log p)p2). 4. Experiments
1. Experimental Datasets and Preprocessing: In the experiment, we chose a series of challenging real datasets from bioinformatics which have the following characteristics: (1) All the examined datasets are typical high-dimensional multi-category data. Six datasets have over 10,000 dimensions and two datasets have over 10 categories. In addition, there exists unbalanced class distribution in some datasets; (2) Some datasets have been studied by many researchers. For example, the GCM microarray dataset is recognized in the literature as a dataset that is difficult to analyze. For more detailed data sources and background description, please refer to the Kent Ridge bio-medical dataset1 and Arizona State University (ASU) feature selection repository.2 In addition, data normalization is done for HoldOut CV and k-fold CV is performed for each partition. The expressions of the training sets are transformed to zero mean and unit standard deviation across all samples. The testing sets are transformed according to the means and standard deviations of their corresponding training sets. Table 1 summarizes the datasets. 2. Comparison of Algorithms and Parameter Selection: We make comprehensive comparisons with several feature selection methods provided by the ASU feature selection package.3 Three state-of-the-art feature selection methods are used for our experiments: ReliefF, RF-based feature selection4 and SVM-based feature selection.5 These algorithms have been widely used in various experiments. ReliefF and RF-based feature selection can be applied to multi-category problems directly (Boulesteix et al., 2008; Granitto & Burgos, 2009; Le Cao et al., 2009). As for SVM-based feature selection, we implemented the OVA-based MSVM-RFE by Zhou and Tuck (2007)’s. For high-dimensional multi-category problems, we found that by using the setting of Duan, Rajapakse, Jagath, and Nguyen (2007), MSVM-RFE has better recognition results. In our proposed framework, we call it LRFE-SVM. All of these algorithms have better recognition performance and computing power (Wang et al., 2008). The parameters of these algorithms in our experiment are outlined as follows: For ReliefF, the parameter K represents the number of neighbors. Followed by Kononenko’s suggestion we set K as 10 (Kononenko, 1994; Sun, 2007). For RF-based feature selection, we set parameter mtry (the number of input variables we ‘‘try’’ at each split) as pffiffiffi p (number of features), ntree (the number of trees to grow for each forest) as 500 and use default values for all other parameters as used by Diaz-Uriarte and Alvarez de Andres (2006). For our LRFE-PLS, we use the default parameter (nfac is equal to 2). For our GRFE-PLS, we set the parameter nfac as the number of categories. 3. Indicators of Performance Evaluation: In order to evaluate the multi-category problems, we introduce Cohen’s Kappa measure (Galar et al., 2011) except for measuring classifier accuracy. Cohen’s Kappa measure ranges from 1 to 1. A measure of 1 means total disagreement, 0 means random classification and 1 means perfect agreement. When comparing accuracy, Cohen’s Kappa measure is more objective and credible because this measure scores the successes independently for each class and then integrates them. In addition, experiments also compare the size of feature subset (maximum accuracy using the number of features) on different algorithms. We use SVM as a baseline classifier (kernel function: Linear, parameter default) in the experiment section. All experiments are implemented in MATLAB (2010a) on a desktop with Intel Core
4.1. Experimental design In order to verify the validity and efficiency of our proposed algorithms and make the experimental results more objective and credible, our experimental setup consists of the following three parts:
1 2 3 4 5
http://datam.i2r.a-star.edu.sg/datasets/krbd/index.html http://www.featureselection.asu.edu http://featureselection.asu.edu/downloadpackage.php http://www.stat.berkeley.edu/users/breiman/RandomForests/ http://www.kyb.tuebingen.mpg.de/bs/people/spider/main.html
1469
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475 Table 1 Datasets Summary. Type
Dataset
#Instances (Train:Test)
#Features
#Classes
#Instances per class (Train)
Microarray
MLL SRBCT Stjude GCM CLL-SUB-111 Breast Lung Tumors-11
72 (57:15) 83 (63:20) 327 (215:112) 198 (144:54) 111 (–:–) 95 (–:–) 203 (–:–) 174 (–:–)
12582 2308 12558 16063 11340 4869 12600 12533
3 4 7 14 3 3 5 11
20/17/20 23/20/8/12 9/18/42/14/28/52/52 8(⁄11)/16(⁄2)/24 11/49/51 18/44/33 139/17/21/20/6 27/8/26/23/12/11/7/26/6/14/14
i3 CPU 2.4 GHz and 2GB RAM, and all comparison methods are implemented on the same training set and testing set which is randomly generated by 5-fold CV. Due to the fact that all the datasets have high dimensions and the RFE-based process is more timeconsuming, we adopt the following strategies: the features that ranked the lowest 50% will be eliminated from each iteration recursively (Guyon et al., 2002). Thus, the complexity of Algorithms 2 and 3 is approximate to O(p log p). Once the number of features is less than 100, one feature is eliminated at a time in each iteration. In the experiment, we apply the constraint that the largest number of features in the selected feature subset does not exceed 100. In addition, our experiments also examine the following goals: (1) Is the LRFE-PLS algorithm superior to the GRFE-PLS algorithm? (2) Is the feature selection algorithm LRFE-PLS superior to LRFE-SVM and GRFE-RF?
4.2. An empirical analysis on parameter settings The usual practice to determine the optimum parameter nfac (Boulesteix, 2004) for different datasets is to use k-fold CV method. Obviously, this will increase the amount of additional computation for the high-dimensional data. However, the state-of-the-art algorithms applied for comparison in this paper also have the parameter selection problem. For these state-of-the-art algorithms we use the parameter setting methods recommended by the literatures (see Section 4.1). Therefore, considering the fair comparison, we also give the empirical parameter setting: for LRFE-PLS algorithm, the default value for parameter nfac is 2; for GRFE-PLS algorithm, the default value for parameter is g, which is the number of categories. In this paper, we focus on the effectiveness of the proposed method, as well as quantitative comparison of our algorithms with other state-of-the-art methods. Thus, we use the default parameters in subsequent experiments. To validate our parameter setting, the experimental analysis is given. In four multi-category microarray datasets, we conducted 20 random experiments of 5-fold CV in which the number of factors (nfac) varies from 2 to 10 (on GCM nfac from 2 to 30) to compare the recognition rates on the top 100 important information features selected by LRFE-PLS and GRFE-PLS. Figs. 2 and 3 show the relationship between the number of features and recognition accuracy using LRFE-PLS and GRFE-PLS with different value of parameter nfac. Fig. 2 shows that with default parameter (nfac = 2), algorithm LRFE-PLS can maintain good recognition rate without degrading recognition rate. The difference of recognition rate with different nfac values is less than 3.5%, which indicates LRFE-PLS is not sensitive to the parameter nfac; Similarly, Fig. 3 also shows that algorithm GRFE-PLS with too small nfac performs worse, and the worst rate occurs when nfac is equal to 2 in SRBCT, Stjude and GCM datasets. Algorithm GRFE-PLS with too large nfac does not significantly improve the recognition accuracy. Therefore, algorithm GRFE-PLS can also maintain good recognition rate using the default nfac (=g). To some extent, the introduction of RFE (LRFE
and GRFE) can reduce the sensitivity of algorithm parameters, and thus enhance both robustness and accuracy of the algorithms. Therefore, for high-dimensional multi-category problems our empirical parameter setting is reasonable.
4.3. Analysis of the results based on HoldOut CV The HoldOut CV (HOCV) method is the simplest kind of cross validation. The dataset is separated into two sets, called the training set and the testing set. We use HOCV to assess the generalization ability of a model on four microarray datasets where the training set and testing set were separated in the original literature. The detailed process is provided: feature selection methods are used on high-dimensional training sets and the classifiers are trained on the selected feature subsets. Then feature selection and classification models are verified on the testing sets. Finally the recognition accuracy and Kappa measures of different methods are used to evaluate the performance of the feature selection algorithms. Tables 2 provide the detailed comparison for the four tumor microarray datasets on the following twelve feature selection algorithms: Three SFM-based ranking methods: F-test (Le Cao et al., 2009), KruskalWallis (p-value) (Wei, 1981), Information Gain (Uguz, 2011). Three MFM-based ranking methods: PLS (Algorithm 1), RF (Diaz-Uriarte & Alvarez de Andres, 2006; Genuer, Poggi, & Tuleau, 2010) and ReliefF (Kononenko, 1994). Three GRFE with MFM methods: PLS (Algorithm 3), RF (Granitto and Burgos, 2009) and ReliefF (Ruan et al., 2006). Three LRFE methods: PLS (Algorithm 2), ReliefF and SVM (Duan et al., 2007; Zhou and Tuck, 2007). When the classifier (SVM) reaches the maximum recognition accuracy and maximum Kappa on the test dataset, the least number of selected features, the number of Support Vectors (SVs) and CPU time for the SVM classifier are provided below. Table 2 shows that the result based on LRFE-SVM is superior to other feature selection methods in the MLL dataset. LRFE-SVM can reach 100% accuracy using only 2 features. Our proposed LRFE-PLS can also achieve this accuracy using 5 features, and SVM classifier uses the fewest SVs. This validates that the classifier has better generalization ability after feature selection based on LRFE-PLS is performed. In the SRBCT dataset feature selection based on LRFEPLS and LRFE-ReliefF can also achieve 100% accuracy using only 12 features. Note that the SVM classifier uses fewer SVs after feature selection based on our proposed LRFE-PLS is applied. In the Stjude dataset, feature selection based on LRFE-SVM has the best performance. It uses 79 features and reaches 97.32% accuracy. However, the LRFE-PLS algorithm uses only 9 features and can reach an accuracy of 93.75%. The corresponding classifier SVM uses the least number of SVs. In the GCM dataset, feature selection
1470
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475
MLL
SRBCT
95
100
nfac=2
nfac=3
Acc Mean
Acc Mean
nfac=2
90
nfac=4 nfac=5 nfac=6
85
nfac=7
95
nfac=3 nfac=4 nfac=5 nfac=6
90
nfac=7
nfac=8
nfac=8
nfac=9
nfac=9
nfac=10
80
10
20
40
30
50
60
70
80
90
nfac=10
85
100
10
20
40
30
Number of Features
50
60
70
80
90
100
Number of Features
Stjude
GCM
90 70 nfac=2 nfac=3
Acc Mean
Acc Mean
80
nfac=4
70
nfac=5 nfac=6 nfac=7
60
nfac=8
nfac=2
60
nfac=3 nfac=4 nfac=5
50
nfac=6 nfac=7 nfac=8
40
nfac=9
nfac=9
nfac=10
50
10
20
30
40
50
60
70
80
90
nfac=10
30
100
10
20
Number of Features
30
40
50
60
70
80
90
100
Number of Features
Fig. 2. Relationship between the number of features and recognition accuracy using LRFE-PLS algorithm with different nfac (20 random results of 5-fold CV, base classifier: linear SVM).
MLL
SRBCT
95
100
nfac=2
nfac=3
Acc Mean
Acc Mean
nfac=2
90
nfac=4 nfac=5 nfac=6
85
nfac=7
95
nfac=3 nfac=4 nfac=5 nfac=6
90
nfac=7
nfac=8
nfac=8
nfac=9
nfac=9
nfac=10
80
10
20
30
40
50
60
70
80
90
nfac=10
85
100
10
20
30
40
50
60
Number of Features
Number of Features
Stjude
GCM
70
80
90
100
90 70 nfac=2 nfac=3
Acc Mean
Acc Mean
80
nfac=4
70
nfac=5 nfac=6 nfac=7
60
nfac=8
nfac=10
60
nfac=11 nfac=12 nfac=13
50
nfac=14 nfac=15 nfac=16
40
nfac=9
nfac=17
nfac=10
50
10
20
30
40
50
60
70
80
90
nfac=18
100
Number of Features
30
10
20
30
40
50
60
70
80
90
100
Number of Features
Fig. 3. Relationship between the number of features and recognition accuracy using GRFE-PLS algorithm with different nfac (20 random results of 5-fold CV, base classifier: linear SVM).
based on LRFE-PLS has the best performance. It uses 53 features to achieve the highest accuracy of 74.07%. Furthermore, it can be seen from Table 2 that the results based on GRFE and LRFE are superior to the results based on SFM and MFM, and the results based on LRFE-PLS are superior to the results based on GRFE-PLS in general. From Table 2, we can easily observe that our proposed algorithms have fastest computing speed.
Three state-of-the-art MFM-based ranking methods in the framework of LRFE and GRFE are compared here. The relationship between the number of selected features and recognition accuracy on testing sets is identified in Fig. 4. We clearly observe that our LRFE-PLS and GRFE-PLS also have good performance with the four biological datasets. LRFE-ReliefF are completely ineffective in Stjude dataset (See Figs. 4 and 5), which is consistent with the cor-
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475
responding Kappa measures from Table 2. The Kappa measure is almost zero, which indicates that the result of this algorithm is similar to random guesses. This may be due to the fact that dataset Stjude is a sparse dataset and this sparsity will have a serious impact on distance measure algorithm. This is consistent with the results reported in the literature (Le Cao et al., 2009). Redundancy and relevancy are also important criterion to evaluate feature selection algorithms. Good feature selection algorithms are able to reduce the correlation among selected features. That is to say that these algorithms can reduce the redundancy of selected feature subset in order to improve the accuracy of subsequent classification. Fig. 5 shows the redundancy and correlation analysis of the selected feature subsets for three MFM-
1471
based different algorithms with LRFE and GRFE. Fig. 5 contains the grayscales of the correlation matrix (absolute value) with the top 30 informative features selected from the four biological datasets. The lighter the color is, the stronger the (positive or negative) correlation among the selected features is. The darkness of the color can, to some extent, measure the degree of redundancy among the selected features. It can be observed from Fig. 5 that GFRE (second column) and LRFE (third column) usually can reduce redundancy and relevancy of the selected feature subset when compared to three MFM-based algorithms (first column); The correlation grayscales of the top 30 selected features based on LRFE-PLS and GRFE-PLS are more consistent in all four datasets. It is easy to conclude that LRFE-
Table 2 Performance comparison of feature selection based on SFM, MFM, GRFE and LRFE (HOCV).
Note: We use the following rules to evaluate the feature selection methods. If a feature selection method can produce the highest accuracy (or Kappa), its performance is best. If two or more feature selection methods obtain the same accuracy, the one using fewer features is better. If two or more methods achieve the same accuracy rate and select the same number of features, the one using fewest SVs and minimum running time is better. RF denotes Random Forest. The row with the shaded method produces the best result among all the involved twelve methods.
1472
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475
Accuracy (%)
MLL
SRBCT
100
100
80
80 PLS RF ReliefF
60
20
Accuracy (%)
GCM 60
80
40
60
80
60 PLS RF ReliefF
60
40 100
20
100
80
80
40
60
80
40 PLS RF ReliefF
40 20
40
100
100 100
20
40
60
80
LRFE.PLS LRFE.SVM LRFE.ReliefF
60
60
20
40
60
80
100
20
100
100
80
80
40
60
80
20
100 100
20
40
60
80
60 40
60
80
100
20
40
60
80
100
60
60 GRFE.PLS GRFE.RF GRFE.ReliefF
60
LRFE.PLS LRFE.SVM LRFE.ReliefF
20
100
80
GRFE.PLS GRFE.RF GRFE.ReliefF
40
40 LRFE.PLS LRFE.SVM LRFE.ReliefF
40
40
20 60
60 LRFE.PLS LRFE.SVM LRFE.ReliefF
PLS RF ReliefF
20
100
80
40
Accuracy (%)
Stjude 100
40 GRFE.PLS GRFE.RF GRFE.ReliefF
40 20
GRFE.PLS GRFE.RF GRFE.ReliefF
20
40 20
40
60
80
Number of Features
100
20
40
60
80
Number of Features
100
20
40
60
80
Number of Features
100
20
40
60
80
100
Number of Features
Fig. 4. Relationship between the number of features and recognition accuracy using different feature selection algorithms under MFM (top), LRFE (middle) and GRFE framework (bottom) (HOCV, Classifier: SVM).
Fig. 5. Correlation grayscales based on top 30 selected features using different feature selection algorithms with MFM, LRFE and GRFE. The numbers shown in the above figure represent the average absolute value of the correlation coefficient (HOCV).
PLS and GRFE-PLS algorithms are competitive compared to the other state-of-the-art methods. Our algorithms have good recognition rate (Fig. 4) and the selected feature subsets are also less redundant (Fig. 5). In the Stjude dataset, there are abnormal null values in the top 30 features selected by the LRFE-ReliefF algorithms.
4.4. Analysis of the results based on k-fold CV In this section, we focused on LRFE- and GRFE-based feature selection with some state-of-the-art algorithms. In order to avoid ‘‘selection bias’’, stratified 5-fold CV is adopted in the following experiments on all multi-category datasets. All samples of the
1473
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475
datasets are randomly divided into 5 subsets, 4 subsets of which are treated as training set and the other as a testing set. We repeated this process five times. Thus, each subset was used as a test set. At the same time, we selected the features from the training sets to train the classifier (SVM) on the selected feature subset and make predictions on the testing set. The recognition accuracy with 5-fold CV was recorded. For a fair comparison all of the feature selection methods are implemented on the same training set and testing set which was randomly generated by 5-fold CV. Fig. 6 shows the average accuracy (top figure) and deviation (bottom figure) with 20 random results of 5-fold CV based on four different feature selection methods (GRFE-PLS, LRFE-PLS, LRFE-SVM and GRFE-RF). Fig. 6 shows that for the MLL, Breast, and Lung datasets LRFEPLS has the good recognition accuracy. Its average accuracy is usually higher than the other algorithms. For the SRBCT and CLL-SUB-111 datasets, we observe that the average accuracy of LRFE-PLS is better than the other algorithms when the number of selected features is small. In summary, compared to the state-of-the-art LRFE-SVM and GRFE-RF, LRFE-PLS and GRFE-PLS based feature selections are valid. In addition, LRFE-PLS based feature selection also usually has a smaller standard deviation which indicates that our proposed LRFE-PLS has a lower dependence on a training set or better robustness for high-dimensional multi-category datasets. Feature selection aims at getting a feature subset with strong recognition and a small number of features. If a feature selection
MLL
SRBCT
Acc Mean
95
40
60
60
GRFE.PLS LRFE.PLS
96
GRFE.PLS LRFE.PLS
70
LRFE.SVM
LRFE.SVM
LRFE.SVM
GRFE.RF
GRFE.RF
GRFE.RF
80
94 100
12
Deviation
70
80
GRFE.PLS LRFE.PLS
GCM
90
98 90
20
20
40
60
80
60 100
10 8
10
20
40
60
80
GRFE.PLS
50
LRFE.PLS LRFE.SVM
40
100
10
12
8
10
6
8
4
6
GRFE.RF
20
40
60
80
100
20
40
60
80
100
6 8 4 6
2 20
40
60
80
100
20
40
60
80
Number of Features
Number of Features
CLL-SUB-111
Breast
100
20
40
60
80
100
Number of Features
Number of Features
Lung
Tumors-11
95
80
Acc Mean
Stjude
100
85
90
60 55
70 GRFE.PLS
GRFE.PLS
50
LRFE.PLS LRFE.SVM
60
45
GRFE.RF
20
Deviation
algorithm can obtain a higher accuracy while using less number of features, we consider that the examined feature selection algorithm has stronger ability to remove redundant features. In order to compare the size of selected subsets with different algorithms, we also adopt the stratified 5-fold CV, that is, we select the features from the training sets to train the classifier (SVM) on the selected feature subset and make predictions on the testing set. We record the maximum recognition rate and corresponding size of the selected feature subset. In a similar manner to our previous tests, all the feature selection methods are applied on the same training set and testing set which is randomly generated by 5-fold CV in order to obtain a fair comparison. Fig. 7 shows a statistical box plot describing the 20 random results on 5-fold CV with GRFE-PLS, LRFE-PLS, LRFE-SVM and GRFE-RF, with recognition accuracy for each method (top figure) and the corresponding size of the feature subset (bottom figure). The following facts can be observed from Fig. 7: (1) Compared to the state-of-the-art LRFE-SVM and GRFE-RF, PLS-based ranking method with LRFE and GRFE can also achieve high recognition rate. At the same time, the size of the selected feature subset is relatively smaller. (2) Compared to GRFE-PLS, in five datasets (MLL, SRBCT, CLL-SUB-111, Lung and Tumors-11) the accuracy of LRFEPLS is not less than GRFE-PLS, while the size of the selected feature subset based on LRFE-PLS is significantly smaller than GRFE-PLS. To some extent, we can assume that LRFE-PLS’s ability to remove redundant features is stronger, and the selected feature subset is more compact.
40
60
80
100
20
12
18
11
16
10
14
9
12
8
10
7
40
60
40
60
80
Number of Features
100
GRFE.PLS LRFE.PLS
LRFE.SVM
LRFE.SVM
GRFE.RF
GRFE.RF
80
85 100
8 20
90
LRFE.PLS
20
40
60
80
40
60
80
Number of Features
100
GRFE.PLS LRFE.PLS LRFE.SVM
70
GRFE.RF
100
8
10
6
8
4
6
2 20
80
20
40
60
80
100
20
40
60
80
100
4 20
40
60
80
Number of Features
100
Number of Features
Fig. 6. Relationship between the number of features and accuracy rate (average accuracy and standard deviation) (GRFE.PLS=GRFE-PLS, LRFE.PLS=LRFE-PLS, LRFE.SVM=LRFESVM and GRFE.RF=GRFE-RF) (20 random results of 5-fold CV).
1474
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475
Size of subset
Accuracy (%)
MLL
SRBCT
Stjude
100
100
100
95
98
95
90
96
90
85
94
80 100
92
50
GCM 90 80 70
85
60
80 100
50 100
30
80
80
20
60
60
40
40
20
20
10 0 GRFE.PLSLRFE.PLSLRFE.SVMGRFE.RF
GRFE.PLSLRFE.PLSLRFE.SVMGRFE.RF
Accuracy (%)
CLL-SUB-111
Breast
GRFE.PLSLRFE.PLSLRFE.SVMGRFE.RF
Lung
Tumors-11
100
100
90
80
95
80
60
90
85
40
85
80
100
100
100
100
100
95 90
70 100
Size of subset
GRFE.PLSLRFE.PLSLRFE.SVMGRFE.RF
80 50
50
60
50
40 20 0
0 GRFE.PLSLRFE.PLSLRFE.SVMGRFE.RF
0 GRFE.PLSLRFE.PLSLRFE.SVMGRFE.RF
GRFE.PLSLRFE.PLSLRFE.SVMGRFE.RF
GRFE.PLSLRFE.PLSLRFE.SVMGRFE.RF
Fig. 7. The relationship between recognition accuracy and the size of the selected subset. (20 random results on 5-fold CV). Table 3 Parametric and non-parametric statistical significance test.
In the table (+) implies that the first algorithm is statistically better than the confronting one, (-) implies the contrary and (=) means that the two algorithms compared have no significant differences. The p-value associated with each comparison is also given.
4.5. Statistical test on algorithm differences We use the previous 20 random results of 5-fold CV to perform tests of statistical significance. In order to make results more credible, we use both parametric and non-parametric statistical paired tests to verify the statistical difference in terms of prediction accuracy for different algorithms. Since the comparison is implemented on the same training set and testing set which are randomly generated by 5-fold CV, any difference from the paired tests (paired t-test and paired signed rank test) are caused by the difference of algorithms and there is no difference in the random composition of the sample sets. Table 3 gives the comparison of the results. For LRFE-PLS and GRFE-PLS, the p-value of the paired test is much less than the 0.01 level of significance on all datasets except dataset Tumors11, which indicates that a statistically significant difference exists between the compared algorithms. In four datasets, the recognition accuracy of LRFE-PLS is significantly better than GRFE-PLS. In
two dataset (GCM and Tumors-11), the two algorithms have no significant difference. GRFE-PLS is significantly better than LRFE-PLS on the remaining two datasets. Therefore, feature selection based on LRFE-PLS is significantly better than feature selection based on GRFE-PLS in our experiments. These results are consistent with the conclusions of Wang et al. (2008) and the discussion of Granitto and Burgos (2009). For LRFE-PLS and LRFE-SVM, the recognition accuracy of LRFEPLS is significantly better than LRFE-SVM in six datasets. The two algorithms have no significant difference (paired t-test) in one dataset (Stjude). And LRFE-SVM is significantly better than LRFEPLS in one dataset (CLL-SUB-111). Therefore, LRFE-PLS is significantly better than LRFE-SVM in our experiments. For LRFE-PLS and GRFE-RF, the recognition accuracy of LRFE-PLS is significantly better than GRFE-RF in six datasets. And GRFE-RF is significantly better than LRFE-PLS in two datasets (Stjude and GCM). Therefore, LRFE-PLS is significantly better than GRFE-RF in our experiments.
W. You et al. / Expert Systems with Applications 41 (2014) 1463–1475
5. Conclusions This paper discusses feature ranking based on single-feature measure and multi-feature measure and provides the analytical framework for local recursive feature elimination. In the proposed analytical framework, we further propose a new PLS-based local recursive feature elimination algorithm which can efficiently achieve feature selection for high-dimensional multi-category problems. In order to compare the effectiveness of the proposed methodology, we also present PLS-based Global-RFE which takes all categories into consideration simultaneously. In this paper, the comparison was done with ReliefF, SVM-based and RF-based feature selection with LRFE and GRFE, all of which are superior methods. Our experimental results show that the proposed methods are effective. Furthermore, they have good recognition accuracy and Kappa measure. At the same time, the size of the selected subset from LRFE-PLS is relatively smaller. In addition to the biological data sets, our algorithms can be applied to image recognition, financial data analysis and etc. The fusion of heterogeneous feature measure in LRFE framework, the comparison between the feature ranking and projection methods and their mixed use will be discussed in future work. Acknowledgments This project was funded by funds from Natural Sciences and Engineering Research Council of Canada, National Natural Science Foundation of China (Nos.61174161, 61201358 and 61203176), the Natural Science Foundation of Fujian Province of China (No. 2012J01154), the specialized Research Fund for the Doctoral Program of Higher Education of China (Nos. 20100121120022 and 20120121120038), the Key Research Project of Xiamen City of China (No. 3502Z20123014), and the Fundamental Research Funds for the Central Universities in China (Xiamen University: Nos. 2011121047, 2013121025 and CBX2013015). References Allwein, E., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifier. Journal of Machine Learning Research, 1, 113–141. Arauzo-Azofra, A., Aznarte, J. L., & Benitez, J. M. (2011). Empirical study of feature selection methods based on individual feature evaluation for classification problems. Expert Systems with Applications, 38, 8170–8177. Boulesteix, A.-L. (2004). PLS dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology, 3(1). Artile 33. Boulesteix, A.-L., Porzelius, C., & Daumer, M. (2008). Microarray-based classification and clinical predictors: On combined classifiers and additional predictive value. Bioinformatics, 24(15), 1698–1706. Boulesteix, A.-L., & Strimmer, K. (2006). Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics, 8(1), 32–44. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. de Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18(3), 251–263. Diaz-Uriarte, R., & Alvarez de Andres, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(3). Duan, K., Rajapakse, Jagath C., & Nguyen, Minh N. (2007). One-versus-one and oneversus-all multiclass SVM-RFE for gene selection in cancer classification. Berlin Heidelberg: Springer-Verlag (pp. 47–56). 4447. Dudoit, S., Fridlyand, J., & Speed, T. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457), 77–87.
1475
Fan, J., & Fan, Y. (2008). High-dimensional classification using features annealed independence rules. Annals of Statistics, 36(6), 2605–2637. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition, 44(8), 1761–1776. Genuer, R., Poggi, J.-M., & Tuleau, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14), 2225–2236. Granitto, P. M., & Burgos, A. (2009). Feature selection on wide multiclass problems using OVA-RFE. Inteligencia Artificial, 13(44), 27–34. Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83, 83–90. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4–37. Ji, G., Yang, Z., & You, W. (2011). PLS-based gene selection and identification of tumor-specific genes. IEEE Transactions on Systems, Man, Cybernetics C, Application Review, 41(6), 830–841. Kira, K, Rendell, L. A. 1992. ‘‘The feature selection problem: Traditional methods and a new algorithm’’. In Proceedings of the ninth national conference on, artificial intelligence. Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. Kononenko, I. (1994). Estimation attributes: Analysis and extensions of RELIEF. In Proceedings of European conference on machine learning. Springer-Verlag. Kudo, M., & Sklansky, J. (2000). Comparison of algorithms that select features for pattern classifiers. Pattern Recognition, 33(1), 25–41. Le Cao, K. A., Bonnet, A., & Gada, S. (2009). Multiclass classification and gene selection with a stochastic algorithm. Computational Statistics and Data Analysis, 53(10), 3601–3615. Martens, H., & Naes, T. (1989). Multivariate calibration. London: Wiley. Nguyen, D. V., & Rocke, D. M. (2002). Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics, 18(9), 1216–1226. Ruan, X. G., Li, Y. X., Li, J. G., Gong, D. X., & Wang, J. L. (2006). Tumor-specific gene expression patterns with gene expression profiles. Science in China, Series C, 49(3), 293–304. Sun, Y. (2007). Iterative RELIEF for feature weighting: Algorithms, theories, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1035–1051. Sun, Y., Todorovic, S., & Goodison, S. (2010). Local learning based feature selection for high dimensional data analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1610–1626. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58(1), 267–288. Uguz, H. (2011). A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems, 24, 1024–1032. Wang, L., Zhou, N., & Chu, F. (2008). A general wrapper approach to selection of class-dependent features. IEEE Transactions on Neural Networks, 19(7), 1267–1278. Wei, L. J. (1981). Asymptotic conservativeness and efficiency of Kruskal-Wallis test for k dependent samples. Journal of the American Statistical Association, 76(376), 1006–1009. Wold, H. (1975). Path models with latent variables: The NIPALS approach. In H. M. Blalock et al. (Eds.), Quantitative sociology: International perspectives on mathematical and statistical model building (pp. 307–357). Academic Press. Wold, S., Johansson, W., & Cocchi, M. (1993). PLS-partial least-squares projections to latent structures. In 3D QSAR in drug design, theory methods and applications. Springer-Verlag. Yang, Z., You, W., & Ji, G. (2011). Using partial least squares and support vector machines for bankruptcy prediction. Expert Systems with Applications, 38(7), 8336–8342. Youn, E., Koenig, L., Jeong, M., & Baek, S. (2010). Support vector based feature selection using Fisher’s linear discriminant and support vector machine. Expert Systems with Applications, 37, 6148–6156. Zhang, C., Lu, X., & Zhang, X. (2006). Significance of gene ranking for classification of microarray samples. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(3), 312–320. Zhou, X., & Tuck, D. P. (2007). MSVM-RFE: Extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics, 23(9), 1106–1114.