A hybrid feature selection method based on instance learning and cooperative subset search

A hybrid feature selection method based on instance learning and cooperative subset search

Accepted Manuscript A Hybrid Feature Selection Method Based On Instance Learning And Cooperative Subset Search Afef Ben Brahim, Mohamed Limam PII: DO...

352KB Sizes 4 Downloads 66 Views

Accepted Manuscript

A Hybrid Feature Selection Method Based On Instance Learning And Cooperative Subset Search Afef Ben Brahim, Mohamed Limam PII: DOI: Reference:

S0167-8655(15)00345-1 10.1016/j.patrec.2015.10.005 PATREC 6362

To appear in:

Pattern Recognition Letters

Received date: Accepted date:

31 January 2015 1 October 2015

Please cite this article as: Afef Ben Brahim, Mohamed Limam, A Hybrid Feature Selection Method Based On Instance Learning And Cooperative Subset Search, Pattern Recognition Letters (2015), doi: 10.1016/j.patrec.2015.10.005

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

CR IP T

ACCEPTED MANUSCRIPT

Research Highlights (Required)

To create your highlights, please type the highlights against each \item command.

It should be short collection of bullet points that convey the core findings of the article. It should include 3 to 5 bullet points (maximum 85 characters, including spaces, per bullet point.)

AN US

• A hybrid feature selection method is proposed for classification in small sample size data sets. • The filter step is based on instance learning taking advantage of the small sample size of data.

• Few candidate feature subsets are generated since their number corresponds to the number of instances. • Cooperative feature subset search is proposed with a classifier algorithm for the wrapper step.

AC

CE

PT

ED

M

• The proposed method improves classification accuracy and stability of feature selection.

ACCEPTED MANUSCRIPT 1

Pattern Recognition Letters journal homepage: www.elsevier.com

A Hybrid Feature Selection Method Based On Instance Learning And Cooperative Subset Search

CR IP T

Afef Ben Brahim; a,∗∗, Mohamed Limama,b a LARODEC, b Dhofar

ISG, University of Tunis,Tunisia University, Sultanate of Oman

ABSTRACT

PT

ED

M

AN US

The problem of selecting the most useful features from thousands of candidates in a low sample size data set arises in many areas of modern sciences. Feature subset selection is a key problem in such data mining classification tasks. In practice, it is very common to use filter methods. However, they ignore the correlations between genes which are prevalent in gene expression data. On the other hand, standard wrapper algorithms cannot be applied because of their complexity. Additionally, existing methods are not specially conceived to handle the small sample size of the data which is one of the main causes of feature selection instability. In order to deal with these issues, we propose a new hybrid, filter wrapper, approach based on instance learning. Its main challenge is that it convert the problem of the small sample size to a tool that allows choosing only a few subsets of features in a filter step. A cooperative subset search, CSS, is then proposed with a classifier algorithm to represent an evaluation system of wrappers. Our method is experimentally tested and compared with state-of-the-art algorithms based on several high-dimensional low sample size cancer datasets. Results show that our proposed approach outperforms other methods in terms of accuracy and stability of the selected subset. MSC: 41A05; 41A10; 65D05; 65D17 Keywords: Feature selection; hybrid; small sample size; classification; stability c 2015 Elsevier Ltd. All rights reserved.

1. Introduction

AC

CE

Today, high-throughput biotechnologies such as microarray and sequence methods can easily measure expression levels of thousands of genes. It is the case in cancer classification problems where a predictive model is built on the training data consisting of patients belonging to healthy or cancerous categories. The classification algorithm finds the relationship between the features which are gene expression profiles and the two class labels (Okun (2011)). As there are thousands of gene expressions and only few samples in a typical gene expression data set, serious problems occur with the application of many traditional statistical methods. Overfitting on the classifier is one of these problems. It leads to very good and often perfect classification performance on the training data, but this perfect performance

∗∗ Corresponding

author: Tel.: +21653824337; fax: +0-000-000-0000; e-mail: [email protected] (Afef Ben Brahim)

does not translate to new unlabeled data resulting in a very limited classifier generalization. Kohane et al. (2003) explained that the small sample size in genomic applications may be due to the high cost of the microarrays. Each sample involves the measurements of tens of thousands of variables corresponding to the expression of tens of thousands of genes measurable with microarray technology. The result is a large number of features compared to the number of samples. Kohane et al. (2003) described this system as highly underdetermined system and explained that based on the relatively small number of observations, there is a large number of solutions in which the genes being measured could interact. Thus, due to the underdetermined nature of these systems, standard machine learning techniques do not hold up well because those techniques were developed under the assumption that the number of samples, m, is much larger than the features dimensionality d. Since it is difficult to increase m for the reasons explained above, dimension reduction is a solution for such problem. Reducing the number of

ACCEPTED MANUSCRIPT 2 niques based on seven high dimensional data sets and one large scale data set. We finally conclude this paper in Section 5. 2. Basic concepts of feature selection

2.1. Filters

CR IP T

In most common feature selection techniques, an evaluation function is used to assign scores to subsets of features and a search algorithm is used to search for a subset with a high score. The evaluation function can be based on some general relevance measure of the features to the prediction (filter model) or on the performance of a specific predictor (wrapper model). A third category of algorithms that fuse filters and wrappers are known as hybrid methods.

Filters (Guyon and Elisseff (2003)) are not dependent of a specific type of predictive model, they only take characteristics of the data into consideration to select a best feature subset or to obtain a feature’s ranking by assigning a score to each feature. This is done before the learning process begins. Filter methods are very fast and thus very useful to select features in high dimensional data sets. Several filter methods have been proposed in the literature and have shown their effectiveness on selecting the most relevant features and improving the predictive performance. Some of the most popular filter methods are described in the following. t-test filter: The statistical t-test is commonly used for feature selection. The t-test is used in the form that defines the score of a feature as the ratio of the difference between its mean values for each of the two classes and the standard deviation. The weight of each feature is thus given by its computed absolute score. Minimum-Redundancy-Maximum-Relevance: The minimum Redundancy Maximum Relevance (mRMR) method proposed by Peng et al. (2005) is a mutual information based method. It selects features according to the maximal statistical dependency criterion. The mRMR method selects a feature subset that has the highest relevance with the target class, subject to the constraint that selected features are mutually as dissimilar to each other as possible. Relief: This method was proposed by Kira and Rendell (1992). Relief is based on instance learning, it selects features to separate instances from different classes. Robnik and Kononenko (2003) related the relevance evaluation criterion of Relief to the hypothesis of margin maximization, which explains why the algorithm provides superior performance in many applications. Filters can be used as a preprocessing step to reduce space dimensionality and overcome overtting (Kohavi and John (1997) Guyon and Elisseff (2003)). When the number of features becomes very large, the filter model is usually chosen as it is computationally efficient, fast and independent of the classification algorithm. However, taking into account the predictive performance of a learning algorithm while selecting features could be of a big interest, since enhancing this performance is one of the main objectives of feature selection. Filters ignore this aspect and this is their major shortcoming.

AC

CE

PT

ED

M

AN US

genes will reduce the algorithms variance. So, machine learning and more specially feature selection methods are useful to deal with high dimensional data sets. In practice, for such high dimensional data, it is very common to use filter methods that measure the strength of relationship between each gene and the class label. However, Tolosi and Lengauer (2011) demonstrated that filters ignore the correlations between genes, which are prevalent in gene expression data due to gene co-regulation. The consequence is that many redundant differentiated genes are included, meanwhile, useful but weakly differentiated genes may be omitted. On the other hand, Kohavi and John (1997) showed that standard wrapper algorithms cannot be applied because of their high computational complexity due to the need to train a large number of classifiers. With tens of thousands of features, which is the case in gene expression microarray data analysis, a hybrid approach could be adopted. It should follow a filter model in the search step selecting small number of candidate subsets of features. Then, a wrapper method is applied to the reduced subsets to achieve the best possible performance with a particular learning algorithm. Accordingly, the hybrid model is expected to be more efficient than filter and less expensive than wrapper. By another hand, a great variety of feature selection methods have been developed with a focus on improving the predictive accuracy of learning models while reducing dimensionality but most of existing methods do not take into account the small sample size problem in their design. Nevertheless, learning in the small sample case is of practical interest. One reason for this is the difficulty in collecting data for each object. Yet, this data specificity produces some problems, not only for predictive performance of learning algorithms, but also results in the instability of feature selection results. To deal with all these issues, we propose a hybrid feature selection approach which is specially designed for this type of data, in order to benefit from its small sample size. Our proposed method uses an instance based candidate feature subset selection in a filter step. The key idea is to decompose an arbitrarily complex problem into a set of locally ones through local learning of feature relevance, and then find relevant features globally. Each instance proposes a candidate subset of the most relevant features for this instance. Small sample size makes this process feasible with acceptable running time. Thus the high dimensionality of data is reduced to few subsets of features which number corresponds to the data sample size and this is when small sample size is of benefit to feature selection process. The candidate feature subsets are then integrated in a search procedure of the optimal feature subset, where CSS is used with a classifier algorithm as evaluation systems of wrappers. The main goal of our proposed methods is to reduce data set dimensionality while obtaining good performance in terms of accuracy, stability of feature selection and size of the obtained subset. The remainder of the paper is organized as follows. Section 2 discusses the basic concepts of feature selection and their representative methods. In Section 3, we present our proposed hybrid feature selection approach. In Section 4 we evaluate the performance of our method with seven feature selection tech-

ACCEPTED MANUSCRIPT 3

AC

CE

PT

ED

M

AN US

It is of high interest that the search for the optimal feature subset takes into account the specific biases and performance of the predictive algorithm. Based on this, wrapper models use a specific classifier to evaluate the quality of selected features (Kohavi and John (1997)). The performance measure of a learning algorithm along with a statistical re-sampling technique such as cross validation (Kohavi (1995)) are used to select the best feature subset. Given a predefined classifier, a typical wrapper model iteratively produces a set of features based on a searching procedure, then evaluates features using the performance of a classifier until a feature set with the desired quality is reached. A wide range of search strategies can be used and are described in the following. Sequential feature selection: It is one of the most widely used wrapper techniques (Kohavi and John (1997), Aha and Banker (1996)). It selects a subset of features by forward or backward search, which consist on respectively adding or removing features until certain stopping conditions are satisfied. Randomized feature selection: Randomized wrapper algorithms search the next feature subset at random (Skalak (1994)). Single features or several features can be added at once, removed, or replaced from the previous feature set based on the effect on the predictive performance. With these updates, the current set moves to the subset with the highest accuracy. The search procedure terminates when no subset improves over the current set. Support vector machines and recursive feature elimination: While the search and the evaluation procedures are separated in the previous wrapper methods, there exist methods that use an embedded model where the search for an optimal subset of features is built into the classifier construction using its internal parameters. Guyon et al. (2002) introduced a feature selection method using the weight vectors of a Support Vector Machine (SVM) (Vapnik (1995)) in combination with recursive feature elimination (RFE) to form SVM.RFE. The ranking criterion is computed for all features based on their corresponding weights. This process is iterated and the features with the smallest rankings, i.e. weights, are removed. The remaining features are selected. This iterative procedure is a backward feature elimination (Kohavi and John (1997)). The algorithm can be accelerated by removing more than one feature. Wrappers usually provide the best performing feature set for a particular type of model and have the ability to take into account feature dependencies as they consider groups of features jointly. However, the lack of generality of wrappers is a drawback. Different learning algorithms could lead to different feature selection results. Additionally, wrappers repeatedly build learning models on each candidate subset. Thus, they are time consuming and this is their major problem, especially if building the learning algorithm has a high computational cost as reported by Saeys et al. (2007).

the performance of the filter model is less satisfying. The hybrid model is a good combination of the two approaches that overcomes these problems. Hybrid methods use a filter to generate a ranked list of features. On the basis of the order thus defined, nested subsets of features are generated and computed by a learning machine following a wrapper approach. Huang et al. (2007) proposed a hybrid genetic algorithm with two stages of optimization where the mutual information between the predictive labels and the true classes serves as the fitness function for the genetic algorithm and where an improved estimation of the conditional mutual information acts in a filter manner. Bermejo et al. (2011) developed a stochastic algorithm based on the GRASP meta-heuristic. It is a multi-start constructive method which constructs a solution in its first stage and then runs an improving stage over that solution. In the context of cancer diagnosis in high dimensional and small sample size data, we propose a hybrid method which converts the small sample size of microarray data sets to a tool for guiding the feature subset search. This approach is described in the following section.

CR IP T

2.2. Wrappers

3. Hybrid Instance Based Feature Selection algorithm Our hybrid method uses a filter technique inspired by the Relief based methods (Kira and Rendell (1992)) which suffer from instability as the feature selection is based on instances that are picked at random. The feature weight may fluctuate with the instances (Robnik and Kononenko (2003)), making the selection sensitive to the data sampling especially in the presence of noisy and high-dimensional outliers. However our approach uses all instances to weight features. The proposed method uses instance based learning at a first step to generate candidate feature subsets (CFS) by local learning of feature relevance, and then finds relevant features globally, using a combination technique. The objective of this feature selection is not to find a good feature subset for a certain instance, but for most instances thanks to combination of multiple candidate subsets. Also, the introduction of a wrapper method at a second step to evaluate the CFSs and guide the combination process, aims to search for the optimal feature subset and achieve the best possible performance with a particular learning algorithm. For this step, we propose a CSS algorithm where the best feature subset search technique is done with ten-fold cross validation. The hybrid instance based cooperative subset search (HIB-CSS) approach is illustrated in Figure 1 and its two step process is detailed in the following subsections.

2.3. Hybrid methods Whereas the computational cost associated with the wrapper model makes it unfeasible when the number of features is high,

Fig. 1. Hybrid Instance Based Feature Subset Search

ACCEPTED MANUSCRIPT 4

M(xi j ) = d(xi j , N M(xi j )) − d(xi j , NH(xi j ))

(1)

3.2. Wrapper step: Cooperative subset search In this second step, the CFSs are integrated in a search process of the optimal feature subset, where the subset search technique and a classifier algorithm consist an evaluation system of wrappers. This wrapper approach is based on CSS, i.e feature selection decisions of training instances are combined based on their effect on classification performance without using an iterative process, but in a parallel manner instead. Feature weights obtained in the filter step are given as inputs to the wrapper approach. The value of wi, j depends on whether the feature a j appears or not in the candidate subset CFS i . Thus, feature weights will take the following values : ( wi, j if wi, j ∈ CFS i wk, j = 0 Otherwise.

PT

ED

M

AN US

using a distance function. For this paper, we use the Manhattan distance to define a sample’s margin and nearest neighbors while other standard definitions may also be used. This weight definition is used in the well-known Relief algorithm (using Euclidean distance) for the feature-selection purpose (Kira and Rendell (1992)). It has been argued that there is not any significant difference noticed in the estimations use of Relief algorithms using the two metrics, Euclidean or Manhattan (Robnik and Kononenko (2003)). The reason why we relied on this conclusion found in the literature, is that the distance function use in our method is exactly the same as Relief algorithm, believing that the performance of our algorithm will not be affected by the choice of the distance metric. These scores are then normalized to unit length to scale them such that the complete vector of scores has length one. This is done by dividing each score by the Euclidean length of the vector of scores. Thus, we obtain a weighted feature space for each instance xi . The weight projected on each feature a j is wa j = (w1, j , . . . , wm, j ). W, the matrix of normalized feature weights, is shown in Table 1.

features are tested for the parameter n to search for the optimal performance. It is important to note that HIB-CSS approach supposes that if a feature is good while it is not selected for some instances, it will surely appear in the top ranked features for a number of other instances. Also, if this feature is good enough, it will affect the classification results corresponding to CFSs containing this feature, which will increase the probability of its selection in the final subset

CR IP T

3.1. Filter step : Instance based candidate feature subsets selection Let X be a matrix containing m training instances xi = (xi1 , . . . , xid ) ∈ Rd , where d is the number of features, and yi = (y1 , . . . , ym ) the vector of class labels for the m instances. Let A be the set of features a j = (a1 , . . . , ad ), where d >> m. In a preprocessing step of the optimal feature subset selection, the feature space is reduced to m candidate subsets. Each instance of the training data is an expert which proposes a candidate feature subset based on an instance feature weighting technique. Given a distance function, we find two nearest neighbors of each sample xi for each feature a j , one from the same class (called nearest hit or NH), and the other from the different class (called nearest miss or N M). The margin of xi j is then computed as

Table 1. Matrix of feature weights

a2 w1,2 w2,2 w3,2 ... wm,2

CE

a1 w1,1 w2,1 w3,1 ... wm,1

AC

x1 x2 x3 ... xm

... ... ... ... ... ...

ad w1,d w2,d w3,d ... wm,d

Then, features in the space of each instance are ranked based on their weights. Note that an instance may provide a different feature ranking than another instance. Thus, a feature a j may have different ranks depending on the instance considered. After this instance based feature ranking step, a candidate subset of cardinality n is chosen from the top ranked features of each instance. This pre-processing step leads to m candidate feature subsets CFS All = {CFS 1 , CFS 2 , ...CFS m } of cardinality n. In our experiments, ten CFS cardinalities ranging from 1 to 20

Since the CSS process includes a wrapper step to evaluate the CFSs, a test data is requested. Thus, in each iteration k of the initial cross validation which is used to evaluate the performances of the overall feature selection method, another cross validation is applied only on the training data of this iteration, and is used for validating its wrapper step. This means that the wrapper test data is obtained as a subset sample from the training data of the iteration k of the initial cross validation. The corresponding initially held out testing data are only used for the whole feature selection model evaluation, once it is built for the iteration k, to not bias the performance analysis. Thus, in the CSS process, a 10-fold cross validation is used once again to train a classifier algorithm on the projection of each CFS i on the corresponding training data and the classification error βi is calculated on the test set data of this cross validation. Since the choice of the classifier affects the performance of the wrapper algorithm, two classification algorithms were experimented and compared for this wrapper step, KNN (Cover and Hart (1967)) and SVM (Vapnik (1995)). The classifier algorithm is applied m times and m classification error rates are obtained. Two thresholds εmin and εmax are used to fix the good CFS and the bad CFS. These parameters are set after running the classifier in the wrapper step once the CFSs are obtained, and their choice is heuristic based on the recorded error rates obtained using each candidate subset. Once the latter values are recorded, we pick up their minimum and maximum values. The value of εmin is set to be higher than the minimum error obtained in order to let a range for a number of good candidates. εmax value is set to be smaller than the maximum error such that bad candidates are those ranging between the two values. Based on the fixed thresholds, a CFS i is considered good

ACCEPTED MANUSCRIPT 5

In this section we report the experimental setup and results of our proposed hybrid feature selection method and comparison results with seven existing methods. The considered algorithms are applied to several high dimensional data sets. 4.1. Data sets The experiments are conducted on seven high dimensional low size micro array data sets and a large scale data set. The classification in the seven high dimensional data sets is binary and its task is cancer diagnosis. In diffuse large Bcells (DLBCL) presented by Shipp et al. (2002), the classification task is the prediction of the tissue types, where genes are used to discriminate DLBCL tissues from Follicular Lymphomas. The task in the Bladder cancer dataset described by Dyrskjot et al. (2003) is the clinical classification of bladder tumors using microarrays. We consider also another Lymphoma data set which task is to discriminate between two types of Lymphoma based on gene expression measured by microarray technology as in Alizadeh et al. (2000). This dataset contains missing values for numeric attributes that we replace using the KNN imputation method proposed by Troyanskaya et al. (2001). Prostate data set described by Singh et al. (2002) contains expression level of 12600 genes for 102 samples including prostate tumors and normal samples. Another data set is the Breast cancer data set used by van ’t Veer et al. (2002). The Central Nervous System (CNS) by Pomeroy et al. (2002) is also considered. It is a large data set concerned with the prediction of central nervous system embryonal tumor outcome based on gene expression. We analyzed also the malignant pleural mesothelioma and lung adenocarcinoma gene expression database (Lung cancer) used by Gordon et al. (2002) which task is to differentiate between malignant pleural mesothelioma and lung adenocarcinomas. The task in the large scale dataset, named Gisette (Guyon et al. (2004)) from the UCI repository and derived from images of handwritten examples, is to discriminate between two confusable handwritten digits, the four and the nine. With a large number of instances higher than the feature set size, the specificity of this data set is considerably different from the other experimented data sets which allows evaluating HIB-CSS method on a large scale data set also, even it was designed especially for small sample size data sets. To deal with the high execution time of HIB-CSS method if it uses all the large amount of instances of Gisette data set, we opt for a random selection of only 100 instances used to generate CFSs in the filter step of our hybrid approach. Table (2) summarizes the characteristics of the used data sets.

ED

M

AN US

After this categorization, we obtain FS Good , the subset containing the features appearing in the good CFSs, and FS Bad which is the subset containing the features appearing in the bad CFSs. The K CFSs that correspond to the K error rates which are smaller then εmax are called ”Other CFSs”. Their component features are gathered in a single feature subset called FS Union and their feature weights are updated based on their corresponding error rates. This weight adjustment aims to penalize more features of CFSs that result in higher classification error rates in ”Other CFSs” group. Then, the total weight of each feature in FS Union is calculated as the aggregated sum of its weights over candidate subsets in ”Other CFSs”. Based on the calculated weights, features in FS Union are ranked and the S best features are selected. This selected subset is updated in two steps. In the first step, features of FS Bad are extracted from FS Union . In the second step, features of FS Good are added to FS Union . The resulting feature subset FS Union is returned as the optimal feature subset. Note that the pre-selection step of S feature size subset can be omitted and replaced by the test of several feature subset cardinalities once the final feature subset is obtained. The feature subset size that gives the best classification performance is chosen as it will be seen in the experimental study of the algorithm in Section 4. HIB-CSS algorithm is reported in Algorithm 1.

4. Experimental study

CR IP T

if its corresponding classification error βi is less than εmin , and it is considered bad if its corresponding classification error βi is higher than εmax . The approximate choice of these parameters do not affect much the proposed method as it is also flexible. Suppose that a feature is rejected because it belongs to a bad candidate subset based on a comparison with εmax , and suppose that this candidate subset would be considered good if the value of εmax was set to be a little higher. The rejected feature may be recovered, because if it is really a relevant feature, it may appear in good subsets corresponding to smallest error rates.

AC

CE

PT

Algorithm 1 HIB-CSS Input: [X, CFS All , W, εmin , εmax ] βi = Apply Classifier (X, CFS i ) Obtain ERR = {β1 , β2 , ...βm } Find ”Good CFSs” = CFSs corresponding to βi < εmin Obtain FS Good = ∀a j ∈ ”Good CFSs” Find ”Bad CFSs” = CFSs corresponding to βi > εmax Obtain FS Bad = ∀a j ∈ ”Bad CFSs” Find ”Other CFSs” corresponding to βi < εmax K is the number of CFSs in ”Other CFSs” Obtain FS Union = ∀a j ∈ ”Other CFSs” Update weights of features a j ∈ FS Union : PK wk, j wa j = k=1 ( βk ) Rank features in FS Union based on wa j Keep S best ranked features in FS Union Update FS Union = (FS Union \ FS Bad ) Update FS Union = (FS Union ∪ FS Good ) Output: FS Union

4.2. Performance metrics We use K-fold stratified cross validation with K=10 to predict the classification performance of KNN algorithm and stability of the feature selection of all methods on the eight data sets. The whole feature selection process is built on all subsets except the kth one. Performances of model k are evaluated on the kth subset, unused for that model estimation. Note that if we perform feature selection on all the data and then cross validate,

ACCEPTED MANUSCRIPT 6 Table 2. Datasets characteristics

DLBCL Bladder Lymphoma Prostate 77 31 45 102 7029 3036 4026 12600

CNS 60 7129

Lung Gisette 181 7000 12533 5000

the used data sets. Minimum MCE (Min MCE) obtained, the corresponding CFS and SFS cardinalities are reported. For Bladder and Lymphoma data sets, HIB-CSS in its two versions give exactly the same classification performances. For the other data sets, predictive performances obtained using HIB-CSS (SVM) are better for most cases. HIB-CSS (KNN) achieved a best MCE of 0% (100% accuracy) on two data sets, Lymphoma and Lung cancer. For this version of the algorithm, SFS cardinalities range from 1 to 28 selected features for five data sets. However, for the three others, it is equal or higher than 100. The number of features needed to obtain the optimal performance is higher with the algorithm using SVM than the number needed with KNN used in the wrapper step, except for some cases where it similar or much smaller as for the Lung cancer data set. Nevertheless, if a small SFS is preferred, one can sacrifice a small classification performance. For example results obtained give a MCE of 0, 053 with only 11 features and 1 CFS cardinality for Lung cancer data set using HIB-CSS(KNN) which is the same performance using 25 features obtained by HIB-CSS (SVM). So in this case, one can choose the SFS cardinality to work with based on its performance priority : optimal MCE or optimal SFS cardinality. It is important to notice that only 1 feature selected by HIB-CSS (KNN) gives the best MCE for CNS data set, using a minimal CFS cardinality of only 1 feature also, as initial setting. Also, we notice that the highest SFS cardinalities are recorded for the classification of Gisette, which is the large scale data set. This comparative study between SVM and KNN as classifiers used in the wrapper step of HIB-CSS method shows that SVM algorithm yields to similar or better classification performances than KNN algorithm but requires more features in most cases.

K−1 X K X s2 s2 2 (|S i ∩ S j | − )/(S − ), K(K − 1) i=1 j=i+1 d d

(2)

PT

S tab(S 1 , .., S K ) =

ED

M

AN US

then the test data in each fold of the cross validation procedure is also used to choose the features and this is what biases the performance analysis. However, if we adopt the proper procedure, and perform feature selection in each fold, there is no longer any information about the held out cases in the choice of features used in that fold. The K independently measured performances are then averaged. We also used this setting to measure the stability on the feature subsets obtained in multiple algorithm runs. Kalousis et al. (2004) argued that if the feature set varies greatly from one fold of the cross validation to another, it is an indication that the feature selection is unstable. We evaluated also the final selected feature subset (SFS) cardinality obtained to compare our proposed hybrid method to existing feature selection methods. Classification performance: The missclassification error (MCE) of a classifier is defined as the proportion of misclassified instances over all classified instances. This metric is important and always used to evaluate feature selection algorithms for classification tasks. Stability: The stability of a feature selection algorithm is the robustness of the feature preferences it produces to differences in training sets drawn from the same generating distribution (Kalousis et al. (2007)). Measuring stability requires a similarity measure for feature preferences that will measure to which extent K sets of s selected features share common features. We used cross validation to produce those sets as described previously. In Kuncheva (2007), authors propose the following stability index:

Breast 97 24482

CR IP T

Dataset No. of samples No. of features

AC

CE

where d is the total number of features, and S i , S j are two feature sets built from different partitions of the training 2 samples. The ratio sd corrects the bias of selecting common features in both sets by chance. This index satisfies −1 < S tab ≤ 1 and the greater is its value the larger is the number of commonly selected features in various sets. A negative stability index means that feature sets sharing common features are mostly due to chance.

4.3. Performance of proposed algorithm We experiment our proposed hybrid approach with ten initial CFS cardinalities ranging from 1 to 15 features to search for the optimal performance. Performance of HIB-CSS with SFS cardinality for up to 100 features is tested. We record the MCE and the corresponding SFS cardinality for each setting of the proposed hybrid method in its two versions, i.e. with KNN or SVM as classifiers in the wrapper step. Table (3) shows the results of the application of our proposed hybrid methods on

4.4. Comparison with other algorithms In this section, we report comparison results of our proposed methods and seven feature selection methods, Relief, t-test and mRMR algorithms which are filters, Randomized which is a wrapper, SVM-RFE and Feature Generating Machine (FGM) algorithm which are embedded methods and another hybridsequential algorithm. The five first algorithms are described in Section 2. The FGM algorithm (Tan et al. (2010)) iteratively generates and learns a pool of informative and sparse feature subsets with respect to input features to SVM, then combines them using Multiple Kernel Learning algorithm. In their experiments, Tan et al. (2010) showed that FGM is suitable for solving large-scale and very high dimensional problems. The hybrid-sequential algorithm used for our comparisons in addition to these algorithms uses a forward sequential feature selection with KNN algorithm in a wrapper fashion. It finds

ACCEPTED MANUSCRIPT 7 Table 3. HIB-CSS results on the used data sets.

For the other data sets, mRMR and SVM-RFE methods give competitive classification results. While SVM-RFE give better results for some data sets, mRMR outperforms for others. For Bladder and Lymphoma data sets, Hyb-Seq algorithm gives a best MCE-stability couple performance. It gives minimum MCE for Prostate cancer data set coupled with poor stability (34%) and a very bad MCE (52%) coupled with a perfect stability (100%) for Breast cancer data set. Thus, the trade-off MCE-Stability is not satisfied in these cases. Best stability is often achieved by FGM or t-test methods. t-test gives a good trade-off classification-stability for Bladder, Lymphoma and Lung cancer data set. However, t-test gives the worst MCE results for other data sets (14%, 52% and 36% MCE respectively for DLBCL, Breast and CNS cancer data sets). This same behaviour is observed with FGM which is based on SVM algorithm, where we can see that high stability achieved is often coupled with poor classification performance. This affects FGM and t-test’ reliability in terms of classification performance. It is to notice that the good classification performance of FGM on large scale data set is also proved in our experiments as FGM follows mRMR with a competitive performance (4.76%) for Gisette data set. Randomized algorithm shows very poor stability performance for all data sets and it achieves the optimal MCE for only one data set (Breast cancer). Finally, Relief filter performs well for DLBCL and Lung cancer data sets but is not specially good for the others. To summarize, HIB-CSS methods are the MCE-stability optimization challenge winners. Except for Randomized algorithm which can be considered instable, performances of all other algorithms vary depending on the data set considered and they generally show good classification or good stability performance and not both on the same time, making them less reliable than HIB-CSS algorithm.

AC

CE

PT

ED

M

AN US

important features from a reduced set of features obtained using filter results of a t-test as a pre-processing step. We refer to this algorithm in our experiments as ”Hyb-Seq”. The KNN classifier is used with all setups to evaluate classification performance. As for HIB-CSS and to not have a local minimum of the 10 cross validation MCE, we tested the performance of algorithms as a function of the number of features and recorded the optimal MCE rate, the corresponding SFS cardinality and stability for each algorithm. Comparative Results: We report in Table (4) the best classification results, the corresponding feature subset cardinality and stability. To have a clearer vision on the results, we highlighted the best classification performance (MCE) coupled with best stability for each data set. Let’s remember that algorithms that best optimize the couple MCE-stability are considered good feature selection algorithms. Results show that HIB-CSS whether using SVM or KNN, is among the best algorithms in terms of optimal MCE results. For six out of eight data sets, HIB-CSS (SVM) is in the first place. Stability of HIB-CSS in its two versions is similar for four data sets. However, it is not the case for the other data sets, where we can see that HIB-CSS (KNN) is considered stable for CNS and Lung data sets, and not HIB-CSS (SVM). Thus, HIB-CSS (KNN) is preferred here if we want to optimize classification and stability in the same time. We can see that this algorithm gives the best MCE-stability performance for four data sets (DLBCL, Breast, CNS and Lung cancer). For Gisette large scale sata set, HIB-CSS algorithms give good classification performances compared to best ones, especially HIB-CSS (SVM) with (6.37%). However, their stability for this data set is poor. This can be explained by our choice to randomly select a small number of instances used to generate CFSs. The use of a higher number of instances for this purpose would enhance stability but also classification results, but would also consume a high execution time. We should highlight once again that in spite of this, the obtained classification performances are good. mRMR filter shows a perfect classification performance for Lymphoma data set, equally to HIB-CSS and Hyb-Seq algorithms, and a best MCE-stability performance for Prostate cancer data set. mRMR classification and stability performances are specially good for Gisette large scale dataset, where we can see a best classification performance (4.58%) and a stability of (95%) with a large difference compared to the other methods.

HIB-CSS (KNN) # CFS # SFS Min MCE 5 28 0.0500 9 17 0.0333 5 7 0.0000 2 25 0.0991 9 100 0.2889 1 1 0.2319 5 129 0.0000 5 189 0.0908

CR IP T

DLBCL Bladder Lymphoma Prostate Breast CNS Lung Gisette

HIB-CSS (SVM) # CFS # SFS Min MCE 10 68 0.0125 9 50 0.0333 7 28 0.0000 9 69 0.0500 10 92 0.2682 8 36 0.2095 5 25 0.0053 14 200 0.0637

5. Conclusion To ovoid overfitting of the data, feature selection is required when the number of features is large with respect to the sample size. Filter feature selection methods are well suited to such applications as they are fast. However, they ignore the correlations between features and their interaction with the learning algorithm and thus may have modest classification performance.

ACCEPTED MANUSCRIPT 8 Table 4. MCE rates, SFS cardinalities and stability on the used data sets.

Bladder

Lymph

Prostate

Breast

CNS

Lung

Gisette (large-scale)

Min MCE #SFS Stability Min MCE #SFS Stability Min MCE #SFS Stability Min MCE #SFS Stability Min MCE #SFS Stability Min MCE #SFS Stability Min MCE #SFS Stability Min MCE #SFS Stability

HIB-CSS (KNN) 0.0500 28 0.6893 0.0333 17 0.5925 0.0000 7 0.2336 0.0991 25 0.6553 0.2889 100 0.4718 0.2319 1 0.6222 0.0000 129 0.8804 0.0908 189 0.1010

Randomized 0.0519 40 0.0709 0.0645 38 0.0667 0.0667 30 0.0686 0.0882 87 0.0216 0.2680 75 0.0168 0.3333 17 0.0448 0.0552 80 0.0112 0.1330 87 0.0202

SVM-RFE 0.0519 55 0.5102 0.0323 15 0.4372 0.0222 20 0.4350 0.1275 10 0.5863 0.2990 65 0.3806 0.3000 8 0.4911 0.0055 65 0.4831 0.0799 80 0.4032

M

ED

PT

CE

AC

References

Aha, D.W., Banker, R.L., 1996. A Comparative Evaluation of Sequential Feature Selection Algorithms. volume 112. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., Powell, J.I., Yang, L., Marti, G.E., Moore, T., Hudson, J.J., Lu, L., Lewis, D.B., Tibshirani, R., Sherlock, G., Chan, W.C., Greiner, T.C., Weisenburger, D.D., Armitage, J.O., Warnke, R., Levy, R., Wilson, W., Grever, M.R., Byrd, J.C., Botstein, D., Brown, P.O., Staudt, L.M., 2000. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511. Bermejo, P., Gmez, J.A., Puerta, J.M., 2011. A grasp algorithm for fast hybrid (filter-wrapper) feature subset selection in high-dimensional datasets. Pattern Recognition Letters 32, 701–711. Cover, T., Hart, P., 1967. Nearest neighbor pattern classification 13, 21– 27.

FGM 0.0649 33 0.6326 0.0968 21 0.4992 0.1556 10 0.7304 0.4804 7 0.9111 0.5052 10 1.0000 0.5000 10 0.9644 0.0718 10 1.000 0.0476 100 0.5063

t-test 0.1429 55 0.8791 0.0323 12 0.6877 0.0222 15 0.7234 0.1078 4 0.7944 0.5258 2 1.0000 0.3667 25 0.3970 0.0166 15 0.8368 0.0963 100 0.7671

Relief 0.0649 65 0.6335 0.0645 60 0.4650 0.0222 30 0.4291 0.2157 38 0.6275 0.4433 6 0.3628 0.3667 90 0.4974 0.0110 20 0.8409 0.4711 100 0.5891

mRMR 0.0779 45 0.5657 0.0645 20 0.4039 0.0000 12 0.5951 0.0784 25 0.7096 0.2784 30 0.3355 0.3000 50 0.3792 0.0055 4 0.6610 0.0458 100 0.9512

Dyrskjot, L., Thykjaer, T., Kruhoffer, M., Jensen, J.L., Marcussen, N., Hamilton-Dutoit, S., Wolf, H., Orntoft, T.F., 2003. Identifying distinct classes of bladder carcinoma using microarrays. Nat Genetics 33, 90–96. Gordon, G., Jensen, R., Hsiao, L., Gullans, S., Blumenstock, J., Ramaswamy, S., Richards, W., Sugarbaker, D., Bueno, R., 2002. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research 62, 4963–4967. Guyon, I., Ben Hur, A., Gunn, S., Dror, G., 2004. Result analysis of the nips 2003 feature selection challenge. Advances in Neural Information Processing Systems 17, 545552. Guyon, I., Elisseff, A., 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182. Guyon, I., Weston, J., Barnhill, S., Vapnik, V., 2002. Gene selection for cancer classification using support vector machines. Machine Learning 46, 389– 422. Huang, J., Cai, Y., Xu, X., 2007. A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recognition Letters 28, 1825–1844. Kalousis, A., Prados, J., Hilario, M., 2007. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowledge Information Systems 12, 95–116. Kalousis, A., Prados, J., Sanchez, J.C., Allard, L., Hilario, M., 2004. Distilling classification models from cross validation runs: An application to mass spectrometry., in: ICTAI, IEEE Computer Society. pp. 113–119. Kira, K., Rendell, L., 1992. A practical approach to feature selection, in: Sleeman, D., Edwards, P. (Eds.), International Conference on Machine Learning, pp. 368–377. Kohane, I.S., Kho, A.T., Butte, A.J., 2003. Microarrays for an integrative genomics. Cambridge, MA: MIT Press. Kohavi, R., 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, Morgan Kaufmann Publishers Inc.. pp. 1137–1143. Kohavi, R., John, G.H., 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 273–324. Kuncheva, L., 2007. A stability index for feature selection, in: Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications, Innsbruck, Austria. pp. 390–395. Okun, O., 2011. Feature Selection and Ensemble Methods for Bioinformatics: Algorithmic Classification and Implementations. Peng, H., Long, F., Ding, C., 2005. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1226– 1238. Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla, L.M., Angelo, M.,

AN US

Wrappers on the other hand use the bias of the induction algorithm to select features and generally perform better. However, the computational burden of wrapper methods is prohibitive on large data sets. In this paper, we propose a new hybrid approach, HIB-CSS, which is based on cooperative subset search. The proposed approach uses instance learning in its filter step. The main goal is to speed the feature subset selection process by reducing the number of wrapper evaluations while maintaining good performance in terms of accuracy, stability and size of the obtained subset. The main challenge in these approach is that they convert the problem of the small sample size to a tool that allows choosing only a few subsets of variables to be analyzed since the number of CFSs is the number of instances. Therefore, the number of wrapper evaluations decreases significantly. Our method is experimentally tested and compared with existing feature selection algorithms based on seven highdimensional low sample size data sets and one large scale data set. Results show that HIB-CSS is the MCE-stability optimization challenge winner outperforming different compared feature selection methods. We expect investigating using small sample size to create an efficient filter that deals with feature redundancy problem as filters remain very interesting methods for high dimensional data sets due to their low computational cost.

Hyb-Seq 0.1039 10 0.2078 0.0323 36 0.7389 0.0000 27 0.7158 0.0686 45 0.3423 0.5258 1 1.0000 0.3167 4 0.1217 0.0276 8 0.1217 0.0967 92 0.1568

CR IP T

DLBCL

HIB-CSS (SVM) 0.0125 68 0.7186 0.0333 50 0.5445 0.0000 28 0.5125 0.0500 69 0.6279 0.2682 92 0.4884 0.2095 36 0.2586 0.0053 25 0.4959 0.0637 200 0.2333

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

McLaughlin, M.E., Kim, J.Y.H., Goumnerova, L.C., Black, P.M., Lau, C., Allen, J.C., Zagzag, D., Olson, J.M., Curran, T., Wetmore, C., Biegel, J.A., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, D.N., Mesirov, J.P., Lander, E.S., Golub, T.R., 2002. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, 436–442. Robnik, S.M., Kononenko, I., 2003. Theoretical and empirical analysis of relieff and rrelieff. Machine Learning 53, 23–69. Saeys, Y., Inza, I., Larranaga, P., 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517. Shipp, M.A., Ross, K.N., Tamayo, P., Weng, A.P., Kutok, J.L., Aguiar, R.C., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G.S., Ray, T.S., Koval, M.A., Last, K.W., Norton, A., Lister, T.A., Mesirov, J., Neuberg, D.S., 2002. Diffuse large b(cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine 9, 68–74. Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P., 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203–209. Skalak, D.B., 1994. Prototype and feature selection by sampling and random mutation hill climbing algorithms, in: Machine Learning: Proceedings of the Eleventh International Conference, Morgan Kaufmann. pp. 293–301. Tan, M., Wang, L., Tsang, I.W., 2010. Learning sparse svm for feature selection on very high dimensional datasets., in: ICML, Omnipress. pp. 1047–1054. Tolosi, L., Lengauer, T., 2011. Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27, 1986–1994. Troyanskaya, O.G., Cantor, M., Sherlock, G., Brown, P.O., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B., 2001. Missing value estimation methods for dna microarrays. Bioinformatics 17, 520–525. Vapnik, V.N., 1995. The nature of statistical learning theory. Springer-Verlag New York, Inc. van ’t Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., Schreiber, G.J., Kerkhoven, R.M., Roberts, C., Linsley, P.S., Bernards, R., Friend, S.H., 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536.

CR IP T

9