Feature selection considering two types of feature relevancy and feature interdependency

Feature selection considering two types of feature relevancy and feature interdependency

Accepted Manuscript Feature selection considering two types of feature relevancy and feature interdependency Liang Hu, Wanfu Gao, Kuo Zhao, Ping Zhan...

8MB Sizes 3 Downloads 112 Views

Accepted Manuscript

Feature selection considering two types of feature relevancy and feature interdependency Liang Hu, Wanfu Gao, Kuo Zhao, Ping Zhang, Feng Wang PII: DOI: Reference:

S0957-4174(17)30690-5 10.1016/j.eswa.2017.10.016 ESWA 11599

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

8 May 2017 5 October 2017 6 October 2017

Please cite this article as: Liang Hu, Wanfu Gao, Kuo Zhao, Ping Zhang, Feng Wang, Feature selection considering two types of feature relevancy and feature interdependency, Expert Systems With Applications (2017), doi: 10.1016/j.eswa.2017.10.016

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • A novel feature selection method is proposed based on information theory.

CR IP T

• Our method divides feature relevancy into two categories. • We performed experiments over 12 public data sets.

• Our method outperforms five competing methods in terms of accuracy.

AC

CE

PT

ED

M

AN US

• Our method selects few number of features when it achieves the highest accuracy.

1

ACCEPTED MANUSCRIPT

Feature selection considering two types of feature relevancy and feature interdependency a

CR IP T

Liang Hua , Wanfu Gaoa , Kuo Zhaoa , Ping Zhangb , Feng Wanga,∗

College of Computer Science and Technology, Jilin University, Changchun 130012, China b College of Software, Jilin University, Changchun 130012, China

AN US

Abstract

AC

CE

PT

ED

M

Feature selection based on information theory, which is used to select a group of the most informative features, has extensive application fields such as machine learning, data mining and natural language processing. However, numerous previous methods suffer from two common defects. (1) Feature relevancy is defined without distinguishing candidate feature relevancy and selected feature relevancy. (2) Some interdependent features may be misinterpreted as redundant features. In this study, we propose a feature selection method named Dynamic Relevance and Joint Mutual Information Maximization (DRJMIM) to address these two defects. DRJMIM includes four stages. First, the relevancy is divided into two categories: candidate feature relevancy and selected feature relevancy. Second, according to candidate feature relevancy that is joint mutual information, some redundant features are selected. Third, the redundant features are combined with a dynamic weight to reduce the selection possibility of true redundant features while increasing the false ones. Finally, the most informative and interdependent features are selected and true redundant features are eliminated simultaneously. Furthermore, our method is compared with five competitive feature selection methods on 12 publicly available data sets. The classification results show that DRJMIM performs better than other five methods. Its statistical significance is verified by a paired two-tailed t-test. Meanwhile, DRJMIM obtains ∗

Corresponding author Email addresses: [email protected] (Liang Hu), [email protected] (Wanfu Gao), [email protected] (Kuo Zhao), [email protected] (Ping Zhang), [email protected] (Feng Wang)

Preprint submitted to Expert Systems with Applications

October 17, 2017

ACCEPTED MANUSCRIPT

few number of selected features when it achieves the highest classification accuracy.

CR IP T

Keywords: Feature selection, Information theory, Feature relevancy, Feature interdependency 1. Introduction

CE

PT

ED

M

AN US

With the arrival of the era of big data, high dimensional data has become prevalent. These high dimensional data are produced from different fields such as social media, bioinformatics, image processing and natural language processing. High dimensional data will improve classification performance because of more features in general. Nonetheless, some irrelevant and redundant features in high dimensional data are not conducive to classification performance but significantly increase the computational complexity and memory storage requirements, which is even worse for data with thousands of features but only a few dozen samples. A typical example is microarray data. Such data sets make data processing challenging. Therefore, it is necessary for dimensionality reduction. Dimensionality reduction has two basic technologies: feature selection and feature extraction. Traditional feature extraction methods include Principal Component Analysis (PCA) (Jolliffe et al., 2002), Linear Discriminant Analysis (LDA) (Mika et al., 1999) and Canonical Correlation Analysis (CCA) (Hardoon et al., 2004). In this study, feature selection is our concern rather than feature extraction because feature selection could retain the characteristics of original features. Feature selection has three main categories: wrapper, embedded and filter methods (Bol´on-Canedo et al., 2014). • Wrapper methods rely on the predictive performance of a predefined classifier to evaluate the quality of selected features.

AC

• Embedded methods generally use machine learning models for classification, and then an optimal subset of features is built by the classifier. • Filter methods have lower computational cost than wrappers. The disadvantage of filter methods is that they have no interaction with the classifier.

3

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

As a result, wrapper and embedded methods are inefficient and prone to over-fitting problems whereas the filter method is simpler and faster, especially for high dimensional data. Our method belongs to filter methods. Information theory is widely employed in filter methods, one of the important reasons is that mutual information and conditional mutual information can be used to measure the nonlinear correlation (Zhao et al., 2016; Dionisio et al., 2004). Although many new measurement methods have been proposed such as the maximal information coefficient (Reshef et al., 2011), the fairness of the maximal information coefficient is still controversial. Traditional filter methods based on information theory consist of two parts: feature redundancy term and feature relevancy term. Minimize the feature redundancy while maximizing the feature relevancy are the common criterion no matter the forms of the methods such as linear methods and nonlinear methods. However, previous methods (Peng et al., 2005; Lewis, 1992; Battiti, 1994; Fleuret, 2004) have two common defects. (1)The first one is that previous methods consider candidate feature relevancy equal to selected feature relevancy. We hold that candidate feature relevancy should consider the class and every selected feature. And selected feature relevancy should only concern the class. The main reasons why selected feature relevancy does not consider other already selected features include two aspects: selected features have been regarded as candidate features when they are waiting to be chosen. Therefore, they have been calculated with other selected features. In addition, its time complexity is extremely complicated for a selected feature if it considers other selected features. Let n denote the number of selected features, the time complexity when selected features consider every selected feature is O(n2 ). However, the time complexity when candidate features consider every selected feature is O(n). (2) The second one is that majority of methods ignore interdependent features when they eliminate redundant features. Redundant features include true redundant features and false redundant features. False redundant features either do not bring information about the class or acquire similar information with selected features. However, they are strongly relevant to the class when they work together. An intuitive example is XOR problem in Table 1. f 1 and f 2 cannot distinguish the class Y alone. However, f 1 and f 2 can discriminate the class Y when they are both employed. Majority of methods consider f 1 and f 2 as redundant features. However, f 1 and f 2 are false redundant features that are interdependent features. 4

ACCEPTED MANUSCRIPT

Table 1: XOR problem

f2

Y

0 1 1 0

1 0 1 0

1 1 0 0

CR IP T

f1

AC

CE

PT

ED

M

AN US

To address the first issue, we divide feature relevancy into two categories: candidate feature relevancy and selected feature relevancy. Candidate feature relevancy is calculated when selected features are taken into account. Selected feature relevancy is calculated between the class and selected features. With respect to the second issue, our method introduces a dynamic weight that calculates the relationship between a candidate feature and every selected feature. It is also used in (Sun et al., 2013a,b). We combine the dynamic weight with redundant features. Through this way, the selection possibility of true redundant features is reduced, in contrast, the selection possibility of interdependent features is promoted. The details are presented in Section 4.3. To evaluate the classification performance of our method, we perform two experiments on an artificial example and 12 real-world data sets. Our method is compared with three traditional methods and two relevant methods. To take the independence between filter methods and classifiers into account, we employ the average accuracy of three classifiers that are Support Vector Machine with RBF kernel (SVM), Na¨ıve-Bayes (NB) and 3-Nearest Neighbors (3NN). The experimental results are shown in Figures 1-6. To elaborate further, we show the highest accuracies of six different methods and the number of selected features that are corresponding the highest accuracies. The 12 benchmarking data sets come from the UCI Repository (Lichman, 2013) and (Li et al., 2016). These data sets include both multi classification problems and binary classification problems. The experimental results on 12 data sets show that our method yields a better classification performance than other methods. The classification results show that DRJMIM performs better than other five methods. Meanwhile, our method selects the few number of selected features than other compared methods when it achieves the highest classification accuracy. The rest of this paper is organized as follows. Some basic concepts of 5

ACCEPTED MANUSCRIPT

2. Basic concepts of information theory

CR IP T

information theory are described in Section 2. Traditional filter methods and state-of-the-art methods based on information theory are reviewed in Section 3. Our feature selection method is detailed in Section 4. We display the experimental results and analysis in Section 5. Finally, we present a brief conclusion and our plans for future work in Section 6.

H(X) = −

L X

p(xi )log(p(xi ))

(1)

i=1

number of instances with value xi total number of instances(L)

M

p(xi ) =

AN US

In this section, we describe a series of basic concepts of information theory (Cover and Thomas, 2012). Information is an abstract concept but it cannot be measured. In 1948, Shannon presented the concept of information entropy to solve the problem of quantifying information (Shannon, 2001). Suppose X, Y and Z are three random variables. X = {x1 , x2 , ......, xL }, Y = {y1 , y2 , ......, yM }, Z = {z1 , z2 , ......, zN }. Then, the entropy of X is defined as

(2)

PT

ED

p(xi ) is the probability mass function. The base of the logarithm, log, is 2; therefore, 0 ≤ H(X) ≤ 1. Joint entropy is defined as the uncertainty of the simultaneous occurrence of two variables. The specific definition of joint entropy is as follows:

CE

H(X, Y ) = −

L X M X

p(xi , yj )log(p(xi , yj ))

(3)

i=1 j=1

AC

The definition of conditional entropy is as follows: H(X|Y ) = −

L X M X i=1 j=1

p(xi , yj )log(p(xi |yj ))

(4)

The conditional entropy is the amount of uncertainty left in X when a variable Y is introduced. The relationship between joint entropy and conditional entropy is as follows: H(X, Y ) = H(X) + H(Y |X) 6

(5)

ACCEPTED MANUSCRIPT

H(X, Y ) = H(Y ) + H(X|Y )

(6)

I(X; Y ) =

L X M X

p(xi , yj )log

i=1 j=1

CR IP T

Next, we introduce another important concept named mutual information. Mutual information is the amount of information provided by variable X, which reduces the uncertainty of variable Y . p(xi , yj ) p(xi )p(yj )

(7)

Mutual information can also be expressed in the form of entropy.

AN US

I(X; Y ) = H(Y ) − H(Y |X)

(8)

I(X; Y ) = H(X) − H(X|Y )

(9)

I(X; Y ) = H(X) + H(Y ) − H(X, Y )

(10)

M

If the two random variables are independent, then the value of mutual information is 0. Mutual information is symmetric and nonnegative. (11)

ED

I(X; Y ) = I(Y ; X)

CE

PT

Similar to the introduction of conditional entropy in the study of entropy, we also introduce conditional mutual information, which is the mutual information of X and Y when the random variable Z is known. Conditional mutual information is also nonnegative. I(X; Y |Z) = H(X|Z) − H(X|Y, Z)

(12)

AC

in(Cover and Thomas, 2012), joint mutual information is defined as follows: I(X; Y, Z) = I(X; Z) + I(X; Y |Z)

(13)

I(X; Y, Z) ≥ 0

(14)

7

ACCEPTED MANUSCRIPT

3. Related work

PT

ED

M

AN US

CR IP T

Feature selection aims at selecting one feature subset that retains the information of original feature as much as possible. Therefore, it is important to measure the relevancy between classes and features, and the redundancy between features. There are various measurement criteria for feature selection such as similarity based methods, sparse learning based methods and statistical based methods. Some state-of-the-art methods include Sparse Discriminative Feature Score (Yan and Yang, 2015), Fisher Discriminate Analysis and F-score (Song et al., 2017), Deep Sparse SVM (Cong et al., 2015), Fisher Score (Duda et al., 2012), Maximum Relevance Minimum Multicollinearity (Senawi et al., 2017), and Hamming Distance Based Binary Particle swarm optimization (Banka and Dara, 2015). In this section, we focus on the methods based on information theory. As mentioned above, a large number of filter methods based on information theory are designed to increase the relevancy between features and classes while reducing the redundancy between features (Duda et al., 2012). In this study, we denote that Xk represents the candidate feature. Xj represents the selected feature. Y represents the class. S represents the selected feature subset, and its initial value is null. J(.) is a feature selection criterion where, the higher the value of J(Xk ), the more important the feature Xk is. Majority of methods can be summarized in the following framework (Brown et al., 2012): X X J(Xk ) = I(Xk ; Y ) − β I(Xj ; Xk ) + λ I(Xj ; Xk |Y ) (15) Xj ∈S

Xj ∈S

AC

CE

The framework reflects the criterion that minimizes the feature redundancy and maximizes the feature relevancy. β and λ are two nonnegative parameters between zero and one. It is noteworthy that the third term of the framework means the conditional redundancy. The conditional redundancy should be maximized when it exists (Guo and Nixon, 2009; El Akadi et al., 2008). Mutual Information Maximization, which is also named Information Gain (IG) (Lewis, 1992), is a relatively early classical feature selection method based on information theory. IG measures the mutual information between features and classes, then it sorts features by mutual information in descending order. IG adopts the threshold that the user defines to obtain a subset 8

ACCEPTED MANUSCRIPT

from original feature set. The criterion of IG is as follows: J(Xk ) = I(Xk ; Y )

(16)

AN US

Xj ∈S

CR IP T

The criterion (16) considered as a variation of Formula (15) when both β and λ are equal to zero. IG considers the relevancy between features and the class while ignoring the redundancy between features. Battiti (Battiti, 1994) proposes a first-order incremental search method named Mutual information Feature Selection (MIFS). The specific formula is as follows: X J(Xk ) = I(Xk ; Y ) − β I(Xj ; Xk ) (17)

CE

PT

ED

M

Battiti considers that selected features should not only be strongly relevant to the class, but also be irrelevant P with each other. Redundant features are evaluated by the second term β Xj ∈S I(Xj ; Xk ). MIFS is also one variation of Formula (15) when β is between zero and one, and λ is zero. MIFS-U (Kwak and Choi, 2002) is a variation of MIFS. MIFS-U performs better than MIFS by obtaining a better mutual information between input features and output classes. However, MIFS and MIFS-U have the common defect: with the increasing number of selected features, the redundancy term grows faster than the relevancy term. As a result, some irrelevant features are selected. To address this issue, Peng et al. (Peng et al., 2005) proposed a classical feature selection method named minimal-redundancy-maximal-relevance (mRMR). The specific formula is as follows: J(Xk ) = I(Xk ; Y ) −

1 X I(Xj ; Xk ) |S| X ∈S

(18)

j

AC

mRMR criterion sets the value of β the reverse of the number of selected features and λ is zero. Through this way, the impact of redundancy between features is gradually decreased. MIFS-ND (Hoque et al., 2014)is a filter method which incorporates with a greedy algorithm named Non-dominated Sorting Genetic Algorithm-II (Deb et al., 2000). MIFS-ND calculates the domination count and dominated count instead of the input the parameter β.

9

ACCEPTED MANUSCRIPT

Xj ∈S

CR IP T

Joint Mutual Information (JMI) (Yang and Moody, 1999) considers that the candidate feature should be complementary to the selected features given the class. It is denoted as follows: X J(Xk ) = I(Xj , Xk ; Y ) (19)

Although it cannot be directly regarded as a variation of Formula (15), it is reduced to Formula (15) in (Brown et al., 2012) , both β and λ are equal to 1 . It is as follows: |S| 1 X 1 X I(Xj ; Xk ) + I(Xj ; Xk |Y ) |S| X ∈S |S| X ∈S j

j

(20)

AN US

J(Xk ) = I(Xk ; Y ) −

Double Input Symmetrical Relevance (DISR) (Meyer et al., 2008) normalizes joint mutual information and performs well in microarray. It is expressed as follows: X I(Xj , Xk ; Y ) (21) J(Xk ) = H(X j , Xk , Y ) X ∈S

M

j

ED

DISR is one type of nonlinear methods that is based on information theory. It is similar to the criterion of JMI in (19). Conditional Mutual Information Maximization (CMIM) (Fleuret, 2004) is different from previous methods, it cannot be reduced to the Formula (15). The criterion is as follows:

PT

J(Xk ) = argmaxXk ∈F −S (minXj ∈S (I(Xk ; Y |Xj )))

(22)

AC

CE

CMIM selects the minimum conditional mutual information between the candidate feature and the class when each selected feature is given. F is the total number of the original features. If I(Xk ; Y |Xj ) is low, it means that the candidate is redundant with the selected feature or the candidate feature is irrelevant with the class. Then, the candidate feature with the maximum minXj ∈S (I(Xk ; Y |Xj )) is selected. By this means, the most informative feature is chosen. The basic idea of CMIM is also to choose the feature that is relevant to the class and less redundant with selected features. Joint Mutual Information Maximization (JMIM) (Bennasar et al., 2015) is proposed in 2015. It is expressed as follows: J(Xk ) = argmaxXk ∈F −S (minXj ∈S (I(Xk , Xj ; Y ))) 10

(23)

ACCEPTED MANUSCRIPT

AN US

CR IP T

JMIM employs both the ’maximum of the minimum’ criterion and the joint mutual information between candidate features, selected features and the class. JMIM addresses the problem of overestimation the significance of some features, which happens when cumulative summation is employed. In (Bennasar et al., 2015), JMIM is mainly compared to CMIM and JMI. In some data sets, JMIM achieves better classification performance. Gene Selection via Dynamic Relevance (DRGS) (Sun et al., 2013a) is proposed for gene selection in 2013. DRGS designs a kind of weight that retains useful intrinsic groups of interdependent genes. The weight is not only suitable for gene data sets but also other data sets in different fields (Sun et al., 2013b; Zeng et al., 2015). Therefore, we introduce this weight to our work. The methods mentioned above are summarized in Table 2. Table 2: Traditional feature selection methods based on information theory

Candidate feature relevancy

Selected feature relevancy

Whether considering interdependent features

IG MIFS mRMR JMI DISR CMIM JMIM DRGS

I(Xk ; Y ) I(Xk ; Y ) I(Xk ; Y ) I(Xk , Xj ; Y ) I(Xk , Xj ; Y ) I(Xk ; Y |Xj ) I(Xk , Xj ; Y ) I(Xk ; Y )

I(Xk ; Y ) I(Xk ; Y ) I(Xk ; Y ) I(Xk , Xj ; Y ) I(Xk , Xj ; Y ) I(Xk ; Y |Xj ) I(Xk , Xj ; Y ) I(Xk ; Y )

No No No No No No No Yes

PT

ED

M

Methods

AC

CE

As can be seen from Table 2, traditional methods consider that candidate feature relevancy and selected feature relevancy are consistent. Meanwhile, the majority of previous studies ignore the existence of interdependent features. In our method, a dynamic weight is introduced that calculates the relationship between a candidate feature and every selected feature. Through combining the dynamic weight with redundant features, the selection possibility of true redundant features is reduced, in contrast, the selection possibility of interdependent features is promoted. Our solution is detailed in Section 4. 11

ACCEPTED MANUSCRIPT

4. The proposed method for feature selection

CR IP T

This section introduces our feature selection method named Dynamic Relevance and Joint Mutual Information Maximization (DRJMIM) that is a basically supervised. The details of DRJMIM is threefold. First, feature relevancy, feature redundancy and feature interdependency are refined. They are similar to the literatures (Sun et al., 2013a; Bennasar et al., 2015). The main differences are that we divide the feature relevance into two categories: candidate feature relevancy and selected feature relevancy. Second, the theory evidence of DRJMIM is analyzed. Third, we present the specific steps of the method, and provide the pseudo code.

AN US

4.1. Definitions of feature relevancy, feature redundancy and feature interdependency Definition 4.1 (Candidate feature relevancy). Feature Xk is more relevant to the class label Y than feature Xk0 in the context of the already selected feature Xj if

M

I(Xk , Xj ; Y ) > I(Xk0 , Xj ; Y )

(24)

ED

Candidate feature relevancy is used when new features are waiting to be chosen.

CE

PT

Definition 4.2 (Minimum joint mutual information). Let F denote the entire set of features, and S represents the subset of already selected features. Let Xk ∈ F −S, and Xj ∈ S. Minimum joint mutual information is expressed as follows: min(I(Xk , Xj ; Y ))

(25)

AC

The minimum value of joint mutual information means that the candidate feature Xk shares with the class label Y when it is joined with every selected feature individually.

Definition 4.3 (Selected feature relevancy). An already selected feature Xj is more relevant to the class label Y than another already selected feature Xj0 if I(Xj ; Y ) > I(Xj0 ; Y ) 12

(26)

ACCEPTED MANUSCRIPT

Selected feature relevancy is used when a candidate feature updates the correlation with selected features. Accoring to Eq. (13), we can obtain: (27)

CR IP T

I(Y ; Xk , Xj ) = I(Y ; Xj ) + I(Xk ; Y |Xj )

It indicates that the value of candidate feature relevancy is different from selected feature relevancy except I(Xk ; Y |Xj ) = 0, i.e., Xk is independent of Xj .

AN US

Definition 4.4 (Feature redundancy and feature interdependency). The correlation between a candidate feature Xk and the class Y can be reduced or unchanged by the knowledge of an already selected Xj , if the feature Xk is redundant or irrelevant with the feature Xj ,that is, I(Xk ; Y |Xj ) ≤ I(Xk ; Y )

(28)

The correlation between a candidate feature Xk and the class can be increased by the knowledge of an already selected feature Xj , if the feature Xk is interdependent to the feature Xj , that is,

M

I(Xk ; Y |Xj ) > I(Xk ; Y )

(29)

AC

CE

PT

ED

4.2. Dynamic Relevance and Joint Mutual Information Maximization (DRJMIM) Feature selection aims to select the feature that can obtain the maximum value of I(Xk , S; Y ). I(Xk , S; Y ) is replaced by I(Xk , Xj ; Y ) in our work due to complexity of already selected feature subset S. According to the definition 1, a candidate feature Xk is good only if I(Xk , Xj ; Y ) is large for every already selected feature Xj . More specifically, I(Xk , Xj ; Y ) is low if either Xk brings similar information with at least one of selected features about the class Y or Xk does not bring new information about Y . By taking the candidate feature with the maximum min(I(Xk , Xj ; Y )), it ensures that the candidate feature Xk is both informative and different from the preceding ones in term of predicting the class Y . However, some features that have lower I(Xk , Xj ; Y ) may be interdependent rather than redundant with selected features. Therefore, we introduce a weight that is employed in (Sun et al., 2013a). Features are weighted according to their correlation with the selected features. And the weight of features will be dynamically updated

13

ACCEPTED MANUSCRIPT

after each candidate feature has been selected. According to the definitions above, the weight is described as follows: I(Xk ; Y |Xj ) − I(Xk ; Y ) H(Xk ) + H(Y )

(30)

CR IP T

C Ratio(k, j) = 2

We combine the updated weight C Ratio with min(I(Xk , Xj ; Y )) to obtain a value named DR jmi(Xk ). The criterion of DRJMIM is as follows: DR jmi(Xi ) = minXj ∈S (I(Xi , Xj ; Y )) ∗ DR(Xi )

AN US

DR(Xi ) is an intermediate variable, i.e.,

DR(Xi ) = DR(Xi ) + C Ratio(Xi , Xj ) ∗ I(Xj ; Y )

(31)

(32)

AC

CE

PT

ED

M

Therefore, we can find that DR jmi(Xk ) consider two types of feature relevancy and C Ratio. Based on analysis of C Ratio. DR jmi(Xk ) will be increased when the candidate feature is interdependent with already selected features. In contrast, DR jmi(Xk ) will be decreased or unchanged when the candidate feature is redundant or irrelevant with already selected features. By selecting the feature Xk with the maximum of DR jmi(Xk ), our method obtains the feature Xk that is most informative for classification. The details of DRJMIM are presented in Algorithm1.

14

ACCEPTED MANUSCRIPT

Algorithm 1 DRJMIM

ED

M

AN US

CR IP T

Input: A training sample D with a entire feature set F = {X1 , X2 , , Xn }and the class Y ;User-specified threshold K. Output: The selected feature subset S. 1: S ← Φ; 2: k ← 0; 3: for i = 1 to n do 4: Calculate the mutual information I(Xi ; Y ); 5: end for 6: Initial the DR jmi(Xi ) = DR(Xi ) = I(Xi ; Y ) for each feature; 7: while k < K do 8: Select the feature Xj with the largest DR jmi(Xi ); 9: S = S ∪ {Xj }; 10: F = F − {Xj }; 11: for each candidate feature Xi ∈ F do 12: Calculate C Ratio(Xi , Xj ); 13: Update DR(Xi ) = DR(Xi ) + C Ratio(Xi , Xj ) ∗ I(Xj ; Y ); 14: Update DR jmi(Xi ) = minXj ∈S (I(Xi , Xj ; Y )) ∗ DR(Xi ); 15: end for 16: k = k + 1; 17: end while

AC

CE

PT

The proposed method includes three stages. In the first stage (lines1-6), DRJMIM initiates the selected feature subset S, the threshold K, DR jmi and DR. It also calculates the mutual information between the class and each feature. In the second stage (lines 7-10), the method selects the most informative feature and updates the selected feature subset S and entire feature set F . In the last stage (lines 11-17), DRJMIM calculates C Ratio, and updates DR and DR jmi. 4.3. Comparison with other relevant work DRJMIM differs from the relevant methods in following manners. 1. Differ from JMIM (Bennasar et al., 2015), Although DRJMIM and JMIM both employ joint mutual information and the ’maximum of the 15

ACCEPTED MANUSCRIPT

4.

CR IP T

ED

5.

AN US

3.

M

2.

minimum’ criterion, our method introduces a dynamic weight. We combine the weight with min(I(Xk , Xj ; Y )) because a candidate feature Xk with low I(Xk , Xj ; Y ) may be interdependent with some already selected features rather than redundant. Differ from DRGS (Sun et al., 2013a), DRGS defines the feature relevancy as I(Xj ; Y ). Our method divides feature relevancy into two categories: candidate feature relevancy and selected feature relevancy. We hold that selected features should be taken into account when candidate feature relevancy is calculated. Differ from mRMR (Peng et al., 2005), mRMR is a linear feature selection method that employs both feature relevancy and feature redundancy. Our method employs the ’maximum of the minimum’ criterion, it is a nonlinear method that also considers about interdependent features. In addition, we refine the feature relevancy. Differ from DISR (Meyer et al., 2008), DISR normalizes the joint mutual information and it is a linear feature selection method. Although our method also employs the joint mutual information, we do not normalize it. DRJMIM is a nonlinear feature selection method that employs the ’maximum of the minimum’ criterion. Differ from Interaction Weight based Feature Selection (IWFS) (Zeng et al., 2015), IWFS defines the feature relevancy as a normalized measure of mutual information between features and classes. The feature relevancy represents both candidate feature relevancy and selected relevancy that is different from our method.

AC

CE

PT

4.4. Complexity analysis Suppose the number of features to be selected is k, the number of samples in data set is M , and the total number of features is N . The time complexity of mutual information, conditional mutual information and joint mutual information is O(M ) because all samples need to be visited for probability estimation. As a result, the time complexity of JMIM, DRGS, MRMR, IG and DISR is O(kM N ). Differ from DRGS, the minimum joint mutual information of each round of DRJMIM is needed, thus the time complexity of DRJMIM is O(k 2 M N ). DRJMIM is a generally more computationally intensive since more information is taken into account.

16

ACCEPTED MANUSCRIPT

5. Experimental results and analysis

CR IP T

In this section, we show the experimental results of an artificial example and 12 real-world data sets. All experiments are executed on an Intel Core i7 with a 3.40GHZ processing speed and 8GB main memory.

AN US

5.1. Case study on an artificial example In order to illustrate the effectiveness of the proposed method more clearly, an artificial example D = (O, F, Y ) is given as shown in Table 3. O represents the instance, F represents the feature and Y represents the class, where O = {o1 , o2 , ..., o10 }, F = {X1 , X2 , ..., X8 }. We execute DRJMIM, DRGS, JMIM, MRMR, DISR and IG on the artificial example. Table 3: An artificial example

X1

X2

X3

o1 o2 o3 o4 o5 o6 o7 o8 o9 o10

1 1 1 0 0 0 1 1 1 1

0 0 1 1 1 1 0 0 0 0

1 0 0 0 1 0 0 0 1 1

ED

PT

X4

X5

X6

X7

X8

Y

1 1 0 1 0 1 0 0 0 0

0 1 0 1 0 0 1 1 1 1

1 1 1 1 1 1 0 0 0 0

1 0 0 1 0 1 0 1 1 1

0 1 0 0 1 0 1 0 0 1

1 0 1 1 0 1 1 0 1 1

M

objects

CE

By DRJMIM, we have

AC

step1: The mutual information between Xi and Y is calculated. I(X1 ; Y ) = 0.0016, I(X2 ; Y ) = 0.0058, I(X3 ; Y ) = 0.0058, I(X4 ; Y ) = 0.0058, I(X5 ; Y ) = 0.0058, I(X6 ; Y ) = 0.0058, I(X7 ; Y ) = 0.0913, I(X8 ; Y ) = 0.0913; C Ratio(X1 ) = C Ratio(X2 ) = C Ratio(X3 ) = C Ratio(X4 ) = 1, C Ratio(X5 ) = C Ratio(X6 ) = C Ratio(X7 ) = C Ratio(X8 ) = 1; DR jmi(X1 ) = 0.0016, DR jmi(X2 ) = 0.0058, DR jmi(X3 ) = 0.0058, DR jmi(X4 ) = 0.0058, DR jmi(X5 ) = 0.0058, DR jmi(X6 ) = 0.0058, 17

ACCEPTED MANUSCRIPT

CR IP T

DR jmi(X7 ) = 0.0913, DR jmi(X8 ) = 0.0913. The maximum value DR jmi(Xi ) of every Xi is compared, and X7 is selected. Then, the candidate feature set is {X1 , X2 , X3 , X4 , X5 , X6 , X8 }. step2: The updated C Ratio(Xi ) , M inI(Xi ) and DR jmi(Xi ) are calculated.

AC

CE

PT

ED

M

AN US

(1) When k = 2, C Ratio(X1 ) = 0.2138, C Ratio(X2 ) = 0.0645, C Ratio(X3 ) = 0.2518, C Ratio(X4 ) = 0.2518, C Ratio(X5 ) = 0.0645, C Ratio(X6 ) = 0.2518, C Ratio(X8 ) = 0.0673; M inI(X1 ) = 0.2813, M inI(X2 ) = 0.1568, M inI(X3 ) = 0.3303, M inI(X4 ) = 0.3303, M inI(X5 ) = 0.1568, M inI(X6 ) = 0.3303, M inI(X8 ) = 0.2448; DR jmi(X1 ) = 0.0059, DR jmi(X2 ) = 0.0018, DR jmi(X3 ) = 0.0095, DR jmi(X4 ) = 0.0095, DR jmi(X5 ) = 0.0018, DR jmi(X6 ) = 0.0095, DR jmi(X8 ) = 0.0239. The maximum value DR jmi(Xi ) of every Xi is compared, and X8 is selected. Then, the candidate feature set is {X1 , X2 , X3 , X4 , X5 , X6 }. (2) When k = 3, C Ratio(X1 ) = 0.2138, C Ratio(X2 ) = 0.2518, C Ratio(X3 ) = 0.0645, C Ratio(X4 ) = 0.2518, C Ratio(X5 ) = 0.2518, C Ratio(X6 ) = 0.6308; M inI(X1 ) = 0.2813, M inI(X2 ) = 0.1568, M inI(X3 ) = 0.1568, M inI(X4 ) = 0.3303, M inI(X5 ) = 0.1568, M inI(X6 ) = 0.3303; DR jmi(X1 ) = 0.0072, DR jmi(X2 ) = 0.0039, DR jmi(X3 ) = 0.0024, DR jmi(X4 ) = 0.0107, DR jmi(X5 ) = 0.0039, DR jmi(X6 ) = 0.0222. The maximum value DR jmi(Xi ) of every Xi is compared, and X6 is selected. Then, the candidate feature set is {X1 , X2 , X3 , X4 , X5 }. (3) When k = 4, C Ratio(X1 ) = −0.0018, C Ratio(X2 ) = 0.0223, C Ratio(X3 ) = 0.1568, C Ratio(X4 ) = 0.0223, C Ratio(X5 ) = 0.0223; M inI(X1 ) = 0.0058, M inI(X2 ) = 0.0323, M inI(X3 ) = 0.1568, M inI(X4 ) = 0.0322, M inI(X5 ) = 0.0322; DR jmi(X1 ) = 4.1485 × 10−5 , DR jmi(X2 ) = 1.2973 × 10−3 , 18

ACCEPTED MANUSCRIPT

CR IP T

DR jmi(X3 ) = 5.2096 × 10−3 , DR jmi(X4 ) = 3.5055 × 10−3 , DR jmi(X5 ) = 1.2973 × 10−3 . The maximum value DR jmi(Xi ) of every Xi is compared, and X3 is selected. Then, the selected feature subset is {X7 , X8 , X6 , X3 }.

AN US

For the artificial example, DRJMIM selects the subset {X7 , X8 , X6 , X3 }, which is the optimal feature subset. The optimal subset can classify all the class labels correctly. However, JMIM and MRMR select {X7 , X3 , X1 , X8 }, DRGS selects {X7 , X8 , X6 , X4 }, IG selects {X7 , X8 , X2 , X3 } and DISR selects {X7 , X3 , X2 , X1 }. These feature selection methods do not select the optimal feature subset. From the above example, we discover that X6 is dependent with X7 and X8 . Meanwhile, X3 has both high value of the weight and minimum joint mutual information except X6 , X7 , X8 . Therefore, X3 is chosen followed by X6 but X3 was the last chosen variable. According to DRJMIM, the value of the weight and minimum joint mutual information are both important.

AC

CE

PT

ED

M

5.2. Experimental results on the real-world data sets The classification performance of our method is compared with DRGS, JMIM, mRMR, DISR and IG on 12 real-world data sets. These data sets come from different fields such as biology, face image and handwritten image. They are common benchmarks that include both binary and multiclass problems. The instance of data sets varies from 50 to 9298, and the number of features varies from 13 to 10000. Data sets are described in Table 4.

19

ACCEPTED MANUSCRIPT

Table 4: Experimental data sets description

Datasets

Instances

Features

Classes

Types

1 2 3 4 5 6 7 8 9 10 11 12

Wine Vehicle Musk1 Semeion ORL Yale USPS GLIOMA Lung discrete Arcene Gisette Madelon

178 846 476 1593 400 165 9298 50 73 200 7000 2600

13 18 166 256 1024 1024 256 4434 325 10000 5000 500

3 4 2 10 40 15 10 4 7 2 2 2

continuous continuous continuous discrete continuous continuous continuous continuous discrete continuous continuous continuous

AN US

CR IP T

No.

AC

CE

PT

ED

M

These data sets include features with continuous values and discrete values. With respect to continuous features, they are discretized into equal number of values following Ding and Pengs suggestion in (Ding and Peng, 2003). Ten rounds 10-fold cross validations are employed if the data set contains more than 100 instances (or leave-one-out validations are used if the data set contains fewer than 100 instances). We compare the classification accuracies against the number of features. The classification accuracies are the average accuracies of three different classifiers (SVM, 3NN and NB) to reduce the bias of a specific classifier. The results are shown in Figures. 1-6. The number K on X-axis represents the first K features with a selected order by different classifiers. The Y-axis represents the average accuracies of three classifiers at the first K features. We set the number of selected feature K from 1 to 30. Different colors and shapes represent different feature selection methods.

20

CR IP T

ACCEPTED MANUSCRIPT

ED

M

AN US

Figure 1: Average classification accuracy achieved with Lung discrete and Arcene

AC

CE

PT

Figure 2: Average classification accuracy achieved with Gisette and GLIOMA

Figure 3: Average classification accuracy achieved with Madelon and Musk1

21

CR IP T

ACCEPTED MANUSCRIPT

ED

M

AN US

Figure 4: Average classification accuracy achieved with ORL and Semeion

AC

CE

PT

Figure 5: Average classification accuracy achieved with USPS and Vehicle

Figure 6: Average classification accuracy achieved with Wine and Yale

22

ACCEPTED MANUSCRIPT

AN US

CR IP T

A good feature selection method can find out informative features accurately. Here, the meaning of accuracy contains two aspects. On one hand, the feature selection method should select the features that obtain the better classification performance. On the other hand, it should select the number of features as small as possible. In Table 5, according to the number of selected features K from 1 to 30, we list the highest value of the average accuracies attained by three classifiers on each data set using the features selected by each feature selection method. Table 5 shows that the highest accuracies of DRJMIM are 83.56%, 86.67%, 78.86%, 47.19%, 58.00%, 74.69%, 85.95%, 92.03%, 97.49%, 72.43%, 81.62%, and 58.94% on each data set, respectively. Our method achieves the highest classification accuracy. Additionally, we can discover that JMIM outperforms DRGS in terms of the highest accuracies on six out of twelve date sets, JMIM and DRGS are similar methods to DRJMIM and they perform equally on 12 data sets. Table 5: The highest average accuracies(%) of three different classifiers (SVM, NB and 3NN) with six methods

1 2 3 4 5 6 7 8 9 10 11 12

Lung discrete GLIOMA Arcene Yale ORL Madelon USPS Gisette Wine Semeion Musk1 Vehicle

JMIM

DRGS

M

data sets

82.19 79.00 85.33 83.33 78.00(-) 77.95(-) 44.67(-) 45.67(-) 56.84(-) 55.93(-) 73.16(-) 72.84(-) 83.86(-) 85.13 (-) 90.30(-) 91.44(-) 96.08(-) 96.85(-) 67.24(-) 72.44(=) 78.50(-) 81.15(-) 58.21(-) 58.04(-)

CE

PT

ED

No.

mRMR

83.11 85.33 77.22(-) 45.88(-) 53.23(-) 59.31(-) 84.98(-) 91.13(-) 96.08(-) 65.34(-) 78.77(-) 58.12(-)

IG

DISR

77.63 82.65 74.67 86.00 71.91(-) 76.79(-) 41.92(-) 46.31(=) 49.30(-) 53.90(-) 71.79(-) 74.74(=) 70.20(-) 82.96(-) 89.45(-) 90.71(-) 96.11(-) 96.04(-) 63.55(-) 64.13(-) 78.38(-) 77.16(-) 57.23(-) 58.3(-)5

AC

Note: The bold font means the maximal value of the row. The symbols (-) and (=), respectively identify statistically significant (at 0.05 level) losses or ties our method.

In further, we employ a paired two-tailed t-test to test the highest accuracy of the significance on 10 data sets (except Lung discrete and GLIOMA) where ten rounds 10-fold cross validations are executed. When the P-value 23

DRJMIM 83.56 86.67 78.86 47.19 58.00 74.69 85.95 92.03 97.49 72.43 81.62 58.94

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

is more than 0.05, the improvement in classification accuracy is likely to happen by chance. Five methods are compared with our method, ’-’ means that the compared method performs worse than our method. ’=’ means that the compared method performs equally to DRJMIM. The bold value means that it is the largest one among these six feature selection methods. It can be observed from the experimental results that DRJMIM outperforms other five feature selection methods of all the data sets. Although DRJMIM obtains 74.69% accuracy in madelon, DISR obtains 74.74%, the P-value is more than 0.05. Therefore, they are equal in terms of the highest accuracy. The same situation for semeion, DRJMIM obtains 72.43% while DRGS obtains 72.44%. The purpose of feature selection method is not only to improve the classification accuracy but also to minimize the number of selected features. Table 6 shows the associated number of selected features with the highest accuracy on the 12 real-world data sets. Table 6 shows that DRJMIM selects the fewest number of selected features on Arcene, USPS, Gisette, and Wine. Additionally, we can discover that JMIM outperforms DRGS in terms of the associated number of selected features with the highest accuracy on four out of twelve date sets, DRGS outperforms JMIM on five out of twelve data sets, therefore, they perform equally on 12 data sets.

ED

Table 6: Number of the selected features

Data sets

Total number

JMIM

DRGS

mRMR

IG

DISR

DRJMIM

1 2 3 4 5 6 7 8 9 10 11 12

Lung discrete GLIOMA Arcene Yale ORL Madelon USPS Gisette Wine Semeion Musk1 Vehicle

325 4434 10000 1024 1024 500 256 5000 13 256 166 18

29 9 5 12 16 11 30 29 10 30 30 8

8 5 30 7 10 14 30 27 11 30 30 16

28 9 10 7 20 10 30 23 10 25 29 13

23 28 22 24 22 12 30 15 12 30 22 18

27 20 11 9 16 17 30 28 11 30 29 16

29 11 5 11 12 16 30 11 10 30 24 12

AC

CE

PT

No.

In order to understand intuitively, we show the highest average classifi24

ACCEPTED MANUSCRIPT

AN US

CR IP T

cation accuracies of three classifiers (SVM, NB and 3NN)and the associated number of selected features in Figures. 7-12. Overall, our method outperforms other compared methods.

ED

M

Figure 7: The highest average classification accuracies of three classifiers and the associated number of selected features on Lung discrete and Arcene.

AC

CE

PT

Figure 8: The highest average classification accuracies of three classifiers and the associated number of selected features on Gisette and GLIOMA.

Figure 9: The highest average classification accuracies of three classifiers and the associated number of selected features on Madelon and Musk1.

25

CR IP T

ACCEPTED MANUSCRIPT

M

AN US

Figure 10: The highest average classification accuracies of three classifiers and the associated number of selected features on ORL and Semeion.

AC

CE

PT

ED

Figure 11: The highest average classification accuracies of three classifiers and the associated number of selected features on USPS and Vehicle.

Figure 12: The highest average classification accuracies of three classifiers and the associated number of selected features on Wine and Yale.

26

ACCEPTED MANUSCRIPT

6. Conclusion and future work

CE

PT

ED

M

AN US

CR IP T

With the growing number of features, the relationship among features becomes increasingly complicated. It is more difficult for the classification task. Previous methods exist two common defects. One is that the feature relevancy is defined without distinguishing candidate feature relevancy and selected feature relevancy. The other one is that some interdependent features may be misinterpreted as redundant features. In this study, we propose a novel feature selection method named DRJMIM. Our method refines two types of feature relevancy: candidate feature relevancy and selected feature relevancy. Based on candidate feature relevancy, we select redundant features that include true redundant features and false redundant features. We combine the redundant features with a dynamic weight to eliminate the true redundant features and retain false redundant feature that are interdependent features. Based on the means mentioned above, our method selects the most informative features that include interdependent features. To verify the effectiveness of our method, we conducted a comparative experiment on an artificial example and 12 real-world data sets. The advantages of our method can be seen clearly in the artificial example. On the 12 real-world data sets, the highest accuracy and the associated number of the selected features are compared with three traditional methods and two relevant methods. Depending on the experimental results, our method not only obtains the best classification performance but also selects the fewest number of features. The highest accuracies are further verified by a paired two-tailed t-test. It is confirmed that the results are not happened by chance. In the future work, we plan to propose more effective feature selection method that has lower time complexity. To address this issue, we pay more attention on linear feature selection method because linear feature selection methods generally has higher computational efficiency.

AC

7. Acknowledgments This work is funded by National Key R&D Plan of China under Grant No. 2017YFA0604500, and by National Sci-Tech Support Plan of China under Grant No. 2014BAH02F00, and by National Natural Science Foundation of China under Grant No. 61701190, and by Youth Science Foundation of Jilin Province of China under Grant No. 20160520011JH, and by Youth Sci-Tech 27

ACCEPTED MANUSCRIPT

CR IP T

innovation leader and team project of Jilin Province of China under Grant No. 20170519017JH,and by the Key Technology Innovation Cooperation Project of Government and University for the whole Industry Demonstration under Grant No. SXGJSF2017-4. References

I. T. Jolliffe, M. Uddin, S. K. Vines, Simplified EOFs - Three alternatives to rotation 20 (3) (2002) 271–279.

AN US

S. Mika, G. Ratsch, J. Weston, B. Scholkopf, K. R. Mullers, Fisher discriminant analysis with kernels, in: Neural Networks for Signal Processing Ix, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop, 41–48, 1999. D. R. Hardoon, S. R. Szedmak, J. R. Shawe-Taylor, Canonical Correlation Analysis: An Overview with Application to Learning Methods, Neural Computation 16 (12) (2004) 2639.

M

V. Bol´on-Canedo, N. S´anchez-Marono, A. Alonso-Betanzos, J. M. Ben´ıtez, F. Herrera, A review of microarray datasets and applied feature selection methods, Information Sciences 282 (2014) 111–135.

PT

ED

J. Zhao, Y. Zhou, X. Zhang, L. Chen, Part mutual information for quantifying direct associations in networks, Proceedings of the National Academy of Sciences of the United States of America 113 (18) (2016) 5130.

CE

A. Dionisio, R. Menezes, D. A. Mendes, Mutual information: a measure of dependency for nonlinear time series, Physica A: Statistical Mechanics and its Applications 344 (1) (2004) 326–329.

AC

D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. Mcvean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, P. C. Sabeti, Detecting Novel Associations in Large Data Sets, Science 334 (6062) (2011) 1518–24. H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on pattern analysis and machine intelligence 27 (8) (2005) 1226–1238.

28

ACCEPTED MANUSCRIPT

D. D. Lewis, Feature selection and feature extraction for text categorization, in: Proceedings of the workshop on Speech and Natural Language, Association for Computational Linguistics, 212–217, 1992.

CR IP T

R. Battiti, Using mutual information for selecting features in supervised neural net learning, IEEE Trans Neural Netw 5 (4) (1994) 537–550.

F. Fleuret, Fast binary feature selection with conditional mutual information, Journal of Machine Learning Research 5 (Nov) (2004) 1531–1555.

AN US

X. Sun, Y. Liu, D. Wei, M. Xu, H. Chen, J. Han, Selection of interdependent genes via dynamic relevance analysis for cancer diagnosis., Journal of Biomedical Informatics 46 (2) (2013a) 252–258. X. Sun, Y. Liu, M. Xu, H. Chen, J. Han, K. Wang, Feature selection using dynamic weights for classification, Knowledge-Based Systems 37 (2) (2013b) 541–549. Repository,

URL

M

M. Lichman, UCI Machine Learning http://archive.ics.uci.edu/ml, 2013.

ED

J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, H. Liu, Feature selection: A data perspective, arXiv preprint arXiv:1601.07996 . T. M. Cover, J. A. Thomas, Elements of information theory, John Wiley & Sons, 2012.

PT

C. E. Shannon, A mathematical theory of communication, ACM SIGMOBILE Mobile Computing and Communications Review 5 (1) (2001) 3–55.

CE

H. Yan, J. Yang, Sparse discriminative feature selection, Pattern Recognition 48 (5) (2015) 1827–1835.

AC

Q. Song, H. Jiang, J. Liu, Feature selection based on FDA and F-score for multi-class classification, Expert Systems with Applications 81 (2017) 22– 27. Y. Cong, S. Wang, J. Liu, J. Cao, Y. Yang, J. Luo, Deep sparse feature selection for computer aided endoscopy diagnosis, Pattern Recognition 48 (3) (2015) 907–917.

29

ACCEPTED MANUSCRIPT

R. O. Duda, P. E. Hart, D. G. Stork, Pattern classification, John Wiley & Sons, 2012.

CR IP T

A. Senawi, H.-L. Wei, S. A. Billings, A new maximum relevance-minimum multicollinearity (MRmMC) method for feature selection and ranking, Pattern Recognition 67 (2017) 47–61. H. Banka, S. Dara, A Hamming distance based binary particle swarm optimization (HDBPSO) algorithm for high dimensional feature selection, classification and validation, Pattern Recognition Letters 52 (2015) 94– 100.

AN US

G. Brown, A. Pocock, M. J. Zhao, M. Lujn, Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection, Journal of Machine Learning Research 13 (1) (2012) 27–66. B. Guo, M. S. Nixon, Gait feature subset selection by mutual information, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 39 (1) (2009) 36–46.

M

A. El Akadi, A. El Ouardighi, D. Aboutajdine, A powerful feature selection approach based on mutual information, International Journal of Computer Science and Network Security 8 (4) (2008) 116.

ED

N. Kwak, C. H. Choi, Input feature selection for classification problems, IEEE Transactions on Neural Networks 13 (1) (2002) 143–159.

PT

N. Hoque, D. Bhattacharyya, J. K. Kalita, MIFS-ND: a mutual informationbased feature selection method, Expert Systems with Applications 41 (14) (2014) 6371–6385.

AC

CE

K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II, in: International Conference on Parallel Problem Solving From Nature, Springer, 849–858, 2000. H. H. Yang, J. E. Moody, Data Visualization and Feature Selection: New Algorithms for Nongaussian Data., in: NIPS, vol. 12, 1999. P. E. Meyer, C. Schretter, G. Bontempi, Information-theoretic feature selection in microarray data using variable complementarity, IEEE Journal of Selected Topics in Signal Processing 2 (3) (2008) 261–274. 30

ACCEPTED MANUSCRIPT

M. Bennasar, Y. Hicks, R. Setchi, Feature selection using Joint Mutual Information Maximisation, Expert Systems with Applications An International Journal 42 (22) (2015) 8520–8532.

CR IP T

Z. Zeng, H. Zhang, R. Zhang, C. Yin, A novel feature selection method considering feature interaction, Pattern Recognition 48 (8) (2015) 2656– 2666.

AC

CE

PT

ED

M

AN US

C. Ding, H. Peng, Minimum redundancy feature selection from microarray gene expression data, in: Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE, IEEE, 523–528, 2003.

31