High-dimensional supervised feature selection via optimized kernel mutual information

High-dimensional supervised feature selection via optimized kernel mutual information

Expert Systems With Applications 108 (2018) 81–95 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www.e...

2MB Sizes 0 Downloads 84 Views

Expert Systems With Applications 108 (2018) 81–95

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

High-dimensional supervised feature selection via optimized kernel mutual information Ning Bi a, Jun Tan a,∗, Jian-Huang Lai b, Ching Y. Suen c a

School of Mathematics, Sun Yat-sen University, Guangzhou 510275, China School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510275, China c Centre for Pattern Recognition and Machine Intelligence, Concordia University, Montreal, QC H3G 1M8, Canada b

a r t i c l e

i n f o

Article history: Received 16 July 2017 Revised 29 January 2018 Accepted 27 April 2018 Available online 2 May 2018 Keywords: Feature selection Kernel method Mutual information Classification Optimize function Machine learning

a b s t r a c t Feature selection is very important for pattern recognition to reduce the dimensions of data and to improve the efficiency of learning algorithms. Recent research on new approaches has focused mostly on improving accuracy and reducing computing time. This paper presents a flexible feature-selection method based on an optimized kernel mutual information (OKMI) approach. Mutual information (MI) has been applied successfully in decision trees to rank variables; its aim is to connect class labels with the distribution of experimental data. The use of MI removes irrelevant features and decreases redundant features. However, MI is usually less robust when the data distribution is not centralized. To overcome this problem, we propose to use the OKMI approach, which combines MI and a kernel function. This approach may be used for feature selection with nonlinear models by defining kernels for feature vectors and class-label vectors. By optimizing the objection equations, we develop a new feature-selection algorithm that combines both MI and kernel learning, we discuss the relationship among various kernel-selection methods. Experiments were conducted to compare the new technique applied to various data sets with other methods, and in each case the OKMI approach performs better than the other methods in terms of feature-classification accuracy and computing time. OKMI method solves the problem of computation complexity in the probability of distribution, and avoids this problem by finding the optimal features at very low computational cost. As a result, the OKMI method with the proposed algorithm is effective and robust over a wide range of real applications on expert systems. © 2018 Published by Elsevier Ltd.

1. Introduction Feature selection is one of the important issues in expert and intelligent system technology, which uses both the input and output variables to predict the relationships between the features and class labels. Prediction models have been used for various expert and intelligent systems applications including multi-agent systems, knowledge management, neural networks, knowledge discovery, data and text mining, multimedia mining, and genetic algorithms. The models generally involve a number of features. However, not all of these features are equally important for a specific task. Some of them may be redundant or even irrelevant. Better performance may be achieved by discarding some features. Many applications in pattern recognition require the identification of the most characteristic features of a given data set D



Corresponding author. E-mail address: [email protected] (J. Tan). URL: http://www.math.sysu.edu.cn (J. Tan)

https://doi.org/10.1016/j.eswa.2018.04.037 0957-4174/© 2018 Published by Elsevier Ltd.

that contains N samples and M features, with X = {xi , i = 1, . . . , M}. The data set D usually includes a large amount of irrelevant, redundant, and unnecessary information, which degrades recognition performance. The feature-selection method (Cheriet, Kharma, Liu, & Suen, 2007) selects a subset G of features from M (|G| < M) such that D is optimized based on G according to criterion J. The goal is to maximize the predictive accuracy of the data within D and minimize the cost of extracting features within G. Feature selection focuses on finding the best subspace, where the total number of subspaces in the original data set D is 2M . Given the number k (k < M), the  number of subspaces with dik M mension less than k is i=1 i . Thus, D is high dimensional for a large feature number M, so thoroughly searching the subspace of features is difficult. To address this issue, sequential-search-based methods to select features have been proposed. Blum and Langley (1997) grouped feature-selection methods into three types: filter, wrapper, and embedded. Filter methods (Almuallim & Dietterich, 1994; Kira & Rendell, 1992) provide quick estimates of the value of features and filter the irrelevant or redundant features be-

82

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

fore they are fed into the classifier. In contrast, wrapper methods (Kohavi & John, 1997) usually interact with a classifier, so the classifier performance will directly affect the quality of the feature subsets. Finally, in embedded methods (Lal, Chapelle, Weston, & Elisseeff, 2006), feature selection is embedded into the classifier, so the two are not independent and execute simultaneously. Additionally, feature-selection methods can be categorized as unsupervised, supervised, or semi-supervised. Unsupervised methods were developed without using class labels and include joint embedding learning and sparse regression (JELSR) (Hou, Nie, Li, Yi, & Wu, 2014), matrix factorization (MF) (Wang, Pedrycz, Zhu, & Zhu, 2015), k-nearest-neighbor (Chan & Kim, 2015), feature similarity feature selection (FSFS) (Mitra, Murthy, & Pal, 2002), Laplacian score (LS) (He, Cai, & Niyogi, 2005), and regularized selfrepresentation (Zhu, Zuo, Zhang, Hu, & Shiu, 2015), all of which offer many efficient algorithms for unsupervised feature selection. Supervised feature-selection methods search for features of an input vector by predicting the class label, the existing methods include ReliefF (Kira & Rendell, 1992), Fisher score, correlation, kernel optimize (Kopt) (Xiong, Swamy, & Ahmad, 2005), kernel class separability (KCS) (Wang, 2008), generalized multiple kernel learning (GMKL) (Varma & Babu, 2009), scaled class separability selection(SCSS) (Ramona, Richard, & David, 2012), spectral feature with minimum redundancy (MRSF) (Zhao, Wang, & Liu, 2010), HilbertSchmidt independence criterion (HSIC) (Gretton, Bousquet, Smola, & Lkopf, 2005), the HSIC-based greedy feature selection criterion (Song, Smola, Gretton, Bedo, & Borgwardt, 2012), the sparse additive models (SpAM) (Ravikumar, Lafferty, Liu, & Wasserman, 2009), Hilbert–Schmidt feature selection (HSFS) (Masaeli, Fung, & Dy, 2010), and centered kernel target alignment (cKTA) (Cortes, Mohri, & Rostamizadeh, 2014), feature-wise kernelized Lasso (HSIC Lasso) (Yamada, Jitkrittum, Sigal, Xing, & Sugiyama, 2014). The method proposed herein and the methods we compare it with are all supervised methods. The popular MI method (Eriksson, Kim, Kang, & Lee, 2005) constructs decision trees to rank variables and also serves as a metric for feature selection. An MI method based on Shannon information uses information-theoretic ranking with the dependency between two variables serving as metric and uses entropy to represent relationships between an observed variable x and an output result y. The MI of x and y is defined by their probabilistic density functions p(x) and p(y), respectively, and their joint probabilistic density function p(x, y) (Battiti, 1994). To rank features, some works report that the union of individually good features does not necessarily lead to good recognition performance; in other words “the m best features are not the best m features.” Although MI can decrease the redundancy with respect to the original features and select the best features with minimal redundancy, the joint probability of features and the target class increases, so the redundancy among features may decrease (Pudil, Novovicˇ ová, & Kittler, 1994). To select good features by using the statistical dependency distribution, Peng, Long, and Ding (2005) proposed the minimalredundancy-maximal-relevance criterion (mRMR), which uses a feature-selection method based on MI. The method provides maximal dependency, maximal relevance, and minimal redundancy. The selected features have the maximal joint dependency on the target class, which is called “maximal dependency,” but it is hard to implement, so the relevant approximate dependency uses the MI between feature and target class. Minimal redundancy reduces the redundancy resulting from maximum relevance, so the redundancy metric is computed from the MI among the selected features. Experiments with mRMR improved the classification accuracy for some data sets. For more details about research into mRMR, see Refs. Ding and Peng (2005) and Zhao et al. (2010). The definition of MI is based on the feature entropy and class label; however, it favors features with many values. Some features

can be very simple, so the feature value is an integer with a small range. However, some feature values are floating points with very wide ranges, which needs more computation to obtain a ratio that reflects the correlation between features and class. Another problem is inconsistency (Dash & Liu, 2003): consider n samples with the same range of feature-value but m1 of these samples belong to class 1 and the remaining mi samples belong to class i. The largest feature is m1 , so the inconsistency is n − m1 and the inconsistency rate is the sum of the inconsistencies divided by the size N of the set. Reference Dash and Liu (2003) shows that the time complexity of computing the inconsistency rate is close to O(N), so the rate is also monotonic and is tolerant to noise; however, it is only available for discrete values. Thus, the rate must be discretized for continuous features, which will seriously affect the computation complexity and consume more memory resources. Occasionally, the computation is interrupted when the feature number is too large for the memory. This problem is discussed in detail below. To resolve these drawbacks of the MI method, the kernelbased methods (Gretton, Herbrich, & Smola, 2003; Lin, Ying, Chen, & Lee, 2008; Sakai & Sugiyama, 2014) are imported to enhance the MI, the Hilbert–Schmidt independence criterion (HSIC) using kernel-based independence measures is introduced in Refs. Gretton et al. (2005) and Song et al. (2012). These approaches are popular for mapping the data to a nonlinear high-dimensional space (Alzate & Suykens, 2012; Schölkopf, Smola, & Müller, 1998). Multi-kernel learning (Wang, Bensmail, & Gao, 2014) has been applied to feature selection with a sparse representation on the manifold can handle noise features and nonlinear data. The kernelbased feature-selection method integrates a linear combination of features with the criterion. Real applications of the kernel concern the type of kernel and parameters, so while cross validation may optimize the kernel, it consumes more time and is easy to over-fit. Traditional feature selection methods (Kira & Rendell, 1992) based on the assumption of linear dependency between input features and output values, they cannot capture non-linear dependency. KCS (Wang, 2008) cKTA (Cortes et al., 2014) are not necessarily positive definite, and thus the objective functions can be non-convex. Furthermore, for the kernel-based methods (Gretton et al., 2003; Varma & Babu, 2009; Xiong et al., 2005), output y should be transformed by the non-linear kernel function φ (), this highly limits the flexibility of capturing non-linear dependency, an advantage of the formulation is that the global optimal solution can be computed efficiently. Thus, it is scalable to high-dimensional feature selection problems. Finally, an output y should be a real number in SpAM (Ravikumar et al., 2009), meaning that SpAM cannot deal with structured outputs such as multilabel and graph data. Greedy search strategies such as forward selection/backward elimination are used in mRMR (Peng et al., 2005) HSIC (Gretton et al., 2005). However, the greedy approaches tend to produce a locally optimal feature set. To the best of our knowledge, the convex feature selection method is able to deal with high-dimensional non-linearly related features. In addition, the output Gram matrix L is used to select features in HSIC Lasso (Yamada et al., 2014), which can naturally incorporate structured outputs via kernels. All feature methods are summarized in Table 1. To address this problem, we propose herein an approach that combines the goodness of the kernel function and the MI method to obtain a high-dimensional supervised feature-selection framework called optimized kernel mutual information (OKMI) with joint kernel learning, maximum relevance, and minimum redundancy. Instead of using MI to characterize high-dimensional data by the feature and class probability, we embed a kernel function into the MI to form a new framework. Widely used types of kernel functions, including polynomial, Gaussian, exponential, and sigmoid, can be seen as special cases in the OKMI framework. Af-

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

83

Table 1 Summary of feature selection methods. Method

Optimization

Dependency

Scalability

Structured output

Correlation Kopt mRMR KCS SpAM GMKL MRSF SCSS cKTA BAHSIC HSIC Lasso

Convex Greedy Greedy Non-Convex Convex Convex Convex Convex Non-convex Greedy Convex

Linear Non-Linear Non-Linear Non-Linear Additive non-Linear Non-Linear Non-linear Linear Non-linear Non-linear Non-linear

Highly scalable Scalable Scalable Not scalable Scalable Scalable Highly scalable Highly scalable Not scalable Scalable Highly scalable

Not available Available Available Available Available Not available Not available Not available Available Available Available

ter repeated comparisons and analyses, we chose a suitable kernel to satisfy the requirements of any given problem. By comparing experimental results with related high-dimensional supervisedfeature-selection methods, OKMI facilitates the integration of other methods to rapidly find a compact subset of features. Experimental results reveal that OKMI improves effectiveness in classification accuracy over a wide range of data sets. The motivation of this work can be summarized as follows: (1) Theoretical analysis. The OKMI framework integrates kernel learning and MI. Although both have been used for feature selection, no advanced theoretical analysis of their combination explains why they select optimal features. Therefore, the first step in this paper is to theoretically analyze the OKMI by using an optimization method. (2) Computational complexity. Traditional MI methods compute the probability of distribution to implement feature selection. The range of feature values will seriously affect the probability, and occasionally the size of the feature space and the discrete data process are too costly, which causes programs with high computational complexity to crash. The OKMI method can avoid this problem by finding the optimal features at very low cost. (3) Classification accuracy. Any theoretical analysis must be verified experimentally. We therefore implement various experiments to prove that the OKMI method is an efficient method for supervised feature selection. We compare it with other methods through comprehensive experiments that involve different kernels, classifiers, and different types of data sets. The results show that the OKMI method with the proposed algorithm is effective and robust for a wide range of real applications of feature selection. The remainder of this paper is organized as follows: Section 2 introduces some background, previous works related to supervised feature selection, and also theoretically analyses of the kernel and MI. Section 3 presents the OKMI method and its feature-selection algorithm, which consists of schemes to select the optimal squeezed features. Section 4 discusses implementation issues involving several kernels and classifiers. The results of experiments on various types of data sets are described in Section 5, including genes, handwritten digits, images, and microarray real-world data sets. Discussions and conclusion are presented in Section 6. The theoretical analysis and experiments described herein focus on supervised feature selection. 2. Using optimized kernel mutual information to select features The MI method is an important way to learn the mapping of a large number of input features to output class labels. In this section, we review the related methods and present a criterion

of OKMI for feature selection. We analyze theoretically the OKMI method and explain why it is suitable. 2.1. Description of feature-selection problem If X (∈ Rm×n ) is a set of input feature vectors and the vector of the output-class label is Y (∈ Rm ), there exists m independent distributed paired samples ui and their class yi , which is described as

{(ui , yi )|ui ∈ X , yi ∈ Y, i = 1, . . . , m},

(1)

where m is the number of samples and the dimension of the input feature vector, n is the number of features in set X (the relation between m and n will affect the classification performance), xi is an m-dimensional feature vector, and the original data are denoted by

X = [u1 , . . . , um ] = [x1 , . . . , xn ] ∈ Rm×n ,

(2)

Y = [y1 , . . . , ym ] ∈ Rm ,

(3)

where  means the vector transpose. The joint distribution function of the probability density, px, y (x, y), can be derived from here. In pattern recognition and machine learning, finding a predictor function f( · ) is vital because it maps the feature vector x to the output class y. The quantity f( · ) is drawn from the training set and includes the sample pairs (x, y) by the machine-learning method. If f( · ) is a member function of a predictor, which is categorized by different parameters ω, we can specify the special member function fω ( · ). The objective is to find the ω that minimizes the objective function as follows:

min (ω ) = min



L( fω (x ), y ) p(x, y )dxdy,

(4)

where L( fω (x ), y ) is a loss function that returns fω (x), but in fact its value is y, where p(x, y) is the joint probability density function of x and y. The largest-loss function L is defined as the misclassification rate for classification problems, where the ordinary criterion takes the form of

L ( f ω ( x ), y ) =

1 y − fω (x )α22 + λα1 , 2

(5)

where α = [α1 , . . . , αn ] is a coefficient vector for regression to the feature, λ > 0 is a regularization parameter, and  · 1 ,  · 2 are 1 − and 2 − norms. Usually we do not know the probability density function p(x, y) in Eq. (4), so we cannot optimize the loss equation with respect to ω. Alternatively, we often handle the empirical formula computed with the training data and replace the integral in m Eq. (4) with the sum i=1 over all sample data. Optimizing the empirical equation is known to often lead to overfitting, so many common methods have been developed to minimize the empirical equation by using a penalty term  (ω) in the solutions. Thus, we

84

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

= max(I ( , y ) − R( ))

replace Eq. (4) by a minimum loss function with a new formulation:

min (ω ) = min

m 1  L( fω (xi ), yi ) + (ω ), m i=1

Shannon information theory forms the base of modern communications research. In terms of information theory, feature selection may be viewed as a problem of information-code distortion and recovery. The MI method combines a Bayes error estimate and information entropy, such that the loss function L( f ω (x ), y ) in Eq. (4) can be rewritten as

L( fω (x ), y ) = Ex,y [log2  p( y | x ; f ω ) ]  = p(x, y )  p(y|x; fω )dx (7)

where p(x, y) is the joint probability density function and  p(y|x; fω ) denotes the generative parameter model. Thus, the equation is optimized by minimizing the sum of the conditional entropy and the distance between the model and the true data. In practice, Eq. (7) is not available in a high-dimensional feature space, making it difficult to estimate the MI from the data of Eq. (7) in high dimensions. Eq. (7) might resort to simple estimates of parametric density, from which MI can be redefined. The goal of MI is to find a single variable and test its value, which allows the best classes to be identified. To achieve the goal by reducing the class label entropy by comparing the class distribution function of the training data with the average entropies of the new partitions, the MI I(x, y) of x and y is defined by their individual or joint density probability functions p(x), p(y), p(x, y):

I (x, y ) =

 

p(x, y ) p(x, y )log dxdy. p( x ) p( y )

(8)

Mutual information is used in feature selection for two reasons: the first is to remove irrelevant variables (i.e., Max-Relevance), and the second is to remove redundant variables to the extent possible (i.e., Min-Redundancy). For Max-Relevance, features are selected via a maximal relevance criterion between the input features subset , (xi ∈ ) and the output class y. It is denoted by

I ( , y ) =

1 

| | x ∈

I ( xi , y ).

(9)

i

It is clear that relevancy requires evaluating a joint metric of redundancy because, in the feature-selection process, Max-Relevance is usually richly redundant. Min-Redundancy considers that, when two features depend strongly on each other, the respective classification performance would not decrease if one of them were removed, so the following minimal redundancy criterion can be added to choose suitable features in the candidate feature set :

R( ) =

1

| |2



I ( xi , x j ).

| | x ,x ∈ i

I ( xi , x j ) .

(11)

j

where I(xi , xj ) is not a monotonicity function, the optimal set ∗ is approximated by a greedy algorithm (Guestrin, Krause, & Singh, 2005), for some small ε > 0, then the mutual information I(xi , xj ) has the following property,

I (xi , x j ) ≥ (1 − 1/ε ) max

| |=k

xi ,x j ∈ ∗



I (xi , x j ) − kε ,

(12)

xi ,x j ∈

In fact, only low-dimensional space may provide sufficient statistics for the MI method to work; sufficient statistics are not available in high dimensions. Therefore, evaluating the MI could be hard to solve the function-optimization problem in the general multi-class high-dimensional feature space.

The kernel function maps the original data to a highdimensional feature space. With Kernel transforms, optimizing the redundancy and finding the relevance of features become easy. Widely used types of kernel functions include polynomial, Gaussian (RBF), exponential, sigmoid, and delta functions. Expressions for these are given below: Polynomial kernel:

k(x, y ) = (x y + c )d ;

(13)

Gaussian kernel:

k(x, y ) = exp(−

x − y2 ); 2σ 2

(14)

Exponential kernel:

k(x, y ) = exp(−

x − y ); 2σ 2

(15)

Sigmoid kernel:

k(x, y ) = tanh(cx y + d );

(16)

where c, d, and σ are kernel parameters whose values are dictated by the practical problem. The value of sigma is usually set to the number of samples. The delta kernel is a special kernel that is used for multi-class classification (Song et al., 2012):



k(x, y ) =

1/ny , if x = y, 0, otherwise,

(17)

where ny is the number of samples in the data set of class y. We consider the mapping function for the feature vector x, φ (· ) : x −→ φ (x ), and transform the output vector y by using the mapping φ (· ) : y −→ φ (y ), where φ ( · ) is a nonlinear mapping function and the kernel function is denoted by the mapping function k(x, y ) = φ (x ) φ (y ). Therefore, the optimization equation for feature selection from Eq. (5) using the kernel method is represented by

min L( fω (x ), y ) = min{φ (y ) − αφ (x )2F + λα1 }.

(18)

In this case, we see that

(10)

xi ,x j ∈

The criterion obtained by combining Eqs. (9) and (10) is called “minimal redundancy maximal relevance” (mRMR) (Ding & Peng, 2005), with which the optimization problem (7) transforms into a new criterion by MI. From this, we obtain the following form for simultaneously optimizing I( , y) and R( ):

min L( fω (x ), y ) = min(−I ( , y ) + R( ))



1

2.3. Integrate kernel function with mutual information

x

= −H (Y |X ) − Ex [DKL (P (y|x )  p( y | x ; f ω ) ] ,

I ( xi , y ) −

xi ∈



2.2. Mutual information for max-relevance and min-redundancy



= max

(6)

where  (ω) is defined as a regularization function related to f( · ) and may have many types parameters; see Ref. Perkins, Lacker, and Theiler (2003).

y



φ (y ) − αφ (x )2F + λα1 = T r (φ (y ) − αφ (x )) (φ (y ) − αφ (x )) + λα1 = T r (φ (y ) φ (y )) − 2T r (α  φ (y ) φ (x )) + T r (α  φ (x ) φ (x ).α ) + λα1 (19)

L ( f ω ( x ), y ) =

To optimize the equation above, we also assume that the loss function L for the given input data is differentiable with respect to the coefficient vector α , i.e., that we may compute ∂ L(xi , y )/∂ α for the

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

85

willful feature vector xi followed by the willful coefficient α i . This gives

= (1 − δ )K and where K L i j = ( 1 − δi j ) L i j , ij ij ij

∂L ∀α , = 0. ∂α

δi j =

(20)

Thus, the gradient of the criterion with respect to the arbitrary coefficient α is

∂ L ∂ [T r (φ (y ) φ (y ))] ∂ [2T r (α  φ (y ) φ (x ))] = − ∂α ∂α ∂α ∂ [T r (α  φ (x ) φ (x )α )] ∂ [λα1 ] + + ∂α ∂α    = −φ (x ) φ (y ) + |α| φ (x ) φ (x ) + λ = −k(x, y ) + |α| k(x, x ) + λ,



| | x ∈ ;x ∈ i

IOKMI (xi , x j ) ,

(22)

j

where ROKMI is a function redefined by the kernel from Eq. (10), and is a feature set that contains k selected features. The purpose of the method is to find the best optimal (k + 1 )st feature from the set {Xࢨ }, the respective optimization algorithm to search for the (k + 1 )st feature can be denoted

max[IOKMI (xk , y ) − xk ∈

1 k



IOKMI (xk+1 , xk )],



(27)

ki j liq ]

T



= (m )−1 3 Ez (1 K L1 − tr K L )

(28)

Ex Ey Ex Ey [k(x, x )l (y, y )]  = (m )−1 ki j lrq ] 4 Ez [ (29)

where symbols (m )k = (mm−k! )! . But HSIC tends to select nonredundant features x, which strongly depend on y, mRMR (Peng et al., 2005) can detect redundant features, this is a preferable property in feature selection, therefore it has become a popular method in machine learning and artificial intelligence fields. Eq. (18) can be redefined using HSIC, the first term can be rewritten as

 φ (y ) −



αφ (x ) 2F = HSIC (y, y ) − 2

i

+





α HSIC (xi , y )

xi ∈

αi α Tj HSIC (xi , x j )

(30)

xi ,x j ∈

2.4. Relationship to HSIC Hibert–Schmidt Inpendence criterion (HSIC) (Gretton et al., 2005; Song et al., 2012) is feature selection method based on dependence maximization between selected features and the class labels. Consider mutual information in Eq. (8), the linear dependence from a distribution p(x, y), discussed in previous p(x, y) is difficult to compute. Denote covariance matrix Cxy = Exy [XY T ] − Ex [X]Ey [Y T ] which contains all second order dependence. HSIC defines a map : X → F from X to a feature space, and  : Y → G is an analogue map, the inner product may write as kernel function Ki j = K (xi , x j ) =< (xi ), (yi ) > and Li j = L(y, y ) =<

(y ), (y ) > , so HSIC is defined as (Gretton et al., 2005) HSIC0 (F , G, Z ) = (m − 1 )−2 trKHLH, Z = (X, Y )

(24)

where H = I − m−1 11 ∈ Rm×m is a centering matrix,HSIC0 is well concentred but has bias O(m−1 ). An unbiased estimator HSIC1 has defined in Ref. Song et al. (2012) T



Exy Ex Ey [k(x, x )l (y, y )] = (m )−1 3 Ez [

xk+1 ∈X \ ;xk ∈

11T

21T K 1 1T K L1 L1

tr (K L) + − m (m − 3 ) (m − 1 )(m − 2 ) m−2



ki j li j

= (m )−1 2 Ez (tr K L )

(23)

where the parameter | | has been replaced by k, the number of selected features. The criterion is the basis for the feature-selection algorithm presented below.

HSIC1 =



T T T

= (m )−1 4 Ez (1 K 11 L1 − 41 K L1 + 2tr K L )





(26)

, K L are matrices with the diagonal set to 0; HSIC1 constructed a unbiased estimator, there are three types expectation, joint expectation Exy Ex y , partial decouple expectation Exy Ex Ey , and all four expectations independently Ex Ey Ex Ey , they are defined as

(21)

min L( fω (x ), y ) = max(IOKMI ( , y ) − ROKMI ( ))  = max IOKMI (xi , y ) xi ∈

1, if i = j, 0, i = j,

Exy Ex y [k(x, x )l (y, y )] = (m )−1 2 Ez

where we consider the case λ = 0. Compared with Eq. (11), the criterion above uses the kernel to replace the MI. The kernel method is used to test the dependence between two random variables so, instead of defining MI with I(x, y), more details have been discussed in Gretton et al. (2003), we use the kernel function IOKMI (x, y), which is called the optimized kernel mutual information (OKMI) function IOKMI ( · ). This gives the new criterion

1





(25)

where HSIC(y, y) is a constant to be ignored, HSIC (xi , y ) = tr (φ (x )T , φ (y )) is a dependence between the i−th feature xi and output y, when HSIC(xi , y) takes a large value, α i also takes a value according to Eq. (18) is minimized, by the contrast, HSIC (xi , y ) = 0, when xi is independent of y, and thus α i can be deleted overall, HSIC shows the relevant features xi would have strong dependence on y. On the other hand, having redundant features implies features xi and xj are strongly dependent. Therefore, to minimize Eq. (18), HSIC(xi , xj ) takes a big value, thus α i , α j tend to become 0. It means the redundant features can be discarded, as a result, OKMI can be regarded as an idealized model of HSIC to find redundant features. Furthermore, HSIC always take non-negative value, only two random variables are independent HSIC is 0, non-Negative Matrix Fraction(NMF) has an analog property, which can be used to select features, Gretton et al. (2005)  proved the empirical HSIC converges to the true HSIC with O(1/ (n )), and OKMI also are considered as the centered model of the centered kernel target alignment(cKTA) (Cortes et al., 2014), which is used to tune parameters of the standard kernels. Let K denote a kernel gram matrix, let K ∗ = yyT be the ideal target matrix, where Y (y ∈ {−1, +1}n ) is the vector consisting of the labels of n samples, the measure is defined as

A(K, yyT ) = √

< K, K ∗ > < K, K ><

K∗, K∗

>

=

< K , K ∗ >F n  K F

(31)

where cKTA is just a measure to clarify their relationship and difference, but it is not designed for feature selection.

86

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

2.5. Kernel optimization Assume k0 (x, y) is a basic kernel, the data-dependent kernel function (Wang, 2008) may be defined as

k(x, y ) = q(x )q(y )k0 (x, y )

(32)

where q( · ) is the factor function,

q(x ) = α0 +

N 

αi ki (x, αi )

(33)

i=1

in which ki (x, α i ) is also a Gaussian kernel

ki (x, αi ) = exp(−  x − αi

2 ), αi ∈ Rd

(34)

The class separability in kernel space is defined as

J=

tr (Sb ) tr (Sw )

(35)

where Sb is between-class scatter matrix, Sw is within-class scatter matrix, tr( · ) is a trace of square matrix. Let C denote the number of classes, the number of samples in the i−th class is ni , the total  is n = ci=1 ni . Denote mi as the mean vector of the i−th class, and m as the mean vector of all vectors, therefore Sw , Sb are defined as



ni c  

Sw =

i=1



(xi j − mi )(xi j − mi )T



Sb =

ni c   i=1

(36)

j=1

(mi − m )(mi − m )

T

(37)

number training samples, simply maximizing Jφ (α ) easily leads to over-fitting, therefore, Jφ (α ) adds a regularization expressed as φ

Jopt (α ) = λJ φ (α ) + (1 − λ )  α − α0

2 ,

(0 ≤ λ ≤ 1 )

(42)

where λ is a regular parameter, a larger λ, means the number of features is larger, or the number of training samples is smaller, the λ value is mainly set empirically. φ But Jopt (α ) is not a convex function, it will search for a local opφ

timum by a gradient-based algorithm, this paper optimized Jopt (α ) used by the Quasi-Newton method, other methods have tackled the KPO problem. The difference of convex (DC)programming is used to find the global optimum, more complicated methods to search for α ∗ have been proposed in Chapelle, Vapnik, Bousquet, and Mukherjee (2002). 3. Implementation of optimized kernel mutual information feature-selection algorithm This section describes the design of an iterative update algorithm based on the above criterion to select an optimal subset of features. In the previous section, we proposed a theoretical OKMI method that left many issues unresolved, including how to determine the candidate subset from the feature set and how to choose a suitable kernel to enhance the performance. However, according to the OKMI method, the redundant features sometimes need to be removed from the selected-feature subset; a process that has not been considered by the traditional incremental method. Refining the incremental method will therefore be a challenge. 3.1. Choosing the candidate set

j=1

Sw , Sb are developed in kernel space K, they are rewritten by kernel function φ ( · ), Let caper-script φ denote the variables in K from in Rd , denote Di as the set of samples of i−th class, and D is  φ φ defined by D = ci=1 , thus t r (Sw ), t r (Sb ) can be rewritten as



φ

tr (Sw ) = tr





ni c   i=1



(φ (xi j ) − mφi )(φ (xi j ) − mφi )T

j=1

c  Sum(KDi ,Di ) = tr KD,D − ni

 (38)

i=1

 φ

tr (Sb ) = tr

c 

 φ

φ ni [(mi − mφ )(mi − mφ )T ]

i=1 c c  Sum(KDi ,Di )  Sum(KD,D ) = − ni n i=1

(39)

i=1

where K is a kernel matrix related to k(x, y) and k0 (x, y) by Eq. (32),

KA,B = [q(xi )q(x j )K0 (xi , x j )]m×m = QK0 Q

(40)

where KA, B has the constraint of xi ∈ A and xj ∈ A, Q is a diagonal matrix, the diagonal entries of Q are [q(x1 ), q(x2 ), . . . , q(xm )]. φ φ From Eq. (33), t r (Sb ), t r (Sw ) can be regarded as the function φ

φ

of α , J1 (α ) = tr (Sw )(α ), J2 (α ) = tr (Sb )(α ), the gradient-based optimization method used to find the optimal α , if the dimension of feature space is large, feature selection is essentially converted to a KPO problem, the optimal kernel parameter set α ∗ is

α ∗ = argmax[Jφ (α )] α ∈Rd ,α >0

(41)

Note that feature xi is more important if it is related α i is larger. In addition, there are high-dimension feature space, but a small

Some methods search for the optimal feature subsets from within the candidate set to determine the best candidate feature for a data set. We use a wrapper (Kohavi & John, 1997) selector to search the feature set. The feature-selector wrapper consists of a classifier (e.g., support vector machine or naive Bayes classifier), as shown in Fig. 1. Therefore, that wrapper approach may lead to high classification performance by minimizing classification error by using a special classifying algorithm; otherwise the cost of computing is usually high. To avoid this problem, we cannot optimize the classification algorithm directly, so we use a “filter” approach, such as OKMI, which selects features by testing the consistency between the candidate features and the target class. Usually the filter methods are less computationally complex than wrappers. Moreover, the filter methods incorporate wrappers and often yield comparable classification performance with various classifiers. The wrapper method has two selections including the backward and forward selections. For this section, we consider the forward selection wrapper in the OKMI framework. As with other search algorithms, the process of searching for the candidate feature set consists of four parts: (1) (2) (3) (4)

first feature vector when the search starts; search used to control the fashion of the exploration; evaluation criteria used to provide feedback; termination criterion.

The first selected feature vector x1 depends on the largest error reduction max k(x1 , y), and the search sequence starts from the entire feature set X to find the single feature that meets the requirement. The subset S with selected features assumes k features in the set . The original features set is X, the total target feature set end is K, (k ≤ K), and the wrapper selects the feature xk+1 from the subset {X − }. We calculate the cross-validation error for a large set of features and search the result for a subset with a relatively small error.

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

87

Fig. 1. Wrapper approach for feature selection.

Table 2 Algorithm 1: Candidate feature search method for Eq. (23). Input: Feature data set X = {x1 , . . . , xN }, Class label vector y; Empty Feature candidates set , Temporary set F; Number of feature target set K; Evaluation function: J(x, y); Kernel function: kernel(x, y); Output: Feature candidate set with K feature vectors; Classification accuracy acl; 01. Begin: is empty, F = X and k = 0; Set x1 = arg max kernel (y, xi ); xi ∈X

Set ← x1 ; Set F ← X\x1 ; 02. While k from 2 → K For all pairs(k, j), xk ∈ and xj ∈ F calculate kernel(xk , xj ) and save value. calculate xk+1 , its value is determined by evaluation function J(x, y):  arg max[ker nel (y, xk ) − 1k x j ∈F ker nel (xk , x j )]; x j ∈F

set ← ∪ xk+1 ; set F ← F \xk+1 ; k = k + 1; repeat until k = K; 03. end while Yields target subset end , Divide end by training set Tr and test set Te; Class label vector set Y also allocated; Choose classifier and set parameter to predict classification performance, gain accuracy acl ; Return accuracy acl. 04. End

To obtain the best error reduction, the wrapper strategy repeats until the classification error begins to increase with criterion (23), i.e., error ek+1 > ek . The search continues when ek+1 = ek makes the search space as large as possible. Once the candidate feature condition is satisfied, the feature xk+1 joins the subset , such that = ∪ {xk+1 }, and we start a new iteration by incrementing k to k + 1. The search stops if all available subsets have been calculated or the maximum number of iterations has been reached. In the present work, we predict a maximum number K of candidate feature subsets and, when k = K, the search process ends and yields the target subset end . 3.2. Optimized kernel mutual information algorithm According to the above criterion, we use the algorithm from Table 2. The algorithm from Table 2 uses the basic schemes of the OKMI method. The candidate subset of is a subset of the original data set X, as for a sequential search, which gives a new feature vector

from the criterion by iteration. The iteration ends when all termination conditions are satisfied or when the maximum number of iterations is reached. The iteration gives the best optimal feature subset end subject to the number of features. To compare the performance of classifiers via various feature subsets, we use the classifier (i.e., support vector machine, or SVM) to classify data on a candidate subset. The data are divided into two subsets, including a training data set and test data set. To confirm balanced allocation and widely distributed data, we allocate the sample data based on number parity: when the sample number is even (odd) the data are assigned to the training (testing) data set. According to the result of the classification, we can continue to improve the schemes. 4. Discussion on kernel and classifier In this section, we discuss the choice of kernel and classifier for use in experiments. The kernel is chosen to meet the requirements, and the experiments are analyzed to determine how best to use various classifiers. 4.1. Kernel estimation The kernel metric is the bridge that connects the original data to the learning methods, and the kernel methods play a key role in OKMI. Conversely, they sometimes also act as information bottlenecks. The learning methods cannot directly access the original data. If the data are not reflected in the kernel matrix, the data cannot be recovered. As a result, the estimate of the kernel must be decisive to achieve a higher classification accuracy. Theoretically, kernels should be adapted to the requirements of the real applications in all cases. The ideal kernel function φ is denoted k(x, y ) = φ (x )φ (y ), but we do not know the form of the function φ ( · ) that we learn in most real cases. Therefore, we start from simple kernels and combine them into more complex kernels. Some kernels from different kernel families are chosen for use in many applications, so we analyze the property of various kernels from different kernel families. Table 3 lists several kernel functions, and every function introduces the function name, parameter, and reference function. We need to select the kernel function for performing OKMI, and decide on the kernel parameters. Linear kernels have one advantage in that there are no adjustable parameters, so a linear kernel is the simplest of all kernels. However, when the training data are nonseparable, a linear kernel performs less well in applications. Therefore, nonlinear kernels have at least one parameter to tune up. Three widely used nonlinear kernel functions are available (see Table 3): Gaussian (RBF), polynomial, and sigmoid. The polynomial kernel often takes a long time to train, and sometimes performs worse than the other kernels. Moreover, the polynomial kernel has more hyper parameters in high degrees than

88

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95 Table 3 Comparison of different kernel functions. Function name

Abbr.

k(x,y) Expression

Linear Polynomial Radial Basis or Gaussian Symmetric Triangle Cauchy Laplace or Exponential Hyperbolic Secant

Linear PF RBF STF CF LF HSF

x y ( x y + c )d 2 e−gx−y max(1 − gx − y, 0 ) (1 + gx − y2 )−1 e−gx−y

Squared Sinc Sigmoid or Hyperbolic Tangent

SSCF HTF

2 e−gx−y +egx−y sin2 (x−y ) (gx−y )2 

4.2. Classifier choice In the OKMI framework, the classifier plays a crucial role in the scheme, so we try to choose a classifier to select features with good performance. To attain this goal, we consider three popular classifiers: the SVM, the Naive Bayes (NB), and the K nearest neighbor (KNN). The SVM is a modern perfect classifier (Hsu & Lin, 2002) that also uses kernels to construct boundaries for linear classification in high-dimensional vector spaces. Its basic idea is to map data to a high-dimensional vector space and to search for a separating hyperplane with the maximal margin. Given feature vector xk and class-label vector yk , the SVM gives the following quadratic optimization equation: m

min

k=1

sub ject

t o y k ( w  φ ( x k ) + b ) ≥ 1 − ξk ,

(43)

ξk ≥ 0, k = 1, . . . , m, where φ is a mapping function to map data to high dimensions, and C is a penalty parameter or training error. In the experiment, we use the Gaussian kernel to train and test the SVM after trying various types of kernels. The library for support vector machine (LIBSVM) package (Hsu & Lin, 2002) is a wellknown software package for the SVM algorithm that can classify multi-class problems. The NB classifier is the oldest classifier based on the Bayes decision rule (Keogh & Pazzani, 1999), which is a great simplification for learning and assumes that features are independent given their class. Although independence is usually a poor assumption, the NB classifier competes well with other classifiers in practice. The Bayes classifiers assign the class label to the most likely class given a sample defined by its features and produce a sample vector ξ = {x1 , . . . , xn } with n features. Assuming the features are independent, the probability of the class belonging to class ci is

p( c i | s ) = p( c i )

n 

p( x k | c i ) ,

d > 0, default default 1 default 3 default 3 default 3 default 3

Ref. function 3

Linear (x ) = x Poly(x ) = (x + 1 )d 2 RBF (x ) = e−gx ST F (x ) = max(1 − g|x|, 0 ) CF (x ) = (1 + gx2 )−1 LF (x ) = e−g|x| HSF (x ) = e−g|x|2+eg|x| (x ) SSCF (x ) = sin (gx )2 HT F (x ) = tanh (−gx + c ) 2

1

p(xk |ci ) for continuous features, the Parzen window could always be used. The KNN rule is also popular in the pattern-recognition field. The KNN rule assigns an unlabeled sample to the class expressed by its k nearest neighbors in the data set. It is statistically justified (Cover & Hart, 1967) because, when the number N of samples and k, both approach to infinity as k/N → 0, the error rate of the KNN approximates the error rate of the optimal Bayes method, so KNN performs well in real applications. The drawback of the KNN rule is that it is not guaranteed to be the optimal path in the finite sample space. In theory, KNN implicitly assumes k nearest neighbors of a data point over a very small region, so it delivers good resolution in estimating the various conditional densities. However, the distance between two samples is not always negligible in practice and can even become very large outside the regions of high density. 5. Experiments In this section, we test our proposed OKMI method to select features from eight public data sets. The data sets have different features and samples, and the number of features in some data sets is large. However, few features are in the some sets, and the number of samples is large, so the ranges of the feature values differ greatly. Thus, various data sets present a greater challenge to the proposed method: they require a highly accurate and robust OKMI method under complex experimental conditions. Moreover, we compare the proposed OKMI methods with the existing methods for feature selection from the data sets above, and analyze their performance. 5.1. Data sets Eight real-world public data sets were used in our experiments, including four face image sets, one handwritten digits extracted from image set, and three microarray data sets. The face image sets include AR10P,1 PIX10P,1 PIE10P,1 and ORL10P.1 The handwritten digit set is the USPS set.2 The microarray sets include TOX-171,1 CLL-SUB-111,1 and the Lung Cancer set.3 We select various numbers of features and samples for these experiments to validate the algorithm, and data sets from different fields serve as good platforms for evaluating the performance of various feature-selection approaches. The data sets are described below (see also Table 4). (1) The first set consists of face images. The database samples a number m that ranges from 100 to 210. Their features of dimension n range from 2400 to 10,304, and all data sets have 10 classes. They are considered to be m  n type. For AR10P, the feature values range from 6 to 255; the range for

(44)

k=1

where p(xk |ci ) is the conditional probability density, and p(ci ) is the prior probability of class ci . Despite the assumption that all features are independent not holding in the real world, the Bayes classifier has performed well with many real data sets. To estimate

c > 0; g > 0, g > 0, g > 0, g > 0, g > 0,

g > 0, default 3 c < 0; g > 0, default

tanh (gx y + c )

the RBF kernel, which may lead to infinity or zero. The sigmoid kernel is like the RBF kernel with some parameters, but the sigmoid kernel is invalid with other parameters that give normal performance for the RBF kernel. Unlike a linear kernel, the RBF can map data to high-dimensional nonlinearity, so the RBF could handle the nonlinear relationship between feature vectors and class label vectors. Therefore, we choose the RBF kernel by default in this work.

 1  w w+C ξk , w,b,ξ 2

Parameter

1 2 3

http://featureselection.asu.edu/datasets.php. http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html. http://penglab.janelia.org/proj/mRMR/#data.

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

89

Table 4 Summary of real-world data sets. Type

Data set name

# of samples

# of features

# of class

Face image

AR10P PIE10P PIX10P ORL10P USPS Tox-171 CLL-SUB-111 Lung cancer

130 210 100 100 9298 171 111 73

2400 2400 10,0 0 0 10,304 256 5748 11,340 325

10 10 10 10 10 4 3 7

Handwritten digits MicroArray

PIE10P is similar to that for AR10P, but the minimum feature value is zero. The PIX10P feature values range from 0 to 62, and the ORL10P feature values range from 0 to 245. All values are integers. Thus, every feature vector has dimension of about 100, and its density function is relatively simple because the data type is integer and the value range is not large. The only challenge comes from the size of the feature vector space: the maximum number is 10,304, so the computation takes a long time when we use the wrapped method to select features because every iteration requires a lot of computing resources. Thus, computing complexity is an important metric for comparing all methods. (2) The second type consists of handwritten digits. The USPS set is a widely used handwritten digit database with m = 9298 samples and n = 56 features. In contrast with the face data sets, USPS is of type m  n, so the dimension of the feature vector is high, and the number of feature vectors is low. Selecting features is done rapidly by the wrapper method. For USPS, the feature-value type is doubled, and the values range from −1 to 1, so directly computing the probability density when using USPS is difficult, and the data values need to be discretized before the process. (3) The last type has a microarray data set. Their features come from biomedical images. The relationships between features and samples differ from those of the face images or USPS databases, and the number n of features within Tox-171 or CLL-SUB-111 is greater than the number m of samples. The size of the CLL-SUB-111 features is 11,340. Meanwhile, its feature values range from 0.88 to 7.7011 × 105 , and the Tox171 feature values range from 0.02 to 3392.3. The featurevalue type is doubled, so calculating the density probability even when data are discretized is difficult. In experiments, some feature-selection methods consume too many resources to finish the task to process these databases. Conversely, for the lung cancer database, both the number of features and the number of samples are small, so the proposed approach works well for the lung cancer data set. All data stored in the above feature databases are feature values that are extracted from the original images. Some sample images appear in Fig. 2. Besides the feature value, the data sets store the class labels, with the number of classes ranging from 3 to 10, so, in this work, feature selection is a multi-class problem as well as a high-dimensional problem. 5.2. Evaluation metrics To evaluate the performance of various feature-selection methods, the three metrics, accuracy (ACC), normalized mutual information (NMI), and redundancy rate (RED), are used to measure the performance of the supervised methods. ACC. Given a sample xi , i ∈ {1, . . . , n}, ri and si denote the clustering class label and true class label, respectively, so the ACC is de-

fined as follows:

n

ACC =

i=1

δ (si , map(ri )) n

,

(45)

where δ (x, y) is the delta function [δ (x = y ) = 1, i f x = y; δ (x = y ) = 0]. The map map(ri ) is the mapping function that permutes the label ri to match the true label si using Kuhn–Munkres algorithm. 5.2.1. Normalized mutual information NMI is a widely used metric for supervised feature selection. The proposed OKMI method is based on MI as defined in Eq. (8). In this experiment, we use the NMI metric to reflect the consistency between clustering labels and ground truth labels. The NMI metric is also based on MI: given two random variables x and y, the NMI of x and y is defined as

NMI (x, y ) =

I (x, y ) , max(H (x ), H (y ))

(46)

where I(x, y) is the mutual information defined in Eq. (8), and H(x) and H(y) are the entropies of x and y, respectively. The NMI values range from 0 to 1, so it is easy to gain from the equation above. NMI = 1 means the two variables are identical, so NMI delivers a higher view of the clustering performance. 5.2.2. Redundancy rate Given a selected feature set X with features xi and xj both in feature set X, the redundancy rate of X is defined as

RED =

1 m (m − 1 )



ρi, j ,

(47)

xi ,x j ∈X,i> j

where ρ is the correlation between feature xi and xj . A large RED score indicates that many selected features are more strongly correlated, which means that many redundant features are selected. Thus, it is preferable to have a small redundancy rate from our proposed feature-selection method. 5.3. Comparison of kernels In Section 4.1, we discussed the different theoretical kernel functions. We now discuss the real performance of kernels in feature selection based on experiments done to test the kernels on real-world data sets. Because choosing the kernel is part of the work of OKMI, we do not do the overall experiment but only compare the RBF kernel with three popular kernels using a simple test. Fig. 3 shows the mean classification accuracy of OKMI using various kernels. The RBF kernel performs the best on all data sets, and the highest accuracy is from the PIX10P data set, (the average accuracy is 99.2%). The difference between the RBF kernel and other kernels is greatest for the PIE10P data set. The RBF kernel gives 95.78% average accuracy when the accuracies of the other kernels range from 68% to 72%. The accuracies of all kernels are similar and lowest for the CLLSUB data set, with the RBF kernel giving the maximum accuracy of only 56.8%. Although the accuracy

90

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

Fig. 2. Some original images, (a) ORL10P and (b) USPS.

Fig. 3. Comparison of performance using different kernels.

of the RBF kernel is much better than those of the other kernels, the running time of the other kernels, especially the linear kernel, is shorter than that of the RBF kernel. For example, the running time for the linear kernel is 45.9 s when 530 features are selected from PIE10P, whereas the running time of the RBF kernel is 102.4 s. Therefore, to meet our goal, we choose the RBF kernel as the OKMI kernel.

5.4. Comparison of classifiers The OKMI framework does not specify a classifier, so we choose the best classifier for the OKMI method to obtain good performance from our feature selection method. As already introduced, we mainly consider three classifiers: NB, SVM, and KNN. In the experiments, OKMI is applied to all data sets using three classifiers. Fig. 4 shows the mean classification accuracy with NB, SVM, and KNN individually. OKMI with SVM gives a higher accuracy than with NB or KNN. The mean accuracy with KNN is near that with SVM, and the highest accuracy with KNN is 95.4%, which is obtained with PIX10P. The lowest accuracy is 53.8% with CLLSUB, and their corresponding accuracies with SVM are 99.2% and 56.8%, respectively.

Fig. 4. Comparison of performance when using different classifiers.

The NB classifier performs poorly in these experiments. It needs to calculate the density and probability, so the feature values must be positive. Thus, sometimes NB classifiers cannot be implemented normally. The mean classification accuracy with NB is less than those of KNN or SVM: the highest and lowest accuracies with NB are 92.1% and 49.0%, respectively. Therefore, in the following experiments, OKMI methods select features by using the SVM classifier, unless otherwise noted. 5.5. Comparison of methods For the following experiments, we chose some representative supervised-feature-selection methods to compare with the OKMI method: ReliefF (Kira & Rendell, 1992), Fisher score, correlation, mRMR (Peng et al., 2005), kernel optimize(Kopt) (Xiong et al., 2005), kernel class separability (KCS) (Wang, 2008), sparse additive models (SpAM)(Ravikumar et al., 2009),4 generalized multiple kernel learning (GMKL) (Varma & Babu, 2009), scaled class separability selection (SCSS) (Ramona et al., 2012), centered kernel target

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

alignment (cKTA)(Cortes et al., 2014),4 backward elimination using Hilbert-Schmidt independence criterion (BAHSIC) (Song et al., 2012),4 feature-wise kernelized Lasso (HSIC Lasso) (Yamada et al., 2014), and spectral feature with minimum redundancy (MRSF) (Zhao et al., 2010). Note that the first two are existing spectral feature-selection methods, that correlation has the simplest criterion when using covariance and variance, that HSIC Lasso is a kernel-based method for capturing the nonlinear dependence of input on output, and that mRMR and MRSF are state-of-the-art feature-selection methods for removing redundant features. According to codes provided by the authors, BAHSIC (Song et al., 2012) was written in python language, the latest version of OpenKernel (Cortes et al., 2014) is known to work under Linux and MacOS X using g++, SpAM (Ravikumar et al., 2009) codes is written in R language. We run the three programs on a GPU server with Core i7-7700K cpu, NVIDA Titan X GPU and 16G memory, the server platform is Linux. The other experiments were implemented with Matlab ver. 7.0 and v2012 on an Intel Core i5-4250 CPU with 4 GB memory running Windows 7.0. For each data set, because it is sorted by class label and there are more than two, randomly choosing samples from the data set would lead to some class labels appearing in the training data but not in the test data set (and vice versa), especially those with more than 10 classes. Therefore we used a measure C∗ F/O (Chan & Kim, 2015) that represents the complexity of a data set. C is the number of classes, F is the number of features, and O is the number of samples. Since O of some data set is small, and C is large, to confirm the data balance, we used both hold out and k-fold cross validation to calculate the classification accuracy. According to the complexity of various data sets, the value of k is set to 3–5 in kfold cross validation. With the hold out method, we use the sorted data by choosing samples one by one from the original features of the data set. One sample is chosen for the training data, and the next sample is chosen for the testing data. The process ensures a uniform distribution between the training and testing sets, with both sets containing every class label. Unlike other methods, the quantities of top selected features from all data sets are similar to other reported experiments, e.g., the top m = 10, 20, . . . , 200 features selected in Ref. Zhao et al. (2010), and the top m = 10, 20, . . . , 50 features selected by each algorithm in Ref. Yamada et al. (2014). In the experiments reported herein, the number of features m and the number of samples n follow a complicated and changeable relationship, and the overall performance of the algorithm cannot be verified for a small number of selected features. The feature spaces PIX10P, ORL10P, and CLL-SUB are greater than 10 0 0 0, and a method that only verifies less than 2% of the features is insufficient. In addition, some data sets have only a small number of features; e.g., the total numbers of features for USPS and lung cancer are only 256 and 325, respectively. Therefore, we do not select the same maximum number of features from all data sets; instead, the maximum number of features is selected according to the size of the data set feature space. For instance, we select the top 10, 30, 50, 100, 150, 250, 300 features from the lung cancer data set, and the top 10, 80, 150, 200, 400, 800, 1100, 2400, 4000 features from TOX and CLLSUB. The PIX10P feature space has a size of 10,0 0 0 but, as seen in Fig. 5(f), all the methods in PIX10P have a high classification accuracy, and OKMI obtains 100% accuracy when m ≥ 80 features are selected; thus, we select the top 10, 80, 150, 30 0, 80 0 features. By varying the number of top features selected, the experiment can test under conditions of a small number of samples or a big num-

4 We thank Prof. Liu, Dr. Afshin, Dr. Song and all the authors of Song et al. (2012), Cortes et al. (2014) and Ravikumar et al. (2009) for providing their codes.

91

ber of samples. The results of all these experiments confirm the objectivity of comparing all methods. Fig. 5(a)–(h) shows the classification with a different number of selected features from various data sets. Fig. 5(a) and (c)–(e) shows that the overall performance of OKMI compares favorably with the other approaches for USPS, PIE10P, lung cancer, and ORL in terms of classification accuracy and significantly outperforms the other methods for AR10P, PIX10P, TOX, and CLLSUB, as shown in Fig. 5(b) and (f)–(h). For the TOX and CLLSUB data sets, the mRMR terminates the task because of insufficient memory. Because the mRMR algorithm cannot operate on two data sets, the mRMR method must calculate the probability density but the range of feature values for two data sets is too large, so this approach requires too much memory. Thus, no mRMR data appear in Fig. 5(g) and (h) and in Table 5. Fig. 5(a), (c), and (d) shows that OKMI methods are slightly less accurate when the feature number is relatively small (i.e., between 10 and 80); however, for over 100 features, OKMI feature selection is more accurate than the other methods. The statement “more features, not higher accuracy” may be suggested by Fig. 5(b), (e), and (h), which show that the performance is maximal at around 100 features. Table 5 shows the average classification accuracy for various numbers m of the top selected features via individual methods. All methods are highly accurate for PIX10P and PIE10P, but the classification accuracies for TOX and CLLSUB are rather poor, indicating that the features of PIX10P and PIE10P are better than those of TOX and CLLSUB. For PIX10P, the highest classification accuracy for OKMI is 100% and the lowest is 96%, with the mean of accuracy at 99%. For CLLSUB, however, the highest accuracy for OKMI is 65% and the lowest is 38% for 10 features, and the mean accuracy for OKMI is 56.8%, which is better than the other methods. The results listed in Table 5 show that, for most data sets, the mean classification accuracy of OKMI is significantly better than those produced by the other methods. Note that mRMR is more accurate than OKMI for USPS and HSIC Lasso is the most accurate for ORL, but the difference in accuracy with OKMI is less than 1%. Table 6 shows the average redundancy(RED) rates of the top n features selected by various approaches, where n depends on the data set. RED gives higher means because unnecessary features are present and these can be expressed by a linear combination of the remaining feature vectors. The boldface number in Table 6 gives the lowest redundancy rate for RED. As can be seen, OKMI gives better results than the lowest RED value for all data sets except for PIE and TOX. Overall, the RED values with OKMI are smaller, which means that the OKMI methods select less-redundant features, so the minimal-redundancy algorithm in OKMI is more effective. 5.6. Computation complexity The OKMI, mRMR, and correlation methods approximate the MI selection scheme, so we only compare their performances in terms of computation complexity, which reveals their running time for real data. Note that mRMR cannot be applied to the TOX and CLLSUB data sets, so we compare them for the six remaining data sets. We obtain the computation complexity by computing the average time for selecting a single feature. The running time is obtained in the following mode, and we run all the methods to select the top m features from each data set and record the total time T. Thus, the average running time is T/m, where the units are seconds per feature. Table 7 shows the time required for three methods to select features from the data sets. For the lung cancer set, all methods finish quickly because of the small number of samples and features (see Table 4). The OKMI method takes about 0.0025 s to select a single feature, whereas mRMR and correlation require 0.0113 and 0.0125 s, respectively. However, the running times for the ORL and PIX sets are long because the total number of features

92

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

Fig. 5. Accuracy of OKMI classification and different methods for real-world data.

is large ( ≥ 10, 0 0 0), thus the time cost is large when the experiment selects the top 40 0 0 features from these sets. We also test the computation complexity for the TOX and CLLSUB data sets, which is hard work for mRMR. The TOX and correlation methods take 2.30 s, and OKMI takes 1.83 s. For CLLSUB, the

correlation method requires an average of 4.21 s to find a feature, whereas OKMI requires 1.26 s. The TOX and CLLSUB data sets do not only yield a large number of features, but their feature values are also more complex than the other data sets. As a result, the

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

93

Table 5 Mean classification accuracy (%) with standard deviation in parentheses. Algorithm

USPS

AR10P

PIE10P

PIX10P

LUNG

ORL

TOX

CLLSUB

ReliefF Fisher score Correlation Kopt mRMR KCS SpAM GMKL MRSF SCSS cKTA BAHSIC HSIC Lasso OKMI

86.6(8.88) 85.7(8.32) 90.5(8.22) 85.2(6.77) 91.6(5.02) 82.3(9.45) 87.5(5.33) 84.8(10.2) 90.9(5.09) 81.7(6.98) 89.3(7.28) 87.2(6.23) 85.1(8.45) 90.3(7.51)

81.6(8.77) 80.3(6.76) 89.2(5.75) 79.8(8.15) 90.7(5.47) 83.5(6.77) 85.9(10.2) 81.4(9.76) 89.4(3.88) 84.2(10.1) 70.9(20.7) 77.3(12.2) 85.6(6.19) 96.1(2.66)

91.3(5.98) 92.0(5.18) 93.2(5.34) 74.8(10.25) 96.1(2.74) 90.6(4.56) 95.5(6.2) 89.2(7.93) 93.9(4.52) 91.5(4.29) 88.4(2.18) 90.8(9.1) 93.4(3.18) 95.7(4.10)

93.2(4.65) 93.0(7.31) 95.2(3.03) 90.2(6.71) 98.0(4.47) 89.3(9.6) 95.7(14.6) 91.4(3.78) 94.8(4.54) 95.7(10.3) 92.0(9.0) 92.0(11.3) 95.0(4.42) 99.2(1.78)

79.8(12.5) 78.7(11.9) 82.6(11.0) 71.3(20.4) 85.0(8.12) 69.1(23.5) 82.1(4.3) 80.3(5.9) 80.3(10.3) 86.2(14.1) 88.1(2.3) 80.6(12.6) 81.9(10.3) 89.5(2.46)

83.8(13.3) 83.4(12.8) 88.2(5.02) 78.4(18.2) 87.2(10.3) 79.5 (6.5) 84.3(10.2) 82.4(9.6) 84.4(10.0) 81.2(11.8) 84.2(15.9) 85.6(3.5) 88.8(9.47) 87.6(12.1)

76.9(18.0) 72.1(19.4) 75.4(20.4) 68.2(21.0) – (–) 56.5(26.3) 80.2(6.8) 78.3(15.5) 78.0(20.2) 76.3(22.3) 80.4(11.0) 71.5(8.7) 79.3(19.0) 82.5(18.9)

51.5(5.55) 49.7(7.01) 54.7(7.30) 50.2(14.5) – (–) 47.6(18.3) 52.4(11.2) 51.3(9.42) 54.1(7.59) 49.2(9.8) 55.8(10.4) 50.3(10.6) 51.0(7.28) 56.8(8.17)

Table 6 Mean redundancy rate (%) with standard deviation in parentheses. Algorithm

USPS

AR10P

PIE10P

PIX10P

LUNG

ORL

TOX

CLLSUB

ReliefF Fisher score Kopt Correlation mRMR KCS SpAM GMKL MRSF SCSS cKTA BAHSIC HSIC Lasso OKMI

46.6(10.2) 45.7(10.8) 45.2(7.24) 40.5(11.8) 31.6(11.2) 42.3(8.5) 37.5(6.4) 34.8(9.4) 30.9(5.09) 32.6(5.7) 39.3(7.45) 37.6(7.41) 25.1(8.45) 20.3(7.51)

79.6(8.77) 60.3(6.23) 59.8(8.30) 59.2(8.57) 26.9(3.47) 53.5(10.2) 65.2(8.8) 41.4(9.02) 25.4(3.88) 34.2(12.8) 31.8(4.67) 34.3(7.2) 18.5(2.19) 16.1(3.66)

36.3(10.2) 37.0(5.18) 34.8(8.35) 33.2(8.13) 22.5(3.74) 40.6(4.8) 25.5(7.8) 29.4(8.7) 23.9(4.52) 31.5(4.67) 27.8(3.27) 30.2(7.65) 13.4(1.48) 15.7(4.10)

77.2(4.65) 83.0(7.31) 80.2(6.42) 85.2(3.03) 21.0(6.47) 29.2(10.1) 83.5(10.7) 41.4(14.5) 26.8(4.54) 45.5(12.3) 42.0(7.6) 33.6(8.5) 17.7(2.42) 14.2(1.78)

39.8(9.25) 38.7(11.9) 41.3(9.05) 42.6(15.3) 35.0(8.12) 38.4(5.7) 42.5(5.4) 41.2(6.91) 30.3(10.3) 36.2(13.8) 48.1(14.3) 40.4(8.6) 31.9(10.3) 21.5(4.76)

33.8(13.5) 73.4(12.8) 78.4(18.2) 78.2(13.2) 27.2(9.30) 34.5 (6.7) 34.8(10.2) 37.3(10.1) 24.4(10.0) 45.4(13.4) 44.3(6.5) 37.1(9.5) 19.8(2.67) 17.6(10.8)

36.9(10.3) 52.1(8.34) 58.2(13.0) 55.4(10.4) – (–) 45.7(5.5) 40.2(9.8) 46.5(19.2) 18.0(2.02) 34.5(18.5) 40.6(12.5) 37.5(11.2) 38.3(27.0) 32.5(8.92)

58.5(5.55) 69.7(9.78) 50.2(14.5) 34.7(7.30) – (–) 42.1(10.9) 42.5(10.3) 42.9(10.13) 34.1(7.59) 38.7(7.6) 45.8(10.2) 41.9(7.8) 34.4(3.48) 26.8(4.15)

Table 7 Mean running time (seconds per feature). Algorithm

USPS

LUNG

AR10P

PIE10P

ORL

PIX10P

mRMR (Peng et al., 2005) Correlation OKMI

0.2162 0.5851 0.1193

0.0113 0.0125 0.0025

2.0542 0.8071 0.0322

1.1204 1.0256 0.7580

3.5690 3.4422 0.6967

1.2829 3.2683 0.6963

time cost for TOX and CLLSUB is higher than for the other data sets. The data analysis and our experiments with six data sets reveal that OKMI has a lower average running time than mRMR and correlation. These results demonstrate the effectiveness of the OKMI method for complex computations. 5.7. Analysis of results With the proposed methods, we focus on feature selection using various data sets. The OKMI framework consists of a kernel function and a classifier and uses the process of the wrapper filter to select features. This approach leads to a high accuracy and minimal redundancy. The OKMI method optimizes the MI to yield a candidate feature subset by using the kernel function, so the features in the subset have new metrics for Max-relevance and Minredundancy. The OKMI method is especially appropriate for solving multi-class, high-dimensional, complex feature-value problems. Compared with a traditional MI algorithm, OKMI does not calculate the multivariate joint probability and density functions. These are very difficult to calculate and doing so can easily lead to mistakes, as demonstrated by the analysis and experiments discussed in the previous section. In this work, we used a kernelbased method to compute the MI, which overcomes the difficulty of calculating the multivariate joint probability and density func-

tions. This kernel-based method improves the accuracy and reduces the redundancy. In the experiments reported herein, OKMI feature selection usually gives a higher classification accuracy than the other methods; however, occasionally a few ups and downs occur in the classification accuracy and the redundancy rate. For instance, Tables 5 and 6 show that OKMI does not lead to the highest performance for one or two data sets. However, many conditions may cause this undulation; one possible reason is that some degraded features may produce noise, thereby affecting the classification result. Another reason is that cross validation may lead to the fluctuation in classification, and a third reason is the difference between the relevance and redundancy [see Eq. (22)]. A possible solution to this problem is to give a large penalty λ = 0 in Eq. (5) to a redundant feature with a large relevance, which would surely be selected as the top feature. Traditional feature-selection methods consist of an incremental search. In these experiments we test forward and backward selections, and not all of the methods can confirm the global optimization of the search path. The main difficulty is the limitation in thoroughly searching the entire vector space. In addition, global optimization leads to overfitting and high costs in time. Thus, OKMI delivers better classification accuracy with low computational cost.

94

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95

In these experiments, the goal is to compare various kernels, classifiers, and other feature-selection methods. In most complexity conditions, OKMI increases the accuracy of classification and decreases the time cost. It gives the highest classification accuracy for PIX10P. For CLLSUB the performance is unsatisfactory, although it gives the highest accuracy and solves the problem, which is better than some methods that cannot be used with this data set. The results of the experiments show that OKMI is a highly accurate and effective method for selecting features. The OKMI method can thus be considered as an open framework, highly effective method for selecting features. These experiments concluded that the non-linear kernel performed best in our feature selection evaluation. Note that our method has two implications: 1, Apply a Gaussian kernel if nonlinear effects are present, such as multi-class or high dimensional effects of different data sets. 2, our approach also provides the global optimization, which can be used to bound the quality of other heuristic approaches, such as local search, that may perform better than the greedy algorithm on specific problems. These experiments also imply a desirable property of OKMI as a whole: it correlates well with the observed outcomes. 6. Conclusions In this paper, we propose an OKMI approach to select features by using a kernel function and mutual information. The OKMI method improves the selection and avoids the complex computation of the joint probability density. We report the results of an experiment that uses eight data sets, with over 10 0 0 0 features in some data sets. The average rate of correct classification is greater than those produced by the compared methods. The OKMI method integrates kernel function, classifier, and mutual information, and the experimental results demonstrate that OKMI rapidly searches through candidate features in subsets of remaining features with high performance. The major advantages of the OKMI method are (1) (2) (3) (4)

It It It It

is robust when applied to a complex feature set. can handle multi-class problems. works fairly well for high-dimensional data. provides higher accuracy at low computation complexity.

In the future, more work is needed on both the theoretical and practical aspects of this method. Specifically, future research should at least deal with (1) Advancing the theoretical analysis; we want to find a compressed representation of the data itself in the hope that it is useful for a subsequent learning tasks. (2) Automatically finding the optimal feature subset without the assigned number; it would be appealing if the optimal number of features to be selected could be determined automatically. (3) Applying OKMI to real problems, such as handwriting or face recognition, with particular attention to improving the robustness and performance of the OKMI method. (4) Note that the formulation in this paper can be extended to unsupervised transformation methods. Because of space limitations, we focused this paper to the supervised case. We will explore unsupervised dimensionality reduction methods as a future extension of this work. Acknowledgments This work was supported by the Guangdong Provincial Government of China through the “Computational Science Innovative Research Team” program and the Guangdong Province Key Laboratory of Computational Science at Sun Yat-Sen University, the Technology Program of GuangDong (Grant no. 2012B091100334), the National Natural Science Foundation of China (Grant no. 11471012), and the China Scholarship Council (Grant no. 201506385010).

References Almuallim, H., & Dietterich, T. G. (1994). Learning boolean concepts in the presence of many irrelevant features. Artificial Intelligence, 69(1), 279–305. Alzate, C., & Suykens, J. A. (2012). Hierarchical kernel spectral clustering. Neural Networks, 35, 21–30. Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550. Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1), 245–271. Chan, H. P., & Kim, S. B. (2015). Sequential random k-nearest neighbor feature selection for high-dimensional data. Expert Systems with Applications, 42(5), 2336–2342. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1), 131–159. Cheriet, M., Kharma, N., Liu, C.-L., & Suen, C. (2007). Character recognition systems: A guide for students and practitioners. Hoboken, New Jersey: John Wiley & Sons. Cortes, C., Mohri, M., & Rostamizadeh, A. (2014). Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13(2), 795–828. Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. Dash, M., & Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelligence, 151(1), 155–176. Ding, C., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology, 3(02), 185–205. Eriksson, T., Kim, S., Kang, H.-G., & Lee, C. (2005). An information-theoretic perspective on feature selection in speaker recognition. IEEE Signal Processing Letters, 12(7), 500–503. Gretton, A., Bousquet, O., Smola, A., & Lkopf, B. (2005). Measuring statistical dependence with hilbert-schmidt norms. In Proceedings of international conference on algorithmic learning theory (pp. 63–77). Gretton, A., Herbrich, R., & Smola, A. J. (2003). The kernel mutual information. In Proceedings of IEEE international conference on acoustics, speech, and signal processing (pp. 880–883). Guestrin, C., Krause, A., & Singh, A. P. (2005). Near-optimal sensor placements in Gaussian processes. In Proceedings of international conference on machine learning (pp. 265–272). He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. In Proceedings of advances in neural information processing systems (pp. 507–514). Hou, C., Nie, F., Li, X., Yi, D., & Wu, Y. (2014). Joint embedding learning and sparse regression: A framework for unsupervised feature selection. IEEE Transactions on Cybernetics, 44(6), 793–804. Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425. Keogh, E., & Pazzani, M. (1999). Learning augmented bayesian classifiers: A comparison of distribution-based and classification-based approaches. In Proceedings of the 7th international workshop on artificial intelligence and statistics (pp. 225–230). Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional methods and a new algorithm. In Proceedings of AAAI conference on artificial intelligence: 2 (pp. 129–134). Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1), 273–324. Lal, T. N., Chapelle, O., Weston, J., & Elisseeff, A. (2006). Embedded methods. In Feature extraction (pp. 137–165). Springer. Lin, S. W., Ying, K. C., Chen, S. C., & Lee, Z. J. (2008). Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Systems with Applications, 35(4), 1817–1824. Masaeli, M., Fung, G., & Dy, J. G. (2010). From transformation-based dimensionality reduction to feature selection.. In Proceedings of international conference on machine learning (pp. 751–758). Mitra, P., Murthy, C., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 301–312. Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238. Perkins, S., Lacker, K., & Theiler, J. (2003). Grafting: Fast, incremental feature selection by gradient descent in function space. The Journal of Machine Learning Research, 3, 1333–1356. Pudil, P., Novovicˇ ová, J., & Kittler, J. (1994). Floating search methods in feature selection. Pattern Recognition Letters, 15(11), 1119–1125. Ramona, M., Richard, G., & David, B. (2012). Multiclass feature selection with kernel gram-matrix-based criteria.. IEEE Transactions on Neural Networks and Learning Systems, 23(10), 1611–1623. Ravikumar, P., Lafferty, J., Liu, H., & Wasserman, L. (2009). Sparse additive models. Journal of the Royal Statistical Society, 71(5), 1009–1030. Sakai, T., & Sugiyama, M. (2014). Computationally efficient estimation of squared-loss mutual information with multiplicative kernel models. IEICE Transactions on Information and Systems, 97(4), 968–971. Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319.

N. Bi et al. / Expert Systems With Applications 108 (2018) 81–95 Song, L., Smola, A., Gretton, A., Bedo, J., & Borgwardt, K. (2012). Feature selection via dependence maximization. The Journal of Machine Learning Research, 13(1), 1393–1434. Varma, M., & Babu, B. R. (2009). More generality in efficient multiple kernel learning. In Proceedings of international conference on machine learning (pp. 1065–1072). Wang, J. J.-Y., Bensmail, H., & Gao, X. (2014). Feature selection and multi-kernel learning for sparse representation on a manifold. Neural Networks, 51, 9–16. Wang, L. (2008). Feature selection with kernel class separability. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9), 1534–1546. Wang, S., Pedrycz, W., Zhu, Q., & Zhu, W. (2015). Subspace learning for unsupervised feature selection via matrix factorization. Pattern Recognition, 48(1), 10–19.

95

Xiong, H., Swamy, M. N., & Ahmad, M. O. (2005). Optimizing the kernel in the empirical feature space.. IEEE Trans Neural Network, 16(2), 460–474. Yamada, M., Jitkrittum, W., Sigal, L., Xing, E. P., & Sugiyama, M. (2014). High-dimensional feature selection by feature-wise kernelized lasso. Neural Computation, 26(1), 185–207. Zhao, Z., Wang, L., & Liu, H. (2010). Efficient spectral feature selection with minimum redundancy. In Proceedings of AAAI conference on artificial intelligence. Zhu, P., Zuo, W., Zhang, L., Hu, Q., & Shiu, S. C. (2015). Unsupervised feature selection by regularized self-representation. Pattern Recognition, 48(2), 438–446.