Knowledge-Based Systems xxx (xxxx) xxx
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Multiple indefinite kernel learning for feature selection✩ ∗
Hui Xue , Yu Song, Hai-Ming Xu School of Computer Science and Engineering, Southeast University, Nanjing, 210096, China MOE Key Laboratory of Computer Network and Information Integration (Southeast University), China
article
info
Article history: Received 17 July 2019 Received in revised form 22 November 2019 Accepted 24 November 2019 Available online xxxx Keywords: Feature selection Multiple indefinite kernel learning Indefinite kernel DC programming
a b s t r a c t Multiple kernel learning for feature selection (MKL-FS) utilizes kernels to explore complex properties of features and performs better in embedded methods. However, the kernels in MKL-FS are generally limited to be positive definite. In fact, indefinite kernels often emerge in actual applications and can achieve better empirical performance. But due to the non-convexity of indefinite kernels, existing MKL-FS methods are usually inapplicable and the corresponding research is also relatively little. In this paper, we propose a novel multiple indefinite kernel feature selection method (MIK-FS) based on the primal framework of indefinite kernel support vector machine (IKSVM), which applies an indefinite base kernel for each feature and then exerts an l1 -norm constraint on kernel combination coefficients to select features automatically. A two-stage algorithm is further presented to optimize the coefficients of IKSVM and kernel combination alternately. In the algorithm, we reformulate the nonconvex optimization problem of primal IKSVM as a difference of convex functions (DC) programming and transform the non-convex problem into a convex one with the affine minorization approximation. We further utilize a leverage score sampling method to select landmark points for solving large-scale problems. Moreover, we extend MIK-FS to multi-class feature selection scenarios. Experiments on realworld datasets demonstrate that MIK-FS is superior to some related state-of-the-art methods in both feature selection and classification performance. © 2019 Elsevier B.V. All rights reserved.
1. Introduction Kernel method is one of the most popular techniques in machine learning. It works through mapping the input data into a high-dimensional feature space to solve nonlinear problems effectively, where the mapping is represented by a kernel implicitly. However, selecting an appropriate kernel function and tuning its parameters are both difficult. Multiple kernel learning (MKL) uses a combination of base kernels instead of a single kernel, and skillfully converts the problem of kernel selection into the learning of multiple kernel combination coefficients, which has been regarded as a promising strategy to improve the performance of kernel method [1–3]. In the past few years, MKL also has been generalized to feature selection field and shown to be among the most effective methods for nonlinear feature selection, where the predictor is a nonlinear function of the input features. Multiple kernel learning ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.105272. ∗ Corresponding author at: . E-mail addresses:
[email protected] (H. Xue),
[email protected] (Y. Song),
[email protected] (H.-M. Xu).
for feature selection (MKL-FS) utilizes a base kernel for each feature and automatically selects the features corresponding to the non-zero kernel combination coefficients. As a result, MKLFS can not only uncover complicated properties of the features effectively, but also convert the selection of the features into the learning on a sparse combination of multiple kernels [4,5]. Chen et al. treated the feature selection problem of gene expression data as an ordinary multiple parameter learning problem based on multiple kernel support vector machine [6]. Dileep and Sekhar further applied this method in image categorization tasks and showed its superiority over principal component analysis (PCA) [7]. Varma and Babu proposed a more generalized MKL scheme for feature selection where the combination of base kernels can be generalized to be nonlinear [5]. Xu et al. presented a non-monotonic feature selection method to select a specific number of features and the corresponding combinatorial optimization problem was approximated by an MKL problem [4]. Tan et al. focused on sparse support vector machines (SVM) with l0 norm whose convex relaxation can be further formulated as an MKL problem, where each kernel corresponds to a sparse feature subset [8]. The authors further proposed a modified accelerated proximal gradient method to speed up the training and extended sparse SVM to group feature selection and nonlinear feature selection [9]. Yamada et al. proposed an alternative feature-wise kernelized Lasso to capture nonlinear input–output dependency
https://doi.org/10.1016/j.knosys.2019.105272 0950-7051/© 2019 Elsevier B.V. All rights reserved.
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.
2
H. Xue, Y. Song and H.-M. Xu / Knowledge-Based Systems xxx (xxxx) xxx
and select informative features according to the kernel coefficients [10]. Maldonado et al. proposed a feature selection approach designed to deal with class-imbalance problems and can be extended to model complex nonlinear pattern by kernel methods [11]. Moghaddam and Hamidzadeh proposed a new kernel function derived from hermite orthogonal polynomials for SVM, which can reduce the number of support vectors and increase classification speed [12]. However, MKL-FS methods usually require that base kernels should be positive definite (PD) and satisfy the Mercer’s condition in order to obtain a bi-convex optimization problem. In fact, standard PD kernels are inapplicable in many practical situations. For example, kernels obtained from similarity measures often violate Mercer’s conditions and are not positive definite [13,14]. On the contrary, indefinite kernels have played an increasingly important role in machine learning and shown much better performance in some scenarios than PD kernels. Kowalski et al. focused on using mixed norm regularization to reach better sparsity [15]. Liwicki et al. applied an indefinite robust gradient-based kernel in an incremental kernel principal component analysis (KPCA) algorithm for visual tracking and achieved more efficient results [16]. Xue et al. integrated additional problem-specific prior knowledge in the construction of indefinite kernels and presented its superiority in supervised and semi-supervised classification [17]. Xu et al. further solved single indefinite kernel SVM with difference of convex functions (DC) programming [18]. Recently, indefinite kernels have been widely studied in dimensionality reduction. Liu utilized an indefinite fractional power polynomial kernel in KPCA in face recognition, which can achieve higher recognition accuracies than the KPCA using PD polynomial kernels [19]. Haasdonk and Pekalska proposed indefinite kernel discriminant analysis (IKDA) to extend traditional KDA in indefinite kernel scenarios [20]. Huang et al. addressed PCA with indefinite kernels (IKPCA) in the framework of least squares support vector machine and then gave IKPCA a feature space interpretation [21]. However, the problem of indefinite kernels applied in feature selection has got relatively little research. Moreover, due to the non-convexity of indefinite kernels, almost all existing MKL-FS methods cannot be used. Furthermore, traditional MKL-FS methods were designed only for binary problems. Since multi-class problems are very common in reality, it is necessary to extend the binary model to multi-class scenarios. Although traditional MKL methods can be extended to the multi-class scenarios via one-vs-one or one-vs-all strategies, the extended strategies just simply split the multiclass problems into several independent binary problems and are not suitable for feature selection. In fact, such strategies can only select features independently and the features selected from different binary problems are usually different. That is to say, features selected from one binary problem may not be suitable for other binary problems and they also cannot be ranked according to their weights uniformly for all the binary problems. As a result, traditional MKL-FS methods are unlikely to select a fixed set of features directly for multi-class problems when using multi-class strategies. In contrast, some works based on linear classifiers [22–24] have been developed to deal with multiclass feature selection problems, where linear SVM is the most commonly used classifier. Xu and Zhang extended recursive SVM to multi-class classification and also implemented the multi-class SVM recursive feature elimination method in the workflow of R-SVM [22]. Cai et al. proposed a l2,1 -norm multi-class SVM, which used a structured regularization term for all the classes to naturally select features for multi-class problems [23]. Xu et al. introduced the scaling factor with a flexible parameter to adjust the distribution of feature weights and selected the most discriminative features [24]. However, these works are mainly
based on linear SVMs and are unable to perform nonlinear feature selection. In this paper, we propose a novel multiple indefinite kernel feature selection method (MIK-FS) based on the primal framework of indefinite kernel support vector machine (IKSVM) in embedded method scenarios. MIK-FS uses an indefinite base kernel to represent each feature respectively and an l1 -norm constraint is then enforced on kernel combination coefficients in order to select features naturally. A two-stage algorithm is further presented to optimize the coefficients of IKSVM and kernel combination alternately. Concretely, when the kernel combination coefficients are fixed, the non-convex primal problem of IKSVM is reformulated as a difference of convex functions (DC) programming and then converted into a convex one by the affine minorization approximation. Once the coefficients of IKSVM have been solved, kernel combination coefficients can be optimized by projected gradient descent. We further prove that the value of the objective function in MIK-FS is strictly monotonic decreasing with each iteration in the algorithm. For large-scale problems, a leverage score sampling method is further adopted to select landmark points in training process. Finally, we extend MIK-FS to multiclass scenarios. Experimental results on real-world datasets have shown that MIK-FS outperforms some related methods in terms of feature selection and classification. The conference version of this paper was a first attempt for simple binary feature selection problems. The following is the differences between the conference version and this paper:
• We extend binary feature selection problems to multi-class feature selection scenarios, which can perform nonlinear feature selection effectively and select features uniformly for all classes. • We utilize a leverage score sampling method to select landmark points for solving large-scale problems and validates the superiority of our method under the condition of sampling. • More detailed experiments are conducted to validate that our methods are superior to other state-of-the-art feature selection algorithms. The remainder of this paper is organized as follows. Section 2 briefly reviews the related works on feature selection. In Section 3, we present the proposed MIK-FS method. Section 4 describes the optimization algorithm for solving our model. Section 5 presents the sampling method to select landmark points of large-scale datasets. In Section 6, we extend MIK-FS to solve multi-class problems. Section 7 provides the experimental results. Section 8 concludes this paper. 2. Related work Given a training set {(xi , yi )}ni=1 , where xi ∈ RM is a training sample and yi ∈ {−1, +1} is the corresponding class label. Kernel methods are powerful tools to capture the nonlinear relationship between output labels and input samples. Kernel function is defined as Definition 1 (Kernel Function). Let χ be the input space, {x1 , . . . , xn } ∈ χ is the sample set. For function k : χ × χ −→ R, if there exists an inner product space H and a map function ϕ : χ −→ H satisfies, k(xi , xj ) = ϕ (xi ), ϕ (xj )
⟨
⟩ H
, ∀xi , xj ∈ χ
then k is a kernel function, ⟨·, ·⟩H is the inner product of space H.
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.
H. Xue, Y. Song and H.-M. Xu / Knowledge-Based Systems xxx (xxxx) xxx
A kernel function is said to be positive definite if ∀f ∈ H, ⟨f , f ⟩H ≥ 0 and negative definite if ∀f ∈ H, ⟨f , f ⟩H < 0. Otherwise it is called indefinite. To perform nonlinear feature selection, MKL-FS firstly applies a base kernel km on each feature of the samples and then combines these kernels into a kernel combination including two ways: additive and multiplicative combinations. Most MKL-FS methods integrated the kernels into an additive combination [7] k xi , xj =
(
)
M ∑
dm km xi,m , xj,m , dm ≥ 0
(
)
(1)
where K represents a Reproducing Kernel Kreˇın Space (RKKS) and L is a loss function. According to [25], the Representer Theorem still holds in RKKS and the solution to Eq. (4) can be expressed as f∗ =
k xi , xj =
(
)
M ∏
dm km xi,m , xj,m , dm ≥ 0
(
)
(2)
m=1
minα,d
n 1∑
2
i,j=1
n ∑
s.t .
n ∑ ) ( αi αi αj yi yj k xi , xj + λ∥d ∥1 − i=1
(3)
αi yi = 0
i=1
3. MIK-FS In this section, we will present our multiple indefinite kernel feature selection method (MIK-FS). In MIK-FS method, each feature of the samples is characterized by an indefinite kernel and the global optimization problem would come to be non-convex. In order to avoid suffering from the dual gap, we attempt to focus on the primal problem of IKSVM and its objective function can be formulated as an unconstrained optimization problem
λ ⟨f , f ⟩K +
n ∑ i=1
L(yi , f (xi ) + b)
(4)
(5)
λ
n ∑
n n ∑ ( ) ∑ ( ) βi βj k x i , x j + L(yi , βj k xi , xj + b)
i,j=1
i=1
(6)
j=1
We further embed the indefinite kernel combination of Eq. (1) into Eq. (6) to obtain our MIK-FS. An l1 -norm regularizer is added to get sparse kernel combination coefficients. The corresponding model of MIK-FS is β,b,d
s. t
λ1 βT K β + λ2 ∥d ∥1 +
n ∑
L(yi , K i β + b)
(7)
i=1
dm ≥ 0, m = 1, . . . , M
∑M
where K = m=1 dm K m is an indefinite kernel matrix derived from associated kernel function K ij = k(xi , xj ) described in Eq. (1) and K m is derived from kernel function km . K i represents the ith row of K . Furthermore, in order to ensure the MIK-FS model is continuously differentiable, we select the smooth quadratic hinge loss function as L(·) and the optimization problem becomes
s. t
where d is the kernel combination coefficients. An l1 -norm regularizer is used on d to obtain a sparse solution. When the kernels are PD, MKL-FS can select the informative features while training a good classifier SVM. However, when the kernels become indefinite, it more likely leads to bad classification performance if we directly embed the combination into the dual problem of indefinite kernel SVM (IKSVM). In the case of indefinite kernels, the primal and dual problems of IKSVM are both non-convex. Consequently, there is a dual gap between the two problems and their solutions are not equivalent [18]. As a result, the selected features and the classifier’s coefficients learned from the dual IKSVM are more likely not beneficial for the subsequent classification. In other words, it is more reasonable that the indefinite kernel combination should be embedded into the primal IKSVM. However, existing SVM-based MKL-FS methods mostly focus on the dual problem and will fail in solving the non-convex indefinite kernel problems.
f ∈K,b
β,b
β,b,d
dm ≥ 0, m = 1, . . . , M
min
min
min
0 ≤ αi ≤ C , i = 1, . . . , n
βi k(xi , ·)
where βi ∈ R, k(·, ·) is a kernel function in RKKS and the corresponding kernel matrix can be indefinite. Combining Eqs. (4) and (5), the kernelized primal form of IKSVM can be formulated as
min
MKL-FS aims to learn a sparse combination of the kernels so that the feature can be selected naturally. Consequently, SVMbased MKL-FS methods further embeds the kernel combination into the dual problem of SVM, which can learn the coefficients of SVM and the combination simultaneously
n ∑ i=1
m=1
where xi,m denotes the mth feature of xi and dm represents the coefficient of the kernel km . Varma and Babu further extended the linear combination Eq. (1) into a nonlinear product form [5]
3
λ1 βT K β + λ2 ∥d ∥1 +
n ∑
)2
max 0, 1 − yi (K i β + b)
i=1
(
(8)
dm ≥ 0, m = 1, . . . , M
In Eq. (8), the first term is the regularizer related to hypothesis f . Specially, it is worth noting that β is unconstrained which is different to the Lagrange multiplier α in Eq. (3). The second term is the l1 -norm regularizer related to d. If the coefficient dm equals to zero, it means that the corresponding feature has no effect on the classification and can be discarded. Compared to traditional MKL-FS methods, the proposed MIKFS has two advantages. Firstly, the kernels used in MIK-FS can actually involve both PD and indefinite kernels. Consequently, MIK-FS is a broader method to exploit more generalized kernels. Secondly, we construct MIK-FS on the primal form of IKSVM and avoid the dual gap in the non-convex optimization problem effectively. 4. Optimization algorithm We adopt a two-stage algorithm to optimize the coefficients (β, b) of IKSVM and d of kernel combination alternately. The complete algorithm is described in Algorithm 1. We will introduce the two stages in detail in the following subsections. 4.1. DC programming DC programming and DC algorithms (DCA) are powerful optimization tools for smooth/nonsmooth nonconvex programming [26]. A standard DC programming is of the form inf{f (ω) := g (ω) − h (ω) : ω ∈ Rn }
(9)
where Eq. (9) is a non-convex problem and the objective f can be decomposed into the subtraction of two convex lower semicontinuous functions g and h, which take values in R ∪ {+∞}.
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.
4
H. Xue, Y. Song and H.-M. Xu / Knowledge-Based Systems xxx (xxxx) xxx
When the coefficients d are fixed, Eq. (15) degenerates into a non-convex problem of IKSVM with a single indefinite kernel. That is
Algorithm 1 MIK-FS Inputs: T : the maximum number of iterations λ1 , λ2 : the regularization parameters ϵ : the tolerance value for convergence Process: 1: set t = 0, initialize d t , objt ; 2: while (t < T ) do 3: set t = t + 1; 4: fix d t −1 and solve Eq. (8) to obtain (βt , bt ) ; 5: fix (βt , bt ) and solve Eq. (8) to obtain a solution d t ; 6: calculate the value of objective function objt ; 7: if |objt − objt −1 |≤ ϵ then 8: MIK-FS converges and break; 9: end if 10: end while 11: return βt , bt , d t ;
f (β, b) = λ1 βT K β + L + Constantd
∑n
f (β, b) = g(β, b) − h(β, b) with
where g (ω) = sup{⟨ω, ω⟩ − g(ω), ω ∈ R } is the conjugate function of g and h∗ is the conjugate function of h. And the relationship between variables ω and ω is (11)
where ∂ h and ∂ g are the subgradients of function h and function g ∗ respectively. DCA for solving DC programming is based on local optimality conditions and the duality in DC programming. The main idea is to construct two sequences {ωk } and {ωk } which are candidates for primal and dual solutions respectively. According to Eq. (11), DCA approximates the functions h and g ∗ with their affine minorization at points ωk and ωk respectively ∗
h(ω) = h(ωk ) + ω − ωk , ω
⟨
⟩ k
g ∗ (ω) = g ∗ (ωk ) + ω − ωk , ωk+1
⟨
(12)
⟩
where the positive number ρ satisfies the condition: ρ ≥ η and the number η is the maximum eigenvalue of the kernel matrix K . As a result, the functions g(β, b) and h(β, b) can both be guaranteed to be convex. The detailed steps for solving (β, b) in Eq. (16) are described in Algorithm 2. Algorithm 2 Primal IKSVM Inputs: T1 : the maximum number of iterations ϵ : the tolerance value for convergence d: the kernel coefficients K 1 . . . K M : matrix obtained from features δ : the offset for dc decomposition Process: ∑M 1: calculate K = m=1 dm K m ; 2: calculate η, the maximum eigenvalue of K ; 3: set ρ = |η|+δ ; k 4: set k = 0, initialize β ; 5: while (k < T1 ) do k
calculate β = λ1 (ρ I − K )βk ;
6:
8: 10: 11:
ωk ∈ ∂ h(ωk )
12:
ωk+1 ∈ ∂ g ∗ (ωk )
13:
(13)
= arg min{g(ω) − h(ωk ) + ⟨ω − ωk , ωk ⟩ : ω ∈ Rn }
[
14:
]
In Eq. (13), the construction of {ωk } and that of {ωk } go on alternately until the corresponding sequence converges. Furthermore, in order to simplify the solving process [26], {ω} can be directly computed in the way ωk ∈ ∂ h(ωk ). Aa a result, the two sequences {ω} and {ω} can be constructed as follows
ωk ∈ ∂ h(ωk )
(14)
ωk+1 = arg min{g(ω) − ⟨ω, ωk ⟩} 4.2. Solving (β, b)
Algorithm 2 firstly integrates the kernel matrix K m into an additive combination with coefficients d and performs eigenvalue decomposition on K to find the maximum eigenvalue for DC decomposition (Steps 1–3). Within the loop, Algorithm 2 calculates the solution of conjugate dual problem (Step 6) and then solves primal problem by approximating h(β, b) with its affine minorization (Step 7). Difference between two points obtained in adjacent iterations is calculated for measuring convergence (Steps 8–11). 4.3. Solving d When (β, b) is fixed, Eq. (8) can be reformulated as
Firstly, we denote the objective function of MIK-FS as F (β, b, d) =λ1 β K β + λ2 ∥d ∥1 T
n
+
∑ i=1
k
calculate (βk+1 , bk+1 ) = argmin{g(β, b) − ⟨β, β ⟩}; set αk+1 = (βk+1 , bk+1 ) and αk = (βk , bk ); if ∥αk+1 − αk ∥22 ≤ ϵ then the algorithm converges and break the loop; end if set k = k + 1; end while return βk , bk ;
7: 9:
where ωk ∈ ∂ h(ωk ), ωk+1 ∈ ∂ g ∗ (ωk ) and
[ ] = arg min{h∗ (ω) − g ∗ (ωk−1 ) + ⟨ω − ωk−1 , ωk ⟩ : ω ∈ Rn }
(17)
h(β, b) = λ1 β (ρ I − K )β
(10)
ω ∈ ∂ h(ω), ω ∈ ∂ g ∗ (ω)
g(β, b) = λ1 βT (ρ I )β + L + Constantd T
n
∗
)2
and Constantd = where L = i=1 max 0, 1 − yi (K β + b) λ2 ∥d ∥1 is a constant. According to [18], the non-convex problem of Eq. (16) can be reformulated as a DC programming [26,27] equivalently due to the favorable property of the spectra for indefinite kernel matrices. Concretely, the objective function can be decomposed as
And, the equivalent conjugate dual problem of Eq. (9) can be formulated as inf{f ∗ (ω) = h∗ (ω) − g ∗ (ω) : ω ∈ Rn }
(16) i
(
)2
max 0, 1 − yi (K i β + b)
(
(15)
min d
dT γ +
n ∑
max 0, 1 − yi θ i d + b
(
i=1
(
))2 (18)
s.t . dm ≥ 0, m = 1, . . . , M
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.
H. Xue, Y. Song and H.-M. Xu / Knowledge-Based Systems xxx (xxxx) xxx
where γ = [λ1 βT K1 β + λ2 , . . . , λ1 βT KM β + λ2 ]T , θ = [K1 β, . . . , KM β] and θ i represents the ith row of θ . Eq. (18) can be solved by projected gradient descent (PGD). The gradient of d at dm is
∇dm = γm + 2
∑(
1 − yi K i β + b
(
)) (
−yi Kmi β
)
(19)
i∈SV
where SV = xi ∈ X |yi K i β + b ≤ 1 is the set of support vectors in current iteration. PGD method for solving d is summarized in Algorithm 3.
{
(
)
5
takes O(T2 ∗ n2 ∗ m), where m denotes the number of features and T2 denotes the number of iterations in Algorithm 3. In summary, the total complexity is O(T ∗ (T1 ∗ n2 + T2 ∗ n2 ∗ m)), where T is supposed to be the maximum number of iterations of Algorithm 1. Since max(T , T1 , T2 ) ≪ m and n, MIK-FS algorithm’s time complexity is O(n2 ∗ m).
}
Algorithm 3 PGD Inputs: T2 : the maximum number of iterations ϵ : the tolerance value for convergence Process: k 1: set k = 0, initialize d ; 2 2: while (k < T2 and ∥∇d k ∥2 ≥ ϵ ) do 3: calculate ∇d k according to Eq. (19); 4: calculate step size α by Armijo rule ; 5: set d k+1 = d k − α ∗ ∇d k ; 6: set d k+1 = max(d k+1 , 0); 7: set k = k + 1; 8: end while k 9: return d ; Algorithm 3 calculates the gradient of Eq. (18) and the step size α based on the Armijo rule (Steps 3–4). After that, d is updated and projected to feasible sets (Steps 5–6).
5. Leverage score sampling Generally, kernel method is time-consuming when the number of samples becomes large. Sampling-based methods provide a powerful alternative for reducing the size of kernel matrix while preserving most information. According to [28], leverage scores of a matrix X ∈ Rm×n are defined as Definition 2 (Leverage scores). Let V ∈ Rn×k contain the top k right singular vectors of a m × n matrix X . Then, the leverage score of the ith column of X is defined as 2 li = V i 2
i = 1, 2, . . . , n
(20)
i
where V denotes the ith row of V . The columns of matrix X can then be sorted and selected according to their leverage scores. The detailed sampling steps are summarized in Algorithm 4. Step 1 obtains top k right singular vectors of input data matrix by computing singular value decomposition (SVD) of X . Step 2 calculates leverage scores of the columns of X . And Step 3 selects the columns that have the largest c leverage scores. Algorithm 4 LeverageSampling
4.4. Convergence analysis We further present a theoretical analysis for the convergence of MIK-FS. As mentioned above, MIK-FS is a two-stage algorithm and thus we will analyze the convergence of these two stages respectively. First of all, when the kernel combination coefficients d are fixed, MIK-FS can be formulated as a primal IKSVM problem. We solve it by DC programming which can guarantee that the objective function of IKSVM is monotonically decreasing [18]. Proposition 1.
For the sequence {αk = (βk , bk )}, we have
(g − h)(βk , bk ) − (g − h)(βk+1 , bk+1 ) ≥ τ ∥d(α)∥2 , the equality holds if and only if τ ∥d(α)∥ = 0, where τ is a positive parameter to make functions g and h strongly convex. 2
Inputs: X ∈ RM ×n : input data matrix, every column is a sample k : the num of singular vectors to be selected c : the num of samples to be selected Process: 1: compute V , the top-k right singular vectors of X ; 2: calculate leverage scores l according to Eq. (20); 3: obtain c columns of X that correspond to the largest c leverage scores ; M ×c 4: return X c ∈ R ; Let {(x′i , y′i )}ci=1 be the new training set selected from the original training set {(xi , yi )}ni=1 . Then the reduced kernel matrix K ′ ∈ Rc ×c is constructed as follows.
Furthermore, the local minimum would be obtained when
∥d(α)∥2 = 0 is satisfied.
Then, when the coefficients (β, b) of IKSVM are fixed, MIK-FS is transformed into a convex problem to solve the kernel combination coefficients. Thus, we can obtain the following proposition for the whole algorithm MIK-FS. Proposition 2. F (β
k+1
,b
k+1
,d
K ′ij = k xi ′ , x′j =
(
)
M ∑
dm km x′i,m , x′j,m , dm ≥ 0
(
m=1
x′i,m
where denotes the mth feature of x′i and dm represents the coefficient of the kernel km . Therefore, the corresponding optimization problem becomes
For the sequence {(βk , bk , d k )}, we have k+1
min
) ≤ F (β , b , d ) , k
k
k
β,b,d
that is, the objective function F (β , b , d ) is strictly monotonic decreasing along the solution sequence. k
k
k
When F (βk , bk , d k ) = F (βk+1 , bk+1 , d k+1 ) comes true, the algorithm MIK-FS can converge to a stationary point. As stated in Algorithm 1, MIK-FS is solved iteratively. When SVM coefficients (βt , bt ) are calculated in each iteration (Step 4), it takes O(n2 + T1 ∗ n2 ), where n denotes the number of samples and T1 denotes the number of iterations in Algorithm 3. Similarly, when calculating kernel combination coefficients d t (Step 5), it
)
s. t
λ1 βT K ′ β + λ2 ∥d ∥1 +
n ∑
(
′
max 0, 1 − yi (K i β + b)
i=1
)2 (21)
dm ≥ 0, m = 1, . . . , M
Then we can solve Eq. (21) by Algorithm 1 as before. 6. Multi-class MIK-FS (MMIK-FS) In this section, we extend the binary MIK-FS model to solve multi-class problems. The proposed MMIK-FS can perform nonlinear feature selection effectively and select features uniformly for all classes.
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.
6
H. Xue, Y. Song and H.-M. Xu / Knowledge-Based Systems xxx (xxxx) xxx
Table 1 Small-scale binary datasets.
Table 2 Large-scale binary datasets.
Datasets(#Class)
Abbreviation
#Sample
#Feature
Datasets(#Class)
Abbreviation
#Sample
#Feature
ALLAML(2) Colon(2) Gli_85(2) Prostate_GE(2) Central_Nervous_System(2) Lung_Cancer(2) Dbworld(2) Isolet(2) Glioma(2) Carcinom(2)
ALL. Col. Gli. Pro. CNS Lun. Dbw. Iso. Gli. Car.
72 62 85 102 60 181 64 120 21 34
7129 2000 22283 5966 7129 12533 4702 617 4434 9182
Ada(2) A5a(2) Mushrooms(2) Ijcann(2) German(2) Phishing(2) Satimage(2) Sensitvehicle(2)
Ada A5a Mus. Ijc. Ger. Phi. Sat. Sen.
4147 6414 8124 49990 1000 11055 4435 78823
48 123 112 22 24 68 36 100
Given a training set {(xi , yi )}ni=1 , where xi ∈ RM and yi ∈ {1, 2, . . . , c }. The solution to Eq. (4) for multi-class problem can be expressed as f∗ =
c n ∑ ∑
Bji k(xi , ·)
(22)
j=1 i=1
where B ∈ Rc ×n and c is the number of classes. Furthermore, the multi-class loss function in Eq. (4) can be constructed as follows n ∑
L(·) =
n n ∑ ∑
max 0, Bj K i − Bci K i + 1
(
)
j
ci
where B and B represent the jth row and ci th row of matrix B respectively. As a result, MMIK-FS can be obtained by combining Eqs. (22) and (23) into Eq. (7) and we can have min λ1 tr(BKBT ) + λ2 ∥d ∥1 B,d
+
n n ∑ ∑
max 0, Bj K i − Bci K i + 1
(
(24)
)
i=1 j̸ =ci
s.t
dm ≥ 0, m = 1, . . . , M
In order to solve this non-convex problem, we also adopt a two-stage algorithm to optimize the coefficients B of multi-class IKSVM and d of kernel combination alternately. Specifically, when the coefficients d are fixed, Eq. (24) degenerates into multi-class IKSVM with a single indefinite kernel that is min f (B) = λ1 tr(BK BT ) + L + Constantd
(25)
B
∑n ∑n
j i ci i where L = i=1 j̸ =ci max 0, B K − B K + 1 and Constantd = λ2 ∥d ∥1 is a constant. Then Eq. (25) can be solved by DC programming with the following DC decomposition
(
)
f (B) = g(B) − h(B) with g(B) = λ1 tr(BT (ρ I )B) + L + Constantd
Datasets(#Class)
Abbreviation
#Sample
#Feature
Yale(15) CLL_SUB_111(3) orlraws10P(10) pixraw10P(10) TOX_171(4) warpAR10P(10) lung(5) lung_discrete(7) lymphoma(9) nci9(9) warpPIE10P(10)
Yal. CLL. Orl. Pix. TOX. WaA. Lun. Ldi. Lym. Nci. WaP.
165 111 100 100 171 130 203 73 96 60 210
1024 11340 10304 10000 5748 2400 3312 325 4026 9712 2420
(23)
i=1 j̸ =ci
i=1
Table 3 Multi-class datasets.
(26)
h(B) = λ1 tr(BT (ρ I − K )B) where the positive number ρ satisfies the condition: ρ ≥ η and the number η is the maximum eigenvalue of the kernel matrix K . As a result, the functions g(B) and h(B) can both be guaranteed to be convex. When the coefficients B are fixed, Eq. (24) degenerates into a convex problem of the variable d and can be solved with off-theshelf convex optimization methods. Similar to MIK-FS, a feature is considered having no effect on the learning task and can be discarded if the corresponding coefficient dm equals to zero.
introduce the datasets used in our experiments and then present the experiment design and finally the experimental results. 7.1. Datasets In the experiments, we select twenty-nine datasets from five different repositories for experiments: (a) eighteen datasets from a feature selection repository,1 namely ALLAML, Colon, Gli_85, Prostate_GE, Isolet, Glioma, Carcinom, orlraws10P, pixraw10P, warpAR10P, warpPIE10P, Yale, CLL_SUB_111, lung, lung_discrete, lymphoma, nci9, TOX_171; (b) two datasets from an online repository2 of high-dimensional biomedical datasets, namely Central_Nervous_System, Lung_Cancer; (c) one dataset Dbworld from UCI Machine Learning Repository; (d) one dataset German from WCCI 2006 Workshop on Model Selection; (e) seven datasets from LIBSVM repository,3 namely Ada, A5a, Mushrooms, Ijcnn1, Phishing, Satimage, Sensitvehicle. Tables 1–3 list a brief description of these datasets, including the number of samples, the num of classes in each dataset and the number of features and in each sample. In Tables 1 and 3, the number of features in datasets are all very large. So there are more likely high redundancies among these features. Table 1 is used for feature selection for binary problems and Table 3 is used for feature selection for multi-class problems. Datasets in Table 2 have very large number of samples and they are used for feature selection for large-scale binary problems. We randomly divide the samples into two non-overlapping training and testing sets which contain almost half of the samples in each class. The processes are repeated ten times to generate ten independent runs for each dataset and then the average results are reported. Since the three datasets Isolet, Glioma and Carcinom are designed for multi-class problems , we choose the first two classes. For datasets Satimage and Sensitvehicle, we choose half of the classes as positive samples and the remaining as negative ones.
7. Experiments In this section, we experimentally evaluate the performance of the proposed algorithms MIK-FS and MMIK-FS by comparing our methods to some related state-of-the-art algorithms. We first
1 http://featureselection.asu.edu/datasets.php. 2 http://datam.i2r.a-star.edu.sg/datasets/krbd/. 3 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.
H. Xue, Y. Song and H.-M. Xu / Knowledge-Based Systems xxx (xxxx) xxx
7
Fig. 1. Classification accuracies versus the number of selected features of five embedded feature selection algorithms.
7.2. Feature selection for small-scale binary problems We compare the proposed MIK-FS with the related state-ofthe-art algorithms, including l1 -SVM [29], ElasticNet-SVM [30], RFMKL [7], GMKL [5], IKPCA [21], Sparse SVM (SSVM) [8] and SSVM with polynomial kernel feature mapping (SSVM-P) [9]. In our experiments, we choose the indefinite sigmoid kernel k = tanh(a · xTi xj − r) as the base feature kernel in MIK-FS. Gaussian kernel is used for RFMKL and GMKL. For all the datasets, the regularization parameters λ1 , λ2 and the kernel parameters a, r are chosen from the set {10−2 , 10−1 , 1, 101 }. A feature is discarded if the corresponding kernel combination coefficient di is less than a small threshold, which is 10−5 in our experiments.
Table 4 lists the average classification accuracies and the corresponding number of selected features in each compared algorithm on the ten datasets. The best results are highlighted in bold. As shown in Table 4, l1 -SVM is simple and fast, but it performs poorly compared to other algorithms. ElasticNet-SVM outperforms l1 -SVM in terms of classification accuracies but tends to select too many features. GMKL achieves similar results with RFMKL on most datasets except that it performs worse than RFMKL on Gli_85, CNS and Lung_Cancer. MIK-FS excels GMKL and RFMKL on all datasets. The reduced dimensions in IKPCA are limited by the number of samples and it performs even worse than l1 -SVM on most datasets. SSVM only performs better than MIK-FS on Gli_85 and Dbworld. And SSVM-P tends to perform
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.
8
H. Xue, Y. Song and H.-M. Xu / Knowledge-Based Systems xxx (xxxx) xxx
Fig. 2. Classification accuracies versus parameters λ1 and λ2 .
poorly. Both SSVM and SSVM-P select more features than MIK-FS on most of the datasets. Overall, our proposed MIK-FS achieves higher accuracies than other algorithms on most of the datasets while selecting a smaller number of features. Fig. 1 shows the classification accuracies corresponding to the specified number of features of five embedded feature selection algorithms on the ten datasets. The maximum number of selected feature is set to 60 which is enough for most algorithms to reach the highest accuracies on the datasets. As shown in Fig. 1, our algorithm obviously outperforms the other algorithms in terms of the highest classification accuracies, whose accuracies can even exceed the others’ beyond 7% on the datasets Carcinom and Glioma. With the increasing of the number
of selected features, the classification accuracies of MIK-FS rise more quickly than the other algorithms on most datasets. Fig. 2 shows the classification accuracies corresponding to different values of λ1 and λ2 on ten datasets. As we can see from Fig. 2, MIK-FS is relatively stable on most of the datasets. The experiments about the convergence of MIK-FS are conducted on the ten datasets. We plot the value of {F (βk , bk , d k ) − F (βk+1 , bk+1 , d k+1 )} during the iterations in MIK-FS. Fig. 3 shows difference value of objective function in MIK-FS changing with iterations on the ten datasets. Obviously, MIK-FS method converges rapidly within 10 iterations on all the datasets except on Gli_85.
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.
H. Xue, Y. Song and H.-M. Xu / Knowledge-Based Systems xxx (xxxx) xxx
9
Table 4 Classification accuracies (mean±std. deviation) and the number of selected features (mean (#dimension)) of each compared algorithm on ten real-world datasets. ALL. Col. Gli. Pro. CNS Lun. Dbw. Iso. Gli. Car.
l1 -SVM 89.43 ± 82.58 ± 76.67 ± 95.49 ± 71.04 ± 98.34 ± 85.16 ± 98.67 ± 80.00 ± 79.41 ±
3.39 3.87 6.68 1.76 6.01 0.89 2.58 1.24 5.49 6.34
(18) (13) (5) (15) (9) (12) (14) (11) (5) (4)
ElasticNet-SVM
RFMKL
95.14 88.07 75.24 95.10 68.62 98.89 90.00 99.34 81.00 80.59
95.71 85.81 75.00 95.88 67.93 97.78 88.71 99.34 81.00 81.76
± ± ± ± ± ± ± ± ± ±
4.43 (558) 3.54 (87) 10.44 (8) 1.58 (71) 8.22 (19) 0.99 (15) 5.48 (232) 1.33 (55) 4.86 (104) 7.43 (488)
GMKL
± ± ± ± ± ± ± ± ± ±
1.14 2.60 4.01 1.62 4.09 1.21 2.57 1.33 6.47 4.54
(14) (11) (28) (14) (10) (33) (18) (12) (5) (7)
95.14 85.80 73.10 95.88 66.20 91.44 87.74 99.34 81.00 84.70
IKPCA
± ± ± ± ± ± ± ± ± ±
3.14 3.66 2.61 1.84 1.38 3.81 2.25 1.33 6.57 3.43
(34) (20) (4) (29) (3) (428) (154) (15) (6) (16)
87.71 77.74 69.52 50.98 65.52 83.33 89.68 99.83 70.00 76.47
SSVM
± ± ± ± ± ± ± ± ± ±
4.05 4.33 0.42 0.71 0.00 0.14 3.36 0.50 3.50 7.43
(13) (16) (38) (45) (27) (9) (6) (19) (9) (15)
96.57 78.70 84.05 95.10 68.96 98.45 90.96 99.33 80.00 81.18
SSVM-P
± ± ± ± ± ± ± ± ± ±
2.69 5.67 3.23 1.97 2.75 1.66 2.49 1.33 5.43 8.65
(123) (58) (230) (114) (67) (37) (135) (75) (141) (39)
67.43 69.36 76.19 86.86 67.24 95.33 85.48 90.67 78.00 81.18
MIK-FS
± ± ± ± ± ± ± ± ± ±
2.44 4.32 2.42 2.23 4.23 2.32 3.16 1.33 3.43 5.32
(15) (56) (66) (108) (71) (116) (67) (74) (37) (23)
97.14 ± 1.11 (18) 87.09 ± 2.51 (17) 78.09 ± 3.01 (14) 95.88 ± 4.09 (8) 75.52 ± 4.09 (17) 99.89 ± 0.50 (46) 90.00 ± 4.09 (11) 100.00 ± 0.00 (9) 90.00 ± 2.64 (3) 91.70 ± 3.21 (11)
Table 5 Classification accuracies (mean±std. deviation) and the number of selected features (mean (#dimension)) of each compared algorithm on eight real-world datasets. l1 -SVM Ada A5a Mus. Ijc. Ger. Phi. Sat. Sen.
80.17 76.88 86.59 90.34 70.66 87.57 75.10 53.78
± ± ± ± ± ± ± ±
1.17 (12) 2.02 (3) 1.39 (2) 0.14 (2) 0.59 (8) 1.79 (2) 2.50 (11) 11.44 (4)
ElasticNet-SVM
RFMKL
80.47 77.32 86.50 86.35 70.90 90.75 74.49 57.39
78.98 77.15 88.57 90.56 73.32 90.42 86.38 56.82
± ± ± ± ± ± ± ±
1.85 1.45 1.13 5.19 5.50 8.31 1.59 1.80
(27) (23) (5) (10) (20) (19) (31) (20)
± ± ± ± ± ± ± ±
GMKL 2.23 (6) 0.90 (11) 0.92 (2) 0.18 (6) 4.11 (21) 1.11 (9) 2.02 (4) 11.30 (5)
77.34 77.12 86.84 90.49 71.32 90.66 84.32 54.84
MIK-FS
± ± ± ± ± ± ± ±
2.03 1.34 1.25 0.97 0.57 1.38 3.59 5.76
(11) (22) (5) (7) (13) (18) (9) (10)
82.09 80.85 91.66 91.16 74.78 91.57 88.50 83.73
± ± ± ± ± ± ± ±
1.03 1.45 4.21 0.24 1.14 0.40 1.05 0.78
(17) (24) (20) (12) (17) (20) (11) (21)
number of sampling c on several datasets. As we can see from Fig. 4, MIK-FS is stable on most of the datasets. As a result, sampling can be an effective way for MIK-FS to deal with large-scale datasets. 7.4. Feature selection for multi-class problems
Fig. 3. Convergence of MIK-FS on ten datasets.
7.3. Feature selection for large-scale binary problems For experiments on large-scale problems, we compare the proposed MIK-FS with l1 -SVM [29], ElasticNet-SVM [30], RFMKL [7] and GMKL [5]. The parameters setting is the same as feature selection for small-scale binary problems. Firstly, we conduct sampling on these training datasets to select landmark points by Algorithm 4. After that, we use the selected landmark points for training and all the testing sets for testing. Specifically, we fix the number of right singular vector k as 10 and the number of sampling c as 200 for experiments. Table 5 lists the average classification accuracies and the corresponding number of selected features in each compared algorithm with leverage score sampling. The best results are highlighted in bold. As shown in Table 5, our proposed MIKFS achieves higher accuracies than other algorithms on all the datasets although sampling may lose some information of the original datasets. Furthermore, MIK-FS can select features effectively since it reduces the dimension of features sharply. Among the four compared algorithms, l1 -SVM and ElasticNet-SVM are inferior to GMKL and RFMKL except on Ada. GMKL achieves similar results to RFMKL and is inferior to MIK-FS, with similar number of features selected on most datasets. Fig. 4 shows the classification accuracies corresponding to different values of the number of right singular vector k and the
When positive definite kernels are used in Eq. (1), Eq. (24) degrades into the extension of MKL-FS method for multi-class problems in the primal form. Multi-class MKL-FS (MMKL-FS) can be reformulated as a BiConvex optimization problem [31] and we will compare this method with MMIK-FS. We also compare our MMIK-FS with two other embedded methods for feature selection including MSVM21 [23] and l1 SVM [29]. l1 -SVM [29] is designed for binary problems, so we adopt one-vs-all strategy with all the features for multi-class problems. For other embedded methods, the number of selected features is restricted to being smaller than 200. Similar to binary problems, gaussian kernel is used for MMKL-FS and sigmoid kernel is used for MMIK-FS. To further evaluate the quality of selected features, we then compare our MMIK-FS to seven state-of-the-art filter methods for feature selection including cife [32], trace_ratio [33], MSVM [24] and other four methods whose details are given in [34]. We select 200 features of each dataset according to their feature weights and evaluate them by the linear SVM classifier with one vs all strategy. The parameter C in SVM is selected from the set {10−6 , 10−5 , . . . , 102 }. Table 6 lists the average classification accuracies and the corresponding number of selected features in each embedded algorithm for multi-class classification. The best results are highlighted in bold. As shown in Table 6, two kernel methods obviously outperform two linear SVM methods in terms of classification accuracies. Furthermore, Our proposed MMIK-FS achieves higher accuracies than MMKL-FS on most of the datasets while selecting a small number of features. Table 7 lists the average classification accuracies with top 200 features in each filter algorithm for multi-class classification. The best results are highlighted in bold. As shown in Table 7, our proposed MMIK-FS achieves higher accuracies than other algorithms on most of the datasets, which exceeds the others’ beyond nearly 4.5% on the datasets orlraws10P, pixraw10P and warpAR10P. Furthermore, MMIK-FS performs stably on all the datasets while these filter feature selection methods tend to perform badly on some datasets.
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.
10
H. Xue, Y. Song and H.-M. Xu / Knowledge-Based Systems xxx (xxxx) xxx
Fig. 4. Classification accuracies versus k and c. Table 6 Classification accuracies (mean±std. deviation) and the number of selected features (mean (#dimension)) of each embedded algorithm on real-world datasets. l1 -SVM Yal. CLL. Lun. Ldi. Lym. Nci. Orl. Pix. TOX. WaA. WaP.
65.87 64.81 95.90 85.00 91.91 53.92 89.80 92.20 81.42 88.00 98.50
± ± ± ± ± ± ± ± ± ± ±
MSVM21 4.45 1.75 6.83 4.66 2.82 5.40 1.26 2.86 4.04 2.47 2.89
69.33 62.77 94.40 81.47 86.60 48.57 91.00 95.60 82.02 90.17 98.60
± ± ± ± ± ± ± ± ± ± ±
MMKL-FS 4.78 1.56 5.48 4.28 4.38 5.57 1.11 2.40 2.65 3.57 4.02
(122) (84) (45) (39) (79) (57) (93) (79) (51) (90) (85)
72.40 71.67 96.50 88.83 93.40 61.07 90.25 96.40 84.76 93.33 96.80
± ± ± ± ± ± ± ± ± ± ±
2.67 2.55 0.92 3.77 1.70 5.10 5.60 1.80 3.74 4.46 0.42
± ± ± ± ± ± ± ± ± ± ±
1.14 1.62 0.22 0.60 0.45 0.81 1.31 0.72 1.12 0.92 0.07
MMIK-FS 72.53 68.33 96.60 89.41 94.89 62.14 91.20 97.40 86.93 92.67 99.70
(196) (56) (61) (93) (108) (118) (90) (75) (111) (118) (158)
± ± ± ± ± ± ± ± ± ± ±
1.07 3.73 0.32 2.17 2.10 5.55 3.23 1.32 4.46 4.26 0.67
(182) (50) (108) (101) (186) (115) (80) (110) (88) (111) (113)
Table 7 Classification accuracies (mean±std. deviation) of each filter algorithm with top 200 features. cife Yal. CLL. Lun. Ldi. Lym. Nci. Orl. Pix. TOX. WaA. WaP.
63.60 63.89 93.90 90.59 75.74 45.00 88.60 82.40 71.31 79.67 99.80
± ± ± ± ± ± ± ± ± ± ±
1.79 1.45 0.78 0.70 1.45 1.49 0.30 0.62 1.33 1.48 0.06
trace_ratio
fisher_score
gini_index
mrmr
73.07 60.56 93.90 87.94 87.23 55.36 78.80 68.60 80.24 88.33 98.10
72.93 60.37 93.90 87.94 87.23 61.43 77.80 73.00 80.83 88.33 98.10
66.13 68.33 95.20 87.94 79.79 56.43 78.20 88.60 79.76 83.17 97.00
68.80 64.07 93.80 89.12 90.85 63.93 88.00 92.40 75.60 88.17 98.90
± ± ± ± ± ± ± ± ± ± ±
0.60 0.79 0.23 0.55 1.04 1.80 1.47 2.12 0.90 0.85 0.17
± ± ± ± ± ± ± ± ± ± ±
0.43 1.88 0.30 0.52 0.64 0.88 2.16 1.84 1.31 0.69 0.19
± ± ± ± ± ± ± ± ± ± ±
0.51 0.67 0.25 1.67 1.05 2.73 0.80 0.54 1.18 1.01 0.36
reliefF 70.53 58.33 95.20 88.53 90.00 54.29 69.00 82.60 86.31 82.00 98.60
MSVM
± ± ± ± ± ± ± ± ± ± ±
1.85 1.71 0.08 0.83 0.49 1.41 1.62 2.10 0.57 1.40 0.23
63.87 67.41 95.70 91.18 87.87 61.79 70.80 89.60 88.69 79.33 97.90
± ± ± ± ± ± ± ± ± ± ±
MMIK-FS 1.10 1.62 0.20 0.33 0.51 0.56 2.45 0.28 0.52 1.79 0.22
73.87 66.48 96.00 88.82 91.57 61.07 93.00 97.00 88.21 93.17 99.80
± ± ± ± ± ± ± ± ± ± ±
0.04 1.45 0.44 0.63 0.67 0.63 1.32 0.23 1.52 0.65 0.12
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.
H. Xue, Y. Song and H.-M. Xu / Knowledge-Based Systems xxx (xxxx) xxx
11
Fig. 5. Classification accuracies versus the number of selected features of each filter algorithm.
Fig. 5 shows the classification accuracies corresponding to the specified number of features of multi-class feature selection algorithms on the 11 datasets. The maximum number of selected feature is set to 90. As shown in Fig. 5, our algorithm obviously outperforms the other algorithms in terms of classification accuracies with the increasing of the number of selected features on most of the datasets. 8. Conclusion In this paper, we propose an embedded feature selection method MIK-FS based on the primal framework of IKSVM, which applies an indefinite base kernel for each feature and generates sparse kernel combination coefficients by l1 -norm to select features automatically. A two-stage algorithm is accordingly designed to optimize the coefficients of IKSVM and kernel
combination alternately. Meanwhile, we adopt a leverage score sampling method to select landmark points for large-scale problems. Furthermore, we extend MIK-FS to MMIK-FS for multi-class problems. Experiments on real-world datasets validate the effectiveness of MIK-FS and MMIK-FS in feature selection and classification. There are several directions for future study: ■ Optimization technique: In the paper, we apply a twostage algorithm to solve the non-convex optimizations in the proposed models, which optimizes the coefficients of IKSVM by DCA and kernel combination by PGD alternately. However, DCA is time-consuming when the number of samples is large and PGD is computationally expensive when the number of features is large. Therefore, how to accelerate MIK-FS by refining the solving algorithms needs more systematic research.
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.
12
H. Xue, Y. Song and H.-M. Xu / Knowledge-Based Systems xxx (xxxx) xxx
■ Large-scale problem: In the experiments, we utilize a leverage score sampling method to select landmark points for large-scale problems. However, sampling might loss some information of the original datasets. Consequently, how to extend MIK-FS for solving large-scale problems directly instead of selecting landmark points is another interesting topic for future study.
Acknowledgments This work was supported by National Natural Science Foundation of China (Grant Nos. 61876091 and 61375057), and the National Key R&D Program of China (2017YFB1002801). Furthermore, the work was also supported by Collaborative Innovation Center of Wireless Communications Technology, China. References [1] M. Nen, Alpayd, N. Ethem, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268. [2] S. Yan, X. Xu, D. Xu, S. Lin, X. Li, Image classification with densely sampled image windows and generalized adaptive multiple kernel learning, IEEE Trans. Cybern. 45 (3) (2015) 395. [3] W. Chung, J. Kim, H. Lee, E. Kim, General dimensional multiple-output support vector regressions and their multiple kernel learning, IEEE Trans. Cybern. 45 (11) (2015) 2572. [4] Z. Xu, R. Jin, J. Ye, M.R. Lyu, I. King, Non-monotonic feature selection, in: Proceedings of the Twenty-Sixth Annual International Conference on Machine Learning, ACM, 2009, pp. 1145–1152. [5] M. Varma, B.R. Babu, More generality in efficient multiple kernel learning, in: Proceedings of the Twenty-Sixth Annual International Conference on Machine Learning, ACM, 2009, pp. 1065–1072. [6] Z. Chen, J. Li, L. Wei, A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue, Artif. Intell. Med. 41 (2) (2007) 161–175. [7] A.D. Dileep, C. Sekhar, Representation and feature selection using multiple kernel learning, in: Proceedings of the Twenty-Second International Joint Conference on Neural Networks, IEEE, 2009, pp. 717–722. [8] M. Tan, L. Wang, I.W. Tsang, Learning sparse SVM for feature selection on very high dimensional datasets, in: Proceedings of the 27th International Conference on Machine Learning, ICML-10, 2010, pp. 1047–1054. [9] M. Tan, I.W. Tsang, L. Wang, Towards ultrahigh dimensional feature selection for big data, J. Mach. Learn. Res. 15 (1) (2014) 1371–1429. [10] M. Yamada, W. Jitkrittum, L. Sigal, E.P. Xing, M. Sugiyama, Highdimensional feature selection by feature-wise kernelized lasso, Neural Comput. 26 (1) (2014) 185–207. [11] S. Maldonado, J. López, Dealing with high-dimensional class-imbalanced datasets: Embedded feature selection for SVM classification, Appl. Soft Comput. 67 (2018) 94–105. [12] V.H. Moghaddam, J. Hamidzadeh, New Hermite orthogonal polynomial kernel and combined kernels in support vector machine classifier, Pattern Recognit., S0031320316301534.
[13] H. Saigo, J.-P. Vert, N. Ueda, T. Akutsu, Protein homology detection using string alignment kernels, Bioinformatics 20 (11) (2004) 1682–1689. [14] Y. Chen, M.R. Gupta, B. Recht, Learning kernels from indefinite similarities, in: Proceedings of the Twenty-Sixth Annual International Conference on Machine Learning, ACM, 2009, pp. 145–152. [15] M. Kowalski, M. Szafranski, L. Ralaivola, Multiple indefinite kernel learning with mixed norm regularization, in: International Conference on Machine Learning, 2009, pp. 545–552. [16] S. Liwicki, S. Zafeiriou, G. Tzimiropoulos, M. Pantic, Efficient online subspace learning with an indefinite kernel for visual tracking and recognition, IEEE Trans. Neural Netw. Learn. Syst. 23 (10) (2012) 1624–1636. [17] H. Xue, S. Chen, Discriminality-driven regularization framework for indefinite kernel machine, Neurocomputing 133 (2014) 209–221. [18] H.-M. Xu, H. Xue, X.-H. Chen, Y.-Y. Wang, Solving indefinite kernel support vector machine with difference of convex functions programming, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017. [19] C. Liu, Gabor-based kernel PCA with fractional power polynomial models for face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 26 (5) (2004) 572–581. [20] B. Haasdonk, E. Pekalska, Indefinite kernel discriminant analysis, in: Proceedings of the Sixteenth International Conference on Computational Statistics, Springer, 2010, pp. 221–230. [21] X. Huang, A. Maier, J. Hornegger, J.A. Suykens, Indefinite kernels in least squares support vector machines and principal component analysis, Appl. Comput. Harmon. Anal. (2016). [22] Q. Xu, X. Zhang, Multiclass feature selection algorithms base on R-SVM, in: Signal and Information Processing (ChinaSIP), 2014 IEEE China Summit & International Conference on, IEEE, 2014, pp. 525–529. [23] X. Cai, F. Nie, H. Huang, C. Ding, Multi-class l2, 1-norm support vector machine, in: Data Mining (ICDM), 2011 IEEE 11th International Conference on, IEEE, 2011, pp. 91–100. [24] J. Xu, F. Nie, J. Han, Feature selection via scaling factor integrated multi-class support vector machines. [25] C.S. Ong, X. Mary, S. Canu, A.J. Smola, Learning with non-positive kernels, in: Proceedings of the Twenty-First International Conference on Machine Learning, ACM, 2004, p. 81. [26] T.P. Dinh, H.A. Le Thi, Recent advances in DC programming and DCA, in: Transactions on Computational Intelligence XIII, Springer, 2014, pp. 1–37. [27] P.D. Tao, L.T.H. An, Convex analysis approach to dc programming: Theory, algorithms and applications, Acta Math. Vietnam. 22 (1) (1997) 289–355. [28] D. Papailiopoulos, A. Kyrillidis, C. Boutsidis, Provable deterministic leverage score sampling, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2014, pp. 997–1006. [29] P.S. Bradley, O.L. Mangasarian, Feature selection via concave minimization and support vector machines, in: Proceedings of Fifteenth International Conference on Machine Learning, vol. 98, 1998, pp. 82–90. [30] L. Wang, J. Zhu, H. Zou, The doubly regularized support vector machine, Statist. Sinica (2006) 589–615. [31] J. Gorski, F. Pfeuffer, K. Klamroth, Biconvex sets and optimization with biconvex functions: a survey and extensions, Math. Methods Oper. Res. 66 (3) (2007) 373–407. [32] D. Lin, X. Tang, Conditional infomax learning: an integrated framework for feature extraction and fusion, in: Computer Vision–ECCV 2006, Springer, 2006, pp. 68–82. [33] F. Nie, S. Xiang, Y. Jia, C. Zhang, S. Yan, Trace ratio criterion for feature selection, in: AAAI, vol. 2, 2008, pp. 671–676. [34] Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, H. Liu, Advancing feature selection research, ASU Feature Select. Repository (2010) 1–28.
Please cite this article as: H. Xue, Y. Song and H.-M. Xu, Multiple indefinite kernel learning for feature selection, Knowledge-Based Systems (2019) 105272, https://doi.org/10.1016/j.knosys.2019.105272.