Quasi-newton method for LP multiple kernel learning

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Quasi-new...

Download PDF

730KB Sizes 0 Downloads 57 Views

Report

PDF Reader
Full Text

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Quasi-newton method for LP multiple kernel learning Hu Qinghui a,b, Wei Shiwei b,n, Li Zhiyuan b, Liu Xiaogang a a b

Guangxi Colleges and Universities Key Laboratory Breeding Base of Robot and Welding Technology, Guilin University of Aerospace Technology, Guilin, China School of Computer Science and Engineering, Guilin University of Aerospace Technology, Guilin, China

art ic l e i nf o

a b s t r a c t

Article history: Received 3 November 2015 Received in revised form 14 January 2016 Accepted 29 January 2016

Multiple kernel learning method has more advantages over the single one on the model’s interpretability and generalization performance. The existing multiple kernel learning methods usually solve SVM in the dual which is equivalent to the primal optimization. Research shows solving in the primal achieves faster convergence rate than solving in the dual. This paper provides a novel LP-norm(P4 1) constraint nonspare multiple kernel learning method which optimizes the objective function in the primal. Subgradient and Quasi-Newton approach are used to solve standard SVM which possesses superlinear convergence property and acquires inverse Hessian without computing a second derivative, leading to a preferable convergence speed. Alternating optimization method is used to solve SVM and to learn the base kernel weights. Experiments show that the proposed algorithm converges rapidly and that its efﬁciency compares favorably to other multiple kernel learning algorithms. & 2016 Elsevier B.V. All rights reserved.

Keywords: Multiple kernel learning Quasi-Newton method Alternating optimization

1. Introduction Kernel method is an effective way to solve non-linear pattern recognition problems. For any kernel method, the data examples are ﬁrst mapped to high dimensional Hilbert space H through a map ϕ : X-H, and then a liner decision boundary is found in that space. The map ϕ is computed implicitly through a kernel function kðxi ; xj Þ, which is used to measure the similarity between data examples xi and xj . In the past several decades kernel methods have been widely used to solve machine learning problems such as classiﬁcation [1,2], regression [3,4], density estimation [5] and subspace analysis [6]. For these tasks, the performance of the algorithm strongly depends on the data representation, which is implicitly chosen through the kernel function kð:; :Þ. Many kernel methods usually adopt a single predeﬁned kernel function. However, in many real-world applications, it is usually not enough to use a single predeﬁned kernel function because real data may come from multiple diverse sources or could be given in terms of different kinds of representations [7–13]. Multiple kernels based method has been extensively studied in the past few years [14– 23]. A lot of applications have shown that using multiple kernels instead of a single one can effectively improve the interpretability and performances of the decision function and successfully resolve the challenges of speech recognition [24], anomaly detection [25] and protein–protein interaction extraction [26]. In such cases, we n

Corresponding author. Tel.: +8613077692672. E-mail address: [email protected] (W. Shiwei).

often consider that the kernel kð:; :Þ is actually a convex combination of basis kernels XM XM kðxi ; xj Þ ¼ θ k ðx ; x Þ; s:t: m ¼ 1 θm ¼ 1; θm Z 0 m¼1 m m i j where M is the total number of basic kernels. θm is the combination weight corresponding to kernel km ð:; :Þ. In such multiple kernel learning framework, data representation in the feature space is transformed into the selection of the basic kernels and weights. Promoting training speed and predictive accuracy is the most active research directions in multiple kernel learning (MKL). Many effective methods have been investigated in recent years. The MKL problem has been introduced by Lanckriet et al. [14], which is exactly a constrained quadratic programming (QCQP) problem and becomes rapidly intractable as the number of data examples or basic kernels become large. Actually, the kernel learning problem is nonsmooth, which makes the direct application of gradient descent methods infeasible. Bach et al. [15] have considered a smoothed version of the problem so that gradient method such as SMO can be applied. Sonnenburg et al. [16] reformulate the MKL problem [15] as a semi-inﬁnite linear program (SILP) and address that the problem can be solved by iteratively using existing single kernel classical SVM. Note that SILP may suffer from the instability of the solution of MKL. However, the approaches above employ mixed-norm regularization which results in slow convergence. Rakotomamonjy et al. [17] propose an algorithm, named SimpleMKL, reformulating the mixed-norm regularization in the MKL problem above as the weighted 2-norm regularization, which makes MKL more practical for large-scale learning. Then some improved methods have been proposed to solve this problem,

http://dx.doi.org/10.1016/j.neucom.2016.01.079 0925-2312/& 2016 Elsevier B.V. All rights reserved.

Please cite this article as: H. Qinghui, et al., Quasi-newton method for LP multiple kernel learning, Neurocomputing (2016), http://dx. doi.org/10.1016/j.neucom.2016.01.079i

H. Qinghui et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

such as level-based optimization [18] and second-order Newton method [19]. In order to avoid overﬁtting, some regularization techniques are imposed on the weights. L1 norm MKL [17], for example, promotes the sparse solutions in terms of the kernel weights and thus it has preferable interpretability in kernel selection. Nevertheless, sparseness is not always beneﬁcial in practice and sparse MKL is frequently observed to be outperformed by a regular SVM with an unweighted-sum kernel [20,21]. Recently, Kloft et al. [20,21] propose LP-norm MKL method to extend the regular L1norm MKL for arbitrary LP-norm MKL with P 41. Compared with L1-norm MKL, LP-norm can signiﬁcantly improve the performance on diverse and relevant real-world datasets. The existing methods mainly reformulate the MKL problem as a saddle point optimization problem which concentrates on solving in the dual. Primal optimization and the dual optimization are two equivalent ways to the same aim. Recent research shows solving in the primal achieves better convergence properties than solving in the dual [27–29]. Firstly, we can efﬁciently solve the primal problem without the need of the computations related to the variable switching. Secondly, when it comes to approximate solution, primal optimization is superior because it is more focused on minimizing what we are interested in, namely the primal objective. In this paper we will show how to solve LP-norm MKL in the primal with improved Quasi-Newton method [30]. Since QuasiNewton method possesses superlinear convergence property and acquires inverse Hessian without computing a second derivative, the proposed algorithm obtains a faster convergence. Similar to other MKL methods, the alternating optimization algorithm is adopted to optimize classical SVM and the kernel weights respectively. Finally, we conduct a series of experiments to verify the efﬁciency and classiﬁcation performance of our method. The paper is organized as follows. Introduction of MKL problem is provided in Section 2. Section 3 describes the proposed MKL method in detail. Experiments are presented in Section 4 and some concluding remarks are given in the last section.

2. Multiple kernel learning problem Support vector machine (SVM) is one of the most successful applications in kernel-based methods. Considering binary classiﬁcation problem, the training data D ¼ fðxi ; yi Þj i ¼ 1:::n; xi A Rd g where yi ¼ 71 is the label of xi . The data examples are ﬁrst mapped to high dimensional Hilbert space H through a map ϕ, and then a linear decision boundary φðxÞ is found in that space, maximizing the margin between the two classes. Generally, the decision boundary is constructed by minimizing the following generic objective function:

λ

Q ðf ; bÞ ¼ jj f jj 2H þ C 2

Xn i¼1

ℓðyi ; f ðxi Þ þ bÞ

ð1Þ

where λ(λ Z 0) is a tuning parameter which is used to balance the effect of the two items on the right. The second one is the empirical risk of hypothesis f , and ℓ is a hinge loss function which is commonly used for binary classiﬁcation with the following form: ℓðyi ; f ðxi ÞÞ ¼ maxð0; 1 yi f ðxi ÞÞ

ð2Þ

The decision boundary is deﬁned as

class of the example, in which xi belongs to þ 1 class if f ðxi Þ4 0, otherwise it belongs to 1 class. Considering a given feature map ϕ : X-H, where H corresponds to kernel function k such that kðxi ; xj Þ ¼ ϕðxi ÞT ϕðxj Þ. K is a kernel matrix, K ðijÞ ¼ kðxi ; xj Þ, and K ðiÞ is the ith column of K. Based on the representation theorem [31], the decision boundary is as follows: Xn f ðxÞ ¼ α kðxi ; xÞ ð4Þ i¼1 i Here we denote the αi ’s as expansion coefﬁcients instead of Lagrange multipliers αi in standard SVM. Then we can transform the objective function (1) into the following form:

λ

Q ðα Þ ¼ α T K α þ C 2

Xn i¼1

maxð0; 1 yi K Ti αÞ

ð5Þ

Considering ϕm : X-H m , m ¼ 1:::M are M different feature maps corresponding to kernel function km . The aim of MKL is to learn the linear convex combination of basic kernels XM kðxi; xj Þ ¼ θ k ðx ; x Þ ð6Þ m¼1 m m i j The boundary f ðxÞ is deﬁned as XM f ðxÞ ¼ f ðxÞ m¼1 m

ð7Þ

According to Eqs. (5) and (6), the ﬁnal optimization objection function can be formulated as λ XM Q ðα ; θ Þ ¼ α T θ K α m m m¼1 2 XM Xn ð8Þ þC max 0; 1 yi θ K α i¼1 m ¼ 1 m mðiÞ s:t:

nXM m¼1

ðθ m Þ P

o1=P

r 1; θm Z 0; P 4 1

Here arbitrary LP-norm constraint (P 41) is imposed on weights θ to achieve non-spare solutions.

3. Optimizing the MKL problem in primal One general approach for solving problem Q ðα; θÞ is to use alternating optimization algorithm applied in [17–21]. In the ﬁrst step, Q ðα; θÞ is optimized with respect to αwith θ being ﬁxed. Then in the second step, the weights are updated to decrease the objective function Q ðα; θÞ with α being ﬁxed. The two steps are alternated until a predeﬁned criterion is satisﬁed. Fixed θ, Eq. (8) is exactly a nonsmooth function with respect to α. We adopt an improved quasi-Newton method, named subLBFGS [30] to solve this nonsmooth optimization problem. We ﬁrst present some details of this optimization technique. Quasi-Newton method is an effective method for solving nonlinear optimization problem with superlinear convergence property. An approximation is engaged to the inverse Hessian, which is built up on the basis of information gathered during the descent process, in place of true inverse required in Newton’s method. There are several kinds of quasi-Newton methods according to different approximations to Hessian matrix, such as BFGS and LBGFS. Deﬁnition 1. (Subgradient): Considering a convex function Q : ℜd ↦ℜ, Vector g A ℜd is a subgradient of Q at point w if and only if 8 w' A ℜd , there is Q ðw'Þ Z Q ðwÞ þ ðw' wÞT g.

ð3Þ

Deﬁnition 2. (Subdifferential): The set of all subgradients of the convex function Q at point w is subdifferential, and is denoted by ∂Q ðwÞ.

To simplify computation, real scalar b is omitted as it is only related to the position of boundary. Finally we can determine the

Based on the above deﬁnitions we can conclude that function Q is subdifferentiable at point w if the set of subgradients is not

φðxÞ ¼ f ðxÞ þ b

Please cite this article as: H. Qinghui, et al., Quasi-newton method for LP multiple kernel learning, Neurocomputing (2016), http://dx. doi.org/10.1016/j.neucom.2016.01.079i

H. Qinghui et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

3

3

2 y

loss

2 1

1 0

0 -1 -1

3

0

1 yi(

2

3

-1

0

1

2

M Ki ) m=1 m m

3

4

x

Characteristics of hinge loss function

Geometric interpretation of subgradient

Fig. 1. Nonsmooth points of Eq. (8) and subgradient. (a) Characteristics of hinge loss function ℓ. (b) Geometric interpretation of subgradient.

empty. If it contains only one element, i.e. ∂Q ðwÞ ¼ ∇Q ðwÞ, then Q is said to be differential at w. The problem with BFGS(LBFGS) method is that the objective function is assumed to be smooth and convex. In most cases, however, the objective functions are nonsmooth but convex in many machine learning problems, for example, Eq. (8) with respect to α, see Fig. 1(a). These objective functions Pmight not be differentiable M everywhere, such as at the point yi θ K m ¼ 1 m mðiÞ α ¼1, while a subgradient always exists.p is a descent direction if and only if pT g o0 (8 g A ∂Q ðwÞ). For smooth function, negative gradient is always the descent direction, but for nonsmooth function, negative subgradient may not be a descent direction, which leads to invalidity with BFGS. Yu J et al. [30] propose a modiﬁed version, which acquires the exact descent direction p through an iterative procedure by bundle search method. The detailed algorithm of ﬁnding p is described as follows: Algorithm 1. Finding exact descent direction p. Input: Maximum iterations kmax , arbitrary subgradient ∂Q ðwt Þ of current iteration Output: exact descent direction p 1: Initiate k ¼ 1, p1 ¼ Bt g 1 , where g 1 A ∂Q ðwt Þ is arbitrary subgradient ∂Q ðwt Þ of current iteration 2: REPEAT 3: g k þ 1 ¼ argsupg A ∂Q ðwt Þ ðpTk gÞ 4: IF pTk g k þ 1 o 0 THEN 5: RETURN pk 6: END IF 7: pk þ 1 ¼ μpk þð1 μÞð Bt g k þ 1 Þ, 0 r μ r 1 8: k ¼ k þ 1 9: UNTIL k Z kmax 10: RETURN NULL

In line 3 of Algorithm 1, a subgradient from the subdifferential set is picked out using argsup operation. This subgradient is the most violating subgradient in line 7. To do so, Yu et al. [30] adopt an alternative method. The key idea is to write the proposed descent direction at iteration k þ 1 as a convex combination of previous direction pk and Bt g k þ 1 (see step 7), where g k þ 1 is the new subgradient we just picked out in line 3. After ﬁnding p, one dimensional line search is used to acquire the best step sizes. The Wolfe conditions guarantee a sufﬁcient decrease in the objective function and

avoid unnecessary small step sizes. It is modiﬁed as follows: 8 Q ðwt þ 1 Þ r Q ðwt Þ þc1 sup g T pt > > > g A ∂Q ðwt Þ > < sup ðg'ÞT pt Zc2 sup g T pt > > g A ∂Q ðwt þ 1 Þ g A ∂Q ðαt Þ > > : s:t:0 o c1 o c2 o 1

ð9Þ

The modiﬁed quasi-Newton method for nonsmooth convex optimization is called subLBFGS. We will introduce how we use it to solve the MKL problem in the next subsection. 3.1. Solving α with subLBFGS In order to use subLBFGS algorithm to objective function (8) to solve α, we ﬁrst let E; R; W index the set of data which are in error, on the margin and well-classiﬁed, respectively 8 P E ¼ fiA f1; 2; :::; ng : 1 yi ð M > m ¼ 1 θ m K mðiÞ αÞ 4 0g > < P R ¼ fi A f1; 2; :::; ng : 1 yi ð M ð10Þ m ¼ 1 θ m K mðiÞ αÞ ¼ 0g > > : W ¼ fiA f1; 2; :::; ng : 1 y ðPM θ K αÞ o 0g i

m¼1

m

mðiÞ

Differentiating Eq. (8) then yields n XM XM X ∂Q ¼λ θ K αC βi yi m ¼ 1 θm K TmðiÞ m¼1 m m ∂α i¼1

¼ωC

n X

βi yi

XM

iAR

m¼1

θm K TmðiÞ

ð11Þ

n P P P T where ω ¼ λ M βi yi M m ¼ 1 θm K m α C m ¼ 1 θ m K mðiÞ , iAE 8 iAE > <1 βi ¼ ½0; 1 i A R . > : 0 iAW

In subLBFGS algorithm, a subgradient is needed for a given direction p, which can be calculated in linear time using only the marginal points at a current iterate sup g T p ¼ sup

g A ∂Q ðαÞ

∂Q ∂ðαÞ

T p

¼ ωT p inf C

n X i ¼ 1;i A R

βi yi

XM

!

θ KT p m ¼ 1 m mðiÞ

ð12Þ

For a given p, there is βi A ½0; 1, i A R. While the ﬁrst term of the right-hand side of Eq. (12) is a constant, the supremum sup g T p g A ∂Q ðαÞ

Please cite this article as: H. Qinghui, et al., Quasi-newton method for LP multiple kernel learning, Neurocomputing (2016), http://dx. doi.org/10.1016/j.neucom.2016.01.079i

H. Qinghui et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

is attained with the following strategy: 8 P T < 0 yi M m ¼ 1 θ m K mðiÞ p Z 0 βi ¼ PM : 1 yi m ¼ 1 θm K TmðiÞ p o 0

s:t: ð13Þ

Once a decent direction p is found, an exact linear search is performed for the best step size η, i.e., by exactly solving a one-dimensional minimization problem

nXM

ðθ m Þ P

m¼1

o1=P

r 1; θm Z 0; P 4 1

Finally the kernel weights are calculated by

θ m ¼ P

ðαTðmÞ K m αðmÞ Þ1=ðP þ 1Þ M i¼1

ðαTðiÞ K i αðiÞ ÞP=ðP þ 1Þ

ð18Þ

1=P

m ¼ 1; :::; M:

η ¼ arg minφðηÞ η40

¼ arg minQ ðα þ ηpÞ

ð14Þ

η40

The detail of solving α is described in Algorithm 2. Algorithm 2. Solving α with subLBFGS. Input: Initiate α Output: Optimized α 1: Initialize: iterate times t ¼0, Hessian matrix Bt ¼ I, αt ¼ α 2: Compute arbitrary subgradient g t of current iteration t with Eq. (10) 3: REPEAT 4: ﬁnding exact descent direction pt by using Algorithm 1 5: IF pt ¼NULL THEN 6: BREAK 7: ENDIF 8: perform linear search with Eq. (13) to ﬁnd the best step size

ηt

9: st ¼ ηt pt , update αt þ 1 ¼ αt þ st

10: compute arbitrary g t þ 1 A ∂Q ðαt þ 1 Þ so that ðαt þ 1 αt ÞT st 4 0

11: update Hessian Bt þ 1 by using BFGS Equation 12: t ¼t þ 1 13: UNTIL predeﬁned criterion is satisﬁed.

λ XM

1

m¼1

2

þC

Xn i¼1

θm

ð15Þ

αTðmÞ K m αðmÞ

XM max 0; 1 yi θ Ki α m ¼ 1 m m ðmÞ

ð16Þ

nXM o1=P P s:t: ðθ Þ r 1; θm Z 0; P 4 1 m¼1 m Fixed αðmÞ , m ¼ 1; :::; M, optimize θ is equivalent to the following constraint optimization problem: Q ðθ Þ ¼

λ XM 2 þC

1

m¼1

θm

αTðmÞ K m αðmÞ

XM max 0; 1 y θ K α m ðmÞ mðiÞ i i¼1 m¼1

Xn

t

with Eq. (18),

m ¼ 1; 2:::; M 5: t ¼ t þ 1. 6: UNTIL predeﬁned criterion is satisﬁed. Algorithm 3 performs an alternate procedure, and the predeﬁned criterion may be loop times or the kernel weights reach convergence. The complexity of QN-MKL is O(M Tsublbfgs TaltStep), where Tsublbfgs is the time taken by subLBFGS to solve α and Tsublbfgs is the number of steps taken by alternating optimization process.

In this section, a series of experiments are conducted to evaluate the efﬁciency and classiﬁcation performance of the proposed algorithm QN-MKL compared with a number of state-of-the-art MKL algorithms using different optimization techniques. The MKL algorithms in our comparisons include

Integrated with Eq. (4) we get αðmÞ ¼ θm α, where αiðmÞ is the ith expansion coefﬁcient corresponding to mth kernel. So Eq. (8) is transformed into the following form: Q ðαðmÞ ; θÞ ¼

t þ1

þ1 ¼ θm αt þ 1 , calculate θm 4: ﬁxαt þ 1 ,αtðmÞ

4. Experiments

3.2. Solving weight vector θ From Eq. (7) we can obtain XM f ðxÞ f ðxÞ ¼ m¼1 m Xn XM ¼ α k ðx ; xÞ m¼1 i ¼ 1 iðmÞ m i Xn XM α k ðx ; xÞ ¼ i¼1 m ¼ 1 iðmÞ m i

Algorithm 3. QN-MKL. Input: data examples ðx1 ; y1 Þ; :::; ðxn ; yn Þ Output: α,θ 1: Initialize: t ¼0, M ¼number of base kernels, αt ¼ O, pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ θtm ¼ p ð1=MÞ, λ, C 2: REPEAT P t t tþ1 by using 3: ﬁx θ , calculate K ¼ M m ¼ 1 θ m K m , optimize α Algorithm 2

ð17Þ

1) SimpleMKL [17]: An algorithm reformulates the mixed-norm regularization of MKL problem as the weighted 2-norm regularization, and L1-norm is imposed on kernel weights. 2) HessianMKL [19]: An algorithm uses second order optimization approach for solving MKL problem, and L1-norm is imposed on kernel weights. 3) primalMKL [28]: A MKL algorithm reformulates the optimization problem as a BiConvex optimization and solves it in the primal, and L1-norm constraint is imposed on kernel weights. Table 1 Dataset. Name

n

d

c

Name

n

d

c

a1a a3a a4a Balance-scale Diabetes German.number Heart Ionosphere

1605 3185 4780 576 768 1000 270 351

123 123 123 4 8 24 13 34

2 2 2 2 2 2 2 2

Liver monks2 pima svmguide3 satimage segment wdbc

345 1033 768 1243 4435 2310 569

5 6 8 21 36 19 30

2 2 2 2 6 7 2

Please cite this article as: H. Qinghui, et al., Quasi-newton method for LP multiple kernel learning, Neurocomputing (2016), http://dx. doi.org/10.1016/j.neucom.2016.01.079i

H. Qinghui et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

5

Table 2 Classiﬁcation accuracy of ﬁve algorithms. Dataset

SimpleMKL

HessianMKL

primalMKL

LP -MKL

QN-MKL

a1a a3a a4a Balance-scale Diabetes German.number Heart Ionosphere Liver pima monks2 svmguide3 satimage segment wdbc

0.78647 0.0007 0.78217 0.0002 0.7859 7 0.0004 0.96137 0.0095 0.77457 0.0024 0.69127 0.0007 0.84797 0.0013 0.93147 0.0115 0.69177 0.0017 0.75117 0.0127 0.93577 0.0004 0.80467 0.0013 0.88187 0.0037 0.93747 0.0043 0.9603 7 0.0024

0.78717 0.0032 0.78247 0.0055 0.78477 0.0012 0.9630 7 0.0026 0.77367 0.0011 0.6925 7 0.0008 0.84787 0.0023 0.9302 7 0.0046 0.69147 0.0011 0.7542 7 0.0084 0.9402 7 0.0127 0.81477 0.0049 0.87757 0.0026 0.93347 0.0043 0.96117 0.0129

0.7846 7 0.0214 0.80617 0.0533 0.8148 7 0.0217 0.96737 0.0148 0.77047 0.0015 0.70077 0.0055 0.84687 0.0027 0.94777 0.0264 0.67467 0.0037 0.75677 0.0100 0.9509 7 0.0222 0.80787 0.0150 0.89167 0.0015 0.94437 0.0083 0.96077 0.0135

0.8002 70.0045 0.8093 70.0018 0.8044 70.0009 0.9841 70.0026 0.7854 70.0036 0.708270.0005 0.8579 70.0019 0.9445 70.0214 0.7288 70.0162 0.7639 70.0067 0.9548 70.0065 0.8165 70.0032 0.8929 70.0068 0.9582 70.0035 0.9689 70.0048

0.8193 70.0042 0.8215 70.0053 0.8182 70.0085 0.9836 70.0042 0.7839 70.0005 0.7079 70.0054 0.8593 70.0022 0.9607 70.0263 0.7281 70.0217 0.7646 70.0144 0.9524 70.0313 0.8144 70.0148 0.8942 70.0053 0.9578 70.0212 0.9759 70.0094

a1a

balance-scale

SimpleMKL HessianMKL PrimalMKL Lp-MKL QN-MKL

30 25 20

Training time(Second)

Training time(Second)

35

15 10

SimpleMKL HessianMKL PrimalMKL Lp-MKL QN-MKL

1.5

1

0.5

5 0

0 10

20

30

40

50

60

70

10

20

30

40

Training Data(%)

Training Data(%)

a3a

banlance-scale

ionosphere

50

60

70

50

60

70

50

60

70

monks2

SimpleMKL HessianMKL PrimalMKL Lp-MKL QN-MKL

1.5

1

Training time(Second)

Training time(Second)

12

0.5

0

10

20

30

40

50

60

SimpleMKL HessianMKL PrimalMKL Lp-MKL QN-MKL

10 8 6 4 2 0

70

10

20

30

40

Training Data(%)

Training Data(%)

ionosphere

monks2

pima

Wdbc

SimpleMKL HessianMKL PrimalMKL Lp-MKL QN-MKL

12 10 8

Training time(Second)

Training time(Second)

14

6 4

SimpleMKL HessianMKL PrimalMKL Lp-MKL QN-MKL

1.5

1

0.5

2 0

10

20

30

40

50

60

70

0

10

20

30

40

Training Data(%)

Training Data(%)

pima

wdbc

Fig. 2. Training time on different datasets. (a) a3a. (b) banlance-scale. (c) ionosphere. (d) monks2. (e) pima. (f) wdbc.

Please cite this article as: H. Qinghui, et al., Quasi-newton method for LP multiple kernel learning, Neurocomputing (2016), http://dx. doi.org/10.1016/j.neucom.2016.01.079i

H. Qinghui et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

50

35

40

25

Testing error(%)

Training error(%)

30

20 15 10

30 20 10

5 0

0

0.02

0.04

0.06

0.08

0

0.1

0

0.02

Results on training datasets

0.04

0.06

0.08

0.1

Results on testing datasets

Fig. 3. Inﬂuence of λ on classiﬁcation performance. (a) Results on training datasets. (b Results on testing datasets.

a3a

balance-scale

1

0.9 0.8

0.8

0.6

0.6

m

m

0.7 0.5

0.4

0.4 0.3

0.2

0.2 0.1

2

4

6

8

10

12

14

0

16

5

10

15

Iteration times

Iteration times

a3a

banlance-scale

ionosphere

20

monks2

1 0.8

0.8

0.7 0.6 m

m

0.6 0.5

0.4

0.4 0.2 0

0.3 0.2 2

4

6

8

10

12

14

16

2

4

6

8

10

Iteration times

Iteration times

ionosphere

monks2

pima

wdbc

12

14

16

12

14

16

1 0.9 0.8

0.8 0.7

m

m

0.6

0.6 0.5

0.4

0.4 0.3

0.2

0.2 0.1

0 2

4

6

8

10

12

14

16

2

Iteration times

³=2-2

6

8

10

Iteration times

pima ³=2-3

4

wdbc ³=2-1

³=20

³=21

³=22

³=23

Fig. 4. The convergence trend of θ on different datasets. (a) a3a. (b) banlance-scale. (c) ionosphere. (d) monks2. (e) pima. (f) wdbc.

Please cite this article as: H. Qinghui, et al., Quasi-newton method for LP multiple kernel learning, Neurocomputing (2016), http://dx. doi.org/10.1016/j.neucom.2016.01.079i

H. Qinghui et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4) LP-MKL [20]: A MKL algorithm generalizes the regular L1-norm MKL method to arbitrary LP-norm(P 41) MKL method. All the benchmark datasets used in our experiments are shown in Table 1. These datasets are widely used to evaluate the classiﬁcation performance of machine learning algorithms. We adopt a one-against-all approach to train a set of binary classiﬁers to solve the multiclass classiﬁcation tasks. All the experiments are run in Matlab7.10.0.499 on a windows8 64 bit operation system with Intel(R) Core(TM) i5 2.80 GHz CPU and 16 GB RAM.

4.1. Experimental setup In our experiments, we follow the general approach used in other MKL algorithms in literature. We create a set of 27 basic kernels with the kernel parameters settings as follows: 1) Gaussian kernels with 24 different widths σ ¼ 25; ; 24; ; :::; 25; ; 26; Þ,ð1:55; ; 1:54; ; :::; 1:55; ; 1:56; Þ. 2) Polynomial kernels of degree 1 to 3 on all features. In QN-MKL we set parameter C ¼150/n, P ¼2 for the best tratþ1 t deoff of accuracy and efﬁciency and λ ¼0.0001. jj θ θ jj o 10 4 is used as the stopping criterion of the alternation process. We adopt SimpleMKL toolbox that includes the implementation of SimpleMKL algorithm, which employs the duality gap as the stopping criterion and the optimization process is stopped when the duality gap is less than a threshold 0.01 or the maximum iterations reach 500. For LP-MKL, we set P¼ 2, same as our method. For primalMKL, we adopt its default settings used in [28].

7

4.2. Experiment results analysis 4.2.1. Classiﬁcation accuracy We ﬁrst conduct a series of experiments to evaluate the classiﬁcation accuracy of the ﬁve algorithms. In these experiments, we randomly select 50 percent of the dataset as training set, the rest as testing set. Each experiment is repeated for 10 times, the average classiﬁcation accuracy and the standard deviations of involved algorithms are reported in Table 2. The best results are highlighted by bold font. From the results we can draw a conclusion that L1-norm MKL algorithms, such as SimpleMKL, HessianMKL and primalMKL are generally comparable on accuracy performances. But they signiﬁcantly behave worse than the LPnorm constraints MKL algorithms (LP-MKL, QN-MKL) since they only perform a kernel selection process and acquire sparse kernel weights. QN-MKL and LP-MKL achieve better performance on different datasets. Compared comprehensively, our method outperforms other L1-norm MKL algorithms and is comparable with LPMKL in most cases. 4.2.2. Training time In the following experiments we compare the training time of QN-MKL with other algorithms. The training time does not include the generation time of kernel matrix. For primalMKL, we use the default settings of the training time, which is related to parameter train-times of loop times, that is, Train_times equal to 5 in our experiments. All algorithms are ran 10 times randomly and the relationships between training time and training size on six datasets are shown in Fig. 2(a–f). From Fig. 2 we ﬁnd that the QN-MKL algorithm enjoys a signiﬁcant advantage in learning efﬁciency over the other MKL

1

0.8

0.8

0.6

0.6

m

m

1

0.4

0.4

0.2

0.2

0

0.125 0.25 0.5

1

2

4

0

8

0.125 0.25 0.5

1

2

4

8

2

4

8

m

m

P=2

P =3 1

0.8

0.8

0.6

0.6 m

m

1

0.4

0.4

0.2

0.2

0

0.125 0.25 0.5

1 m

P =4

2

4

8

0

0.125 0.25 0.5

1 m

P =6

Fig. 5. Inﬂuence of P on θ. (a) P¼ 2. (b) P ¼3. (c) P ¼4. (d) P ¼ 6. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

Please cite this article as: H. Qinghui, et al., Quasi-newton method for LP multiple kernel learning, Neurocomputing (2016), http://dx. doi.org/10.1016/j.neucom.2016.01.079i

H. Qinghui et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

algorithms. Obviously, the time cost of QN-MKL increases more slowly than other algorithms with the increase of training examples, especially on dataset ionosphere, monks2 and pima. Fig. 2 also shows that the time cost of LP-MKL algorithm grows drastically as it has higher computing complexity and slower convergence speed than others. Although both PrimalMKL and QNMKL solve the MKL problem in primal, our method is generally faster than the latter. The reason for this is that a LBFGS (used in our method) method often outperforms conjugate gradient method (which is a Hessian free approach, used in PrimalMKL). It is noted that our method has preferable effectiveness.

4.3. Performance analysis 4.3.1. Inﬂuence of λ on classiﬁcation accuracy Six datasets are selected to verify the inﬂuence of regularization parameter λ on classiﬁcation accuracy. 50% data examples are selected as training data and the rest as testing data. We set λ in interval [0.0001…0.1]. The relationship between λ and the classifying error on training datasets and testing datasets is shown in Fig. 3(a) and (b). Fig. 3 shows that the classiﬁcation accuracy of our method is sensitive to parameter λ, the smaller λ is, the better the classiﬁcation performance is both on training datasets and testing datasets. This is because the scalar bias b in the objective function is omitted, which leads to the result that the norm of decision function f should be large so that the training examples are classiﬁed as correct as possible (i.e., 1 yi f ðxi Þ is less than zero). Small λ just reduces the regularized loss from the large norm of f . So we set λ ¼ 0.0001 in all the experiments. 4.3.2. Convergence trend of θ We check the convergence of our algorithm on six datasets. For convenience for drawing, we set P ¼ 6, and only seven Gaussian kernels are selected with σ ¼ð23 ; 22 ; :::; 22 ; 23 Þ. All data examples are selected as training set. The relationship between the iteration times of the alternating optimization and the convergence trend of θ on different datasets are shown in Fig. 4(a–f). Fig. 4 shows that our method reaches convergence after almost only 10 iteration times, which means that our method has fast convergent speed. It is noted that the parameter P might also affect the convergent speed of the proposed method. The larger the parameter P is, the slower the convergent speed is. 4.3.3. Inﬂuence of P on θ Fig. 5(a–d) shows how the weights θ are affected by different P values where the horizontal axis is seven basic kernels with σ ¼ ð23 ; 22 ; :::; 22 ; 23 Þ and the vertical axis is the convergent value of θm (m ¼ ð1; :::; 7Þ) marked by different color bars. Fig. 5(b–d) shows that when P is greater than or equals to 3, almost all the basic kernels are retained, which may lead to poor performance if the kernel pool contains noise kernels. There are only a few useful kernels which are retained when P equals to 2. In our method, we set P ¼2 for the best tradeoff of accuracy and efﬁciency. We could also control the sparseness of θ by the convergence condition of alternating optimization process.

5. Conclusion In this paper we proposed a novel method to solve non-sparse MKL problem in primal. We adopted an alternating optimization approach to solve classical SVM and kernel weights, respectively. Subgradient and an improved Quasi-Newton method are used to optimize the non-smooth objective function corresponding to expansion coefﬁcients α and weights are obtained analytically. Our method has preferable convergence property beneﬁting from the superlinear convergence property of Quasi-Newton method. We also carried out a series of experiments on several datasets to demonstrate its effectiveness. Compared with some existing MKL methods, our method also exhibits comparable classiﬁcation performance. We expect to carry out some theoretical analysis, such as convergence, the bound of generalization error in the future.

Acknowledgment This work is supported by the Project of the National Natural Science Foundation of China (60975050), National Natural Science Foundation of Guangxi (2014GXNSFAA1183105, 2014GXNSFBA118286) and the Higher Education Science Research Project of Guangxi (ZD2014147).

References [1] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discov. 2 (2) (1998) 121–167. [2] Tobias Reitmaier, Bernhard Sick, The responsibility weighted Mahalanobis kernel for semi-supervised training of support vector machines for classiﬁcation, Inf. Sci. 323 (2015) 179–198. [3] Zhen Zhang, Yanning Zhang, Variable kernel density estimation based robust regression and its applications, Neurocomputing 134 (9) (2014) 30–37. [4] A.J. Smola, B. Scholkopf, A tutorial on support vector regression, Stat. Comput. 14 (3) (2004) 199–222. [5] Yujie He, Yi Mao, Wenlin Chen, et al., Nonlinear metric learning with kernel density estimation, IEEE Trans. Knowl. Data Eng. 27 (6) (2015) 1602–1614. [6] B. SchÄolkopf, S. Mika, A. Smola, G. Ratsch, K.R. Muller, Kernel PCA pattern reconstruction via approximation pre-images, in: Proceedings of the IEEE International Conference on Artiﬁcial Neural Networks, Skovde, Sweden, 1998, pp. 147–152. [7] K.R. Muller, S. Mika, G. Ratsch, et al., An introduction to kernel based learning algorithms, IEEE Trans. Neural Netw. 12 (2) (2001) 181–201. [8] F.R. Bach, Consistency of the group Lasso and multiple kernel learning, J. Mach. Learn. Res. 9 (6) (2008) 1179–1225. [9] D.N. Zheng, J.X. Wang, Y.N. Zhao, Non-ﬂat function estimation with a multiscale support vector regression, Neurocomputing 70 (13) (2006) 420–429. [10] Tao Guan, Yuesong Wang, Liya Duan, On-device mobile landmark recognition using binarized descriptor with multifeature fusion, ACM Trans. Intell. Syst. Technol. 7 (1) (2015). [11] Yawei Luo, Tao Guan, Benchang Wei, Hailong Pan, Junqing Yu, Fast terrain mapping from low altitude digital imagery, Neurocomputing 156 (2015) 105–116. [12] T. Guan, Y. He, J. Gao, J. Yang, J. Yu, On-device mobile visual location recognition by integrating vision and inertial sensors, IEEE Trans. Multimed. 15 (7) (2013) 1688–1699. [13] Benchang Wei, Tao Guan, Junqing Yu, Projected residual vector quantization for ANN search, IEEE Multimed. 21 (3) (2014) 41–51. [14] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, et al., Learning the kernel matrix with semi-deﬁnite programming, J. Mach. Learn. Res. 5 (1) (2004) 27–72. [15] F.R. Bach, G.R.G. Lanckriet, M.I. Jordan, Multiple kernel learning, conic duality, and the SMO algorithm, in: Proceedings of the 21st International Conference on Machine Learning, ACM, pp. 41–48, 2004. [16] S. Sonnenburg, G. Ratsch, C. Schafer, et al., Large scale multiple kernel learning, J. Mach. Learn. Res. 7 (7) (2006) 1531–1565. [17] A. Rakotomamonjy, F. Bach, S. Canu, Y. Grandvalet, SimpleMKL (2491–2152), J. Mach. Learn. Res. 9 (11) (2008). [18] Z. Xu, R. Jin, I. King, M.R. Lyu, An extended level method for efﬁcient multiple kernel learning, in: Proceedings of the NIPS, 2008. [19] O. Chapelle, A. Rakotomamonjy, Second order optimization of kernel parameters, in: Proceedings of the NIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008.

Please cite this article as: H. Qinghui, et al., Quasi-newton method for LP multiple kernel learning, Neurocomputing (2016), http://dx. doi.org/10.1016/j.neucom.2016.01.079i

H. Qinghui et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [20] M. Kloft, U. Brefeld, S. Sonnenburg, et al., lp-Norm multiple kernel learning, J. Mach. Learn. Res. 12 (5) (2011) 953–997. [21] M. Kloft, G. Blanchard, On the convergence rate of lp-norm multiple kernel learning, J. Mach. Learn. Res. 13 (8) (2012) 2465–2501. [22] T. Guan, Y.F. He, L.Y. Duan, J.Q. Yu, Efﬁcient BOF generation and compression for on-device mobile visual location recognition, IEEE Multimed. 21 (2) (2014) 32–41. [23] G.önen Mehmet, Alpaydın Ethem, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (7) (2011) 2211–2268. [24] C. Longworth, M.J.F. Gales, Combining derivative and parametric kernels for speaker veriﬁcation, IEEE Trans. Audio, Speech Lang. Process. 17 (4) (2009) 748–757. [25] S. Das, B.L. Matthews, A.N. Srivastava et al., Multiple kernel learning for heterogeneous anomaly detection: algorithm and aviation safety case study, in: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010, pp. 47–56. [26] M. Miwa, R. Sætre, Y. Miyao, et al., Protein–protein interaction extraction by leveraging multiple kernels and parsers, Med. Inform. 78 (12) (2009) 34–46. [27] S. Melacci, M. Belkin, Laplacian support vector machines trained in the primal, J. Mach. Learn. Res. 12 (5) (2011) 1149–1184. [28] Zhifeng Hao, Ganzhao Yuan, Xiaowei, et al., A primal method for multiple kernel learning, Neural Comput. Appl. 23 (3) (2013) 975–987. [29] Zhizheng Liang, ShiXiong Xia, Yong Zhou, et al., Training Lp norm multiple learning in the primal, Neural Netw. 46 (5) (2013) 172–182. [30] J. Yu, S.V.N. Vishwanathan, S. Gunter, et al., A quasi-Newton approach to nonsmooth convex optimization problems in machine learning, J. Mach. Learn. Res. 11 (5) (2010) 1145–1200. [31] G. Kimeldorf, G. Wahba, Some results on tchebychefﬁan spline functions, J. Math. Anal. 33 (1) (1971) 82–95.

Hu Qinghui was born in Chongqing, China, in 1976. He received the Master degree from Southwest Petroleum University, 2005, the Ph.D. Degree from Wuhan University in 2015. Now he is an associate professor in Guilin University of Aerospace Technology, China. His interests of researches include pattern recognition, intelligent computation.

9 Wei Shiwei was born in Henan, China, in 1979. He received the Bachelor degreee from Guilin Institute of Electronic Technology, 2004, the Master degreee from Guilin University of Electronic Technology, 2007. Now he is working in the school of Information Engineering at Guilin University of Aerospace Technology. His research interests are in the areas of cloud computing, distributed systems and Intelligent computing.

Li Zhiyuan was born in Jiangxi, China, in 1964. He received the Bachelor degree from Shanghai University in 1984. Now he is a professor in Guilin University of Aerospace Technology, China. His interests of research include mobile ad hoc network.

Liu Xiaogang was born in Heilongjiang, China, in 1964. He received the Bachelor degree from Dalian University of Technology, 1987, the Master degree from University of Science and Technology Beijing,1995, the Ph.D. degree from South China University of Technology in 2009. Now he is a professor in Guilin University of Aerospace Technology, China. His interests of research include intelligent robot welding.

Please cite this article as: H. Qinghui, et al., Quasi-newton method for LP multiple kernel learning, Neurocomputing (2016), http://dx. doi.org/10.1016/j.neucom.2016.01.079i

Quasi-newton method for LP multiple kernel learning

Quasi-newton method for LP multiple kernel learning

Recommend Documents