Multi-class kernel margin maximization for kernel learning

Author’s Accepted Manuscript Multi-Class Kernel Margin Maximization for Kernel Learning Yan-Guo Zhao, Miaomiao Li, Ronald Chung, Zhan Song www.elsevie...

Download PDF

504KB Sizes 0 Downloads 87 Views

Report

PDF Reader
Full Text

Author’s Accepted Manuscript Multi-Class Kernel Margin Maximization for Kernel Learning Yan-Guo Zhao, Miaomiao Li, Ronald Chung, Zhan Song www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(16)30525-2 http://dx.doi.org/10.1016/j.neucom.2016.05.073 NEUCOM17138

To appear in: Neurocomputing Received date: 29 April 2015 Revised date: 30 May 2016 Accepted date: 31 May 2016 Cite this article as: Yan-Guo Zhao, Miaomiao Li, Ronald Chung and Zhan Song, Multi-Class Kernel Margin Maximization for Kernel Learning, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.05.073 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Multi-Class Kernel Margin Maximization for Kernel Learning Yan-Guo Zhaoa,c , Miaomiao Lib , Ronald Chungd , Zhan Song1a,d a

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, 518055 b School of Computer, National University of Defense Technology, Changsha, China, 410073 c Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, China, 518055 d Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, China

Abstract Two-stage multiple kernel learning (MKL) algorithms have been intensively studied due to its high eﬃciency and eﬀectiveness. Pioneering work on this regard attempts to optimize the combination coeﬃcients by maximizing the multi-class margin of a kernel, while obtaining unsatisfying performance. In this paper, we attribute this poor performance to the way in calculating the multi-class margin of a kernel. In speciﬁc, we argue that for each sample only the k-nearest neighbors, while not all samples with the same label, should be selected for calculating the margin. After that, we also develop another sparse variant which is able to automatically identify the nearest neighbors and the corresponding weights of each sample. Extensive experimental results on ten UCI data sets and six MKL benchmark data sets demonstrate the 1

Corresponding author ([email protected]). Dr. Zhan Song is with Guangdong Provincial Key Laboratory of Computer Vision and Virtual Reality Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences.

Preprint submitted to Neurocomputing

June 3, 2016

eﬀectiveness and eﬃciency of the proposed algorithms. Keywords: Large Margin, Kernel Methods, Support Vector Machines 1. Introduction Kernel learning algorithms have been intensively studied in machine learning community during the last decade Lanckriet et al. (2004); Rakotomamonjy et al. (2008); G¨onen and Alpaydın (2011); Yan et al. (2012); Cortes et al. (2012); Bucak et al. (2014); Zhang et al. (2010); Liu et al. (2015, 2012); Qinghui et al. (2016); Wu et al. (2010). Among of them, multiple kernel learning optimally combines multiple pre-speciﬁed base kernels to improve the performance of learning tasks. As one of active research direction on this regard, two-stage MKL algorithms Kumar et al. (2012); Cortes et al. (2012) provides an eﬃcient and eﬀective approach to learning an optimal kernel from the data. Two-stage MKL algorithms ﬁrst learn the optimal kernel coeﬃcients according to some criteria, and then applies the learned optimal kernel to the kernel based algorithms to train a kernel model. Compared with one-stage MKL algorithms Zien and Ong (2007); Ye et al. (2008); Xu et al. (2010); G¨onen (2012); Cortes et al. (2013a); Liu et al. (2013, 2014) in which the optimal kernel coeﬃcients and the structural parameters of a classiﬁer are jointly learned, two-stage MKL algorithms: (1) are able to achieve comparable or even better performance, while taking much less computational cost; (2) are more ﬂexible since the learned optimal kernel can be directly applied into diﬀerent learning tasks, including both classiﬁcation and regression. Many recent work on two-stage MKL algorithms have been done Cortes et al. (2012); Kumar et al. (2012). The seminal work in Cortes et al. (2012)

2

proposes two-stage techniques for learning kernels based on the criterion of kernel alignment maximization. Moreover, a number of novel theoretical, algorithmic and empirical results are presented for alignment-based techniques. Diﬀerently, Kumar et al. (2012) formulates the kernel learning problem as a standard linear classiﬁcation problem in a new instance space. In this space, the weights of any linear classiﬁers directly corresponds to the combination coeﬃcients of base kernels. All these work not only provides a new research direction in kernel learning, but also demonstrates state-of-the-art kernel learning performance. More recently, Cortes et al. (2013a) presents a new MKL algorithm based on a natural measure of the multi-class margin of a kernel. It is then proved that the large value of this quantity guarantees the existence of an accurate multi-class predictor. Though theoretically elegant, it is experimentally veriﬁed that this approach typically leads to a poor performance Cortes et al. (2013a). In this paper, we revisit this approach and argue that the poor performance results from the way of calculating the multi-class margin of a kernel. In speciﬁc, only the k-nearest neighbors of each sample should be selected for calculating the margin of a kernel. This is diﬀerent from Cortes et al. (2013a) in which all samples with the same label are involved. This beneﬁt can be clearly seen from Figure 1. In this ﬁgure, we plot the curve of the classiﬁcation accuracy of the proposed two-stage MKL algorithm with the number of k-nearest neighbors in calculating the margin of a kernel. As can be seen, the classiﬁcation accuracy decreases dramatically with the number of k-nearest neighbors. We attribute the superiority of our algorithm to the top k-nearest neighbors since they are more discriminative than that of all

3

samples with the same label, leading to better kernel combination coeﬃcients and superior classiﬁcation performance. psortPos 0.9

0.895

Accuracy

0.89

0.885

0.88

0.875

0.87

5

10

15

20

25

k Figure 1: The classiﬁcation accuracy varies with the number of k-nearest neighbours in calculating the margin of a kernel.

On the other hand, the hyper-parameter k is usually diﬃcult to tune and the uniﬁed weights for all k-nearest neighbors may not be optimal for diﬀerent applications. To further improve this situation, we propose to represent each sample by a linear combination of samples with the same class and optimize the combination coeﬃcients via minimizing the reconstruction error. We conduct intensive experiments to compare the proposed algorithm with stateof-the-art MKL algorithms on ten UCI and six MKL benchmark data sets, and the results verify the superiority of the proposed algorithms.

4

2. Related Work 2.1. Binary Classiﬁcation Framework for Two-Stage MKL We are given n training samples {(φ(xi ), yi )}ni=1 , where φ(xi ) = [φ 1 (xi ), · · · , φm (xi )] ,

{φp (·)}m p=1 are the feature mappings corresponding to m pre-deﬁned base kernels {κp (·, ·)}m p=1 , and yi is the class label of xi . The work in Kumar et al. (2012) ﬁrst constructs samples {(zxx , tyy )}1≤i≤j≤n ⊂ Rm × {±1} in a new space, where zij = (K1 (xi , xj ), · · · , Km (xi , xj ))

(1)

tij = 2 · 1{yi = yj } − 1.

After that, it learns the kernel combination coeﬃcients μ by solving the following optimization problem, min μ≥0

λ μ2 + 2

1

n(n+1) 1≤i≤j≤n 2

1 − tij μ zij

+

,

(2)

where [1 − s]+ = max{0, 1 − s} is the hinge loss and λ is a regularization parameter. The optimization problem 2 can be eﬃciently solved by the stochastic projected sub-gradient descent algorithms Shalev-Shwartz et al. (2007). 2.2. Centered Kernel Alignment for Two-Stage MKL The approach in Cortes et al. (2012) proposes a two-stage MKL algorithm via maximizing the centered kernel alignment. Speciﬁcally, the kernel combination coeﬃcients are optimized by solving the following problem, m p=1 μp Kp , YY F m max , (3) μ∈M p=1 μp Kp F 5

where M = {μ ∈ Rm : μ ≥ 0, μ2 = 1}, Kp is the p-th centered base kernel, Y ∈ {0, 1}n×c is the label indicator matrix with c the number of classes. Moreover, it is proved in Cortes et al. (2012) that Eq.(3) can be equivalently optimized by solving a quadratic programming (QP) problem with size of m, where m is the number of base kernels. 2.3. Multi-Class Kernel Margin Maximization for Two-Stage MKL More recently, the work in Cortes et al. (2013a) presents a two-stage MKL algorithm by maximizing the multi-class kernel margin. For a given kernel K, its multi-class margin for a sample pair (x, y) is deﬁned as γK (x, y) = E(x ,y )∼D [K(x, x ) | y = y]

(4)

E(x ,y )∼D [K(x, x ) | y = y ], − max y =y

and the multi-class kernel margin of K is deﬁned as γ K = E(x,y)∼D [γK (x, y)]. Based on this deﬁnition, Cortes et al. (2013a) proposes to select the kernel combination coeﬃcients μ by maximizing the empirical multi-class margin of the kernel m p=1 μp Kp . This is fulﬁlled by solving the following optimization problem, n

max

μ∈Δq , γ

γi

s.t.

∀i ∀y = yi : μ η(xi , yi , y) ≥ γi ,

i=1

where η(xi , yi , y) = [η1 (xi , yi , y), · · · , ηm (xi , yi , y)] , ηp (xi , yi , y) =

x ∈C(y) Kp (xi ,x ) , |C(y)|

(5)

x ∈C(yi )

Kp (xi ,x )

|C(yi )|

C(y) = {xi : yi = y, i = 1, · · · , n} and Δq = {μ ∈ Rm :

μ ≥ 0, μq = 1}. On the theoretic side, it is shown that the large values of the kernel margin γ K guarantee the existence of an accurate multi-class predictor Cortes et al. 6

−

(2013a). Though theoretically elegant, it has been experimentally validated that directly optimizing Eq.(5) typically leads to a poor performance. In the following, we revisit this approach by redeﬁning the multi-class kernel margin, and propose an eﬃcient and eﬀective two-stage MKL algorithm. 3. Proposed Algorithm 3.1. Multi-Class Kernel Margin with k-Nearest Neighbours (MCKM-kNN) Let Nk (x, y) denote the k-nearest neighbour set of x in class y ∈ Y and d(·) be a distance function. The margin of (xi , yi ) for the combination kernel m p=1 μp Kp is deﬁned as γμ (xi , yi , y) = d(xi , Nk (xi , y)) − d(xi , Nk (xi , yi )),

(6)

where d(x, Nk (x, y)) is calculated in multi-kernel induced space as

d(x, Nk (x, y)) =

xj ∈Nk (x,y)

= 2

m

μp −

p=1

d(x, xj ) k m p=1

μp

xj ∈Nk (x,y)

Kp (x, xj ) . k

(7)

Substituting Eq.(7) into Eq.(6), we obtain γμ (xi , yi , y) = μ η(xi , yi , y),

(8)

where η(xi , yi , y) = [η1 (xi , yi , y), · · · , ηm (xi , yi , y)] with ηp (xi , yi , y) =

xj ∈Nk (xi ,yi )

Kp (xi , xj ) − k

7

xj ∈Nk (xi ,y)

Kp (xi , xj ) . k

(9)

Based on this deﬁnition, we get the the objective of the proposed twostage MKL algorithm by maximizing the multi-class kernel margin with knearest neighbours, as in Eq.(10), max

μ∈Δq ,γ

n

γi

s.t.

∀i, ∀y = yi : μ η(xi , yi , y) ≥ γi .

(10)

i=1

3.2. Multi-Class Kernel Margin with Sparse Representation (MCKM-SR) For the proposed multi-class kernel margin with k-nearest neighbours, it is usually diﬃcult to determine the number of nearest numbers, i.e., k. Moreover, this approach assigns each neighbor the uniﬁed weights k1 , which may not be optimal for diﬀerent applications. To address these issues, we propose another variant which is able to determine the nearest neighbors and their corresponding weights for each sample automatically. In speciﬁc, we suppose that each sample can be represented by a linear combination of samples with the same class and optimize the combination coeﬃcients by minimizing the reconstruction error. For each p (1 ≤ p ≤ m) and y (1 ≤ y ≤ c), the nearest neighbors and their weights for each sample spp can be determined by solving the following optimization problem, min φp (xi ) − y sp

s.t.

xj ∈C(y)

xj ∈C(y)

sypj

= 1,

sypj φp (xj )22

sypj

(11) ≥ 0,

which is a quadratic programming problem with linear constraints and can be eﬃciently solved by existing optimization packages. After obtaining syp by Eq.(11), η p in Eq.(9) is calculated as ηp (xi , yi , y) =

sypji Kp (xi , xj ) −

xj ∈C(yi )

xj ∈C(y)

8

sypj Kp (xi , xj ).

(12)

After that, the optimal kernel coeﬃcients can be optimized by solving the problem in Eq.(10). It is worth noting that the optimization problem in Eq.(10) is a convex one and we implement it via the widely used CVX package CVX Research (2012). More eﬃcient optimization techniques can be used to further improve its computational eﬃciency. 4. Experiments 4.1. Experimental Settings In this section, we conduct intensive experiments to compare the proposed two multi-class kernel margin maximization variants, i.e., MCKM-kNN and MCKM-SR, with several state-of-the-art MKL algorithms, including Uniﬁedweighted MKL (UNIF), MCMKL Zien and Ong (2007), M3 K Cortes et al. (2013a), ALIGNF Cortes et al. (2012) and BinaryMKL Kumar et al. (2012). We ﬁrst evaluate the classiﬁcation performance of the aforementioned algorithms on ten binary UCI data sets2 , including germannum, heart, ionosphere, musk1, pima, sonar, spambase, liver, wdbc and wpbc. For these data sets, we follow the approach in Cortes et al. (2012) to generate 20 Gaussian kernels as base kernels, whose width parameters are linearly equally sampled between 2−7 σ0 and 27 σ0 , with σ0 the mean value of all pairwise distances. For each data set, 60% : 20% : 20% samples are randomly selected as the training set, validation set and the test set, respectively. After that, we report the classiﬁcation results of these algorithms on six MKL benchmark data sets, including psortPos, psortNeg, plant data sets3 , 2 3

https://archive.ics.uci.edu/ml/datasets.html http://raetschlab.org//suppl/protsubloc/

9

the protein fold prediction data set4 , ﬂower17 data set5 and the Caltech1016 . All of them are multi-class classiﬁcation tasks. The base kernel matrices of these data sets are pre-computed and publicly available from the above websites. For protein, psortPos, psortNeg, plant and ﬂower17 data sets, 60% : 20% : 20% samples are randomly selected as the training set, validation set and the test set, respectively. For caltech101, we report the classiﬁcation accuracy curves with the number of training samples varies from 5, 10, 15, 20, 25. The classiﬁcation accuracy is used to evaluate the goodness of the above algorithms. We repeat each experiment 30 times on UCI data sets and ten times on MKL benchmark data sets to eliminate the randomness in generating the partition matrices, and report the averaged results and standard deviation. Furthermore, to conduct a rigorous comparison, the paired student’s t-test is performed. The p-value of the pairwise t-test represents the probability that two sets of compared results come from distributions with an equal mean. A p-value of 0.05 is considered statistically signiﬁcant. In our experiments, each base kernel is centralized and then scaled so that its diagonal elements are all ones. The regularization parameter C for each algorithm is chosen from an appropriately large [2−1 , 20 , · · · , 27 ] by grid search according to the performance on the validation sets. All experiments are conducted on a high performance cluster server, where each node has 2.3GHz CPU and 12GB memory. 4

http://mkl.ucsd.edu/dataset/protein-fold-prediction/ http://www.robots.ox.ac.uk/~vgg/data/flowers/17/index.html/ 6 http://files.is.tue.mpg.de/pgehler/projects/iccv09/ 5

10

4.2. Results on UCI datasets The classiﬁcation accuracy results for all aforementioned algorithms on ten UCI data sets are reported on Table 1. As can be seen, the proposed MCKM-kNN and MCKM-SR usually gives overall comparable classiﬁcation accuracy when compared with other algorithms. Speciﬁcally, the proposed algorithms achieves better performance on sonar, pima and spambase data sets, while it is comparable with others on the rest of data sets. These results on UCI data sets preliminarily validate the eﬀectiveness of our proposed algorithms. 4.3. Results on MKL Benchmark datasets 4.3.1. Results on Protein Subcellular Localization Besides the ten UCI data sets, we compare the performance of abovementioned MKL algorithms on psortPos, psortNeg, plant data sets, which are from the protein subcellular localization and have been widely used in MKL community Zien and Ong (2007); Cortes et al. (2013a,b). The base kernel matrices for these data sets have been pre-computed and can be publicly downloaded from the websites. Speciﬁcally, there are 69 base kernel matrices, including two kernels on phytogenetic trees, three kernels from BLAST E-values, and 64 sequence motif kernels. The class number of psortPos, psortNeg, plant data sets is four, ﬁve and four, respectively. The accuracy achieved by the above algorithms on these three data sets are reported in Table 2. As can be observed, the proposed MCKM-kNN usually achieves better performance than the compared baseline. For example, it outperforms the second best one (ALIGNF) by 1.45% on plant data set.

11

Table 1: Classiﬁcation accuracy comparison among diﬀerent MKL algorithms on UCI datasets. germ

heart

iono

musk

pima

sonar

spam

liver

wdbc

wpbc

76.53

81.67

97.14

93.40

80.26

92.44

89.60

70.00

98.57

76.32

±1.90

±4.14

±1.78

±3.00

±1.71

±3.72

±1.05

±5.42

±1.41

±2.15

MCMKL

72.56

81.85

97.00

95.96

78.69

93.17

87.75

68.70

98.75

73.68

Zien and Ong (2007)

±3.50

±4.43

±1.57

±2.96

±2.49

±3.60

±4.23

±5.12

±1.21

±2.77

M3 K

72.56

81.85

97.00

95.96

81.18

93.17

87.65

68.70

98.75

73.68

Cortes et al. (2013a)

±3.50

±4.43

±1.57

±2.96

±2.38

±3.60

±4.22

±5.12

±1.21

±2.77

ALIGNF

75.98

81.30

96.43

95.32

80.26

91.46

89.55

66.96

98.93

73.68

Cortes et al. (2012)

±2.10

±5.20

±1.54

±2.62

±2.15

±4.02

±1.23

±5.41

±1.25

±2.77

BinaryMKL

75.58

81.30

97.29

93.62

80.85

90.98

89.85

68.84

98.75

75.79

Kumar et al. (2012)

±1.81

±4.14

±1.71

±2.56

±2.24

±4.15

±1.16

±5.43

±1.47

±3.23

76.23

80.19

97.00

95.21

78.89

94.15

89.95

73.33

98.57

76.32

±1.82

±5.02

±1.57

±3.18

±3.26

±2.86

±1.14

±6.08

±1.41

±3.28

75.73

80.37

97.29

94.15

79.41

92.95

90.15

71.59

98.75

75.79

±3.16

±3.72

±1.96

±3.59

±2.84

±3.99

±0.97

±5.60

±1.47

±3.68

UNIF

MCKM-kNN

MCKM-SR

12

Furthermore, the proposed MCKM-SR further improves the performance of MCKM-kNN and demonstrates the best performance on all three data sets! 4.3.2. Results on Protein Fold Prediction We also compare the aforementioned algorithms on the protein fold prediction data set, which is a multi-source and multi-class data set based on a subset of the PDB-40D SCOP collection. It contains 12 diﬀerent feature spaces, including composition, secondary, hydrophobicity, volume, polarity, polarizability, L1, L4, L14, L30, SWblosum62 and SWpam50. This data set has been widely adopted in the MKL community Damoulas and Girolami (2008); Yan et al. (2012). For the protein fold prediction data set, the input features are available and the kernel matrices are generated as in Damoulas and Girolami (2008), where the second order polynomial kernels are employed for feature sets one to ten and the linear kernel for the rest two feature sets. This data set is a 27-class classiﬁcation task and the oneagainst-rest strategy is used to solve the multi-class classiﬁcation problems. As can be seen from Table 2, the proposed MCKM-SR achieves the highest accuracy. It signiﬁcantly outperforms the second best one (ALIGNF), which is a very competitive baseline, by 1.71%. 4.3.3. Results on Oxford Flower17 We then compare the above MKL algorithms on Oxford Flower17, which has been widely used as a MKL benchmark data set Nilsback and Zisserman (2006). There are seven heterogeneous data channels available for this data set. For each data channel, we apply a Gaussian kernel with three diﬀerent width parameters, i.e., 2−2 σ0 , 20 σ0 and 22 σ0 to generate three kernel matrices, 13

Table 2: Classiﬁcation accuracy comparison among diﬀerent MKL algorithms on seven benchmark datasets. The one achieves the highest accuracy and those with no statistical diﬀerence with the best one are marked as bold. psortPos

plant

psortNeg

protein

ﬂower17

87.36

89.41

88.71

67.67

87.54

±3.15

±2.46

±1.47

±3.95

±2.09

MCMKL

86.89

89.89

88.82

69.61

85.92

Zien and Ong (2007)

±2.37

±0.94

±1.37

±3.48

±3.09

M3 K

87.92

90.91

90.59

69.61

85.62

Cortes et al. (2013a)

±3.17

±1.15

±1.15

±3.48

±3.74

ALIGNF

88.40

91.18

90.52

71.24

86.95

Cortes et al. (2012)

±2.36

±1.39

±3.79

±3.79

±1.84

BinaryMKL

88.77

90.38

88.71

67.60

86.91

Kumar et al. (2012)

±2.65

±3.11

±1.44

±3.07

±1.83

88.58

92.63

91.01

70.93

87.87

±1.91

±0.92

±1.12

±2.91

±2.54

90.85

92.80

91.74

72.95

88.20

±2.18

±0.89

±1.39

±4.14

±1.90

UNIF

MCKM-kNN

MCKM-SR

where σ0 denotes the averaged pairwise distances. In this way, we obtain 21 (7 × 3) base kernels, and use them for all the MKL algorithms compared in our experiment. Table 2 reports the accuracy of the above algorithms on the Flower17 data set. From this table, we observe that the proposed MCKM-kNN and MCKMSR obtain superior performance over the others. They even signiﬁcantly outperforms the UNIF algorithm, which is a very competitive baseline on this data set. 14

4.3.4. Results on Caltech101 Finally, we conduct another experiment on the Caltech101 data set to evaluate the performance of the proposed algorithms. This data set is a group of kernels derived from various visual features computed on the Caltech-101 object recognition task with 102 categories. It has 48 base kernels which are publicly available on websites. We do not report the results of MCMKL Zien and Ong (2007) and M3 K Cortes et al. (2013a) duo to the memory limitation. In addition, we incorporate the results of SimpleMKL Rakotomamonjy et al. (2008) and p MKL (p = 2) Xu et al. (2010) since they are competitive baseline on this data set. The classiﬁcation accuracy curves of the above algorithms with the variation of the number of training samples are plotted in Figure 2. As can be seen, the proposed MCKM-SR is signiﬁcantly better than the others when the number of training samples is relatively small (less then 20). Speciﬁcally, MCKM-SR outperforms the second best one by over 5%, 1% and 3% when the number of training samples are 5, 10, 15, respectively. In addition, all MKL algorithms attain similar classiﬁcation performance with the increase of training samples. From the above experiments on ten UCI data sets and six MKL benchmark data sets, we conclude that: (1) The proposed MCKM-kNN and MCKMSR achieve superior performance over the compared ones. (2) The performance of MCKM-kNN can be further improved by MCKM-SR. (3) The work in this paper provides an eﬀective way to utilize the multi-class kernel margin in MKL.

15

Caltech101

Classification Accuracy

0.75

0.7

0.65

0.6

UNIF ALIGNF SimpleMKL L2−MKL BinaryMKL Proposed

0.55

0.5

5

10

15

20

25

Number of Training Samples Figure 2: The classiﬁcation accuracy of diﬀerent MKL algorithms varies with the number of training samples.

5. Conclusion In this paper, we propose two simple and easy-to-implemented two-stage MKL algorithms by eﬀectively utilizing the multi-class kernel margin in MKL. In our approach, only k-nearest neighbors of each sample are selected to calculate the multi-class margin of a kernel. We then propose another variant to automatically determine the nearest neighbors and their weights by minimizing the reconstruction error. After that, we eﬃciently implement the algorithms via the widely used CVX optimization package. Comprehensive experiments have demonstrated the eﬀectiveness of our proposed algorithms.

16

Acknowledgments This work was supported in part by the National Natural Science Foundation of China (No. 61375041), Shenzhen Science Plan (Nos. JCYJ20120903092425971, JCYJ20130402113127502, JCY20140509174140685, JSGG20150925164740726), and GuangDong IUR Innovations Platform (no. 2012B090600032). References G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett, L. E. Ghaoui, M. I. Jordan, Learning the Kernel Matrix with Semideﬁnite Programming, JMLR 5 (2004) 27–72. A. Rakotomamonjy, F. R. Bach, S. Canu, Y. Grandvalet, SimpleMKL, JMLR 9 (2008) 2491–2521. M. G¨onen, E. Alpaydın, Multiple Kernel Learning Algorithms, JMLR 12 (Jul) (2011) 2211–2268. F. Yan, J. Kittler, K. Mikolajczyk, M. A. Tahir, Non-Sparse Multiple Kernel Fisher Discriminant Analysis, JMLR 13 (2012) 607–642. C. Cortes, M. Mohri, A. Rostamizadeh, Algorithms for Learning Kernels Based on Centered Alignment, JMLR 13 (2012) 795–828. S. Bucak, R. Jin, A. Jain, Multiple Kernel Learning for Visual Object Recognition : A Review, IEEE Trans. PAMI 36 (7) (2014) 1354–1369. C. Zhang, F. Nie, S. Xiang, A general kernelization framework for learning algorithms based on kernel PCA, Neurocomputing 73 (4-6) (2010) 959–967. 17

X. Liu, L. Wang, G. Huang, J. Zhang, J. Yin, Multiple kernel extreme learning machine, Neurocomputing 149 (2015) 253–264. X. Liu, L. Wang, J. Yin, L. Liu, Incorporation of radius-info can be simple with SimpleMKL, Neurocomputing 89 (2012) 30–38. H. Qinghui, W. Shiwei, L. Zhiyuan, L. Xiaogang, Quasi-newton method for Lp multiple kernel learning, Neurocomputing 194 (2016) 218–226. F. Wu, W. Wang, Y. Yang, Y. Zhuang, F. Nie, Classiﬁcation by semisupervised discriminative regularization, Neurocomputing 73 (10-12) (2010) 1641–1651. A. Kumar, A. Niculescu-Mizil, K. Kavukcuoglu, H. D. III, A Binary Classiﬁcation Framework for Two-Stage Multiple Kernel Learning, in: ICML, 2012. A. Zien, C. S. Ong, Multiclass multiple kernel learning, in: ICML, 1191–1198, 2007. J. Ye, S. Ji, J. Chen, Multi-class Discriminant Kernel Learning via Convex Programming, JMLR 9 (2008) 719–758. Z. Xu, R. Jin, H. Yang, I. King, M. R. Lyu, Simple and Eﬃcient Multiple Kernel Learning by Group Lasso, in: ICML, 1175–1182, 2010. M. G¨onen, Bayesian Eﬃcient Multiple Kernel Learning, in: ICML, 2012. C. Cortes, M. Mohri, A. Rostamizadeh, Multi-Class Classiﬁcation with Maximum Margin Multiple Kernel, in: ICML, 46–54, 2013a.

18

X. Liu, L. Wang, J. Yin, E. Zhu, J. Zhang, An Eﬃcient Approach to Integrating Radius Information into Multiple Kernel Learning, IEEE TCyb 43 (2) (2013) 557–569. X. Liu, L. Wang, J. Zhang, J. Yin, Sample-Adaptive Multiple Kernel Learning, in: AAAI, 1975–1981, 2014. S. Shalev-Shwartz, Y. Singer, N. Srebro, Pegasos: Primal Estimated subGrAdient SOlver for SVM, in: ICML, 2007. I. CVX Research, CVX: Matlab Software for Disciplined Convex Programming, version 2.0 beta, http://cvxr.com/cvx, 2012. C. Cortes, M. Kloft, M. Mohri, Learning Kernels Using Local Rademacher Complexity, in: NIPS, 2760–2768, 2013b. T. Damoulas, M. A. Girolami, Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection, Bioinformatics 24 (10) (2008) 1264–1270. M.-E. Nilsback, A. Zisserman, A Visual Vocabulary for Flower Classiﬁcation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1447–1454, 2006.

19

Multi-class kernel margin maximization for kernel learning

Multi-class kernel margin maximization for kernel learning

Recommend Documents