Multi-class kernel margin maximization for kernel learning

Multi-class kernel margin maximization for kernel learning

Author’s Accepted Manuscript Multi-Class Kernel Margin Maximization for Kernel Learning Yan-Guo Zhao, Miaomiao Li, Ronald Chung, Zhan Song www.elsevie...

504KB Sizes 0 Downloads 87 Views

Author’s Accepted Manuscript Multi-Class Kernel Margin Maximization for Kernel Learning Yan-Guo Zhao, Miaomiao Li, Ronald Chung, Zhan Song www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(16)30525-2 http://dx.doi.org/10.1016/j.neucom.2016.05.073 NEUCOM17138

To appear in: Neurocomputing Received date: 29 April 2015 Revised date: 30 May 2016 Accepted date: 31 May 2016 Cite this article as: Yan-Guo Zhao, Miaomiao Li, Ronald Chung and Zhan Song, Multi-Class Kernel Margin Maximization for Kernel Learning, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.05.073 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Multi-Class Kernel Margin Maximization for Kernel Learning Yan-Guo Zhaoa,c , Miaomiao Lib , Ronald Chungd , Zhan Song1a,d a

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, 518055 b School of Computer, National University of Defense Technology, Changsha, China, 410073 c Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, China, 518055 d Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, China

Abstract Two-stage multiple kernel learning (MKL) algorithms have been intensively studied due to its high efficiency and effectiveness. Pioneering work on this regard attempts to optimize the combination coefficients by maximizing the multi-class margin of a kernel, while obtaining unsatisfying performance. In this paper, we attribute this poor performance to the way in calculating the multi-class margin of a kernel. In specific, we argue that for each sample only the k-nearest neighbors, while not all samples with the same label, should be selected for calculating the margin. After that, we also develop another sparse variant which is able to automatically identify the nearest neighbors and the corresponding weights of each sample. Extensive experimental results on ten UCI data sets and six MKL benchmark data sets demonstrate the 1

Corresponding author ([email protected]). Dr. Zhan Song is with Guangdong Provincial Key Laboratory of Computer Vision and Virtual Reality Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences.

Preprint submitted to Neurocomputing

June 3, 2016

effectiveness and efficiency of the proposed algorithms. Keywords: Large Margin, Kernel Methods, Support Vector Machines 1. Introduction Kernel learning algorithms have been intensively studied in machine learning community during the last decade Lanckriet et al. (2004); Rakotomamonjy et al. (2008); G¨onen and Alpaydın (2011); Yan et al. (2012); Cortes et al. (2012); Bucak et al. (2014); Zhang et al. (2010); Liu et al. (2015, 2012); Qinghui et al. (2016); Wu et al. (2010). Among of them, multiple kernel learning optimally combines multiple pre-specified base kernels to improve the performance of learning tasks. As one of active research direction on this regard, two-stage MKL algorithms Kumar et al. (2012); Cortes et al. (2012) provides an efficient and effective approach to learning an optimal kernel from the data. Two-stage MKL algorithms first learn the optimal kernel coefficients according to some criteria, and then applies the learned optimal kernel to the kernel based algorithms to train a kernel model. Compared with one-stage MKL algorithms Zien and Ong (2007); Ye et al. (2008); Xu et al. (2010); G¨onen (2012); Cortes et al. (2013a); Liu et al. (2013, 2014) in which the optimal kernel coefficients and the structural parameters of a classifier are jointly learned, two-stage MKL algorithms: (1) are able to achieve comparable or even better performance, while taking much less computational cost; (2) are more flexible since the learned optimal kernel can be directly applied into different learning tasks, including both classification and regression. Many recent work on two-stage MKL algorithms have been done Cortes et al. (2012); Kumar et al. (2012). The seminal work in Cortes et al. (2012)

2

proposes two-stage techniques for learning kernels based on the criterion of kernel alignment maximization. Moreover, a number of novel theoretical, algorithmic and empirical results are presented for alignment-based techniques. Differently, Kumar et al. (2012) formulates the kernel learning problem as a standard linear classification problem in a new instance space. In this space, the weights of any linear classifiers directly corresponds to the combination coefficients of base kernels. All these work not only provides a new research direction in kernel learning, but also demonstrates state-of-the-art kernel learning performance. More recently, Cortes et al. (2013a) presents a new MKL algorithm based on a natural measure of the multi-class margin of a kernel. It is then proved that the large value of this quantity guarantees the existence of an accurate multi-class predictor. Though theoretically elegant, it is experimentally verified that this approach typically leads to a poor performance Cortes et al. (2013a). In this paper, we revisit this approach and argue that the poor performance results from the way of calculating the multi-class margin of a kernel. In specific, only the k-nearest neighbors of each sample should be selected for calculating the margin of a kernel. This is different from Cortes et al. (2013a) in which all samples with the same label are involved. This benefit can be clearly seen from Figure 1. In this figure, we plot the curve of the classification accuracy of the proposed two-stage MKL algorithm with the number of k-nearest neighbors in calculating the margin of a kernel. As can be seen, the classification accuracy decreases dramatically with the number of k-nearest neighbors. We attribute the superiority of our algorithm to the top k-nearest neighbors since they are more discriminative than that of all

3

samples with the same label, leading to better kernel combination coefficients and superior classification performance. psortPos 0.9

0.895

Accuracy

0.89

0.885

0.88

0.875

0.87

5

10

15

20

25

k Figure 1: The classification accuracy varies with the number of k-nearest neighbours in calculating the margin of a kernel.

On the other hand, the hyper-parameter k is usually difficult to tune and the unified weights for all k-nearest neighbors may not be optimal for different applications. To further improve this situation, we propose to represent each sample by a linear combination of samples with the same class and optimize the combination coefficients via minimizing the reconstruction error. We conduct intensive experiments to compare the proposed algorithm with stateof-the-art MKL algorithms on ten UCI and six MKL benchmark data sets, and the results verify the superiority of the proposed algorithms.

4

2. Related Work 2.1. Binary Classification Framework for Two-Stage MKL   We are given n training samples {(φ(xi ), yi )}ni=1 , where φ(xi ) = [φ 1 (xi ), · · · , φm (xi )] ,

{φp (·)}m p=1 are the feature mappings corresponding to m pre-defined base kernels {κp (·, ·)}m p=1 , and yi is the class label of xi . The work in Kumar et al. (2012) first constructs samples {(zxx , tyy )}1≤i≤j≤n ⊂ Rm × {±1} in a new space, where zij = (K1 (xi , xj ), · · · , Km (xi , xj ))

(1)

tij = 2 · 1{yi = yj } − 1.

After that, it learns the kernel combination coefficients μ by solving the following optimization problem, min μ≥0

λ μ2 + 2

1

 

n(n+1) 1≤i≤j≤n 2

1 − tij μ zij

 +

,

(2)

where [1 − s]+ = max{0, 1 − s} is the hinge loss and λ is a regularization parameter. The optimization problem 2 can be efficiently solved by the stochastic projected sub-gradient descent algorithms Shalev-Shwartz et al. (2007). 2.2. Centered Kernel Alignment for Two-Stage MKL The approach in Cortes et al. (2012) proposes a two-stage MKL algorithm via maximizing the centered kernel alignment. Specifically, the kernel combination coefficients are optimized by solving the following problem,    m p=1 μp Kp , YY F m max , (3) μ∈M  p=1 μp Kp F 5

where M = {μ ∈ Rm : μ ≥ 0, μ2 = 1}, Kp is the p-th centered base kernel, Y ∈ {0, 1}n×c is the label indicator matrix with c the number of classes. Moreover, it is proved in Cortes et al. (2012) that Eq.(3) can be equivalently optimized by solving a quadratic programming (QP) problem with size of m, where m is the number of base kernels. 2.3. Multi-Class Kernel Margin Maximization for Two-Stage MKL More recently, the work in Cortes et al. (2013a) presents a two-stage MKL algorithm by maximizing the multi-class kernel margin. For a given kernel K, its multi-class margin for a sample pair (x, y) is defined as γK (x, y) = E(x ,y )∼D [K(x, x ) | y  = y]

(4)

E(x ,y )∼D [K(x, x ) | y  = y  ], − max  y =y

and the multi-class kernel margin of K is defined as γ K = E(x,y)∼D [γK (x, y)]. Based on this definition, Cortes et al. (2013a) proposes to select the kernel combination coefficients μ by maximizing the empirical multi-class margin of  the kernel m p=1 μp Kp . This is fulfilled by solving the following optimization problem, n 

max

μ∈Δq , γ

γi

s.t.

∀i ∀y = yi : μ η(xi , yi , y) ≥ γi ,

i=1 

where η(xi , yi , y) = [η1 (xi , yi , y), · · · , ηm (xi , yi , y)] , ηp (xi , yi , y) = 

 x ∈C(y) Kp (xi ,x ) , |C(y)|

(5) 

x ∈C(yi )

Kp (xi ,x )

|C(yi )|

C(y) = {xi : yi = y, i = 1, · · · , n} and Δq = {μ ∈ Rm :

μ ≥ 0, μq = 1}. On the theoretic side, it is shown that the large values of the kernel margin γ K guarantee the existence of an accurate multi-class predictor Cortes et al. 6



(2013a). Though theoretically elegant, it has been experimentally validated that directly optimizing Eq.(5) typically leads to a poor performance. In the following, we revisit this approach by redefining the multi-class kernel margin, and propose an efficient and effective two-stage MKL algorithm. 3. Proposed Algorithm 3.1. Multi-Class Kernel Margin with k-Nearest Neighbours (MCKM-kNN) Let Nk (x, y) denote the k-nearest neighbour set of x in class y ∈ Y and d(·) be a distance function. The margin of (xi , yi ) for the combination kernel m p=1 μp Kp is defined as γμ (xi , yi , y) = d(xi , Nk (xi , y)) − d(xi , Nk (xi , yi )),

(6)

where d(x, Nk (x, y)) is calculated in multi-kernel induced space as 

d(x, Nk (x, y)) =

xj ∈Nk (x,y)

= 2

m 

μp −

p=1

d(x, xj ) k m  p=1



μp

xj ∈Nk (x,y)

Kp (x, xj ) . k

(7)

Substituting Eq.(7) into Eq.(6), we obtain γμ (xi , yi , y) = μ η(xi , yi , y),

(8)

where η(xi , yi , y) = [η1 (xi , yi , y), · · · , ηm (xi , yi , y)] with ηp (xi , yi , y) =

 xj ∈Nk (xi ,yi )

Kp (xi , xj ) − k

7

 xj ∈Nk (xi ,y)

Kp (xi , xj ) . k

(9)

Based on this definition, we get the the objective of the proposed twostage MKL algorithm by maximizing the multi-class kernel margin with knearest neighbours, as in Eq.(10), max

μ∈Δq ,γ

n 

γi

s.t.

∀i, ∀y = yi : μ η(xi , yi , y) ≥ γi .

(10)

i=1

3.2. Multi-Class Kernel Margin with Sparse Representation (MCKM-SR) For the proposed multi-class kernel margin with k-nearest neighbours, it is usually difficult to determine the number of nearest numbers, i.e., k. Moreover, this approach assigns each neighbor the unified weights k1 , which may not be optimal for different applications. To address these issues, we propose another variant which is able to determine the nearest neighbors and their corresponding weights for each sample automatically. In specific, we suppose that each sample can be represented by a linear combination of samples with the same class and optimize the combination coefficients by minimizing the reconstruction error. For each p (1 ≤ p ≤ m) and y (1 ≤ y ≤ c), the nearest neighbors and their weights for each sample spp can be determined by solving the following optimization problem, min φp (xi ) − y sp

s.t.

 xj ∈C(y)

 xj ∈C(y)

sypj

= 1,

sypj φp (xj )22

sypj

(11) ≥ 0,

which is a quadratic programming problem with linear constraints and can be efficiently solved by existing optimization packages. After obtaining syp by Eq.(11), η p in Eq.(9) is calculated as ηp (xi , yi , y) =



sypji Kp (xi , xj ) −

xj ∈C(yi )

 xj ∈C(y)

8

sypj Kp (xi , xj ).

(12)

After that, the optimal kernel coefficients can be optimized by solving the problem in Eq.(10). It is worth noting that the optimization problem in Eq.(10) is a convex one and we implement it via the widely used CVX package CVX Research (2012). More efficient optimization techniques can be used to further improve its computational efficiency. 4. Experiments 4.1. Experimental Settings In this section, we conduct intensive experiments to compare the proposed two multi-class kernel margin maximization variants, i.e., MCKM-kNN and MCKM-SR, with several state-of-the-art MKL algorithms, including Unifiedweighted MKL (UNIF), MCMKL Zien and Ong (2007), M3 K Cortes et al. (2013a), ALIGNF Cortes et al. (2012) and BinaryMKL Kumar et al. (2012). We first evaluate the classification performance of the aforementioned algorithms on ten binary UCI data sets2 , including germannum, heart, ionosphere, musk1, pima, sonar, spambase, liver, wdbc and wpbc. For these data sets, we follow the approach in Cortes et al. (2012) to generate 20 Gaussian kernels as base kernels, whose width parameters are linearly equally sampled between 2−7 σ0 and 27 σ0 , with σ0 the mean value of all pairwise distances. For each data set, 60% : 20% : 20% samples are randomly selected as the training set, validation set and the test set, respectively. After that, we report the classification results of these algorithms on six MKL benchmark data sets, including psortPos, psortNeg, plant data sets3 , 2 3

https://archive.ics.uci.edu/ml/datasets.html http://raetschlab.org//suppl/protsubloc/

9

the protein fold prediction data set4 , flower17 data set5 and the Caltech1016 . All of them are multi-class classification tasks. The base kernel matrices of these data sets are pre-computed and publicly available from the above websites. For protein, psortPos, psortNeg, plant and flower17 data sets, 60% : 20% : 20% samples are randomly selected as the training set, validation set and the test set, respectively. For caltech101, we report the classification accuracy curves with the number of training samples varies from 5, 10, 15, 20, 25. The classification accuracy is used to evaluate the goodness of the above algorithms. We repeat each experiment 30 times on UCI data sets and ten times on MKL benchmark data sets to eliminate the randomness in generating the partition matrices, and report the averaged results and standard deviation. Furthermore, to conduct a rigorous comparison, the paired student’s t-test is performed. The p-value of the pairwise t-test represents the probability that two sets of compared results come from distributions with an equal mean. A p-value of 0.05 is considered statistically significant. In our experiments, each base kernel is centralized and then scaled so that its diagonal elements are all ones. The regularization parameter C for each algorithm is chosen from an appropriately large [2−1 , 20 , · · · , 27 ] by grid search according to the performance on the validation sets. All experiments are conducted on a high performance cluster server, where each node has 2.3GHz CPU and 12GB memory. 4

http://mkl.ucsd.edu/dataset/protein-fold-prediction/ http://www.robots.ox.ac.uk/~vgg/data/flowers/17/index.html/ 6 http://files.is.tue.mpg.de/pgehler/projects/iccv09/ 5

10

4.2. Results on UCI datasets The classification accuracy results for all aforementioned algorithms on ten UCI data sets are reported on Table 1. As can be seen, the proposed MCKM-kNN and MCKM-SR usually gives overall comparable classification accuracy when compared with other algorithms. Specifically, the proposed algorithms achieves better performance on sonar, pima and spambase data sets, while it is comparable with others on the rest of data sets. These results on UCI data sets preliminarily validate the effectiveness of our proposed algorithms. 4.3. Results on MKL Benchmark datasets 4.3.1. Results on Protein Subcellular Localization Besides the ten UCI data sets, we compare the performance of abovementioned MKL algorithms on psortPos, psortNeg, plant data sets, which are from the protein subcellular localization and have been widely used in MKL community Zien and Ong (2007); Cortes et al. (2013a,b). The base kernel matrices for these data sets have been pre-computed and can be publicly downloaded from the websites. Specifically, there are 69 base kernel matrices, including two kernels on phytogenetic trees, three kernels from BLAST E-values, and 64 sequence motif kernels. The class number of psortPos, psortNeg, plant data sets is four, five and four, respectively. The accuracy achieved by the above algorithms on these three data sets are reported in Table 2. As can be observed, the proposed MCKM-kNN usually achieves better performance than the compared baseline. For example, it outperforms the second best one (ALIGNF) by 1.45% on plant data set.

11

Table 1: Classification accuracy comparison among different MKL algorithms on UCI datasets. germ

heart

iono

musk

pima

sonar

spam

liver

wdbc

wpbc

76.53

81.67

97.14

93.40

80.26

92.44

89.60

70.00

98.57

76.32

±1.90

±4.14

±1.78

±3.00

±1.71

±3.72

±1.05

±5.42

±1.41

±2.15

MCMKL

72.56

81.85

97.00

95.96

78.69

93.17

87.75

68.70

98.75

73.68

Zien and Ong (2007)

±3.50

±4.43

±1.57

±2.96

±2.49

±3.60

±4.23

±5.12

±1.21

±2.77

M3 K

72.56

81.85

97.00

95.96

81.18

93.17

87.65

68.70

98.75

73.68

Cortes et al. (2013a)

±3.50

±4.43

±1.57

±2.96

±2.38

±3.60

±4.22

±5.12

±1.21

±2.77

ALIGNF

75.98

81.30

96.43

95.32

80.26

91.46

89.55

66.96

98.93

73.68

Cortes et al. (2012)

±2.10

±5.20

±1.54

±2.62

±2.15

±4.02

±1.23

±5.41

±1.25

±2.77

BinaryMKL

75.58

81.30

97.29

93.62

80.85

90.98

89.85

68.84

98.75

75.79

Kumar et al. (2012)

±1.81

±4.14

±1.71

±2.56

±2.24

±4.15

±1.16

±5.43

±1.47

±3.23

76.23

80.19

97.00

95.21

78.89

94.15

89.95

73.33

98.57

76.32

±1.82

±5.02

±1.57

±3.18

±3.26

±2.86

±1.14

±6.08

±1.41

±3.28

75.73

80.37

97.29

94.15

79.41

92.95

90.15

71.59

98.75

75.79

±3.16

±3.72

±1.96

±3.59

±2.84

±3.99

±0.97

±5.60

±1.47

±3.68

UNIF

MCKM-kNN

MCKM-SR

12

Furthermore, the proposed MCKM-SR further improves the performance of MCKM-kNN and demonstrates the best performance on all three data sets! 4.3.2. Results on Protein Fold Prediction We also compare the aforementioned algorithms on the protein fold prediction data set, which is a multi-source and multi-class data set based on a subset of the PDB-40D SCOP collection. It contains 12 different feature spaces, including composition, secondary, hydrophobicity, volume, polarity, polarizability, L1, L4, L14, L30, SWblosum62 and SWpam50. This data set has been widely adopted in the MKL community Damoulas and Girolami (2008); Yan et al. (2012). For the protein fold prediction data set, the input features are available and the kernel matrices are generated as in Damoulas and Girolami (2008), where the second order polynomial kernels are employed for feature sets one to ten and the linear kernel for the rest two feature sets. This data set is a 27-class classification task and the oneagainst-rest strategy is used to solve the multi-class classification problems. As can be seen from Table 2, the proposed MCKM-SR achieves the highest accuracy. It significantly outperforms the second best one (ALIGNF), which is a very competitive baseline, by 1.71%. 4.3.3. Results on Oxford Flower17 We then compare the above MKL algorithms on Oxford Flower17, which has been widely used as a MKL benchmark data set Nilsback and Zisserman (2006). There are seven heterogeneous data channels available for this data set. For each data channel, we apply a Gaussian kernel with three different width parameters, i.e., 2−2 σ0 , 20 σ0 and 22 σ0 to generate three kernel matrices, 13

Table 2: Classification accuracy comparison among different MKL algorithms on seven benchmark datasets. The one achieves the highest accuracy and those with no statistical difference with the best one are marked as bold. psortPos

plant

psortNeg

protein

flower17

87.36

89.41

88.71

67.67

87.54

±3.15

±2.46

±1.47

±3.95

±2.09

MCMKL

86.89

89.89

88.82

69.61

85.92

Zien and Ong (2007)

±2.37

±0.94

±1.37

±3.48

±3.09

M3 K

87.92

90.91

90.59

69.61

85.62

Cortes et al. (2013a)

±3.17

±1.15

±1.15

±3.48

±3.74

ALIGNF

88.40

91.18

90.52

71.24

86.95

Cortes et al. (2012)

±2.36

±1.39

±3.79

±3.79

±1.84

BinaryMKL

88.77

90.38

88.71

67.60

86.91

Kumar et al. (2012)

±2.65

±3.11

±1.44

±3.07

±1.83

88.58

92.63

91.01

70.93

87.87

±1.91

±0.92

±1.12

±2.91

±2.54

90.85

92.80

91.74

72.95

88.20

±2.18

±0.89

±1.39

±4.14

±1.90

UNIF

MCKM-kNN

MCKM-SR

where σ0 denotes the averaged pairwise distances. In this way, we obtain 21 (7 × 3) base kernels, and use them for all the MKL algorithms compared in our experiment. Table 2 reports the accuracy of the above algorithms on the Flower17 data set. From this table, we observe that the proposed MCKM-kNN and MCKMSR obtain superior performance over the others. They even significantly outperforms the UNIF algorithm, which is a very competitive baseline on this data set. 14

4.3.4. Results on Caltech101 Finally, we conduct another experiment on the Caltech101 data set to evaluate the performance of the proposed algorithms. This data set is a group of kernels derived from various visual features computed on the Caltech-101 object recognition task with 102 categories. It has 48 base kernels which are publicly available on websites. We do not report the results of MCMKL Zien and Ong (2007) and M3 K Cortes et al. (2013a) duo to the memory limitation. In addition, we incorporate the results of SimpleMKL Rakotomamonjy et al. (2008) and p MKL (p = 2) Xu et al. (2010) since they are competitive baseline on this data set. The classification accuracy curves of the above algorithms with the variation of the number of training samples are plotted in Figure 2. As can be seen, the proposed MCKM-SR is significantly better than the others when the number of training samples is relatively small (less then 20). Specifically, MCKM-SR outperforms the second best one by over 5%, 1% and 3% when the number of training samples are 5, 10, 15, respectively. In addition, all MKL algorithms attain similar classification performance with the increase of training samples. From the above experiments on ten UCI data sets and six MKL benchmark data sets, we conclude that: (1) The proposed MCKM-kNN and MCKMSR achieve superior performance over the compared ones. (2) The performance of MCKM-kNN can be further improved by MCKM-SR. (3) The work in this paper provides an effective way to utilize the multi-class kernel margin in MKL.

15

Caltech101

Classification Accuracy

0.75

0.7

0.65

0.6

UNIF ALIGNF SimpleMKL L2−MKL BinaryMKL Proposed

0.55

0.5

5

10

15

20

25

Number of Training Samples Figure 2: The classification accuracy of different MKL algorithms varies with the number of training samples.

5. Conclusion In this paper, we propose two simple and easy-to-implemented two-stage MKL algorithms by effectively utilizing the multi-class kernel margin in MKL. In our approach, only k-nearest neighbors of each sample are selected to calculate the multi-class margin of a kernel. We then propose another variant to automatically determine the nearest neighbors and their weights by minimizing the reconstruction error. After that, we efficiently implement the algorithms via the widely used CVX optimization package. Comprehensive experiments have demonstrated the effectiveness of our proposed algorithms.

16

Acknowledgments This work was supported in part by the National Natural Science Foundation of China (No. 61375041), Shenzhen Science Plan (Nos. JCYJ20120903092425971, JCYJ20130402113127502, JCY20140509174140685, JSGG20150925164740726), and GuangDong IUR Innovations Platform (no. 2012B090600032). References G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett, L. E. Ghaoui, M. I. Jordan, Learning the Kernel Matrix with Semidefinite Programming, JMLR 5 (2004) 27–72. A. Rakotomamonjy, F. R. Bach, S. Canu, Y. Grandvalet, SimpleMKL, JMLR 9 (2008) 2491–2521. M. G¨onen, E. Alpaydın, Multiple Kernel Learning Algorithms, JMLR 12 (Jul) (2011) 2211–2268. F. Yan, J. Kittler, K. Mikolajczyk, M. A. Tahir, Non-Sparse Multiple Kernel Fisher Discriminant Analysis, JMLR 13 (2012) 607–642. C. Cortes, M. Mohri, A. Rostamizadeh, Algorithms for Learning Kernels Based on Centered Alignment, JMLR 13 (2012) 795–828. S. Bucak, R. Jin, A. Jain, Multiple Kernel Learning for Visual Object Recognition : A Review, IEEE Trans. PAMI 36 (7) (2014) 1354–1369. C. Zhang, F. Nie, S. Xiang, A general kernelization framework for learning algorithms based on kernel PCA, Neurocomputing 73 (4-6) (2010) 959–967. 17

X. Liu, L. Wang, G. Huang, J. Zhang, J. Yin, Multiple kernel extreme learning machine, Neurocomputing 149 (2015) 253–264. X. Liu, L. Wang, J. Yin, L. Liu, Incorporation of radius-info can be simple with SimpleMKL, Neurocomputing 89 (2012) 30–38. H. Qinghui, W. Shiwei, L. Zhiyuan, L. Xiaogang, Quasi-newton method for Lp multiple kernel learning, Neurocomputing 194 (2016) 218–226. F. Wu, W. Wang, Y. Yang, Y. Zhuang, F. Nie, Classification by semisupervised discriminative regularization, Neurocomputing 73 (10-12) (2010) 1641–1651. A. Kumar, A. Niculescu-Mizil, K. Kavukcuoglu, H. D. III, A Binary Classification Framework for Two-Stage Multiple Kernel Learning, in: ICML, 2012. A. Zien, C. S. Ong, Multiclass multiple kernel learning, in: ICML, 1191–1198, 2007. J. Ye, S. Ji, J. Chen, Multi-class Discriminant Kernel Learning via Convex Programming, JMLR 9 (2008) 719–758. Z. Xu, R. Jin, H. Yang, I. King, M. R. Lyu, Simple and Efficient Multiple Kernel Learning by Group Lasso, in: ICML, 1175–1182, 2010. M. G¨onen, Bayesian Efficient Multiple Kernel Learning, in: ICML, 2012. C. Cortes, M. Mohri, A. Rostamizadeh, Multi-Class Classification with Maximum Margin Multiple Kernel, in: ICML, 46–54, 2013a.

18

X. Liu, L. Wang, J. Yin, E. Zhu, J. Zhang, An Efficient Approach to Integrating Radius Information into Multiple Kernel Learning, IEEE TCyb 43 (2) (2013) 557–569. X. Liu, L. Wang, J. Zhang, J. Yin, Sample-Adaptive Multiple Kernel Learning, in: AAAI, 1975–1981, 2014. S. Shalev-Shwartz, Y. Singer, N. Srebro, Pegasos: Primal Estimated subGrAdient SOlver for SVM, in: ICML, 2007. I. CVX Research, CVX: Matlab Software for Disciplined Convex Programming, version 2.0 beta, http://cvxr.com/cvx, 2012. C. Cortes, M. Kloft, M. Mohri, Learning Kernels Using Local Rademacher Complexity, in: NIPS, 2760–2768, 2013b. T. Damoulas, M. A. Girolami, Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection, Bioinformatics 24 (10) (2008) 1264–1270. M.-E. Nilsback, A. Zisserman, A Visual Vocabulary for Flower Classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1447–1454, 2006.

19