Engineering Applications of Artificial Intelligence 64 (2017) 391–400
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Multiple kernel learning using composite kernel functions Shiju S.S., Asif Salim, Sumitra S. * Department of Mathematics, Indian Institute of Space Science and Technology, India
a r t i c l e
i n f o
Keywords: Multiple kernel learning Classification Reproducing kernel Support vector machine Composite kernel functions
a b s t r a c t Multiple Kernel Learning (MKL) algorithms deals with learning the optimal kernel from training data along with learning the function that generates the data. Generally in MKL, the optimal kernel is defined as a combination of kernels under consideration (base kernels). In this paper, we formulated MKL using composite kernel functions (MKLCKF), in which the optimal kernel is represented as the linear combination of composite kernel functions. Corresponding to each data point a composite kernel function is designed whose domain is constructed as the direct product of the range space of base kernels, so that the composite kernels make use of the information of all the base kernels for finding their image. Thus MKLCKF has three layers in which the first layer consists of base kernels, the second layer consists of composite kernels and third layer is the optimal kernel which is a linear combination of the composite kernels. For making the algorithm more computationally effective, we formulated one more variation of the algorithm in which the coefficients of the linear combination are replaced with a similarity function that captures the local properties of the input data. We applied the proposed approach on a number of artificial intelligence applications and compared its performance with that of the other state-of-the-art techniques. Data compression techniques had been used for applying the models on large data, that is, for large scale classification, dictionary learning while for large scale regression pre-clustering approach had been applied. On the basis of the performance, rank was assigned to each model we used for analysis, The proposed models scored higher rank than the other models we used for comparison. We analyzed the performance of the MKLCKF model by incorporating with kernelized locally sensitive hashing (KLSH) also and the results were found to be promising. © 2017 Elsevier Ltd. All rights reserved.
1. Introduction The kernel methods are applied in various class of problems like classification (Boser et al., 1992), regression (Pozdnoukhov, 2002), dimensionality reduction (Schölkopf et al., 1998) etc. The performance of the kernel algorithm depends on the selection of reproducing kernel. Hence the development of efficient methods for finding the best kernel is very much essential as far as the area of kernel methods are concerned. The current available tools for kernel selection include techniques like cross validation and multiple kernel learning (MKL), of which the later is a data driven approach. The advantage of MKL is that it automatically finds the best combination of kernels from a pool of available kernels (base kernels). One of the initial approaches used in MKL algorithms is to represent the optimal kernel as a linear combination of a set of kernels (Lanckriet et al., 2004). By representing the function that generates the data as a linear combination of optimal kernel, the MKL problem finds the function as well as the optimal kernel in a simultaneous manner. Thus in * Corresponding author.
E-mail address:
[email protected] (Sumitra S). http://dx.doi.org/10.1016/j.engappai.2017.06.026 Received 2 February 2017; Received in revised form 9 May 2017; Accepted 29 June 2017 Available online 27 July 2017 0952-1976/© 2017 Elsevier Ltd. All rights reserved.
MKL learning paradigm, there are two sets of parameters, where one set of parameters corresponds to unknowns of the function to be learned and the other corresponds to the optimal kernel. MKL techniques use either one stage (Lanckriet et al., 2004) in which both set of parameters are solved in same iteration or two stage optimization technique in which the functional parameters are updated in first stage and kernel parameters are updated in second stage, such that, these two stages are repeated until convergence (Rakotomamonjy et al., 2008). There exist different approaches for finding the parameters of MKL. Fixed weight approach is used in Pavlidis et al. (2001) in which the kernel parameters are fixed constants, while heuristically calculated weights are assigned for the kernels in de Diego et al. (2010). Other major approaches are regularized MKL (Varma and Babu, 2009b)and localized MKL using two stage (Gonen and Alpaydn, 2013). Bayesian techniques are also used for finding the combination of kernels by defining some priors. Boosting (Bennett et al., 2002), semi-supervised (Wang et al., 2012) and unsupervised (Hsu and Lee, 2011) algorithms are also
Shiju S.S et al.
Engineering Applications of Artificial Intelligence 64 (2017) 391–400
adapted for finding the parameters associated with the representation of optimum kernel. Classification based approaches have also been applied in MKL (Kumar et al., 2012). MKL for large data (Sonnenburg et al., 2006) and non linear combination of kernels (Cortes et al., 2009) are other versions of MKL learning. MKL theory has been applied in areas such as feature selection (Dileep and Sekhar, 2009), feature fusion (Yeh et al., 2012) etc. Pairwise classification (Kreßel, 1999) is another well researched area in case of multi class classification where pairwise kernels are defined. The main contribution of the paper is the formulation of MKL using composite kernel functions for finding the best combination of kernels from a given 𝑃 base kernels for machine learning problems such as classification and regression. With reference to each data point we designed a composite kernel function such that it make use of the information of all the given 𝑃 base kernels for finding the image at each of the points in its domain. We are proposing two variants of this formulation. In the first variant, the optimal kernel is represented as a linear combination of newly designed kernels. As each composite kernel function is built upon a data point, we introduced a second variant in which the coefficients of the linear combination are replaced with a neighborhood function of the reference data point. This representation makes the algorithm more computationally efficient. We verified the efficiency of the proposed models using real world datasets and compared its performance with existing techniques. The proposed methods showed excellent performance. Of the two variants of the approach, the performance of the second variant was found to be better. As the data increases the number of training points as well as the number of terms in the proposed kernel increases. Thus the overall complexity of the problem is increased. In order to tackle this problem, subset selection approach developed by Nair and Dodd (2015) is followed for regression and dictionary learning approach (Jiang et al., 2013) for classification. We did experimental analysis by incorporating these two in the proposed method and the results were found to be promising. The rest of the paper is organized as follows. The Section 2 describes the background theory and state-of-the-art algorithms. The Section 3 details the theory behind the proposed weighted kernel approach and its applications while Section 3.4 details localized approach and its applications. The experimental analysis is given in Section 4.
2.1. Combination of kernels The first work in the domain of MKL is (Lanckriet et al., 2004), in which the optimal kernel is represented as the linear combination of the base kernels and the parameters are learned from the data using semidefinite programming. Its theory can be briefly described as follows. Consider 𝑃 base kernels {𝑘1 , 𝑘2 , … , 𝑘𝑝 } from which the optimal kernel has to be learned. By applying the theory from Lanckriet et al. (2004), (3) becomes 𝑓 (𝑥) =
𝑖=1
𝑖=1
𝜆 ‖𝑓 ‖2 2
𝑁 ∑
𝛼𝑖 𝑘𝑥𝑖
̂ (𝑥), 𝑓 (𝑥′ )) < 𝜖 𝑑(𝑓
𝛼𝑖 𝑘(𝑥𝑖 , 𝑥).
∀ 𝑥′ ∈ 𝐵(𝑥, 𝛿)
(5)
where 𝐵(𝑥, 𝛿), is an open ball of radius 𝛿 in input space and 𝑑̂ is a suitable metric on R . An open ball 𝐵(𝑥, 𝛿) in the input space is called a cluster if all points associated with it satisfy (5). The basic idea of pre-clustering is that any data points which satisfy (5) can be considered to be ‘‘similar’’ and therefore form pre-clusters. The centers of the clusters are then used as a sparse dataset for the function estimation. Output information has also been used to form clusters and hence it is a supervised clustering. The working procedure of the algorithm is as follows. Corresponding to the given similarity measure 𝜖, the algorithm finds the radius 𝛿 in an iterative manner. In an iteration, an open ball 𝐵(𝑥, 𝛿) is formed in a greedy manner and all those non center points that satisfy (5) get eliminated. Thus in each iteration, the training points consists of the center of the open balls and those points that do not satisfy (5). The algorithm gets terminated if all the training points under consideration satisfy (5). Otherwise 𝛿 gets updated using the formula 𝛿 ∶= 𝛿 −ℎ, where ℎ is the step length and moves to the next iteration.
(1)
(2)
where 𝛼𝑖 ∈ R and (𝑘𝑥𝑖 ), 𝑖 = 1, 2, … , 𝑁 are the representer evaluators of input training points. Using the reproducing kernel 𝑘 over (2) becomes 𝑁 ∑
(4)
2.2.1. Supervised pre-clustering In the pre-clustering approach developed by Nair and Dodd (2015), the function 𝑓 to be learned is uniformly continuous, by assuming that it lies in a continuous RKHS , having the domain of its members a compact set . The idea of uniform continuity is used to define a similarity measure on the function to be estimated, As the function, 𝑓 , is uniformly continuous, corresponding to similarity measure 𝜖, there exists a radius, 𝛿, independent of 𝑥 ∈ , such that
𝑖=1
𝑓 (𝑥) =
𝑑𝑗 𝑘𝑗 (𝑥𝑖 , 𝑥)
𝑗=1
The main disadvantage of kernel methods is their computational complexity which scales as 𝑂(𝑁 3 ) where 𝑁 is the number of training points. Nair and Dodd (2015) developed a supervised pre-clustering approach for scaling kernel based regression by making use of the concepts of uniform continuity and compactness. In our work we used this compression technique for large scale regression problems and dictionary based learning for classification. The description of these compression techniques are given below.
where 𝑙() is a loss function and 𝜆 > 0 is the regularization parameter. By representer theorem, Kimeldorf and Wahba (1971) and Schlkopf et al. (2001), the function that minimizes the above cost function can be represented as 𝑓=
𝑃 ∑
2.2. Large data algorithm approaches
Kernel methods search the function to be approximated in Reproducing Kernel Hilbert Space (RKHS). Corresponding to each RKHS there exists a unique reproducing kernel and vice versa. Let {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ) … (𝑥𝑁 , 𝑦𝑁 )}, 𝑥𝑖 ∈ ⊂ R𝑛 be the training points and 𝑦𝑖 ∈ R, 𝑖 = 1, 2 … 𝑁 be the corresponding labels for the training points. In kernel algorithms, the function 𝑓 ∈ RKHS to be learned is found by minimizing the regularized cost function 𝑙(𝑓 (𝑥𝑖 ), 𝑦𝑖 ) +
𝛼𝑖
where 𝑑𝑗 ≥ 0, 𝑗 = 1, 2, … , 𝑃 . The other major works in this domain are Simple MKL, Generalized MKL etc. Simple MKL (Rakotomamonjy et al., 2008) uses the above formulation but solves the MKL much faster using two step optimization. In the first step, the function parameters (𝛼) gets optimized by fixing the kernel parameters (𝑑) and in second step, the kernel weights gets updated using gradient descent approach by fixing function parameters. Generalized MKL (Varma and Babu, 2009b) incorporates a regularization term in its formulation. (Jain et al., 2012) extends the generalized MKL for handling million kernels. However the application of MKL over large data is yet to be solved efficiently in terms of space since the million kernels over a million data points require lots of memory. In localized MKL (Gonen and Alpaydn, 2013), selection of kernels is done in a local manner. The localized MKL may not work well with large number of kernels as well as large number of data points.
2. Background and state-of-the-art methods
𝑁 ∑
𝑁 ∑
(3)
𝑖=1
392
Shiju S.S et al.
Engineering Applications of Artificial Intelligence 64 (2017) 391–400
The representation of 𝑘 as the linear combination of 𝑘𝑗 helps to include the information of all the 𝑃 base kernels in an efficient manner for finding 𝑓 . The cost function used for approximating 𝑓 is as given below.
2.2.2. Data compression using dictionary learning In dictionary learning, the dictionary consists of subset of data points such that all the input data can be represented as a linear combination of them. Let 𝑋 be the input data and 𝐷 be the dictionary atoms and 𝑊 be the sparse weights. Then the problem of dictionary learning can be formulated as 2
min ‖𝑋 − 𝐷𝑊 ‖ + 𝜆‖𝑊 ‖0 .
𝐷,𝑊
min
𝑓 ∈
(6)
𝑁 ∑ ( ) 𝑙 𝑦𝑖 , 𝑓 (𝑥𝑖 ) + 𝜂‖𝑓 ‖2
where 𝑙 is a differentiable loss function. Then,
Jiang et al. (2013) discusses a label consistent dictionary learning algorithm, where it uses label information in the cost function for learning the dictionary. We uses the same algorithm in our model for applying over large data in the case of classification.
𝑓 (𝑥) = =
(7)
̃ Here 𝑘∗ ∶ × → R is any valid reproducing kernel and 𝑍 = (𝐾), ̃ is the range space of 𝐾. ̃ where, (𝐾)
̃ 𝑘 , 𝑥𝑗 ) and 𝜙𝑗 (𝑥𝑙 ) = 𝐾(𝑥 ̃ 𝑗 , 𝑥𝑙 ). Then (7) becomes Proof. Let 𝜙𝑗 (𝑥𝑘 ) = 𝐾(𝑥 ( ) 𝑘𝑗 (𝑥𝑘 , 𝑥𝑙 ) = 𝑘∗ 𝜙𝑗 (𝑥𝑘 ), 𝜙𝑗 (𝑥𝑙 ) .
𝑖
𝛽𝑘 = 1.
𝑘=1
3.2.1. Optimization For solving the above described optimization problem, in the first stage beta is fixed as 𝑁1 and then optimize for alpha using conventional SVM solver, namely, SMO. In the second stage, the 𝛽 is updated using reduced gradient descent method as in Rakotomamonjy et al. (2008).
3.1. MKLCKF: I Let 𝑗 be the RKHS corresponding to 𝑘𝑗 , 𝑗 = 1, 2, … , 𝑁. The function 𝑓 is assumed to lie in a RKHS, , with reproducing kernel 𝑘 ∶ × → R where
3.3. Application in Support Vector Regression (SVR)
𝛽𝑗 𝑘𝑗 (𝑥𝑘 , 𝑥𝑙 ) ( ) ̃ 𝑗 , 𝑥𝑘 ), 𝐾(𝑥 ̃ 𝑗 , 𝑥𝑙 ) 𝛽𝑗 𝑘∗ 𝐾(𝑥
𝑗=1
𝑁 ∑
Each 𝑘𝑗 consists of two layers of functions, where the first layer is defined from × → 𝑍 × 𝑍 and the second layer is from 𝑍 × 𝑍 → R. With the aid of such a design each of the composite kernel make use of the information of all the base kernels for finding the image. Using this idea we developed two variants of MKL algorithm which we named as MKLCKF: I and MKLCKF: II.
=
(10)
𝑠𝑢𝑏 𝑡𝑜 0 ≤ 𝛼𝑖 ≤ 𝐶 𝛽𝑘 ≥ 0 ∑ 𝛼𝑖 𝑦𝑖 = 0
̂ 𝑥′ ) = 𝑘1 (𝜙(𝑥), 𝜙(𝑥′ )) is a valid kernel where By Bishop (2006), 𝑘(𝑥, 𝑀 𝜙 ∶ → 𝑅 and 𝑘1 is a valid kernel defined on 𝑅𝑀 . Therefore 𝑘𝑗 is a valid kernel.
𝑗=1 𝑁 ∑
( ) ̃ 𝑖 , 𝑥𝑗 ), 𝐾(𝑥 ̃ 𝑗 , 𝑥) . 𝛽𝑗 𝑘 𝐾(𝑥 ∗
The corresponding dual using the kernel given in Eq. (8) can be written as follows. ∑ ∑ ( ) 1 ∑∑ ̃ 𝑖 , 𝑥𝑘 ), 𝐾(𝑥 ̃ 𝑘 , 𝑥𝑗 ) 𝛼𝑖 − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝛽𝑘 𝑘∗ 𝐾(𝑥 2 𝑖 𝑖 𝑗 𝑘
Theorem 3.1. The composite function given by (7) is a valid kernel.
𝑁 ∑
𝛼𝑖
𝑁 ∑
For classification problems, SVM algorithm was used to determine the unknown function. The optimization problem corresponding to SVM classification is ∑ 1 ‖𝑓 ‖2 + 𝐶 𝜉𝑖 2 𝑖 𝑠𝑢𝑏 𝑡𝑜 𝑦𝑖 (𝑓 (𝑥𝑖 ) + 𝑏) − 1 + 𝜉𝑖 ≥ 0 𝜉𝑖 ≥ 0.
where 𝑃 is the number of base kernels. Corresponding to each 𝑥𝑗 , 𝑗 = 1, 2, … , 𝑁, composite function, 𝑘𝑗 ∶ × → R, is constructed where
𝑘(𝑥𝑘 , 𝑥𝑙 ) =
𝛼𝑖 𝑘(𝑥𝑖 , 𝑥)
3.2. Application in Support Vector Machine (SVM)
̃ 𝑧) = [𝑘1 (𝑥, 𝑧) 𝑘2 (𝑥, 𝑧) … 𝑘𝑝 (𝑥, 𝑧)]𝑇 𝐾(𝑥,
𝑙
𝑖=1 𝑁 ∑
In order to impose a controlled regularization, the constraint of ∑𝑁 𝑖=1 𝛽𝑖 = 1 can be imposed. The whole idea is summarized in Fig. 1.
Consider a pool of 𝑃 kernels {𝑘1 , 𝑘2 , … , 𝑘𝑝 } from which the best combination of kernels have to be chosen. For that using each data as the reference point we constructed 𝑁 composite kernel functions. This section describes the construction of those kernels. Define an operator 𝐾̃ ∶ × → R𝑃 as
𝑘
𝑁 ∑
𝑖=1
3. Multiple kernel learning using composite kernel functions (MKLCKF)
( ) ̃ 𝑘 , 𝑥𝑗 ), 𝐾(𝑥 ̃ 𝑗 , 𝑥𝑙 ) 0. 𝑘𝑗 (𝑥𝑘 , 𝑥𝑙 ) = ⟨𝑘𝑗𝑥 , 𝑘𝑗𝑥 ⟩ = 𝑘∗ 𝐾(𝑥
(9)
𝑖=1
SVR algorithm was applied to determine the unknown function for regression problems. The optimization problem corresponding to SVR is
(8)
∑ 1 ‖𝑓 ‖2 + 𝐶 [𝜉𝑖 + 𝜉𝑖∗ ] 2 𝑖 𝑠𝑢𝑏 𝑡𝑜 𝑦𝑖 − 𝑓 (𝑥𝑖 ) − 𝑏 + 𝜖 − 𝜉𝑖 ≤ 0 𝑓 (𝑥𝑖 ) + 𝑏 − 𝑦𝑖 + 𝜖 − 𝜉𝑖∗ ≤ 0 𝜉𝑖 ≥ 0
𝑗=1
where, 𝛽𝑗 ≥ 0, ∀𝑗 = 1, 2, … , 𝑁. Theorem 3.2. The kernel given by (8) is a valid kernel. Proof. By Theorem 3.2, 𝑘𝑗 , 𝑗 = 1, 2, … , 𝑁 are valid kernels. Therefore 𝑘 is a conical linear combination of 𝑁 kernel functions. Hence by Bishop (2006) 𝑘 is a valid reproducing kernel.
where 𝑏 ∈ R is the bias and 𝐶 > 0 is the regularization parameter. The corresponding dual using the kernel given in Eq. (8) can be written as 393
Shiju S.S et al.
Engineering Applications of Artificial Intelligence 64 (2017) 391–400
Fig. 1. MKLCKF:I.
Fig. 2. MKLCKF:II.
follows.
where, [√ ]𝑇 √ √ 𝑘𝑥 𝑘 = 𝛽1 𝑘1𝑥 , 𝛽2 𝑘2𝑥 , … 𝛽𝑁 𝑘𝑁 𝑥𝑘 𝑘 𝑘 [√ √ √ = 𝛽1 𝑘∗𝐾(𝑥 , 𝛽2 𝑘∗𝐾(𝑥 … 𝛽𝑁 𝑘∗𝐾(𝑥 ̃ ̃ ̃ ,𝑥 ) ,𝑥 )
𝑁 ∑
𝑁 𝑁 𝑁 ∑ ∑ ( ) 1 ∑∑ ̃ 𝑖 , 𝑥𝑘 ), 𝐾(𝑥 ̃ 𝑘 , 𝑥𝑗 ) 𝛽𝑘 𝑘∗ 𝐾(𝑥 𝛼𝑖 𝑦𝑖 − 𝜖 |𝛼𝑖 | − 𝛼𝑖 𝛼𝑗 2 𝑖=1 𝑗=1 𝑘 𝑖=1 𝑖=1
𝑠𝑢𝑏 𝑡𝑜 −𝐶 ≤ 𝛼𝑖 ≤ 𝐶 𝛽∑ 𝑘 ≥0 𝛼𝑖 = 0
1
𝛽𝑘 = 1.
𝑘
( ) 𝜂𝑖 (𝑥) = 𝑒𝑥𝑝 −𝛾𝑑(𝑥𝑖 , 𝑥)
𝑘=1
The optimization is performed similar to that explained in Section 3.2.1.
]𝑇 𝑁 ,𝑥𝑘 )
.
(12)
where 𝑑(𝑥𝑖 , 𝑥) is a distance metric and 𝛾 > 0. Such a formulation is used as the composite kernel 𝑘𝑖 depends greatly on the data point 𝑥𝑖 . We used euclidean distance metric in our experiments. It is clear from (12), 𝜂𝑖 ’s contribution is highest for 𝑥𝑖 ’s neighbors and hence it helps to capture the local information of 𝑥𝑖 . Now
3.4. MKLCKF:II Using (8), √ 1 √ 1 𝛽 𝑘 𝛽 𝑘 ⟨⎡ √ 1 𝑥𝑘 ⎤ ⎡ √ 1 𝑥𝑙 ⎤⟩ ⎢ 𝛽 𝑘2 ⎥ ⎢ 𝛽 𝑘2 ⎥ 2 𝑥𝑘 ⎥ ⎢ 2 𝑥𝑙 ⎥ 𝑘(𝑥𝑘 , 𝑥𝑙 ) = ⟨𝑘𝑥𝑘 , 𝑘𝑥𝑙 ⟩ = ⎢ , ⎢ ⎥ ⎢ ⋮ ⎥ ⋮ √ √ ⎢ ⎥ ⎢ 𝑁⎥ ⎣ 𝛽𝑁 𝑘𝑁 𝑥𝑘 ⎦ ⎣ 𝛽𝑁 𝑘𝑥𝑙 ⎦
2
Thus in this method 𝑁 parameters {𝛽1 , 𝛽2 , … 𝛽𝑁 } has to be learned . For making the approach more computationally effective we formulated the algorithm MKLCKF: II, in which a neighborhood function 𝜂𝑖 of 𝑥𝑖 is introduced in the place of 𝛽𝑖 . We defined
𝑖 𝑁 ∑
𝑘
𝑘𝑥 𝑘
(11)
∗ ⎡ 𝜂1 (𝑥𝑘 )𝑘1𝑥𝑘 ⎤ ⎡ 𝜂1 (𝑥𝑘 )𝑘𝐾(𝑥 ̃ 1 ,𝑥𝑘 ) ⎢ 𝜂 (𝑥 )𝑘2 ⎥ ⎢ 𝜂 (𝑥 )𝑘∗ 2 𝑘 𝐾(𝑥 2 𝑘 𝑥𝑘 ⎥ ̃ 2 ,𝑥𝑘 ) ⎢ ⎢ = = ⎢ ⎥ ⎢ ⋮ ⋮ ⎢𝜂 (𝑥 )𝑘𝑁 ⎥ ⎢ ∗ ⎣ 𝑁 𝑘 𝑥 ⎦ ⎣𝜂𝑁 (𝑥𝑘 )𝑘 ̃ 𝑘
394
𝐾(𝑥𝑁 ,𝑥𝑘
⎤ ⎥ ⎥. ⎥ ⎥ )⎦
(13)
Shiju S.S et al.
Engineering Applications of Artificial Intelligence 64 (2017) 391–400 Table 1 Classification datasets.
Therefore, the final kernel is, 𝑘(𝑥𝑘 , 𝑥𝑙 ) =
𝑁 ∑
𝜂𝑗 (𝑥𝑘 ) 𝜂𝑗 (𝑥𝑙 )𝑘𝑗 (𝑥𝑘 , 𝑥𝑙 )
𝑗=1
=
𝑁 ∑
( ) ̃ 𝑗 , 𝑥𝑘 ), 𝐾(𝑥 ̃ 𝑗 , 𝑥𝑙 ) . 𝜂𝑗 (𝑥𝑘 ) 𝜂𝑗 (𝑥𝑙 ) 𝑘∗ 𝐾(𝑥
(14)
𝑗=1
Theorem 3.3. The kernel given by (14) is a valid kernel. Proof. By Theorem 3.2 𝑘𝑗 , 𝑗 = 1, 2, … , 𝑁 are valid kernels. Using (14), construct the 𝑁 × 𝑁 matrix 𝐾 = (𝑘(𝑥𝑖 , 𝑥𝑗 )), 𝑖, 𝑗 = 1, 2, … , 𝑁. Then 𝐾 can be represented as 𝐾 =
𝑁 ∑
𝛬𝑗 𝐾 𝑗 𝛬𝑗
(15)
𝑗=1
where 𝛬𝑗 is a diagonal matrix of order 𝑁 × 𝑁 such that, 𝑖th diagonal element, 𝛬𝑗 (𝑖, 𝑖) = 𝜂𝑗 (𝑥𝑖 ) and 𝐾 𝑗 is the kernel matrix of 𝑘𝑗 corresponding to 𝑁 points {𝑥1 , 𝑥2 , … , 𝑥𝑁 } . 𝛬𝑗 is positive semidefinite (p.s.d) matrix as it is a diagonal matrix with diagonal entries positive . Hence 𝛬𝑗 𝐾 𝑗 𝛬𝑗 is symmetric positive definite matrix. As 𝐾 is a linear combination of symmetric p.s.d matrices it is a symmetric p.s.d (Bishop, 2006; ShaweTaylor and Cristianini, 2004). Hence 𝑘 is a valid kernel. The Fig. 2 explains the formulation of MKLCKF:II.
Dataset
Repo.
Dim.
Data points
Arrythmia Haberman Heart Ionosphere Liver Musk 2 Parkinsons Pima Sonar Vert. Column WDBC Whole. Cust. Twonorm Ringnorm
UCI UCI UCI UCI UCI UCI UCI UCI UCI UCI UCI UCI IDA IDA
276 3 13 33 6 166 22 8 60 6 30 7 20 20
452 306 303 351 345 476 195 768 208 310 569 440 7400 7400
Table 2 Regression datasets.
3.4.1. Computational advantage of MKLCKF: II In case of MKLCKF:II, there are no additional parameters for learning the kernel as in the case of conventional MKL problems. The construction of kernel corresponding to MKLCKF:II is given in 1. As with all MKL models, we need the precomputed 𝑃 base kernel matrices for which the complexity is 𝑂(𝑁 2 𝑃 𝑛). They are used for computing the output of 𝐾̃ (line 2). In Algorithm 1, complexity for computing the function 𝑘𝑒𝑟𝑛𝑒𝑙(𝑖, 𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠) is 𝑂(𝑁 2 𝑃 ) (line 4) and complexity for computing the function 𝑒𝑡𝑎(𝑖, 𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠) is 𝑂(𝑁𝑛) (line 5). The time complexity for line 6 is 𝑂(𝑁 2 ). Lines 4, 5 and 6 are executed for 𝑁 times. Thus the total time complexity for the algorithm 1 is 𝑂(𝑁 3 𝑃 ) + 𝑂(𝑁 2 𝑃 𝑛) ≃ 𝑂(𝑁 3 𝑃 ) (considering 𝑛 < 𝑁). Hence this algorithm has the similar time complexity as of the best multiple kernel approaches.
Dataset
Repository
Points.
Dim.
Ailerons Airfoil self noise Bank32NH Commun. and crime Concrete slump test Elevators Energy Eff. Cool Energy Eff. Heat 2D-planes Video Char. Protein tertiary Str.
mldata.org UCI DELVE UCI UCI Exp. of Rui Camacho UCI UCI breiman 1984 UCI UCI
13 750 1 503 8 192 2 215 1 030 16 599 768 768 40 768 68 784 45 730
40 5 32 99 8 18 8 8 10 20 9
( ) ∙ Laplacian Kernel, 𝑘(𝑥, 𝑧) = exp − ‖𝑥−𝑧‖ , where 𝜎 ∈ R > 0 is the 𝜎 adjustable parameter. ( ) 2 ∙ Gaussian Kernel, 𝑘(𝑥, 𝑧) = exp − ‖𝑥−𝑧‖ , where 𝜎 ∈ R > 0 is the 𝜎 adjustable parameter. ∙ Polynomial Kernel, 𝑘(𝑥, 𝑧) = (𝛼𝑥𝑇 𝑧 + 𝑐)𝑑 , where 𝑑 ∈ R is polynomial degree, 𝑐 ∈ R is constant term and 𝛼 ∈ R is slope. Using different hyperparameters in above reproducing kernel functions, 42 base kernels were generated. The 𝜎 of both Laplace and Gaussian kernel are assigned with values from [2−9 , 2−8 , … , 29 ]. The polynomial kernel of degree 1, 2, 3 and 4 were used. The experiments were conducted on same machine throughout under similar conditions. The specification of the machine we used was intel i7 M4810 processor with 16GB RAM. The hyperparameters of the models used for our study were determined using 5-fold cross validation. For our study we chose both ‘large data’ as well as ‘not large data’. Large data is defined as those for which application of MKL produces out of memory problem in the machine which we used for computation and others are being called ‘not large’. The performance of MKLCKF:1 and MKLCKF:2 were compared with the following models:
Algorithm 1 Kernel Construction Algorithm for MKLCKF : II 1: procedure ComputeKernel(datapoints) 2: Initialize 𝐾 = 𝑧𝑒𝑟𝑜𝑠(𝑁, 𝑁) ⊳ N is the number of data points 3: for 𝑖 = 1 𝑡𝑜 𝑁 do 4: 𝐾𝑡𝑒𝑚𝑝 = 𝑘𝑒𝑟𝑛𝑒𝑙(𝑖, 𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠) ⊳ kernel() is the function which computes the kernel matrix from (7) 5: 𝑒𝑡𝑎 = 𝑒𝑡𝑎(𝑖, 𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠) ⊳ eta is the function which computes 𝜂 vector from (12) 6: 𝐾 = 𝐾 + ((𝑒𝑡𝑎 ∗ 𝑒𝑡𝑎𝑇 ). ∗ 𝐾𝑡𝑒𝑚𝑝) 7: end for 8: return K 9: end procedure 4. Experiments
1 SMO-SVM: Normal SMO-SVM model that uses a single kernel. 2 SimpleMKL (Rakotomamonjy et al., 2008): Linear Combination of kernel approach is used in this model. 3 TSMKL (Kumar et al., 2012): The model uses function approximation concepts and two stage learning process termed as Two Stage MKL. 4 GMKL (Varma and Babu, 2009b): A regularized formulation of MKL termed as generalized multiple kernel learning algorithm. 5 LpMKL (Kloft et al., 2011). An Lp Norm regularized MKL Algorithm.
4.1. Setup The experiments were conducted using classification and regression datasets. The classification datasets used are given in Table 1 which are binary class classification problems taken from UCI repository (Asuncion and Newman, 2007) and IDA benchmark repository (Sonnenburg et al., 2011). The Heart Disease dataset is actually a 5 class classification dataset. But we converted that to a binary classification by dividing the subjects on the basis whether they had heart disease or not. The regression datasets used for analysis are given in Table 2. The datasets are taken from UCI, mldata.org and DELVE repositories. In MKL algorithm the 𝑃 base reproducing kernels are generated from following reproducing kernel functions.
Codes for GMKL was taken from authors url (Varma and Babu, 2009a). For LpMKL we used the code from LibMKL (Xu, 2016). The framework in java for MKL named jKernelMachine (Picard et al., 2013) 395
Shiju S.S et al.
Engineering Applications of Artificial Intelligence 64 (2017) 391–400
Table 3 Accuracy table. Performance rank is given in brackets. Dataset
Models SMO SVM
SimpleMKL
TSMKL
LpMKL
GMKL
MKLCKF:I
MKLCKF:II
Arrhythmia Haberman Heart Ionosphere Liver Musk 2 Parkinsons Pima Sonar Vertebal column Wdbc Wholesale customers Average rank
76.98 ± 3.18 (2) 73.22 ± 5.76 (3) 83.48 ± 3.28 (2) 93.74 ± 2.20 (3) 71.42 ± 4.97 (3) 92.13 ± 2.76 (2) 88.21 ± 4.67 (5) 76.53 ± 2.07 (2) 85.03 ± 4.10 (3) 83.90 ± 3.64 (2) 96.94 ± 0.91 (1) 90.73 ± 2.00 (2) 2.5
75.89 ± 3.19 (3) 73.31 ± 5.23 (3) 82.78 ± 2.87 (3) 93.87 ± 2.20 (3) 71.87 ± 3.97 (3) 92.51 ± 2.68 (1) 88.39 ± 4.01 (5) 76.76 ± 2.86 (2) 85.69 ± 4.47 (2) 84.13 ± 4.25 (2) 97.06 ± 0.83 (1) 90.89 ± 2.52 (2) 2.5
74.96 ± 2.52 (4) 73.84 ± 4.41 (2) 81.96 ± 3.08 (4) 94.44 ± 2.15 (2) 73.04 ± 2.91 (1) 92.44 ± 1.80 (1) 90.34 ± 3.40 (3) 75.78 ± 2.95 (3) 88.06 ± 4.27 (1) 82.65 ± 2.46 (3) 97.03 ± 1.28 (1) 90.78 ± 2.06 (2) 2.25
76.81 ± 2.60 (2) 73.21 ± 1.71 (2) 82.84 ± 3.37 (3) 94.52 ± 1.69 (2) 72.66 ± 3.69 (2) 92.10 ± 2.66 (2) 90.29 ± 2.66 (3) 76.28 ± 2.39 (2) 87.63 ± 4.01 (2) 83.23 ± 2.85 (3) 97.32 ± 1.01 (1) 90.58 ± 1.45 (2) 2.17
76.42 ± 3.17 (2) 73.75 ± 5.36 (2) 83.21 ± 3.19 (2) 93.93 ± 2.31 (3) 71.25 ± 3.24 (3) 92.34 ± 2.61 (2) 89.32 ± 4.25 (4) 74.89 ± 3.63 (3) 86.24 ± 4.92 (3) 83.06 ± 2.78 (3) 97.29 ± 1.13 (1) 90.65 ± 2.82 (2) 2.5
75.45 ± 3.29 (3) 73.87 ± 3.49 (2) 84.23 ± 2.74 (1) 95.55 ± 1.67 (1) 73.34 ± 2.89 (1) 92.79 ± 2.09 (1) 91.42 ± 3.43 (2) 76.81 ± 2.41 (2) 87.15 ± 3.29 (2) 84.08 ± 3.60 (1) 97.15 ± 1.13 (1) 91.59 ± 1.96 (1) 1.5
79.38 ± +3.27 (1) 74.62 ± +3.54 (1) 84.96 ± +3.07 (1) 95.14 ± +2.20 (2) 72.83 ± +3.40 (2) 93.10 ± +1.88 (1) 92.47 ± +3.60 (1) 77.74 ± +2.60 (1) 88.39 ± +3.54 (1) 84.80 ± +3.16 (1) 97.12 ± +1.16 (1) 92.22 ± +1.68 (1) 1.16
Table 4 Fmeasure table. Performance rank is given in brackets. Dataset
Models SMO SVM
SimpleMKL
TSMKL
LpMKL
GMKL
MKLCKF:I
MKLCKF:II
Arrhythmia Haberman Heart Ionosphere Liver Musk 2 Parkinsons Pima Sonar Vertebal column Wdbc Wholesale customers Average rank
79.71 ± 3.02 (2) 83.41 ± 4.05 (3) 81.28 ± 4.17 (2) 95.20 ± 1.73 (2) 76.79 ± 4.42 (3) 90.91 ± 3.46 (2) 92.53 ± 3.18 (2) 62.37 ± 3.42 (2) 83.24 ± 5.12 (4) 87.99 ± 3.64 (1) 97.14 ± 0.76 (1) 93.10 ± 1.55 (2) 2.16
77.91 ± 3.24 (4) 83.11 ± 5.01 (3) 81.13 ± 3.45 (2) 95.26 ± 1.75 (2) 77.13 ± 4.12 (3) 90.66 ± 3.10 (2) 91.98 ± 2.65 (3) 62.61 ± 4.09 (2) 83.77 ± 5.36 (4) 87.81 ± 3.98 (1) 97.17 ± 0.64 (1) 93.24 ± 1.87 (2) 2.41
77.17 ± 2.26 (4) 84.08 ± 2.87 (2) 80.10 ± 4.10 (3) 95.72 ± 1.72 (2) 78.42 ± 3.09 (1) 91.52 ± 1.77 (1) 92.69 ± 2.29 (2) 61.46 ± 4.50 (3) 86.80 ± 4.71 (2) 87.07 ± 2.07 (2) 97.64 ± 1.04 (1) 93.12 ± 1.58 (2) 2.08
77.93 ± 1.89 (4) 84.11 ± 1.26 (2) 80.87 ± 4.12 (2) 95.11 ± 1.28 (2) 77.17 ± 3.39 (3) 90.89 ± 3.01 (2) 91.91 ± 1.76 (3) 61.38 ± 4.22 (3) 85.31 ± 4.31 (3) 87.63 ± 2.14 (1) 97.53 ± 0.79 (1) 93.07 ± 1.07 (2) 2.33
79.23 ± 3.42 (2) 83.54 ± 4.15 (3) 80.73 ± 3.57 (2) 95.38 ± 1.53 (2) 77.21 ± 4.22 (3) 90.67 ± 3.06 (2) 91.83 ± 2.58 (3) 62.29 ± 3.92 (2) 85.36 ± 4.62 (3) 87.27 ± 2.83 (2) 97.23 ± 0.60 (1) 93.56 ± 2.37 (2) 2.25
78.53 ± 3.02 (3) 84.43 ± 2.31 (2) 82.01 ± 3.36 (1) 96.57 ± 1.29 (1) 78.78 ± 2.89 (1) 91.61 ± 2.38 (1) 94.54 ± 2.30 (1) 62.54 ± 4.89 (2) 85.98 ± 3.90 (3) 88.37 ± 2.96 (1) 97.84 ± 0.92 (1) 93.82 ± 1.37 (2) 1.58
82.11 ± +2.70 (1) 85.42 ± +2.33 (1) 82.90 ± +3.62 (1) 96.08 ± +1.85 (2) 77.98 ± +2.96 (2) 91.94 ± +2.31 (1) 94.93 ± +2.47 (1) 65.41 ± +3.56 (1) 87.77 ± +3.55 (1) 88.66 ± +2.23 (1) 97.73 ± +0.91 (1) 94.82 ± +1.36 (1) 1.16
Table 5 RMSE using all data points. Dataset
SVR
Simple MKL (SVR)
MKLCKF:I (SVR)
MKLCKF:II (SVR)
Airfoil self noise Commun. and crime Concrete slump test Energy Eff. Cool Energy Eff. Heat
3.83287 ± 0.20978 (2) 5.79657 ± 0.29028 (1) 6.48337 ± 0.45852 (2) 1.33792 ± 0.10755 (2) 2.40471 ± 0.21294 (2)
3.40593 ± 0.32411 (1) 5.86840 ± 0.29237 (1) 6.09983 ± 0.52536 (1) 1.23957 ± 0.10164 (1) 1.40673 ± 0.13337 (2)
3.53291 ± 0.29307 (1) 5.80437 ± 0.31056 (1) 6.16865 ± 0.38802 (1) 1.25763 ± 0.10176 (1) 1.34312 ± 0.14548 (1)
3.47891 ± 0.31476 (1) 5.81379 ± 0.31648 (1) 6.08913 ± 0.36248 (1) 1.24912 ± 0.11345 (1) 1.36918 ± 0.13421 (1)
Average rank
1.8
1.2
1
1
is used for other state-of-the-art algorithms. The same framework was customized for implementing the proposed algorithm. The kernel 𝑘∗ for MKLCKF models was found using cross validation technique. The kernel 𝑘∗ and the kernel selected for single kernel approach (SVM-SMO) was the same for all the experiments. 4.1.1. Model evaluation We used 30 times holdout technique for model evaluation. Their performance was assessed using root mean square (RMSE) for regression and accuracy and F-measure for classification. The 𝑡-test was performed over the 30 times hold out results for verifying the statistical significance of the results (significance level 𝛼 = 0.1). We assigned ranks to the models for their performance on each data using the following strategy: let 𝑀1 and 𝑀2 are two models; let 𝑃1 and 𝑃2 are the values of a performance measure 𝑃 for a given dataset 𝐷 corresponding to 𝑀1 and 𝑀2 respectively. Then we say that 𝑀1 is better than 𝑀2 on the basis of 𝑃 on 𝐷 if 𝑃1 > 𝑃2 and the difference between 𝑃1 and 𝑃2 is statistically significant. We also computed the average rank for all the models corresponding to each performance measure we used.
Fig. 3. Accuracy graph for two norm data.
the accuracy results are shown in Table 3 and F-measure results in Table 4. The rank is displayed in open brackets in each cell of Tables 3 and 4. On the basis of average rank, MKLCKF:11 scored the highest rank followed by MKLCKF:1. Thus for these classification experiments,
4.2. Classification experiments We selected not large data and large data related with classification for evaluating the performance of the models. For not large data, 396
Shiju S.S et al.
Engineering Applications of Artificial Intelligence 64 (2017) 391–400
Fig. 4. Accuracy graph for ring norm data.
Fig. 7. Caltech101 image classification using KNN-LSH results.
Fig. 5. Caltech101 image classification results. Fig. 8. UIUC sports scene classification using KNN-LSH results.
applying dictionary learning algorithm of Jiang et al. (2013). We followed a 30 times hold out approach for fixing the size of dictionary. The dictionary points thus obtained is used as the fixed points in eqn. (14). That is if 𝑑𝑗 , 𝑗 = 1 … 𝑁1 are the dictionary atoms, then (14) becomes 𝑘(𝑥𝑘 , 𝑥𝑙 ) =
𝑁1 ∑
( ) ̃ 𝑗 , 𝑥𝑘 ), 𝐾(𝑑 ̃ 𝑗 , 𝑥𝑙 ) 𝜂𝑗 (𝑥𝑙 ). 𝜂𝑗 (𝑥𝑘 )𝑘∗ 𝐾(𝑑
(16)
𝑗=1
The analysis results are plotted in Figs. 3 and 4. The superior performance of MKLCKF models is evident from these results. 4.2.2. Analysis on image datasets Apart from the datasets described above, we did analysis using image datasets also. Using Caltech101 (Li et al., 2003) and UIUC sports scene classification (Li and Fei-Fei, 2007), the performance of MKLCKF: II had been verified. Since MKLCKF:I is computationally expensive, we restricted the evaluation to only MKLCKF:II. Caltech101 dataset consists of 101 classes and a total of 3131 images where as UIUC dataset consists of 1579 images. In this case, we followed 5-fold cross validation for comparing the performance of the models. Deep Convolutional Neural Network (Krizhevsky et al., 2012) had been used for extracting the features. We used a pretrained CNN which was trained over imagenet from (Vedaldi and Fulkerson, 2008). As the CNN was trained over whole imagenet, a feature selection algorithm
Fig. 6. UIUC sports scene classification results.
proposed model’s performance was better than the other state of the art techniques. 4.2.1. Analysis on large datasets MKL analysis over ringnorm and twonorm datasets ended in out of memory error. We analyzed the performance on these datasets by 397
Shiju S.S et al.
Engineering Applications of Artificial Intelligence 64 (2017) 391–400
Table 6 RMSE using pre-clustered data points. Dataset
SVR
SimpleMKL (SVR)
MKLCKF:I (SVR)
MKLCKF:II (SVR)
Airfoil self noise Commun. and crime Concrete slump test Energy Eff. Cool Energy Eff. Heat
4.14588 ± 0.30715 (3) 6.45198 ± 0.49322 (2) 6.54374 ± 0.42661 (2) 2.56802 ± 0.16839 (3) 2.59222 ± 0.22298 (3)
3.96222 ± 0.39076 (2) 6.53542 ± 0.49109 (2) 6.35594 ± 0.55881 (1) 2.54866 ± 0.19781 (3) 2.57326 ± 0.22317 (3)
3.52020 ± 0.20978 (1) 6.16050 ± 0.45235 (1) 6.12909 ± 0.43515 (1) 2.41040 ± 0.18034 (2) 2.47445 ± 0.18807 (2)
3.49316 ± 0.21016 (1) 6.09381 ± 0.41351 (1) 6.13249 ± 0.41277 (1) 2.29346 ± 0.17199 (1) 2.26533 ± 0.16736 (1)
Average rank
2.6
2.2
1.4
1
Fig. 9. Root mean squared error for large datasets over all models.
had been applied over the features extracted from CNN, that is, the penultimate layer output of the CNN was fed to a feature selection algorithm. The feature selection algorithm, we used was random forest using scikit learn library (Pedregosa et al., 2011). The result over Caltech101 is shown in Fig. 5 while the results for the experiment on UIUC sports scene classification is shown in Fig. 6. The MKLCKF:II performed well on image data.
4.3. Regression experiments Regression experiments are also carried over large as well as not large data. In case of regression experiments over large data, we used the pre-clustering algorithm for data compression. The hyperparameters of the pre-clustering algorithm are the 𝜖 value and step length ℎ. In order to optimize the epsilon value, we took a small part of the data from the dataset and applied cross validation over different values of 𝜖 from {0.1, 0.11, … , 1}. Based on the mean squared error performance as well as compression rate for each 𝜖 values, we chose the 𝜖 which given comparable MSE (within a threshold range) and maximum compression. Rather than picking up a small subset from the dataset randomly, we constructed a ball (𝑝, 𝑟), and the points coming in this ball is taken as subset. In this ball, 𝑝 is any random point and 𝑟 is the radius of the ball. We adjusted the radius value in such a way to get an optimum number of data points in the subset. The optimum value of ℎ is also found using a subset of the data.
4.2.3. Application in kernelized locally-sensitive hashing We chose an application domain also for assessing the performance of the models. Using MKLCKF: II we performed kernelized locally sensitive hashing (KLSH) (Grauman and Kulis, 2011) on image data described in 4.2.2. KNN based approach was used for evaluating the performance of KLSH. We compared the proposed model MKLCKF:II with single kernel approach and TSMKL approach (Kumar et al., 2012) only as other models need theoretical modification for applying on KLSH. The accuracy results for the experiments are shown in Figs. 7 and 8. The graph plots the mean accuracy and its standard deviation of different models over 30 iterations of hold out approach. The MKLCKF:II showed good results on this analysis also.
4.3.1. Results and discussion We assessed the performance of the algorithms on ‘not large data’ whose results are given in Table 5. We analyzed ‘not large data’ using pre-clustering approach also. Fig. 10 shows the level of compression 398
Shiju S.S et al.
Engineering Applications of Artificial Intelligence 64 (2017) 391–400
References
Asuncion, A., Newman, D., UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2007. URL http://www.ics.uci. edu/~mlearn/MLRepository.html. Bennett, K.P., Momma, M., Embrechts, M.J., 2002. MARK: a boosting algorithm for heterogeneous kernel models in: Proceedings KDD-2002: Knowledge Discovery and Data Mining, pp. 24–31. Bishop, C.M., 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. Boser, B.E., Guyon, I.M., Vapnik, V.N., 1992. A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. In: COLT’92, ACM, New York, NY, USA, pp. 144–152 URL http://doi.acm. org/10.1145/130385.130401. Cortes, C., Mohri, M., Rostamizadeh, A., Learning non-linear combinations of kernels, in: Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, A. Culotta (Eds.), Advances in Neural Information Processing Systems 22, 2009. pp. 396–404. de Diego, I., Muoz, A., Moguerza, J., 2010. Methods for the combination of kernel matrices within a support vector framework. Mach. Learn. 78 (1-2), 137–174 URL http://dx.doi.org/10.1007/s10994-009-5135-5. Dileep, A.D., Sekhar, C., Representation and feature selection using multiple kernel learning, in: Neural Networks, 2009. IJCNN 2009. International Joint Conference on, 2009 pp. 717–722 ISSN: 1098-7576. Gonen, M., Alpaydn, E., 2013. Localized algorithms for multiple kernel learning. Pattern Recognit. 46 (3), 795–807. Grauman, K., Kulis, B., 2011. Kernelized locality-sensitive hashing. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1092–1104 undefined. Hsu, C., Lee, W.S. (Eds.), 2011. Proceedings of the 3rd Asian Conference on Machine Learning, 2011, Taoyuan, Taiwan, November 13–15, 2011. In: JMLR Proceedings, vol. 20, JMLR.org, URL http://jmlr.org/proceedings/papers/v20/. Jain, A., Vishwanathan, S.V.N., Varma, M., 2012. SPG-GMKL: generalized multiple kernel learning with a million kernels, in: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Jiang, Z., Lin, Z., Davis, L.S., 2013. Label consistent k-svd: learning a discriminative dictionary for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35 (11), 2651– 2664. Kimeldorf, G., Wahba, G., 1971. Some results on tchebycheffian spline functions. J. Math. Anal. Appl. 33 (1), 82–95. Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A., 2011. lp-norm multiple kernel learning. J. Mach. Learn. Res. 12, 953–997. Kreßel, U., 1999. Pairwise classification and support vector machines. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (Eds.), Advances in Kernel Methods — Support Vector Learning. MIT Press, Cambridge, MA, pp. 255–268. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems. Kumar, A., Niculescu-Mizil, A., Kavukcuoglu, K., Daume III, H., A Binary Classification Framework for Two-Stage Multiple Kernel Learning, 2012, arxiv e-prints arXiv:1206. 6428. Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I., 2004. Learning the kernel matrix with semi-definite programming. J. Mach. Learn. Res. 5, 27–72. Li, F.-F., Andreetto, M., Ranzato, M.A., 2003. Caltech101 image dataset URL http://www. vision.caltech.edu/Image_Datasets/Caltech101/. Li, L.-J., Fei-Fei, L., 2007. UIUC sports event dataset, in: IEEE Intern. Conf. in Computer Vision (ICCV) URL http://vision.stanford.edu/lijiali/event_dataset/. Nair, S.S., Dodd, T.J., 2015. Supervised pre-clustering for sparse regression. Int. J. Syst. Sci. 46 (7), 1161–1171. http://dx.doi.org/10.1080/00207721.2013.811312. Pavlidis, P., Weston, J., Cai, J., Grundy, W.N., 2001. Gene functional classification from heterogeneous data. In: Proceedings of the Fifth Annual International Conference on Computational Biology. In: RECOMB’01, ACM, New York, NY, USA, pp. 249–255. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830. Picard, D., Thome, N., Cord, M., 2013. Jkernelmachines URL http://mloss.org/software/ view/409/. Pozdnoukhov, A., 2002. The Analysis of Kernel Ridge Regression Learning Algorithm Idiap-RR-54-2002. IDIAP, Martigny, Switzerland. Rakotomamonjy, A., Bach, F.R., Canu, S., Grandvalet, Y., 2008. Simple MKL. J. Mach. Learn. Res. 9, 2491–2521. Schlkopf, B., Herbrich, R., Smola, A., 2001. A generalized representer theorem. In: Helmbold, D., Williamson, B. (Eds.), Computational Learning Theory. In: Lecture Notes in Computer Science, vol. 2111, Springer, Berlin Heidelberg, pp. 416–426. Schölkopf, B., Smola, A., Müller, K.-R., 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10 (5), 1299–1319 URL http://dx.doi.org/10. 1162/089976698300017467. Shawe-Taylor, J., Cristianini, N., 2004. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA. Sonnenburg, Soeren, Ong, Cheng Soon, Henschel, Sebastian, Braun, Mikio, mldata.org 2011. URL https://mldata.org/repository/data/.
Fig. 10. Data compression ratio.
achieved using pre-clustering algorithm which is calculated on the basis of the average of the number of pre-clustered points obtained in each iteration during hold out validation, while Table 6 shows the RMSE of each models with pre-clustered points. We performed statistical significance test as well to analyze the results of Tables 5 and 6. For each data, models were assigned rank on the basis of their performance using the criteria discussed in Section 4.1.1 and average rank corresponding to each model were calculated. MKLCKF:11 scored the highest rank in these experiments. MKLCKF:1 also showed better performance than the other models we used for comparison. 4.3.2. Analysis on large datasets We analyzed the performance on large regression datasets by using the data compression approach alone since other approaches are failing with out-of-the-memory error. With the aid of pre-clustering we successfully applied the proposed models over large datasets. For all the large datasets we used, MKLCKF models showed a superior performance over other models (Fig. 9). Pre-clustering helps to remove the redundant points and hence the proposed model is more efficient with the informative data points. 5. Conclusion In this paper we introduced MKLCF, which uses composite kernel functions for MKL. Using each data as reference, we designed composite kernels in such a way that each of those consists of two functions in which the first function makes use of all the 𝑃 base kernels under consideration and the second is a single valid kernel function. The optimal kernel is then represented as a linear combination of 𝑁 kernels. We formulated two versions of MKLCF and assessed their performance on classification as well as regression problems. For applying on large datasets, that is on data in which the normal methods produces ’out of memory’ in the machines we used, we introduced supervised preclustering for regression and data dictionary methods for classification finding the vital points. The models we proposed showed a superior performance on large datasets in comparison with other state of art techniques. 399
Shiju S.S et al.
Engineering Applications of Artificial Intelligence 64 (2017) 391–400 Wang, S., Huang, Q., Jiang, S., Tian, Q., 2012. 𝑆 3 MKL: scalable semi-supervised multiple kernel learning for real-world image applications. IEEE Trans. Multimed. 14 (4), 1259–1274. Xu, X., (May 2016). URL https://sites.google.com/site/xinxingxu666/LibMKL_14-05-06. rar?attredirects=0. Yeh, Y.-R., Lin, T.-C., Chung, Y.-Y., Wang, Y.-C., 2012. A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection. IEEE Trans. Multimedia 14 (3), 563–574.
Sonnenburg, S., Ratsch, G., Schafer, C., Scholkopf, B., 2006. Large scale multiple kernel learning. J. Mach. Learn. Res. 7, 1531–1565. Varma, M., Babu, B., 2009a. June URL http://research.microsoft.com/en-us/um/people/ manik/code/GMKL/gmkl.tgz. Varma, M., Babu, B., 2009b. More generality in efficient multiple kernel learning, in: Proceedings of the International Conference on Machine Learning pp. 1065–1072. Vedaldi, A., Fulkerson, B., 2008. VLFeat: An Open and Portable Library of Computer Vision Algorithms, http://www.vlfeat.org/matconvnet/pretrained/.
400