Accepted Manuscript
A New Support Vector Machine with an Optimal Additive Kernel Jeonghyun Baek , Euntai Kim PII: DOI: Reference:
S0925-2312(18)31220-7 https://doi.org/10.1016/j.neucom.2018.10.032 NEUCOM 20054
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
20 October 2017 31 August 2018 12 October 2018
Please cite this article as: Jeonghyun Baek , Euntai Kim , A New Support Vector Machine with an Optimal Additive Kernel, Neurocomputing (2018), doi: https://doi.org/10.1016/j.neucom.2018.10.032
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
A New Support Vector Machine with an Optimal Additive Kernel
CR IP T
Jeonghyun Baek and Euntai Kim
School of Electrical and Electronic Engineering, Yonsei University, Yonsei Street 50, Seoul, 120-749, Korea
AN US
E-mail:
[email protected]
Abstract
Although the kernel support vector machine (SVM) outperforms linear SVM, its application
M
to real world problems is limited because the evaluation of its decision function is
ED
computationally very expensive due to kernel expansion. On the other hand, additive kernel (AK) SVM enables fast evaluation of a decision function using look-up tables (LUTs). The
PT
AKs, however, assume a specific functional form for kernels such as the intersection kernel
CE
2 (IK) or kernel, and are problematic in that their performance is seriously degraded
when a given problem is highly nonlinear. To address this issue, an optimal additive kernel
AC
(OAK) is proposed in this paper. The OAK does not assume any specific kernel form, but the kernel is represented by a quantized Gram table. The training of the OAK SVM is formulated as semi-definite programming (SDP), and it is solved by convex optimization. In the experiment, the proposed method is tested with 2D synthetic datasets, UCI repository datasets and LIBSVM datasets. The experimental results show that the proposed OAK SVM 1
ACCEPTED MANUSCRIPT
has better performance than the previous AKs and RBF kernel while maintaining fast computation using look-up tables.
Ⅰ. INTRODUCTION
CR IP T
Keyword: additive kernel, kernel optimization, SVM, nonlinear classification
A kernel support vector machine (SVM) is one of the most popular classifiers, and it has been applied to a variety of fields due to its excellent classification performance [1-7][30].
AN US
Typical examples of nonlinear kernels include the polynomial kernel, the radial basis
function (RBF) kernel and the Gaussian kernel, which definitely outperform linear SVM. Unfortunately, however, the application of kernel SVM to real world problems is limited
M
because evaluating a decision function takes a much longer time than linear SVM due to kernel expansion [8-10].
ED
To solve this problem, some research has been conducted, and the solutions belong to one of
PT
three options. The first option is to decrease the computational burden of the kernel SVM by reducing the number of support vectors. This option can be categorized into two approaches:
CE
pre-pruning and post-pruning approaches [1]. The pre-pruning methods reduce the training samples and train the SVM with the reduced samples, thereby reducing the number of
AC
support vectors. They usually employ an unsupervised clustering method to divide the training data into several clusters. The SVM is then trained only using samples near clustering centers of the training data [2], [3], [31]. In addition, some researchers proposed training the SVM using only samples located outside of the clusters in order to use samples located near the decision boundary [8], [9]. On the other hand, the post-pruning methods 2
ACCEPTED MANUSCRIPT
exploited the training results of the SVM to reduce the training data. The post-pruning methods select the optimal subset of support vectors such that the performance degradation is minimized. To search for the optimal subset of support vectors, an iterative search [16-18] or the meta-heuristic genetic algorithm [10], [11] is used.
CR IP T
The second option for dealing with the computational burden of kernel SVM is to combine multiple linear SVMs instead of using the kernel SVM. The training samples are divided into several clusters and multiple linear SVMs are assigned to each cluster as a local classifier. Zhang et al. proposed SVM-KNN in which the kernel SVM is trained for every test sample
AN US
using only the nearest k training neighbors of the test sample [12]. Cheng et al. proposed a localized SVM (LSVM) and a profile SVM (PSVM) in which multiple linear SVMs are trained using the similarities between the test and training samples [13]. Ye et al. proposed a
having multiple postures[14].
M
piecewise linear SVM (PL-SVM) by training multiple linear SVMs to classify pedestrians
ED
The third option is to use the additive kernel (AK). The AK enables fast evaluation of a
PT
decision function by decomposing the evaluation into the summation of dimension-wise kernel computations [15-18, 32, 33]. Recently, Maji et al. proposed an efficient computation
CE
method for AK SVM using look-up tables (LUTs) instead of kernel expansion with support vectors, and showed a significant enhancement in terms of the computation time[15]. AKs,
AC
however, are problematic in that their performance is seriously degraded when a given problem is highly nonlinear. In this paper, an optimal additive kernel (OAK) is proposed to address this issue. Unlike previous AKs, such as the intersection kernel (IK) or 2 kernel, the proposed OAK does not assume any specific kernel form, but the kernel is represented in
3
ACCEPTED MANUSCRIPT
terms of a quantized Gram table. The training of the OAK SVM is formulated as semidefinite programming (SDP), and it is solved by convex optimization. The remainder of this paper is organized as follows: In Section II, some background information about SVM and additive kernels is reviewed. The training of the OAK is
CR IP T
formulated as SDP and it is solved by convex optimization in Section III. In Section IV, the proposed OAK and previous AKs are applied to synthetic and benchmark problems, and the
AN US
experimental results are summarized. Finally, some conclusions are drawn in Section V.
II. PRELIMINARY FUNDAMENTALS
x i , y i
ED
Let us assume that a training set
M
2.1. Support Vector Machine (SVM)
superscript represents the i th data point, x i
L
i 1
N
is given, where the parenthesized
and y i 1, 1 . The SVM is a
PT
machinery that separates the positive ( y 1 ) and negative ( y 1 ) classes by i
i
CE
maximizing the margin between the two classes. When an input vector is represented by a
AC
higher dimensional feature vector x and the SVM is represented by
f x wT x b ,
(1)
its training is formulated in the primal space by the following optimization problem: L 1 p* min wT w C (i ) w ,b 2 i 1
subject to y (w x (i )
T
4
(i )
b) 1
, (i )
(2)
ACCEPTED MANUSCRIPT
where x
D
and D N ; p* denotes the optimal value of Eq. (2); and C 0
controls the tradeoff between the regularization and the empirical risk [19]. Equivalently, the training is represented by
p* max eT α αT yyT K α α
(3)
CR IP T
0 α C subject to T α y0
(1) (2) where e is an L -dimensional vector filled with 1s; α
a dual variable vector; and
T
( L)
denotes a component-wise multiplication. K
LL
L
is
is a
AN US
kernel Gram matrix and its components K i. j x(i ) , x( j ) x(i ) x( j ) . When the SVM is trained in the dual space, the decision function (4) can be evaluated by
L
f x i y i xi , x b
M
i 1
(4)
y x , x b , i
i
i
where
i | i 0
ED
i
denotes a set of support vectors. When nonlinear kernels are used
PT
in Eq. (4), the SVM is represented as a kernel expansion of support vectors and
CE
demonstrates good performance. Unfortunately, however, the kernel expansion requires
AC
high computation and memory resources for SVM evaluation.
2.2. Additive Kernel SVM The additive kernel is a nonlinear yet efficient method of classification proposed in [20].
The idea of the additive kernel is to decompose a kernel value into a summation of dimension-wise components. Thus, the additive kernel is defined by 5
ACCEPTED MANUSCRIPT
N
x, z n xn , zn ,
(5)
n 1
where x x1 , x2 ,..., xN
and z z1 , z2 ,..., zN
N
N
. Typical examples of additive
kernels include the intersection kernel IK [17], , generalized intersection kernel GIK
N
IK x, z min xn , zn n 1
N
GIK x, z min xn , zn N
x, z 2
2
AN US
n 1
2
CR IP T
[16] , 2 kernel 2 [18] and linear kernel LIN . They are defined as
n 1
2 xn zn xn zn
(6)
(7)
(8)
N
LIN x, z xn zn .
(9)
M
n 1
The above four additive kernels are visualized for 1-dimensional data in Fig. 1. In the figure,
ED
the two axes denote the scalar values of x and z , and the value of ( x, z ) is visualized
AC
CE
PT
in gray. The darker an ( x, z ) pair is, the higher the kernel value ( x, z ) is.
(b) IK
(a) LIN
6
CR IP T
ACCEPTED MANUSCRIPT
(c) GIK
(d) 2
Figure. 1. Visualization of additive kernels for 1-dimensional data
AN US
From Eqs. (5) to (9), the SVM with additive kernels can be represented by
L
f x i y i xi , x b i 1
L
N
(10)
i y i n xni , xn b n 1
M
i 1
ED
N L i y i n xni , xn b n 1 i 1
N
hn xn b ,
PT
n 1
AC
CE
where hn xn is defined by
L
hn xn i y i n xni , xn i 1
(11)
y x , x , i
i
n
i
i n
n
and is just a one-dimensional function of xn . Thus, when the SVM of Eq. (3) is trained, i and , y and xn are given, it is worth noting that
i
i
7
ACCEPTED MANUSCRIPT
(1) The one-dimensional function hn xn can be precalculated and the output can be stored in a look-up table (LUT). Thus, when a new test point xtest is presented, hn xntest can easily be read out from the LUT by applying simple interpolation;
(2) and, because function hn xn is implemented off-line by an LUT, the number of
CR IP T
support vectors is not important at all. It is not the case in common kernels.
Once the functions hn xn s are computed, the SVM f xtest in Eq. (10) can be evaluated
AN US
by the summation of N hn xntest values.
III. Additive Kernel Optimization The additive kernel enables the fast evaluation of the SVM regardless of the number of
M
support vectors. For highly nonlinear problems, however, its performance might be degraded from the other non-additive kernels such as polynomial kernels or Radial Basis
ED
Function (RBF) kernels. Figure 2 shows the decision boundary of the SVMs using four
PT
kinds of additive kernels (linear, intersection, generalized intersection and 2 kernels) for a 2-dimensional non-linear problem. In the figures, two classes are marked in red and blue.
CE
1 0.9 0.8 0.7 0.6 0.5 0.4
AC
0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
(a) LIN
0.9
1
1
1
1
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) IK
0
0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
(c) GIK
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.5
0.6
(d) 2
Figure 2. Decision boundary of SVMs with additive kernels
8
0.4
0.7
0.8
0.9
1
ACCEPTED MANUSCRIPT
As stated, all four kinds of kernels demonstrate degraded performance from the expected result. In this paper, a new SVM with the optimal additive kernel (OAK) is proposed. The proposed OAK enables much faster evaluation than non-additive kernel SVM, but its performance is almost as good as that of the non-additive version due to kernel
CR IP T
optimization.
3.1. Basic Idea
For a nonlinear system with a fixed kernel , , the SVM can be trained by Eq. (2) in
AN US
the primal space, or equivalently by Eq. (3) in the dual space. Here, it can be observed from Eq. (3) that the optimal value p* depends on the choice of a kernel , and choosing a good kernel would decrease the objective function. To exploit the advantages of several
M
kernel, the multiple kernel learning methods have been proposed [20-22, 34]. The idea of
ED
MKL is to simultaneously optimize not only dual valuables , but also the kernel , ,
1 q* min max eT α αT yy T K α K α 2 trace(K ) c , K 0 such that 0αC αT y 0
AC
CE
PT
by using
(12)
where the formulation belongs to the convex problem and c is a constant for the equality constraint of trace(K ) [19]. Equation (12) can be recast into an SDP formulation
9
ACCEPTED MANUSCRIPT
p* min t t , , ν ,δ , K
using Schur's complement, where y y (1)
δ , ν
L
and
y (2)
y ( L) T
(13)
0,
CR IP T
trace(K ) c K 0 yy T K e ν δ y such that T t 2CδT e e ν δ y ν 0 δ 0 L
is a set of target values;
are dual variables; e is the L-vector consisting of all 1s; and
AN US
denotes a component-wise multiplication. The SDP formulation of the MKL appears elegant, but its usefulness is not as high as expected because
(1) The number of optimization variables grows rapidly when we have a large number of training samples because the kernel ram matrix has a size of O( L2 ) .
M
(2) The more serious problem is that Eq. (13) cannot be applied to the evaluation of the
ED
SVM when a new test data point is presented because Eq. (13) returns the fixed kernel values , for the training data points.
PT
Interestingly, this is not the case in the additive kernel SVM. The main idea of this paper is
CE
to exploit the decomposition property of additive kernels and resolve the above two problems. The Gram matrix K is parameterized in terms of quantized additive kernel
AC
Gram matrices (referred to as the quantized table (QT) hereafter), and the QTs are optimized by the SDP in Eq. (13). Once the QTs are obtained by the SDP, the dimension-wise function
hn xn (l ) y (l ) xn(l ) , xn for all possible xn is stored in a lookup table (LUT). When L
l 1
a new testing data point is presented and the classifier 10
ACCEPTED MANUSCRIPT
f x (l ) y (l ) x(l ) , x b hn xn b is evaluated using LUT. Here, the term L
N
l 1
n 1
quantized means that each dimension is decomposed into N Q subintervals and the QT is computed for NQ NQ pairs, as in a 2-dimensional array. An example of the QT for linear
M
AN US
CR IP T
kernel LIN is shown in Figure 3.
n
according to the linear kernel LIN
ED
Figure 3. Quantized additive kernel Gram matrix
1 In the figure, it is assumed that the QT is for the n th element xn , and the input
PT
domain for xn is [-3 5]. If the input domain [-3 5] is decomposed into NQ 1 (here as
3.0, 1.4,0.2,1.8,3.4,5.0 for xn , as shown in Figure 5. Hereafter, let us denote
AC
n
CE
6 1 5 ) subintervals with a width of 1.6, we obtain 6 representative values
the QT for n-th dimensional value xn by
data input points xi x1i
x2i
xNi
n
T
. As an example, let us assume that two new
and x j x1 j
x2 j
xN j
T
are
(i ) ( j) presented, and their n th elements are xn 2.2 and xn 4.1 , respectively. The two (i ) ( j) values are then approximated into xn 1.8 and xn 3.4 , respectively, and the n th
11
ACCEPTED MANUSCRIPT
kernel value for the two data points can be read out at the 4th row and the 5th column of the
n th QT
n
, as shown in Fig. 4, and
n xn(i ) , xn( j )
n
4,5 0.48 ,
(14)
M
AN US
CR IP T
where the bracket is an array notation and only positive integers can be used as arguments.
PT
ED
Figure 4. Obtaining the kernel value
In a similar way, xi , x j
n
n
’s. The idea of this paper is to encode the elements in
’s into the full Gram matrix K and the QT
AC
the
according to the linear kernel LIN
will be represented as a summation of the elements in the
’s, each coming from different
CE
n
n
n
’s are optimized by the SDP.
Figure 5 show an example of SDP optimization process using the Gram matrix K and the QT
n
’s.
12
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
n
n
’s
’s are computed, the decision function hn xn (l ) y (l ) xn(l ) , xn for all L
l 1
PT
Once the
ED
Figure 5. Example of SDP Optimization Process using the QT
CE
possible xn is stored in a lookup table (LUT). Figure 6 shows an example of making LUT
AC
of h1 x1 .
13
CR IP T
ACCEPTED MANUSCRIPT
Figure 6. Example of making LUT of h1 x1 .
AN US
Using these LUTs of hn xn , the decision function of
f x (l ) y (l ) x(l ) , x b hn xn b can be computed by an LUT when new test L
N
l 1
n 1
M
points are presented.. For the sake of simplicity, the additive kernel will first be optimized for the 2-dimensional case in Subsection 3.2, and then the idea will be generalized to a
PT
ED
higher dimensional case in Subsection 3.3.
3.2. Kernel Optimization: Simple Case is given and it consists of 5 samples as follows:
CE
Let us assume that a training set
AC
2.5 0.5 3.4 0.5 2.12 , 1 , , 1 , , 1 , , 1 , , 1 1 1.97 1 0.3 0.2
(15)
From left to right, the training data points will be numbered from 1 to 5. For example,
0.5 x(2) and y (2) 1 . Assume that the domains of x1 and x2 are [-3, 5] and [-2, 2], 1
14
ACCEPTED MANUSCRIPT
respectively. When N Q is set to six for both variables, the two QTs
1
and
2
are given
AN US
CR IP T
as in Fig. 7.
Q1
Q2
Figure 7. Two-dimensional QT for NQ 6 .
First, we fill out the full kernel Gram matrix K as a summation of the elements in
M
For example,
[1,1]
1
2
(16)
[4, 4] ,
PT
s.
ED
K 1,1 x(1) , x(1) 1 x1(1) , x1(1) 2 x2(1) , x2(1) 1 2.5, 2.5 2 0.2,0.2
n
because -2.5 should be approximated as -3.0 in x1 and 0.2 should be approximated into 0.4
AC
CE
in x2 . This can be illustrated as shown in Figure 8.
15
CR IP T
ACCEPTED MANUSCRIPT
AN US
Figure 8. Obtaining the kernel value (x(1) , x(1) ) from Qn In the same way,
K 1,1 x(1) , x(1) 1 x1(1) , x1(1) 2 x2(1) , x2(1) 1 2.5,0.5 2 0.2,1
[1,3]
1
2
[4,5] ,
M
(17)
because -2.5 and 0.5 can be approximated into -2 and 0.2, respectively, in x1 , and 0.2 and 1
ED
can be approximated into 0.4 and 1.2, respectively. In the same way, the other kernel values
AC
CE
PT
will be parameterized in terms of the elements in
K 1, 3
1
K 1, 4
1
K 1, 5
1
K 2, 2
1
K 2, 3
1
K 2, 4
1
n
s by
[1, 4]
2
[1,3]
2
[4, 2] ,
[1, 4]
2
[4, 3] ,
[4, 6] ,
[3,3]
2
[3, 4]
2
[5, 6] ,
[2,3]
2
[5, 2] ,
16
[5, 5] ,
(18)
ACCEPTED MANUSCRIPT
K 2, 5
[2, 4]
1
2
[5, 3] ,
2
[3,3].
…
K 5, 5
[4, 4]
1
The full kernel Gram matrix K will then be parameterized by [1,3] 2 [4,5] 1[3,3] 2 [5,5] * * * 1
[1, 4] 2 [4, 6] 1[3, 4] 2 [5, 6] 1[4, 4] 2 [6, 6] * * 1
[1,3] 2[4, 2] 1[3,3] 2 [5, 2] 1[4,3] 2 [6, 2] 1[3,3] 2 [2, 2] * 1
[1, 4] 2 [4,3] 1[3, 4] 2 [5, 3] (19) 1[4, 4] 2 [6,3] 1[3, 4] 2 [2,3] 1[4, 4] 2 [3,3] 1
CR IP T
1[1,1] 2 [4, 4] * K * * *
AN US
Here, it should be noted that the kernel Gram matrix K is symmetric and it does not need to be fully encoded. Encoding the diagonal of an upper triangular part is enough, and * in Eq. (19) denotes that the corresponding element is symmetric. If the full Gram matrix K in Eq. (19) is substituted into Eq. (13) and the SDP is optimized, the optimal
s are
AC
CE
PT
ED
M
obtained, as in Figure 9.
n
Q1
Q2
Figure 9. Optimized Qn with
In the figure, the elements in Qn are displayed to three decimal places. The reason why many 0s are found is that only five training samples are used in the explanation. The two 1dimensional functions h1 x1 and h2 x2 can then be implemented from the Qn s and the 17
ACCEPTED MANUSCRIPT
training results of α 0.08 0.08 0.0028 0.08 0.08 and b 1.03 . Figure 10 shows T
M
AN US
CR IP T
the LUTs of h1 x1 and h2 x2 .
ED
Figure 10. LUTs of hn ( xn )
PT
When a new test data point xtest is presented, the decision boundary can be computed from
CE
2 the above two functions given in Figure 10. For example, when xtest , hn ( xntest ) 1.5
AC
can simply be computed by retrieving values from the LUTs, as in Fig. 11.
18
AN US
CR IP T
ACCEPTED MANUSCRIPT
M
Figure 11. Computing hn ( xn ) using LUTs for xtest [2 1.5]T Finally, the SVM with the optimal additive kernel can be evaluated by
PT
ED
f xtest h 1 ( x1test ) h 2 ( x2test ) b h 1 (2) h 2 (1.5) b 1.68 0 1.03 0.65 . (20)
3.3. Kernel Optimization: General Case
CE
In this subsection, the above idea is generalized into a higher dimensional system. First, it is assumed that the dimension of the input vectors x is N , and the domain of each
AC
variable dom xn is given, where n 1, 2,
, N . Furthermore, it is assumed that a set of
N Q representative values is selected from the domain of each variable, such that
, x1, NQ dom x1
, x2, NQ dom x2
1
x1,1 , x1,2 , x1,3 ,
2
x2,1 , x2,2 , x2,3 , 19
(21)
ACCEPTED MANUSCRIPT
… N
, xN , NQ dom xN ,
xn, NQ . Let us then define an index function n
: R 1, 2,3,
, NQ
(22)
CR IP T
where xn,1 xn,2
xN ,1 , xN ,2 , xN ,3 ,
such that n
xn arg xmin m
n ,m
xn xn,m .
n
(23)
be computed from N QTs by
AN US
In general, when two new data input points xi and x j are given, their kernel value can
xi , x j n xn(i ) , xn( j )
n
x (i ) n
and mn( j )
matrix K is parameterized by N
n 1 N
n 1
*
n
n 1
n
mn(i ) , mn( j ) ,
( j) n
(2) (2) n mn , mn
N
n 1 N
n 1 N
*
(24)
x . As in the simple case, the full kernel Gram
(1) (2) n mn , mn
PT
*
n 1
ED
(1) (1) n mn , mn
CE
N n 1 K
N
M
where mn(i )
N
n 1
(1) (3) n mn , mn (2) (3) n mn , mn (3) (3) n mn , mn
*
*
*
*
*
*
N
n 1 N
n 1 N
n 1
AC
N
n 1
mn(1) , mn( L ) (2) ( L ) n mn , mn . (25) (3) ( L ) n mn , mn ( L) ( L) m , m n n n n
If the full Gram matrix K in Eq. (25) is substituted into Eq. (13), the SDP is optimized by
20
ACCEPTED MANUSCRIPT
p* min t t , , ν ,δ ,
n
and N optimal
n
evaluated from the
n
s by
L
L
x (i ) n
and mn
n
mn(i ) , mn ,
AN US
i 1
n
0,
s are obtained. The N 1-dimensional functions hn xn s can then be
hn xn i y i n xni , xn i y i
where mn(i )
(26)
CR IP T
trace(K n ) c K n 0 T e ν δ y yy K n such that T t 2CδT e e ν δ y ν 0 δ 0
i 1
n
(27)
xn , and they can be implemented in the LUTs. When
M
a new test data point xtest is presented, hn xntest can easily be read out from the LUT, and
ED
the SVM f xtest can be evaluated by adding N hn xntest values. The above algorithm
AC
CE
PT
is summarized in Table 1.
21
ACCEPTED MANUSCRIPT
Table 1. Support vector machine with optimal additive kernel Data:
x i , y i
L
i 1
, x i
N
, y i 1, 1
NQn NQn
Optimization variables: Qn
, α
L
, ν
, δ
L
2: for i 1 to L
K i. j (Qn ) xi , x j n xn(i ) , xn( j )
4:
where mn(i )
end for
N
n 1
n
x (i ) n
and mn( j )
CE
8: // 2. Kernel Optimization 9: p* min t n
AC
t , , ν ,δ ,
n
n 1
PT
7: end for
N
ED
K j.i (Qn ) K i. j (Qn )
5: 6:
AN US
for j 1 to i
M
3:
, t
22
,
CR IP T
1: // 1. Mapping Qn values to Kernel K (Qn )
L
n
mn(i ) , mn( j )
x ( j) n
ACCEPTED MANUSCRIPT
0,
CR IP T
10:
trace(K n ) c K n 0 T e ν δ y yy K n such that T t 2CδT e e ν δ y ν 0 δ 0
11: // 3. Computing dual variables α 12: α yyT K
e ν δ y 1
n
14: for n 1 to N
AC
(i ) n
if f xtest 0 otherwise
CE
1 1
x
PT
16: end for
test
n
i 1
and mntest
n
n
mn(i ) , mntest
x test n
ED
where mn(i )
L
M
i 1
17: y
hn xntest i y i n xni , xntest i y i L
15:
AN US
13: // 4. Decision function evaluation for xtest
where f xtest hn xntest b N
n 1
IV. EXPERIMENTAL RESULTS
In this section, the performance of the proposed method is compared to that of the other
kernels for two kinds of datasets: artificially generated datasets and benchmark datasets, which are obtained from practical usage. The artificial datasets are selected to be 2dimensional for the visualization and three datasets are considered. The benchmark datasets 23
ACCEPTED MANUSCRIPT
are taken from the UCI repository [23], and they represent multi-dimensional data. All SVM classifiers used in the experiment are trained using the MatLab CVX toolbox[24] and performed using a Core i5-3.40 GHz processor. 4.1. Performance Measures
CR IP T
In order to compare the classification performance, we use 5 performance measures which are popular for binary classifiers [25-29]. Table 2 shows the definitions of the performance measures used in this paper. In Table 2, L p and Ln denote the number of positive and negative samples, respectively. In addition, s Ln / Lp is the skewing factor which is the
AN US
ratio between L p and Ln . In addition, tp and tn denote the number of true positive and true negative samples, respectively.
M
Table 2. Definitions of Performance Measures Performance Measure
tpr
ED
True Positive Rate (TPR)
tnr
PT
True Negative Rate (TNR)
tp Lp
tn 1 fpr Ln
Precision
tpr tpr s fpr
F-measure
2tpr tpr s fpr 1
Accuracy
tpr s(1 fpr ) 1 s
Testing Time (ms)
Average computational time to test a single sample
CE AC
Definition
24
ACCEPTED MANUSCRIPT
In this paper, we evaluate and discuss the classification performance of conventional kernels and the proposed OAK based on the measurements in Table 2. 4.1. 2D Artificial Dataset Three 2-dimensional synthetic datasets are utilized, which are referred to as ‘Two moons’,
previous additive kernels and RBF kernel defined by
CR IP T
‘Two spirals’ and ‘Two circles’. For each dataset, the proposed method is compared with the
AN US
x z T x z RBF x, z exp . 2 2
(28)
In experiments with 2D artificial datasets, all SVMs are trained with C 0.1 and five-fold cross-validation is performed to test classification performance. Figure 12 shows an example
AC
CE
PT
ED
M
of k-fold cross validation.
Figure 12. Example of k-fold cross-validation
25
ACCEPTED MANUSCRIPT
To evaluate the performance of the classifiers, we compare classifiers in terms of the decision boundaries and the 6 measurements from Table 2. For all the artificial datasets, the OAK is trained with NQ 21 so that the axis interval is 0.05 and of (28) is set to 0.1 for
CR IP T
the RBF kernel.
AN US
4.1.1 Two Moons Dataset
The ‘Two moons’ dataset consists of 2 classes with the appearance of two half-moons facing each other. The number of samples is 450 and the dom xn ( n 1, 2,
, N ) is [0,1].
M
Figure 13 shows the decision boundaries of the conventional kernels and OAK. In the figure,
LIN , IK , GIK , and OAK denote a linear kernel (8), intersection kernel (5), 2
ED
generalized intersection kernel (6), 2 kernel (7) and OAK, respectively.
0.8
0.5 0.4
AC
0.3
0.9 0.8 0.7
CE
0.7 0.6
1
PT
1 0.9
0.6 0.5 0.4 0.3
0.2
0.2
0.1
0.1
0
0
0.1
0 0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
(b) IK
(a) LIN
26
0.7
0.8
0.9
1
ACCEPTED MANUSCRIPT
1 0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1
0
0.1
0.2
(c) GIK
0.4
0.5
0.6
0.7
0.8
0.9
1
0.7
0.8
0.9
1
(d) 2
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
AN US
1
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
(f) OAK
ED
(e) RBF
0.8
M
0
0.3
CR IP T
1 0.9
CE
PT
Figure 13. Comparison of the decision boundaries for the ‘Two moons’ dataset
In Fig. 13, the previous AKs have difficulty in separating two classes in the region where
AC
the boundaries between the two classes are highly nonlinear. The OAK and RBF kernels, however, classify the given samples well enough and outperform the previous AKs. It is also interesting to observe that the decision boundary of the proposed method is axis-aligned and looks like a square wave while those of the other kernels are smooth curves. The axisaligned square boundary of the OAK is caused by the nature of quantization in Qn . The 27
ACCEPTED MANUSCRIPT
QTs Q1 and Q2 are visualized in Fig. 14, in which the darker the region is, the higher the n
,
is.
AN US
CR IP T
associated
(b) Q2
(a) Q1
Figure 14. Optimized Qn for the ‘Two Moons’ dataset
M
The QTs have some square blocks of the same values and the blocks result in the axis-
ED
aligned square boundary in the OAK. The blocks are formed because an input feature xn is n
xn,1 , xn,2 , xn,3 ,
, xn, NQ . If we
PT
approximated as one of the representative values
increase N Q , the decision boundary becomes smoother, but its square property is
CE
preserved. The performance measurements of the conventional kernels and OAK are
AC
summarized in Table 2. For additive kernels, which include previous AKs and the OAK, the testing performances are measured using two LUTs, each at a size of 2 100 . In Table 2, the best performance for each measurement is highlighted in bold face and the number within brackets denotes the rank of the classifier.
28
ACCEPTED MANUSCRIPT
Table 3. Classification performance: ‘Two Moons’ dataset
LIN
IK
GIK
TPR
0.8267(5)
0.9200(4)
0.9600(3)
TNR
0.8311(6)
0.8800(3)
Precision
0.8302(6)
F-measure
RBF
OAK
0.8000(6)
1.0000(1)
1.0000(1)
0.8400(5)
0.8756(4)
1.0000(1)
0.9600(2)
0.8875(3)
0.8594(5)
0.8681(4)
1.0000(1)
0.9621(2)
0.8274(6)
0.9026(4)
0.9062(3)
0.8303(5)
1.0000(1)
0.9806(2)
Accuracy
0.8289(6)
0.9000(3)
0.9000(3)
0.8378(5)
1.0000(1)
0.9800(2)
Time (ms)
0.0017(1)
0.0036(5)
0.0028(2)
0.0028(2)
4.6421(6)
0.0028(2)
AN US
CR IP T
2
To evaluate not only the accuracy, but also the testing time, we plot the accuracy versus testing time, as illustrated in Figure 15.
M
1
0.96
PT
Linear IK GIK
0.92
2 RBF OAK
0.9
CE
Accuracy
0.94
ED
0.98
0.88
AC
0.86 0.84
0.82 -3 10
-2
10
-1
10
0
10
Testing Time (ms)
Figure 15. Accuracy versus testing time: ‘Two moons’ dataset
29
1
10
ACCEPTED MANUSCRIPT
From Table 3 and Figure 15, our proposed OAK outperforms the other kernels, except for the RBF kernel. However, the testing time of the OAK is about 1500 times faster than the RBF kernel, while the OAK has only 2% lower accuracy than the RBF kernel. 4.1.2 Two Spirals Dataset
CR IP T
The data set ‘Two spirals’ is also 2-dimensional and consists of two classes which have the shape of spirals. This set has more of a nonlinear boundary than the ‘Two moons’ set does. The number of samples is 450 and the dom xn ( n 1, 2,
, N ) is [0,1]. Figure 14 shows
the decision boundaries of the conventional kernels and OAK. 1
AN US
1 0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4 0.3
0.2 0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.2 0.1 0
0
0.1
0.2
0.3
ED
0
M
0.3
PT
0.9
0.6
0.5
0.5
AC
0.7
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.8
0.9
1
0.7
0.8
0.9
1
0.8
0.6
0
0.7
0.9
0.7
0
0.6
1
CE
0.8
0.5
(b) IK
(a) LIN 1
0.4
0.7
0.8
0.9
0
1
(c) GIK
0
0.1
0.2
0.3
0.4
0.5
0.6
(d) 2
30
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1
(e) RBF
0
0.1
0.2
CR IP T
ACCEPTED MANUSCRIPT
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(f) OAK
Figure 16. Comparison of the decision boundaries for the ‘Two spirals’ dataset
AN US
As shown in Figure 16, the previous AKs can separate two classes only in the outer spirals, but fail in separating them in the inner spirals. The proposed OAK and RBF kernel, however, can separate two classes well not only in the outer spirals, but also in the inner
M
spirals. Compared with the RBF, the decision boundary of the OAK SVM has the shape of a rectangular spiral, while RBF has a smooth decision boundary. Figure 16 shows the QTs
ED
Qn for the OAK. The QTs actually represent the combination of rectangles, and they
AC
CE
PT
explain why the decision boundary in the OAK is a rectangular spiral.
31
CR IP T
ACCEPTED MANUSCRIPT
(b) Q2
(a) Q1
AN US
Figure 17. Optimized Qn for the ‘Two spirals’ dataset
In Table 4, the performances of the conventional kernels and OAK are summarized in terms of the 5 performance measurements in Table 2. As in the experiment with the ‘Two moon’
M
dataset, all AKs and the OAK utilize an LUT with a size of 2 100 for testing.
LIN
ED
Table 4. Classification performance: ‘Two spirals’ dataset
IK
GIK
2
RBF
OAK
0.6533(4)
0.6533(4)
0.6000(6)
0.9667(1)
0.8867(2)
TNR
0.6467(5)
0.6600(4)
0.5467(6)
0.6800(3)
0.9600(1)
0.8933(2)
0.6867(3)
Precision
0.6533(5)
0.6613(3)
0.6074(6)
0.6569(4)
0.9625(1)
0.8951(2)
F-measure
0.6520(4)
0.6558(3)
0.6419(5)
0.6235(6)
0.9638(1)
0.8864(2)
Accuracy
0.6500(4)
0.6567(3)
0.6167(6)
0.6400(5)
0.9633(1)
0.8900(2)
Time (ms)
0.0027(1)
0.0027(1)
0.0028(2)
0.0028(2)
4.0666(6)
0.0028(2)
AC
CE
PT
TPR
In addition, Figure 18 shows the accuracy versus testing time for all SVMs used in this experiment with the ‘Two spirals’ dataset. 32
ACCEPTED MANUSCRIPT
1 0.95 0.9 0.85 Linear IK GIK
CR IP T
Accuracy
0.8 0.75
2
RBF OAK
0.7 0.65
AN US
0.6 0.55 0.5 -3 10
-2
-1
10
10
0
10
1
10
Testing Time (ms)
ED
M
Figure 18. Accuracy versus testing time: ‘Two Spirals’
From the above Table 4 and Figure 18, conventional AKs are about 65% accurate with
PT
0.0027 ms of testing time for the ‘Two Spirals’ dataset. Moreover, the RBF kernel performs the best for all classification performance measures, but its testing time is 1500 times more
CE
than the AKs. On the other hand, our proposed method OAK shows 20% higher accuracy
AC
than the AKs, and has a testing time that is 1500 times faster than the RBF kernel, while its accuracy is only 8% lower than the RBF.
33
ACCEPTED MANUSCRIPT
4.1.3 Two Circles Dataset The ‘Two circles’ dataset is also 2-dimensional and it consists of two classes. Unlike the previous two examples, one class is surrounded by the other class. The number of samples is 450 and the dom xn ( n 1, 2,
, N ) is [0,1]. Figure 19 shows the decision boundaries of
1 0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1 0
AN US
1 0.9
CR IP T
the conventional kernels and OAK.
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) IK
M
(a) LIN 1
1 0.9
ED
0.9 0.8 0.7 0.6
0.8 0.7 0.6 0.5
PT
0.5 0.4 0.3
0.1
0.1
0.2
AC
0
CE
0.2
0
0
0.3
0.4
0.4 0.3 0.2 0.1
0.5
0.6
0.7
0.8
0.9
0
1
(c) GIK
0
0.1
0.2
0.3
0.4
0.5
0.6
(d) 2
34
0.7
0.8
0.9
1
ACCEPTED MANUSCRIPT
1 0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1
(e) RBF
0
0.1
0.2
CR IP T
1 0.9
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(f) OAK
Figure 19. Comparison of the decision boundaries for the ‘Two Spirals’ dataset
AN US
From the above decision boundaries, previous AKs, except for the IK, fail to make circle shaped decision boundaries, while the IK, RBF and OAK have circle shaped boundaries and can separate the inner class from the outer class. As with previous experiments, the IK and
M
RBF kernel have a smooth circle, but our proposed OAK has a rectangular circle. The QTs
AC
CE
PT
ED
for the OAK are illustrated in Figure 20.
(b) Q2
(a) Q1
Figure 20. Optimized Qn for the ‘Corners’ dataset
35
ACCEPTED MANUSCRIPT
As shown in Figure 20, the QTs consist of 5 blocks which are located at the 4 corners and center of the table, and that is why the decision boundary in the OAK has a rectangular circle for the ‘Two circles’ dataset. In Table 5, the performances of the conventional kernels and OAK are summarized in
OAK utilize a LUT with a size of 2 100 for testing.
CR IP T
terms of the 5 performance measurements. As in previous experiments, all of the AKs and
Table 5. Classification Performance: ‘Circles’ Dataset
LIN
IK
TPR
0.8523(4)
0.9760(2)
TNR
0.0583(6)
0.8583(3)
Precision
0.4879(6)
0.8834(3)
F-measure
0.6169(6)
0.9262(2)
Accuracy
0.4681(6)
0.9188(3)
Time (ms)
0.0056(5)
0.0040(1)
2
RBF
OAK
0.7340(5)
0.6726(6)
1.0000(1)
0.9449(3)
0.8500(4)
0.5833(5)
0.9167(1)
0.9000(2)
0.8394(4)
0.6320(5)
0.9311(1)
0.9093(2)
0.7775(4)
0.6452(5)
0.9637(1)
0.9262(2)
0.7906(4)
0.6280(5)
0.9601(1)
0.9227(2)
0.0040(1)
0.0049(3)
3.4888(6)
0.0049(3)
M
ED
AN US
GIK
PT
In addition, Figure 21 shows the accuracy versus testing time for all SVMs used in this
AC
CE
experiment with the ‘Two Circles’ dataset.
36
ACCEPTED MANUSCRIPT
1
0.9 Linear IK GIK
2
CR IP T
Accuracy
0.8
RBF OAK
0.7
0.6
0.4 -3 10
AN US
0.5
-2
-1
10
10
0
10
1
10
Testing Time (ms)
M
Figure 21. Accuracy versus testing time: ‘Two Circles’
ED
As in previous experiments, the RBF kernel outperforms other AKs in terms of the performance measures, but it takes about 3.48 ms per sample, which is 870 times longer
PT
than the AKs. On the other hand, the OAK shows better performance than the other AKs, while maintaining a similar testing time compared to the AKs.
CE
4.2. UCI Repository DB
AC
In this subsection, the conventional kernels and OAK are applied to a higher dimensional general dataset. 8 datasets from the UCI repository DB are used and their characteristics are summarized in Table 6 [29]. In Table 6, the columns ‘# of data’ and ‘ s Ln / Lp ’ denote the number of samples and the ratio between the number of positives, respectively. In addition, the column ‘Data type’ denotes whether the input feature xn has a categorical value or a 37
ACCEPTED MANUSCRIPT
real value. In the column ‘Parameters’, N Qn and denote the size of the QT and parameters of the RBF kernel (28), respectively. Table 6. Summary of UCI Repository Datasets
Acute Inflammations
s Ln / Lp
Dimension
Positive: 50 1.400
6
Negative: 70
Monks1
1.000 Negative: 62 Positive: 60
Monks3
TicTacToc
0.530
9
ED
Negative: 332
PT
Positive: 150
Statlog-Heart
0.800
CE
Positive: 168 1.589
16
Negative: 267
AC
NQ 2~5 2
({yes, no})
0.1
Categorical
NQ 4
(Integer 1~4)
0.5
Categorical
NQ 4
(Integer 1~4)
0.5
Categorical
NQ 3
({O,X, })
0.1
Real,
NQ 5
Categorical
1
Categorical
NQ 3
({Yes, No, Neither})
1
13
Negative: 120
Votes
Categorical
6
M
Positive: 626
NQ1 11
6
1.033 Negative: 62
NQ 11
Positive: 147
Parkinson
Parameters
Real,
AN US
Positive: 62
Data type
CR IP T
# of data
0.326
22
Real
Negative: 48
0.1
Positive: 225
NQ 21
Ionosphere
0.560
33
Real
1
Negative: 126
38
ACCEPTED MANUSCRIPT
In the parameter setting, if the data type is categorical, we set the N Qn as the number of categories. Each category in categorical data is mapped to a number in the range of 0,1 . For example, the categorical data ({O,X, }) of ‘TicTacToc’ and ({Yes, No, Neither}) of
CR IP T
‘Votes’ are scaled to ({0, 0.5,1}), respectively.
If the data is a real value, we set N Qn to the parameter that shows the best training
accuracy among NQ {5,11, 21} . In the case of the RBF kernel, we tune to the
AN US
parameter that shows the best training accuracy among {0.01,0.05,0.1,0.5,1,5,10} .
Data points are normalized to [0,1], and all classifiers are trained with C 0.01 . Since no testing data are clearly specified in UCI DB, testing is performed by 5-fold cross-validation
M
as in the case of the 2D artificial datasets. The size of LUT is N 1000 where N is the
ED
number of dimensions.
Tables 7 and 8 show a comparison of all the classifiers in terms of the TPR and TNR for the
PT
UCI DB. The highest value for each dataset is highlighted in bold face. In addition, we rank the classifiers according to the average value of each classifier, and it is displayed within the
AC
CE
bracket.
39
ACCEPTED MANUSCRIPT
Table 7. Performance Comparison for UCI Repository Datasets: TPR
LIN
IK
GIK
Acute Inflammations
0.7400
0.8200
0.7800
Monks1
0.6643
0.6619
Monks3
0.7667
TicTacToc
RBF
OAK
0.7000
0.0000
1.0000
0.6810
0.6643
0.7262
0.4667
0.8667
0.9333
0.7000
0.4167
0.9500
1.0000
1.0000
1.0000
1.0000
1.0000
0.7714
Statlog-Heart
0.8800
0.8867
0.8867
0.8667
0.9733
0.8800
Votes
0.9404
0.9465
0.9525
0.9404
0.5167
0.9702
Parkinson
1.0000
0.9186
1.0000
1.0000
1.0000
0.9246
Ionosphere
1.0000
0.9867
1.0000
1.0000
1.0000
0.9689
Average
0.8739(3)
0.8859(2)
0.9042(1)
0.8589(5)
0.7041(6)
0.8665(4)
AN US
CR IP T
2
M
Table 8. Performance Comparison for UCI Repository Datasets: TNR
1.0000
GIK
1.0000
1.0000
PT
Acute Inflammations
IK
ED
LIN
RBF
OAK
1.0000
1.0000
1.0000
2
0.7262
0.7738
0.6286
0.7571
0.7881
1.0000
0.8571
0.9024
0.9167
0.7762
1.0000
0.9167
0.0000
0.0182
0.0121
0.0000
0.0000
0.5893
Statlog-Heart
0.7750
0.7917
0.7750
0.7833
0.4333
0.8000
Votes
0.9210
0.9210
0.9210
0.9174
1.0000
0.9476
Parkinson
0.0000
0.5944
0.0000
0.0000
0.0000
0.7278
Ionosphere
0.4129
0.7298
0.0000
0.4600
0.0000
0.8255
Average
0.5865(4)
0.7164(2)
0.5317(5)
0.5868(3)
0.5277(6)
0.8509(1)
Monks1
CE
Monks3
AC
TicTacToc
40
ACCEPTED MANUSCRIPT
As shown in Tables 7 and 8, the OAK shows the best performance in TNR, while it ranked 4th in TPR. The reason for this can be found due to s Ln / Lp for each dataset. For example, the number of positive samples is about twice as much as the number of negative
CR IP T
samples in ‘TicTacToc’, ‘Parkinson’ and ‘Ionosphere’ where conventional kernels have better TPR values than the OAK. For an imbalanced dataset, such as those three datasets, decision boundaries tend to be biased to one class which has more samples than other. This is why conventional kernels are almost 1.000 for TPR, but have a lower value than 0.5 for
AN US
TNR. However, our proposed method OAK has values of over 0.85 for both TPR and TNR, while the other methods do not.
Tables 8, 9, and 10 show the comparison of classifiers in terms of the Precision, Fmeasure and Accuracy for the UCI DB, respectively
M
Table 9. Performance Comparison for UCI Repository Datasets: Precision
Acute Inflammations
1.0000
IK
GIK
1.0000
1.0000
ED
LIN
PT
s Ln / Lp
RBF
OAK
1.0000
0.0000
1.0000
2
0.7056
0.7451
0.6434
0.7306
0.7806
1.0000
0.8408
0.8931
0.9192
0.7532
1.0000
0.9218
0.6535
0.6576
0.6562
0.6535
0.6534
0.7788
Statlog-Heart
0.8357
0.8457
0.8365
0.8314
0.6853
0.8527
Votes
0.8856
0.8867
0.8873
0.8805
1.0000
0.9230
Parkinson
0.7547
0.8785
0.7547
0.7547
0.7547
0.9145
Ionosphere
0.7530
0.8678
0.6410
0.7681
0.6410
0.9089
Average
0.8036(3)
0.8468(2)
0.7923(5)
0.7965(4)
0.6894(6)
0.9125(1)
Monks1
CE
Monks3
AC
TicTacToc
41
ACCEPTED MANUSCRIPT
Table 10. Performance Comparison for UCI Repository Datasets: F-measure s Ln / Lp
LIN
IK
GIK
Acute Inflammations
0.8405
0.8980
0.8719
Monks1
0.6816
0.6964
Monks3
0.7985
TicTacToc
RBF
OAK
0.7995
0.0000
1.0000
0.6582
0.6932
0.7506
0.6353
0.8780
0.9260
0.7188
0.5490
0.9353
0.7904
0.7934
0.7924
0.7904
0.7904
0.7751
Statlog-Heart
0.8556
0.8649
0.8598
0.8480
0.8034
0.8638
Votes
0.9112
0.9143
0.9175
0.9086
0.6795
0.9455
Parkinson
0.8601
0.8964
0.8601
0.8601
0.8601
0.9188
Ionosphere
0.8590
0.9232
0.7813
0.8688
0.7813
0.9374
Average
0.8246(4)
0.8581(2)
0.8334(3)
0.8109(5)
0.6518(6)
0.8764(1)
AN US
CR IP T
2
M
Table 11. Performance Comparison for UCI Repository Datasets: Accuracy
Acute Inflammations
0.8917
IK
GIK
0.9250
0.9083
ED
LIN
PT
s Ln / Lp
RBF
OAK
0.8750
0.5833
1.0000
2
0.6952
0.7179
0.6548
0.7107
0.7571
0.7333
0.8109
0.8840
0.9250
0.7365
0.7090
0.9333
0.6535
0.6597
0.6576
0.6535
0.6534
0.7081
Statlog-Heart
0.8333
0.8444
0.8370
0.8296
0.7333
0.8444
Votes
0.9284
0.9307
0.9331
0.9262
0.8135
0.9563
Parkinson
0.7547
0.8405
0.7547
0.7547
0.7547
0.8761
Ionosphere
0.7892
0.8945
0.6410
0.8062
0.6410
0.9173
Average
0.7946(3)
0.8371(2)
0.7889(4)
0.7866(5)
0.7057(6)
0.8711(1)
Monks1
CE
Monks3
AC
TicTacToc
42
ACCEPTED MANUSCRIPT
As shown in Tables 9-11, the RBF shows better performance than the OAK in terms of the Precision for ‘Monks3’ and ‘Votes’, as well as in terms of the F-measure and Accuracy for ‘Monks1’. However, for the 8 UCI datasets, our proposed OAK has the highest average
comparison of classifiers in terms of testing time.
CR IP T
value for the ‘Precision’, ‘F-measure’ and ‘Accuracy’ for s Ln / Lp . Table 12 presents a
Table 12. Performance Comparison for UCI Repository Datasets: Testing Time
LIN
IK
Acute Inflammations
0.0120
0.0116
Monks1
0.0121
0.0119
Monks3
0.0117
0.0120
TicTacToc
0.0111
0.0111
Statlog-Heart
0.0124
0.0126
Votes
0.0111
Parkinson
0.0116
Ionosphere
0.0116
AN US
2
RBF
OAK
0.0116
1.6807
0.0123
0.0118
0.0117
1.7382
0.0120
0.0115
0.0125
1.7099
0.0118
0.0111
0.0110
10.8348
0.0113
0.0117
0.0115
3.8765
0.0114
0.0113
0.0113
0.0120
5.9105
0.0109
0.0118
0.0115
0.0116
2.7939
0.0121
0.0118
0.0116
0.0117
4.7868
0.0113
0.0118(5)
0.0116(1)
0.0117(3)
4.1664(6)
0.0116(1)
ED
M
0.0121
PT
0.0117(3)
CE
Average (ms)
GIK
AC
As shown in Table 12, it is worth noting that the RBF kernel has a different testing time
for each dataset while the AKs and OAK have almost the same testing time for all of the datasets due to testing with LUTs. Figure 22 shows the average Accuracy versus testing time for the UCI DB.
43
ACCEPTED MANUSCRIPT
Linear IK GIK
0.85
2 RBF OAK
CR IP T
Accuracy
0.8
0.75
0.65
0.6
-2
-1
10
10
AN US
0.7
0
10
1
10
Testing Time (ms)
ED
M
Figure 22. Average accuracy versus testing time: UCI DB
As shown in Figure 22, it can be confirmed that our proposed OAK has the best
PT
performance among the other conventional kernels while maintaining the advantage of fast
CE
testing, as in the AKs.
AC
4.2. LIBSVM datasets
To exploit performance of OAK for large scale dataset, the conventional kernels and OAK
are applied to aDB of LIBSVM dataset [35]. 5 datasets of aDB are used their characteristics are summarized in Table 13.
44
ACCEPTED MANUSCRIPT
Table 13. Summary of LIBSVM Datasets # of data
s L
n
/ Lp
Dimensi on
Train
Test
Lp 395 , Ln 1210
Lp 7446 , Ln 23510
s 3.06
s 3.16
Lp 572 , Ln 1693
Lp 7269 , Ln 23027
s 2.96
s 3.17
Lp 773 , Ln 2412
Lp 7068 , Ln 22308
s 3.12
s 3.16
Lp 1188 , Ln 3593
Lp 6653 , Ln 21127
s 3.02
s 3.18
Lp 1569 , Ln 4845
Lp 6272 , Ln 19875
Data type
Parameters
setB
setC
CR IP T
setA
NQ 2
s 3.17
ED
s 3.08
10
M
setD
setE
Binary
AN US
119
All datasets consist of 119-dimensional binary data and the numbers of training data are
PT
from 1500 to 6414. Since all attributes of data are binary, we set NQ 2 and all classifiers
CE
are trained with C 0.1 . C 0.1 . For the sake of efficiency, linear SVM is trained with
C 0.01 and only the data points which has (l ) higher than 0.005 are used to train
AC
OAKSVM with C 0.1 . As in the other experiments, testing of AKs and OAK is performed with LUT with a size of 119 2 . In the case of the RBF kernel, we tune to the parameter that shows the best training accuracy among {0.1,0.5,1,10} .
45
ACCEPTED MANUSCRIPT
All the classifiers are compared in terms of 5 performance measures (TPR, TNR, Precision, F-measure, Accuracy) for LIBSVM dataset in Tables 14-18. The highest value for each dataset is highlighted in bold face. Table 14. Performance Comparison for LIBSVM Datasets: TPR
GIK
2
RBF
OAK
0.7548
0.6621
0.7886
0.6924
0.7750
0.6689
0.7657
0.6750
CR IP T
IK
LIN
0.8022
setB
0.8243
setC
0.8355
setD
0.8568
setE
0.8591
0.7672
0.6615
Average
0.8356
0.7702
0.6720
M
AN US
setA
Table 15. Performance Comparison for LIBSVM Datasets: TNR
LIN
IK
GIK
OAK
0.8013
0.8252
0.8663
0.7930
0.8071
0.8461
0.7900
0.8164
0.8799
0.7731
0.8208
0.8805
setE
0.7684
0.8204
0.8833
Average
0.7852
0.8180
0.8712
setB
CE
setC
PT
setA
AC
setD
ED
RBF
46
2
ACCEPTED MANUSCRIPT
Table 16. Performance Comparison for LIBSVM Datasets: Precision
IK
LIN
GIK
2
RBF
OAK
0.5612
0.5776
0.6107
setB
0.5570
0.5633
0.5867
setC
0.5579
0.5722
0.6382
setD
0.5432
0.5736
0.6401
setE
0.5392
0.5741
0.6414
Average
0.5517
0.5722
0.6234
AN US
CR IP T
setA
Table 17. Performance Comparison for LIBSVM Datasets: F-measure
IK
LIN
setC
0.6604
0.6544
0.6354
0.6648
0.6572
0.6352
0.6688
0.6583
0.6532
0.6649
0.6559
0.6571
0.6626
0.6567
0.6513
0.6643
0.6565
0.6464
AC
CE
Average
PT
setD setE
OAK
ED
setB
RBF
M
setA
GIK
47
2
ACCEPTED MANUSCRIPT
Table 18. Performance Comparison for LIBSVM Datasets: Accuracy
IK
LIN
GIK
2
RBF
OAK
0.8015
0.8082
0.8172
setB
0.8005
0.8026
0.8092
setC
0.8009
0.8064
0.8291
setD
0.7932
0.8076
0.8313
setE
0.7901
0.8076
0.8301
Average
0.7972
0.8065
0.8234
AN US
CR IP T
setA
It is interesting to see that all four conventional AKs demonstrate the same results for LIBSVM data set and the reason for it is that the conventional AKs have the same kernel
2
have the same kernel value when the input is binary.
AC
CE
PT
ED
M
value when the input is binary. Table 19 shows an example in which LIN , IK , GIK , and
48
ACCEPTED MANUSCRIPT
Table 19. Example of Additive Kernel value for binary data
x 0 0 1 1 , z 0 1 0 1
AK N
N
LIN x, z xn zn
x z n 1
n 1
n n
0 0 0 1 1 0 1 1 1
N
min x , z
N
IK x, z min xn , zn
n
n 1
n
0 0 0 1 1
min x N
N
GIK x, z min xn , zn n 1
2
2
2
n
n 1
, zn
2
min 0, 0 min 0,12 min 12 , 0 min 12 ,12 N
N
2
2 xn zn n zn
x n 1
2 0 0 2 0 1 2 1 0 2 1 1 00 0 1 1 0 11 0 0 0 1 1
M
n 1
2 xn zn xn zn
AN US
0 0 0 1 1
x, z
CR IP T
min 0, 0 min 0,1 min 1, 0 min 1,1
n 1
ED
Thus, the conventional AKs cannot improve from the linear SVM. On the other hand, our proposed OAK is completely different from the AKs. The kernel value in the OAK is
PT
trained by optimization and the OAK returns the value which is different from the those of AKs and its computation is as efficient as in the AKs,
CE
Concerning the performance, the OAK outperforms the others in terms of TNR, precision
AC
and accuracy, while AK and RBF kernel demonstrate better performance than OAK in terms of TPR and F-measure. The reason for these results might be that all the LIBSVM datasets are highly imbalanced, as shown in Table 13. The number of negative samples is about three times larger than that of positive samples. The decision boundary tends to be pushed to a class which has more samples than other because the OAK is trained to decrease the sum of
49
ACCEPTED MANUSCRIPT
L
error C (i ) in (2). Therefore, the OAK and RBF underperform the AKs (Linear) in terms i 1
of TPR and F-measure, which is proportional to TPR. In terms of other measures such as TNR, Precision and accuracy, however, our proposed OAK demonstrates better
time.
CR IP T
performance than other AKs. In Table 20, all classifiers are compared in terms of testing
Table 20. Performance Comparison for LIBSVM Datasets: Testing Time(ms)
IK
LIN
GIK
2
RBF
OAK
10.4238
0.0034
0.0031
setB
0.0026
13.4442
0.0026
setC
0.0026
18.9194
0.0027
setD
0.0026
24.0907
0.0028
setE
0.0026
26.8385
0.0026
0.0027
18.7433
0.0028
M
AN US
setA
ED
Average
PT
As shown in Table 20, it is worth noting that the testing time of RBF kernel increases as
CE
the number of training data samples increases. On the other hand, the AK and OAK which use the LUT for testing have almost the same testing time regardless of the number of data.
AC
Figure 21 shows the average Accuracy versus testing time for the LIBSVM DB.
50
AN US
CR IP T
ACCEPTED MANUSCRIPT
M
Figure 23. Average accuracy versus testing time: LIBSVM DB
ED
As shown in Figure 23, it can be seen that our proposed OAK has the best performance in terms of accuracy while it maintains the advantage of the fast and efficient evaluation as in
CE
PT
other AKs.
V. CONCLUSION
AC
In this paper, an OAK SVM is proposed and its performance was tested through its
application to synthetic problems. The training of the OAK SVM was formulated as semidefinite programming (SDP), and it was solved by convex optimization. Unlike previous AKs such as the intersection kernel (IK) or 2 kernel, the proposed OAK did not assume any specific functional form of the kernels. Instead, the kernel was parameterized in terms 51
ACCEPTED MANUSCRIPT
of the QTs, thereby increasing the nonlinearity. Finally, the OAK SVM was compared with other AKs and the RBF kernel in the experimental section, and the enhanced classification performance of the OAK over previous AKs and RBF was confirmed.
[1]
CR IP T
REFERENCES H. G. Jung, G. Kim, Support vector number reduction: Survey and experimental evaluations, IEEE Trans. Intell. Transp. Syst. 15 (2) (2013) 463-476 [2]
M. Barros de Almeida, A. de Pádua Braga, J. P. Braga, SVM-KM: speeding SVMs
AN US
learning with a priori cluster selection and k-means, in: Proceedings of Sith Brazilian Symposium on Neural Networks, 2000, pp. 162–167. [3]
Q.-A. Tran, Q.-L. Zhang, X. Li, Reduce the number of support vectors by using clustering techniques, in: Proceedings of International Conference on Machine
M
Learning and Cybernetics, 2003, vol. 2, pp. 1245–1248.
ED
[4] A. Gani, A. Siddiaq, S. Shamshirband, A survey on indexing techniques for big data: taxonomy and performance evaluation, Knowl. Inf. Syst, 46 (2) (2016) 241-284.
PT
[5] C. Jing, J.Hou, SVM and PCA based fault classification approaches for complicated
CE
industrial process, Neurocomputing 167 (2015) 636-642. [6] L, Khedher, J. Ramirez, J. Gorriz, A. Brahim, F. Segovia, The Alzheimer’s Disease Neuroimaging Initiative, Early diagnosis of Alzheimer׳s disease based on partial least
AC
squares, principal component analysis and support vector machine using segmented MRI images, Neurocomputing 151 (2015) 139-150.
[7] X. Zhang, D. Qiu, F. Chen, Support vector machine with parameter optimization by a novel hybrid method and its application to fault diagnosis, Neurocomputing 149 (2015) 641-651.
52
ACCEPTED MANUSCRIPT
[8]
J. Chen, C.-L. Liu, Fast multi-class sample reduction for speeding up support vector machines, in: Proceedings of IEEE International Workshop on Machine Learning for Signal Processing, 2011, pp. 1–6.
[9]
R. Koggalage, S. Halgamuge, Reducing the number of training samples for fast support vector machine classification, Neural Inf. Process. Rev.2 (3) (2004) 57–65.
CR IP T
[10] M. F. A. Hady, W. Herbawi, M. Weber, F. Schwenker, A multi-objective genetic algorithm for pruning support vector machines, in: Proceedings of IEEE Internatoinal Conference on Tools with Artificial Intelligence, 2011, pp. 269–275.
[11] H.-J. Lin and J. P. Yeh, “Optimal reduction of solutions for support vector machines,”
AN US
Appl. Math. Comput., vol. 214, no. 2, pp. 329–335, 2009.
[12] H. Zhang, A. C. Berg, M. Maire, J. Malik, SVM-KNN: Discriminative nearest neighbor classification for visual category recognition,” in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2006, vol. 2, pp. 2126–
M
2136.
ED
[13] H. Cheng, P.-N. Tan, R. Jin, Efficient algorithm for localized support vector machine, IEEE Trans.Knowl. Data Eng. 22 (4) (2010) 537–549.
PT
[14] Q. Ye, Z. Han, J. Jiao, J. Liu, Human detection in images via piecewise linear support vector machines, IEEE Trans. Image Process.22 (2) (2013) 778–789.
CE
[15] S. Maji, A. C. Berg, J. Malik, Efficient classification for additive kernel SVMs, IEEE
AC
Trans. Pattern Anal. Mach. Intell.36 (1) (2013) 66–77. [16] S. Boughorbel, J.-P. Tarel, N. Boujemaa, Generalized histogram intersection kernel for image recognition, in: Proceedings of IEEE International Conference on Image Processing, 2005, vol. 3.
53
ACCEPTED MANUSCRIPT
[17] S. Maji, U. C. Berkeley, A. C. Berg, Classification using Intersection Kernel Support Vector Machines is Efficient, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008. . [18] M. Varma, D. Ray, Learning the discriminative power-invariance trade-off, in: Proceedings of IEEE 11th International Conference on,Computer Vision, 2007, pp. 1–
CR IP T
8. [19] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) pp. 273–297, 1995.
[20] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M. I. Jordan, Learning the
AN US
kernel matrix with semidefinite programming, J. Mach. Learn. Res.5 (1) (2004) 27–72. [21] J. Tang, Y. Tian, A multi-kernel framework with nonparallel support vector machine, Neurocomputing, 266 (2017) 226-238.
M
[22] X. Liu, L. Wang, G, Hwang, J. Zhang, J. Yin, Multiple kernel extreme learning machine, Neurocomputing 149 (2015) 253-264
ED
[23] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming,
PT
version 2.0 beta. http://cvxr.com/cvx, September 2013. [24] P. A. Flach, The geometry of ROC space: understanding machine learning metrics through ROC isometrics, in: Proceedings of International Conference on Machine
CE
Learning, 2003, pp. 194–201.
AC
[25] K.-A. Toh G.-C. Tan, Exploiting the relationships among several binary classifiers via data transformation, Pattern Recognit., 47 (3) (2014) 1509–1522.
[26] J. Makhoul, F. Kubala, R. Schwartz, R. Weischedel, Performance measures for information extraction, in: Proceedings of DARPA broadcast news workshop, 1999, pp. 249–252.
54
ACCEPTED MANUSCRIPT
[27] M. Di Martino, G. Hernández, M. Fiori, A. Fernández, A new framework for optimal classifier design, Pattern Recognit., 46 (8) (2013) 2249–2255. [29]
A. Asuncion and D. Newman, “UCI machine learning repository.” 2007.
[30] H. Jiang, W. Ching, K. F. C. Yiu and Y. Qiu, Stationary Mahalanobis kernel SVM for
CR IP T
credit risk evaluation, Appl. Soft. Comput, 71, (2018). [31] T. Tang, S. Chen, M. Zhao, W. Huang and J. Luo, Very large-scale data classification based on K-means clustering and multi-kernel SVM, Soft Comput, (2018)
[32] J. Baek, J. Kim and E, Kim, Fast and efficient pedestrian detection via the cascade
AN US
implementation of an additive kernel support vector machine, IEEE Trans. Intell. Transp. Syst. 18 (4) (2017) 902-916.
[33] M. Bilal and M. S. Hanif, High performance real-time pedestrian detection using light weight features and fast cascaded kernel SVM classification, Journal of Signal
M
Processing Systems, (2018).
[34] O. B. Ahmed, J. Benois-Pineau, M. Allard, G. Catheline, C. B. Amar, Recognition of
ED
Alzheimer's disease and Mild Cognitive Impairment with multimodal image-derived
PT
biomarkers and Multiple kernel learning, Neurocomputing, 220 (2017) 98-110. [35] C-C. Chang, C-J. Lin, LIBSVM: a library for support vector machines, 2001, software
AC
CE
available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
55
ACCEPTED MANUSCRIPT
Prof. Euntai Kim
CR IP T
Biography of Authors
CE
PT
ED
M
AN US
Euntai Kim was born in Seoul, Korea, in 1970. He received B.S., M.S., and Ph.D. degrees in Electronic Engineering, all from Yonsei University, Seoul, Korea, in 1992, 1994, and 1999, respectively. From 1999 to 2002, he was a Full-Time Lecturer in the Department of Control and Instrumentation Engineering, Hankyong National University, Kyonggi-do, Korea. Since 2002, he has been with the faculty of the School of Electrical and Electronic Engineering, Yonsei University, where he is currently a Professor. He was a Visiting Scholar at the University of Alberta, Edmonton, AB, Canada, in 2003, and also was a Visiting Researcher at the Berkeley Initiative in Soft Computing, University of California, Berkeley, CA, USA, in 2008. His current research interests include computational intelligence and statistical machine learning and their application to intelligent robotics, unmanned vehicles, and robot vision.
Dr. Jeonghyun Baek
AC
Jeonghyun Baek received the B.S. and Ph.D degree in electrical and electronic engineering from Yonsei University, Seoul, Korea, in 2011, and 2018. He is a senior researcher of Agency for Defense Department (ADD). He has studied machine learning, computer vision and optimization.
56