A new support vector machine with an optimal additive kernel

A new support vector machine with an optimal additive kernel

Accepted Manuscript A New Support Vector Machine with an Optimal Additive Kernel Jeonghyun Baek , Euntai Kim PII: DOI: Reference: S0925-2312(18)3122...

2MB Sizes 1 Downloads 114 Views

Accepted Manuscript

A New Support Vector Machine with an Optimal Additive Kernel Jeonghyun Baek , Euntai Kim PII: DOI: Reference:

S0925-2312(18)31220-7 https://doi.org/10.1016/j.neucom.2018.10.032 NEUCOM 20054

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

20 October 2017 31 August 2018 12 October 2018

Please cite this article as: Jeonghyun Baek , Euntai Kim , A New Support Vector Machine with an Optimal Additive Kernel, Neurocomputing (2018), doi: https://doi.org/10.1016/j.neucom.2018.10.032

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

A New Support Vector Machine with an Optimal Additive Kernel

CR IP T

Jeonghyun Baek and Euntai Kim

School of Electrical and Electronic Engineering, Yonsei University, Yonsei Street 50, Seoul, 120-749, Korea

AN US

E-mail: [email protected]

Abstract

Although the kernel support vector machine (SVM) outperforms linear SVM, its application

M

to real world problems is limited because the evaluation of its decision function is

ED

computationally very expensive due to kernel expansion. On the other hand, additive kernel (AK) SVM enables fast evaluation of a decision function using look-up tables (LUTs). The

PT

AKs, however, assume a specific functional form for kernels such as the intersection kernel

CE

2 (IK) or  kernel, and are problematic in that their performance is seriously degraded

when a given problem is highly nonlinear. To address this issue, an optimal additive kernel

AC

(OAK) is proposed in this paper. The OAK does not assume any specific kernel form, but the kernel is represented by a quantized Gram table. The training of the OAK SVM is formulated as semi-definite programming (SDP), and it is solved by convex optimization. In the experiment, the proposed method is tested with 2D synthetic datasets, UCI repository datasets and LIBSVM datasets. The experimental results show that the proposed OAK SVM 1

ACCEPTED MANUSCRIPT

has better performance than the previous AKs and RBF kernel while maintaining fast computation using look-up tables.

Ⅰ. INTRODUCTION

CR IP T

Keyword: additive kernel, kernel optimization, SVM, nonlinear classification

A kernel support vector machine (SVM) is one of the most popular classifiers, and it has been applied to a variety of fields due to its excellent classification performance [1-7][30].

AN US

Typical examples of nonlinear kernels include the polynomial kernel, the radial basis

function (RBF) kernel and the Gaussian kernel, which definitely outperform linear SVM. Unfortunately, however, the application of kernel SVM to real world problems is limited

M

because evaluating a decision function takes a much longer time than linear SVM due to kernel expansion [8-10].

ED

To solve this problem, some research has been conducted, and the solutions belong to one of

PT

three options. The first option is to decrease the computational burden of the kernel SVM by reducing the number of support vectors. This option can be categorized into two approaches:

CE

pre-pruning and post-pruning approaches [1]. The pre-pruning methods reduce the training samples and train the SVM with the reduced samples, thereby reducing the number of

AC

support vectors. They usually employ an unsupervised clustering method to divide the training data into several clusters. The SVM is then trained only using samples near clustering centers of the training data [2], [3], [31]. In addition, some researchers proposed training the SVM using only samples located outside of the clusters in order to use samples located near the decision boundary [8], [9]. On the other hand, the post-pruning methods 2

ACCEPTED MANUSCRIPT

exploited the training results of the SVM to reduce the training data. The post-pruning methods select the optimal subset of support vectors such that the performance degradation is minimized. To search for the optimal subset of support vectors, an iterative search [16-18] or the meta-heuristic genetic algorithm [10], [11] is used.

CR IP T

The second option for dealing with the computational burden of kernel SVM is to combine multiple linear SVMs instead of using the kernel SVM. The training samples are divided into several clusters and multiple linear SVMs are assigned to each cluster as a local classifier. Zhang et al. proposed SVM-KNN in which the kernel SVM is trained for every test sample

AN US

using only the nearest k training neighbors of the test sample [12]. Cheng et al. proposed a localized SVM (LSVM) and a profile SVM (PSVM) in which multiple linear SVMs are trained using the similarities between the test and training samples [13]. Ye et al. proposed a

having multiple postures[14].

M

piecewise linear SVM (PL-SVM) by training multiple linear SVMs to classify pedestrians

ED

The third option is to use the additive kernel (AK). The AK enables fast evaluation of a

PT

decision function by decomposing the evaluation into the summation of dimension-wise kernel computations [15-18, 32, 33]. Recently, Maji et al. proposed an efficient computation

CE

method for AK SVM using look-up tables (LUTs) instead of kernel expansion with support vectors, and showed a significant enhancement in terms of the computation time[15]. AKs,

AC

however, are problematic in that their performance is seriously degraded when a given problem is highly nonlinear. In this paper, an optimal additive kernel (OAK) is proposed to address this issue. Unlike previous AKs, such as the intersection kernel (IK) or  2 kernel, the proposed OAK does not assume any specific kernel form, but the kernel is represented in

3

ACCEPTED MANUSCRIPT

terms of a quantized Gram table. The training of the OAK SVM is formulated as semidefinite programming (SDP), and it is solved by convex optimization. The remainder of this paper is organized as follows: In Section II, some background information about SVM and additive kernels is reviewed. The training of the OAK is

CR IP T

formulated as SDP and it is solved by convex optimization in Section III. In Section IV, the proposed OAK and previous AKs are applied to synthetic and benchmark problems, and the

AN US

experimental results are summarized. Finally, some conclusions are drawn in Section V.

II. PRELIMINARY FUNDAMENTALS



 x i  , y  i 

ED

Let us assume that a training set

M

2.1. Support Vector Machine (SVM)

superscript represents the i th data point, x   i



L

i 1

N

is given, where the parenthesized

and y i   1, 1 . The SVM is a

PT

  machinery that separates the positive ( y  1 ) and negative ( y  1 ) classes by i

i

CE

maximizing the margin between the two classes. When an input vector is represented by a

AC

higher dimensional feature vector   x  and the SVM is represented by

f  x   wT   x   b ,

(1)

its training is formulated in the primal space by the following optimization problem: L 1 p*  min wT w  C   (i ) w ,b 2 i 1

subject to y (w   x (i )

T

4

(i )

  b)  1  

, (i )

(2)

ACCEPTED MANUSCRIPT

where   x  

D

and D N ; p* denotes the optimal value of Eq. (2); and C  0

controls the tradeoff between the regularization and the empirical risk [19]. Equivalently, the training is represented by





p*  max eT α  αT  yyT K  α α

(3)

CR IP T

0  α  C subject to  T  α y0

(1)  (2) where e is an L -dimensional vector filled with 1s; α  

a dual variable vector; and

T

 ( L)  

denotes a component-wise multiplication. K 

LL

L

is

is a

AN US

kernel Gram matrix and its components K i. j    x(i ) , x( j )     x(i )     x( j )   . When the SVM is trained in the dual space, the decision function (4) can be evaluated by



L



f  x     i  y i  xi  , x  b 

M

i 1

(4)

    y   x  , x   b , i

i

i

where





 i |  i   0

ED

i

denotes a set of support vectors. When nonlinear kernels are used

PT

in Eq. (4), the SVM is represented as a kernel expansion of support vectors and

CE

demonstrates good performance. Unfortunately, however, the kernel expansion requires

AC

high computation and memory resources for SVM evaluation.

2.2. Additive Kernel SVM The additive kernel is a nonlinear yet efficient method of classification proposed in [20].

The idea of the additive kernel is to decompose a kernel value into a summation of dimension-wise components. Thus, the additive kernel is defined by 5

ACCEPTED MANUSCRIPT

N

  x, z     n  xn , zn  ,

(5)

n 1

where x   x1 , x2 ,..., xN  

and z   z1 , z2 ,..., zN  

N

N

. Typical examples of additive

kernels include the intersection kernel  IK [17], , generalized intersection kernel  GIK

N

 IK  x, z    min  xn , zn  n 1



N

 GIK  x, z    min xn , zn N

   x, z    2

2



AN US

n 1

2

CR IP T

[16] ,  2 kernel   2 [18] and linear kernel  LIN . They are defined as

n 1

2 xn zn xn  zn

(6)

(7)

(8)

N

 LIN  x, z    xn zn .

(9)

M

n 1

The above four additive kernels are visualized for 1-dimensional data in Fig. 1. In the figure,

ED

the two axes denote the scalar values of x and z , and the value of  ( x, z ) is visualized

AC

CE

PT

in gray. The darker an ( x, z ) pair is, the higher the kernel value  ( x, z ) is.

(b)  IK

(a)  LIN

6

CR IP T

ACCEPTED MANUSCRIPT

(c)  GIK

(d)   2

Figure. 1. Visualization of additive kernels for 1-dimensional data

AN US

From Eqs. (5) to (9), the SVM with additive kernels can be represented by



L



f  x     i  y i  xi  , x  b i 1

L



N

(10)



   i  y i    n xni  , xn  b n 1

M

i 1









ED

N L      i  y i  n xni  , xn   b n 1  i 1 

N

  hn  xn   b ,

PT

n 1

AC

CE

where hn  xn  is defined by

L

hn  xn     i  y i  n xni  , xn i 1



(11)

    y   x  , x  , i

i

n

i

i n

n

and is just a one-dimensional function of xn  . Thus, when the SVM of Eq. (3) is trained, i   and  , y and xn  are given, it is worth noting that

i

i

7

ACCEPTED MANUSCRIPT

(1) The one-dimensional function hn  xn  can be precalculated and the output can be stored in a look-up table (LUT). Thus, when a new test point xtest is presented, hn  xntest  can easily be read out from the LUT by applying simple interpolation;

(2) and, because function hn  xn  is implemented off-line by an LUT, the number of

CR IP T

support vectors is not important at all. It is not the case in common kernels.

Once the functions hn  xn  s are computed, the SVM f  xtest  in Eq. (10) can be evaluated

AN US

by the summation of N hn  xntest  values.

III. Additive Kernel Optimization The additive kernel enables the fast evaluation of the SVM regardless of the number of

M

support vectors. For highly nonlinear problems, however, its performance might be degraded from the other non-additive kernels such as polynomial kernels or Radial Basis

ED

Function (RBF) kernels. Figure 2 shows the decision boundary of the SVMs using four

PT

kinds of additive kernels (linear, intersection, generalized intersection and  2 kernels) for a 2-dimensional non-linear problem. In the figures, two classes are marked in red and blue.

CE

1 0.9 0.8 0.7 0.6 0.5 0.4

AC

0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(a)  LIN

0.9

1

1

1

1

0.9

0.9

0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b)  IK

0

0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

(c)  GIK

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.5

0.6

(d)   2

Figure 2. Decision boundary of SVMs with additive kernels

8

0.4

0.7

0.8

0.9

1

ACCEPTED MANUSCRIPT

As stated, all four kinds of kernels demonstrate degraded performance from the expected result. In this paper, a new SVM with the optimal additive kernel (OAK) is proposed. The proposed OAK enables much faster evaluation than non-additive kernel SVM, but its performance is almost as good as that of the non-additive version due to kernel

CR IP T

optimization.

3.1. Basic Idea

For a nonlinear system with a fixed kernel   ,  , the SVM can be trained by Eq. (2) in

AN US

the primal space, or equivalently by Eq. (3) in the dual space. Here, it can be observed from Eq. (3) that the optimal value p* depends on the choice of a kernel   ,  and choosing a good kernel would decrease the objective function. To exploit the advantages of several

M

kernel, the multiple kernel learning methods have been proposed [20-22, 34]. The idea of

ED

MKL is to simultaneously optimize not only dual valuables  , but also the kernel   ,  ,

1  q*  min max  eT α  αT  yy T K  α  K α 2  trace(K )  c ,  K 0  such that   0αC  αT y  0

AC

CE

PT

by using

(12)

where the formulation belongs to the convex problem and c is a constant for the equality constraint of trace(K ) [19]. Equation (12) can be recast into an SDP formulation

9

ACCEPTED MANUSCRIPT

p*  min t t ,  , ν ,δ , K

using Schur's complement, where y   y (1)

δ , ν

L

and  

y (2)

y ( L)   T

(13)

0,

CR IP T

trace(K )  c   K 0   yy T K e  ν  δ  y   such that   T  t  2CδT e    e  ν  δ   y   ν 0   δ 0 L

is a set of target values;

are dual variables; e is the L-vector consisting of all 1s; and

AN US

denotes a component-wise multiplication. The SDP formulation of the MKL appears elegant, but its usefulness is not as high as expected because

(1) The number of optimization variables grows rapidly when we have a large number of training samples because the kernel ram matrix has a size of O( L2 ) .

M

(2) The more serious problem is that Eq. (13) cannot be applied to the evaluation of the

ED

SVM when a new test data point is presented because Eq. (13) returns the fixed kernel values   ,  for the training data points.

PT

Interestingly, this is not the case in the additive kernel SVM. The main idea of this paper is

CE

to exploit the decomposition property of additive kernels and resolve the above two problems. The Gram matrix K is parameterized in terms of quantized additive kernel

AC

Gram matrices (referred to as the quantized table (QT) hereafter), and the QTs are optimized by the SDP in Eq. (13). Once the QTs are obtained by the SDP, the dimension-wise function

hn  xn     (l ) y (l )  xn(l ) , xn  for all possible xn is stored in a lookup table (LUT). When L

l 1

a new testing data point is presented and the classifier 10

ACCEPTED MANUSCRIPT

f  x     (l ) y (l )  x(l ) , x   b   hn  xn   b is evaluated using LUT. Here, the term L

N

l 1

n 1

quantized means that each dimension is decomposed into N Q subintervals and the QT is computed for NQ  NQ pairs, as in a 2-dimensional array. An example of the QT for linear

M

AN US

CR IP T

kernel  LIN is shown in Figure 3.

n

according to the linear kernel  LIN

ED

Figure 3. Quantized additive kernel Gram matrix

1 In the figure, it is assumed that the QT is for the n th element xn  , and the input

PT

domain for xn is [-3 5]. If the input domain [-3 5] is decomposed into NQ  1 (here as

 3.0, 1.4,0.2,1.8,3.4,5.0 for xn , as shown in Figure 5. Hereafter, let us denote

AC

n

CE

6  1  5 ) subintervals with a width of 1.6, we obtain 6 representative values

the QT for n-th dimensional value xn by



data input points xi   x1i 

x2i 

xNi 

n



T

. As an example, let us assume that two new



and x j   x1 j 

x2 j 

xN j 



T

are

(i ) ( j) presented, and their n th elements are xn  2.2 and xn  4.1 , respectively. The two (i ) ( j) values are then approximated into xn  1.8 and xn  3.4 , respectively, and the n th

11

ACCEPTED MANUSCRIPT

kernel value for the two data points can be read out at the 4th row and the 5th column of the

n th QT

n

, as shown in Fig. 4, and

 n  xn(i ) , xn( j )  

n

 4,5  0.48 ,

(14)

M

AN US

CR IP T

where the bracket is an array notation and only positive integers can be used as arguments.



PT



ED

Figure 4. Obtaining the kernel value

In a similar way,  xi  , x j 

n

n

’s. The idea of this paper is to encode the elements in

’s into the full Gram matrix K and the QT

AC

the

according to the linear kernel  LIN

will be represented as a summation of the elements in the

’s, each coming from different

CE

n

n

n

’s are optimized by the SDP.

Figure 5 show an example of SDP optimization process using the Gram matrix K and the QT

n

’s.

12

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

n

n

’s

’s are computed, the decision function hn  xn     (l ) y (l )  xn(l ) , xn  for all L

l 1

PT

Once the

ED

Figure 5. Example of SDP Optimization Process using the QT

CE

possible xn is stored in a lookup table (LUT). Figure 6 shows an example of making LUT

AC

of h1  x1  .

13

CR IP T

ACCEPTED MANUSCRIPT

Figure 6. Example of making LUT of h1  x1  .

AN US

Using these LUTs of hn  xn  , the decision function of

f  x     (l ) y (l )  x(l ) , x   b   hn  xn   b can be computed by an LUT when new test L

N

l 1

n 1

M

points are presented.. For the sake of simplicity, the additive kernel will first be optimized for the 2-dimensional case in Subsection 3.2, and then the idea will be generalized to a

PT

ED

higher dimensional case in Subsection 3.3.

3.2. Kernel Optimization: Simple Case is given and it consists of 5 samples as follows:

CE

Let us assume that a training set

AC

  2.5   0.5    3.4    0.5    2.12       , 1 ,    , 1 ,   , 1 ,    , 1 ,   , 1       1    1.97     1     0.3     0.2 

(15)

From left to right, the training data points will be numbered from 1 to 5. For example,

0.5 x(2)    and y (2)  1 . Assume that the domains of x1 and x2 are [-3, 5] and [-2, 2], 1

14

ACCEPTED MANUSCRIPT

respectively. When N Q is set to six for both variables, the two QTs

1

and

2

are given

AN US

CR IP T

as in Fig. 7.

Q1

Q2

Figure 7. Two-dimensional QT for NQ  6 .

First, we fill out the full kernel Gram matrix K as a summation of the elements in

M

For example,

[1,1] 

1

2

(16)

[4, 4] ,

PT



s.

ED

K 1,1    x(1) , x(1)   1  x1(1) , x1(1)    2  x2(1) , x2(1)   1  2.5, 2.5   2  0.2,0.2 

n

because -2.5 should be approximated as -3.0 in x1 and 0.2 should be approximated into 0.4

AC

CE

in x2 . This can be illustrated as shown in Figure 8.

15

CR IP T

ACCEPTED MANUSCRIPT

AN US

Figure 8. Obtaining the kernel value  (x(1) , x(1) ) from Qn In the same way,

K 1,1    x(1) , x(1)   1  x1(1) , x1(1)    2  x2(1) , x2(1)   1  2.5,0.5   2  0.2,1

[1,3] 

1

2

[4,5] ,

M



(17)

because -2.5 and 0.5 can be approximated into -2 and 0.2, respectively, in x1 , and 0.2 and 1

ED

can be approximated into 0.4 and 1.2, respectively. In the same way, the other kernel values

AC

CE

PT

will be parameterized in terms of the elements in

K 1, 3 

1

K 1, 4 

1

K 1, 5 

1

K  2, 2 

1

K  2, 3 

1

K  2, 4 

1

n

s by

[1, 4] 

2

[1,3] 

2

[4, 2] ,

[1, 4] 

2

[4, 3] ,

[4, 6] ,

[3,3] 

2

[3, 4] 

2

[5, 6] ,

[2,3] 

2

[5, 2] ,

16

[5, 5] ,

(18)

ACCEPTED MANUSCRIPT

K  2, 5 

[2, 4] 

1

2

[5, 3] ,

2

[3,3].



K 5, 5 

[4, 4] 

1

The full kernel Gram matrix K will then be parameterized by [1,3]  2 [4,5] 1[3,3]  2 [5,5] * * * 1

[1, 4]  2 [4, 6] 1[3, 4]  2 [5, 6] 1[4, 4]  2 [6, 6] * * 1

[1,3]  2[4, 2] 1[3,3]  2 [5, 2] 1[4,3]  2 [6, 2] 1[3,3]  2 [2, 2] * 1

[1, 4]  2 [4,3]   1[3, 4]  2 [5, 3]   (19) 1[4, 4]  2 [6,3]  1[3, 4]  2 [2,3]   1[4, 4]  2 [3,3]  1

CR IP T

 1[1,1]  2 [4, 4]  *  K *  *   *

AN US

Here, it should be noted that the kernel Gram matrix K is symmetric and it does not need to be fully encoded. Encoding the diagonal of an upper triangular part is enough, and * in Eq. (19) denotes that the corresponding element is symmetric. If the full Gram matrix K in Eq. (19) is substituted into Eq. (13) and the SDP is optimized, the optimal

s are

AC

CE

PT

ED

M

obtained, as in Figure 9.

n

Q1

Q2

Figure 9. Optimized Qn with

In the figure, the elements in Qn are displayed to three decimal places. The reason why many 0s are found is that only five training samples are used in the explanation. The two 1dimensional functions h1  x1  and h2  x2  can then be implemented from the Qn s and the 17

ACCEPTED MANUSCRIPT

training results of α  0.08 0.08 0.0028 0.08 0.08 and b  1.03 . Figure 10 shows T

M

AN US

CR IP T

the LUTs of h1  x1  and h2  x2  .

ED

Figure 10. LUTs of hn ( xn )

PT

When a new test data point xtest is presented, the decision boundary can be computed from

CE

 2  the above two functions given in Figure 10. For example, when xtest   , hn ( xntest )   1.5

AC

can simply be computed by retrieving values from the LUTs, as in Fig. 11.

18

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Figure 11. Computing hn ( xn ) using LUTs for xtest  [2 1.5]T Finally, the SVM with the optimal additive kernel can be evaluated by

PT

ED

f  xtest   h 1 ( x1test )  h 2 ( x2test )  b  h 1 (2)  h 2 (1.5)  b  1.68  0  1.03  0.65 . (20)

3.3. Kernel Optimization: General Case

CE

In this subsection, the above idea is generalized into a higher dimensional system. First, it is assumed that the dimension of the input vectors x is N , and the domain of each

AC

variable dom  xn  is given, where n  1, 2,

, N . Furthermore, it is assumed that a set of

N Q representative values is selected from the domain of each variable, such that



, x1, NQ  dom  x1 





, x2, NQ  dom  x2 

1

 x1,1 , x1,2 , x1,3 ,

2

 x2,1 , x2,2 , x2,3 , 19



(21)

ACCEPTED MANUSCRIPT

… N



, xN , NQ  dom  xN  ,

 xn, NQ . Let us then define an index function n

: R  1, 2,3,

, NQ 

(22)

CR IP T

where xn,1  xn,2 



 xN ,1 , xN ,2 , xN ,3 ,

such that n

 xn   arg xmin  m

n ,m

xn  xn,m .

n

(23)

be computed from N QTs by





AN US

In general, when two new data input points xi  and x j  are given, their kernel value can

 xi  , x j     n  xn(i ) , xn( j )   

n

x  (i ) n

and mn( j ) 

matrix K is parameterized by N

 n 1 N

 n 1

*

n

n 1

n

 mn(i ) , mn( j )  ,

( j) n

 (2) (2)  n  mn , mn 

N

 n 1 N

 n 1 N

*

(24)

 x  . As in the simple case, the full kernel Gram

 (1) (2)  n  mn , mn 

PT

*

n 1

ED

 (1) (1)  n  mn , mn 

CE

N   n 1    K      

N

M

where mn(i ) 

N

 n 1

 (1) (3)  n  mn , mn   (2) (3)  n  mn , mn   (3) (3)  n  mn , mn 

*

*

*

*

*

*

N

 n 1 N

 n 1 N

 n 1

AC

N

 n 1

  mn(1) , mn( L )      (2) ( L )  n  mn , mn     . (25)  (3) ( L )  n  mn , mn      ( L) ( L)    m , m n  n n   n

If the full Gram matrix K in Eq. (25) is substituted into Eq. (13), the SDP is optimized by

20

ACCEPTED MANUSCRIPT

p*  min t t ,  , ν ,δ ,

n

and N optimal

n

evaluated from the

n

s by



L



L

x  (i ) n

and mn 

n

mn(i ) , mn  ,

AN US

i 1

n

0,

s are obtained. The N 1-dimensional functions hn  xn  s can then be

hn  xn     i  y i  n xni  , xn   i  y i 

where mn(i ) 

(26)

CR IP T

trace(K  n )  c   K n 0   T e  ν  δ  y   yy K  n  such that   T t  2CδT e    e  ν  δ   y   ν 0   δ 0 

i 1

n

(27)

 xn  , and they can be implemented in the LUTs. When

M

a new test data point xtest is presented, hn  xntest  can easily be read out from the LUT, and

ED

the SVM f  xtest  can be evaluated by adding N hn  xntest  values. The above algorithm

AC

CE

PT

is summarized in Table 1.

21

ACCEPTED MANUSCRIPT

Table 1. Support vector machine with optimal additive kernel Data:



 x i  , y  i 



L

i 1

, x   i

N

, y i   1, 1

NQn  NQn

Optimization variables: Qn 

, α

L

, ν

, δ

L

2: for i  1 to L





K i. j (Qn )   xi  , x j     n  xn(i ) , xn( j )   

4:

where mn(i ) 

end for

N

n 1

n

x  (i ) n

and mn( j ) 

CE

8: // 2. Kernel Optimization 9: p*  min t n

AC

t ,  , ν ,δ ,

n

n 1

PT

7: end for

N

ED

K j.i (Qn )  K i. j (Qn )

5: 6:

AN US

for j  1 to i

M

3:

, t

22

, 

CR IP T

1: // 1. Mapping Qn values to Kernel K (Qn )

L

n

 mn(i ) , mn( j ) 

x  ( j) n

ACCEPTED MANUSCRIPT

0,

CR IP T

10:

trace(K  n )  c   K n 0   T e  ν  δ  y   yy K  n  such that   T t  2CδT e    e  ν  δ   y   ν 0   δ 0 

11: // 3. Computing dual variables α 12: α   yyT K 

 e  ν  δ  y  1

n

14: for n  1 to N



AC

(i ) n

if f  xtest   0 otherwise

CE

1  1

x 

PT

16: end for

test

n

i 1

and mntest 

n

n

mn(i ) , mntest 

x  test n

ED

where mn(i ) 

L

M

i 1

17: y



hn  xntest     i  y i  n xni  , xntest   i  y i  L

15:

AN US

13: // 4. Decision function evaluation for xtest

where f  xtest    hn  xntest   b N

n 1

IV. EXPERIMENTAL RESULTS

In this section, the performance of the proposed method is compared to that of the other

kernels for two kinds of datasets: artificially generated datasets and benchmark datasets, which are obtained from practical usage. The artificial datasets are selected to be 2dimensional for the visualization and three datasets are considered. The benchmark datasets 23

ACCEPTED MANUSCRIPT

are taken from the UCI repository [23], and they represent multi-dimensional data. All SVM classifiers used in the experiment are trained using the MatLab CVX toolbox[24] and performed using a Core i5-3.40 GHz processor. 4.1. Performance Measures

CR IP T

In order to compare the classification performance, we use 5 performance measures which are popular for binary classifiers [25-29]. Table 2 shows the definitions of the performance measures used in this paper. In Table 2, L p and Ln denote the number of positive and negative samples, respectively. In addition, s  Ln / Lp is the skewing factor which is the

AN US

ratio between L p and Ln . In addition, tp and tn denote the number of true positive and true negative samples, respectively.

M

Table 2. Definitions of Performance Measures Performance Measure

tpr 

ED

True Positive Rate (TPR)

tnr 

PT

True Negative Rate (TNR)

tp Lp

tn  1  fpr Ln

Precision

tpr tpr  s  fpr

F-measure

2tpr tpr  s  fpr  1

Accuracy

tpr  s(1  fpr ) 1 s

Testing Time (ms)

Average computational time to test a single sample

CE AC

Definition

24

ACCEPTED MANUSCRIPT

In this paper, we evaluate and discuss the classification performance of conventional kernels and the proposed OAK based on the measurements in Table 2. 4.1. 2D Artificial Dataset Three 2-dimensional synthetic datasets are utilized, which are referred to as ‘Two moons’,

previous additive kernels and RBF kernel defined by

CR IP T

‘Two spirals’ and ‘Two circles’. For each dataset, the proposed method is compared with the

AN US

  x  z T  x  z    RBF  x, z   exp   . 2   2   

(28)

In experiments with 2D artificial datasets, all SVMs are trained with C  0.1 and five-fold cross-validation is performed to test classification performance. Figure 12 shows an example

AC

CE

PT

ED

M

of k-fold cross validation.

Figure 12. Example of k-fold cross-validation

25

ACCEPTED MANUSCRIPT

To evaluate the performance of the classifiers, we compare classifiers in terms of the decision boundaries and the 6 measurements from Table 2. For all the artificial datasets, the OAK is trained with NQ  21 so that the axis interval is 0.05 and  of (28) is set to 0.1 for

CR IP T

the RBF kernel.

AN US

4.1.1 Two Moons Dataset

The ‘Two moons’ dataset consists of 2 classes with the appearance of two half-moons facing each other. The number of samples is 450 and the dom  xn  ( n  1, 2,

, N ) is [0,1].

M

Figure 13 shows the decision boundaries of the conventional kernels and OAK. In the figure,

 LIN ,  IK ,  GIK ,   and  OAK denote a linear kernel (8), intersection kernel (5), 2

ED

generalized intersection kernel (6),  2 kernel (7) and OAK, respectively.

0.8

0.5 0.4

AC

0.3

0.9 0.8 0.7

CE

0.7 0.6

1

PT

1 0.9

0.6 0.5 0.4 0.3

0.2

0.2

0.1

0.1

0

0

0.1

0 0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

(b)  IK

(a)  LIN

26

0.7

0.8

0.9

1

ACCEPTED MANUSCRIPT

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

1

0

0.1

0.2

(c)  GIK

0.4

0.5

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

1

(d)   2

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

AN US

1

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.9

1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

(f)  OAK

ED

(e)  RBF

0.8

M

0

0.3

CR IP T

1 0.9

CE

PT

Figure 13. Comparison of the decision boundaries for the ‘Two moons’ dataset

In Fig. 13, the previous AKs have difficulty in separating two classes in the region where

AC

the boundaries between the two classes are highly nonlinear. The OAK and RBF kernels, however, classify the given samples well enough and outperform the previous AKs. It is also interesting to observe that the decision boundary of the proposed method is axis-aligned and looks like a square wave while those of the other kernels are smooth curves. The axisaligned square boundary of the OAK is caused by the nature of quantization in Qn . The 27

ACCEPTED MANUSCRIPT

QTs Q1 and Q2 are visualized in Fig. 14, in which the darker the region is, the higher the n

, 

is.

AN US

CR IP T

associated

(b) Q2

(a) Q1

Figure 14. Optimized Qn for the ‘Two Moons’ dataset

M

The QTs have some square blocks of the same values and the blocks result in the axis-

ED

aligned square boundary in the OAK. The blocks are formed because an input feature xn is n



 xn,1 , xn,2 , xn,3 ,



, xn, NQ . If we

PT

approximated as one of the representative values

increase N Q , the decision boundary becomes smoother, but its square property is

CE

preserved. The performance measurements of the conventional kernels and OAK are

AC

summarized in Table 2. For additive kernels, which include previous AKs and the OAK, the testing performances are measured using two LUTs, each at a size of 2 100 . In Table 2, the best performance for each measurement is highlighted in bold face and the number within brackets denotes the rank of the classifier.

28

ACCEPTED MANUSCRIPT

Table 3. Classification performance: ‘Two Moons’ dataset

 LIN

 IK

 GIK



TPR

0.8267(5)

0.9200(4)

0.9600(3)

TNR

0.8311(6)

0.8800(3)

Precision

0.8302(6)

F-measure

 RBF

 OAK

0.8000(6)

1.0000(1)

1.0000(1)

0.8400(5)

0.8756(4)

1.0000(1)

0.9600(2)

0.8875(3)

0.8594(5)

0.8681(4)

1.0000(1)

0.9621(2)

0.8274(6)

0.9026(4)

0.9062(3)

0.8303(5)

1.0000(1)

0.9806(2)

Accuracy

0.8289(6)

0.9000(3)

0.9000(3)

0.8378(5)

1.0000(1)

0.9800(2)

Time (ms)

0.0017(1)

0.0036(5)

0.0028(2)

0.0028(2)

4.6421(6)

0.0028(2)

AN US

CR IP T

2

To evaluate not only the accuracy, but also the testing time, we plot the accuracy versus testing time, as illustrated in Figure 15.

M

1

0.96

PT

Linear IK GIK

0.92

2 RBF OAK

0.9

CE

Accuracy

0.94

ED

0.98

0.88

AC

0.86 0.84

0.82 -3 10

-2

10

-1

10

0

10

Testing Time (ms)

Figure 15. Accuracy versus testing time: ‘Two moons’ dataset

29

1

10

ACCEPTED MANUSCRIPT

From Table 3 and Figure 15, our proposed OAK outperforms the other kernels, except for the RBF kernel. However, the testing time of the OAK is about 1500 times faster than the RBF kernel, while the OAK has only 2% lower accuracy than the RBF kernel. 4.1.2 Two Spirals Dataset

CR IP T

The data set ‘Two spirals’ is also 2-dimensional and consists of two classes which have the shape of spirals. This set has more of a nonlinear boundary than the ‘Two moons’ set does. The number of samples is 450 and the dom  xn  ( n  1, 2,

, N ) is [0,1]. Figure 14 shows

the decision boundaries of the conventional kernels and OAK. 1

AN US

1 0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4 0.3

0.2 0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2 0.1 0

0

0.1

0.2

0.3

ED

0

M

0.3

PT

0.9

0.6

0.5

0.5

AC

0.7

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.1

0.2

0.3

0.4

0.5

0.6

0.8

0.9

1

0.7

0.8

0.9

1

0.8

0.6

0

0.7

0.9

0.7

0

0.6

1

CE

0.8

0.5

(b)  IK

(a)  LIN 1

0.4

0.7

0.8

0.9

0

1

(c)  GIK

0

0.1

0.2

0.3

0.4

0.5

0.6

(d)   2

30

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

1

(e)  RBF

0

0.1

0.2

CR IP T

ACCEPTED MANUSCRIPT

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(f)  OAK

Figure 16. Comparison of the decision boundaries for the ‘Two spirals’ dataset

AN US

As shown in Figure 16, the previous AKs can separate two classes only in the outer spirals, but fail in separating them in the inner spirals. The proposed OAK and RBF kernel, however, can separate two classes well not only in the outer spirals, but also in the inner

M

spirals. Compared with the RBF, the decision boundary of the OAK SVM has the shape of a rectangular spiral, while RBF has a smooth decision boundary. Figure 16 shows the QTs

ED

Qn for the OAK. The QTs actually represent the combination of rectangles, and they

AC

CE

PT

explain why the decision boundary in the OAK is a rectangular spiral.

31

CR IP T

ACCEPTED MANUSCRIPT

(b) Q2

(a) Q1

AN US

Figure 17. Optimized Qn for the ‘Two spirals’ dataset

In Table 4, the performances of the conventional kernels and OAK are summarized in terms of the 5 performance measurements in Table 2. As in the experiment with the ‘Two moon’

M

dataset, all AKs and the OAK utilize an LUT with a size of 2 100 for testing.

 LIN

ED

Table 4. Classification performance: ‘Two spirals’ dataset

 IK

 GIK



2

 RBF

 OAK

0.6533(4)

0.6533(4)

0.6000(6)

0.9667(1)

0.8867(2)

TNR

0.6467(5)

0.6600(4)

0.5467(6)

0.6800(3)

0.9600(1)

0.8933(2)

0.6867(3)

Precision

0.6533(5)

0.6613(3)

0.6074(6)

0.6569(4)

0.9625(1)

0.8951(2)

F-measure

0.6520(4)

0.6558(3)

0.6419(5)

0.6235(6)

0.9638(1)

0.8864(2)

Accuracy

0.6500(4)

0.6567(3)

0.6167(6)

0.6400(5)

0.9633(1)

0.8900(2)

Time (ms)

0.0027(1)

0.0027(1)

0.0028(2)

0.0028(2)

4.0666(6)

0.0028(2)

AC

CE

PT

TPR

In addition, Figure 18 shows the accuracy versus testing time for all SVMs used in this experiment with the ‘Two spirals’ dataset. 32

ACCEPTED MANUSCRIPT

1 0.95 0.9 0.85 Linear IK GIK

CR IP T

Accuracy

0.8 0.75

2

RBF OAK

0.7 0.65

AN US

0.6 0.55 0.5 -3 10

-2

-1

10

10

0

10

1

10

Testing Time (ms)

ED

M

Figure 18. Accuracy versus testing time: ‘Two Spirals’

From the above Table 4 and Figure 18, conventional AKs are about 65% accurate with

PT

0.0027 ms of testing time for the ‘Two Spirals’ dataset. Moreover, the RBF kernel performs the best for all classification performance measures, but its testing time is 1500 times more

CE

than the AKs. On the other hand, our proposed method OAK shows 20% higher accuracy

AC

than the AKs, and has a testing time that is 1500 times faster than the RBF kernel, while its accuracy is only 8% lower than the RBF.

33

ACCEPTED MANUSCRIPT

4.1.3 Two Circles Dataset The ‘Two circles’ dataset is also 2-dimensional and it consists of two classes. Unlike the previous two examples, one class is surrounded by the other class. The number of samples is 450 and the dom  xn  ( n  1, 2,

, N ) is [0,1]. Figure 19 shows the decision boundaries of

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1 0

AN US

1 0.9

CR IP T

the conventional kernels and OAK.

0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b)  IK

M

(a)  LIN 1

1 0.9

ED

0.9 0.8 0.7 0.6

0.8 0.7 0.6 0.5

PT

0.5 0.4 0.3

0.1

0.1

0.2

AC

0

CE

0.2

0

0

0.3

0.4

0.4 0.3 0.2 0.1

0.5

0.6

0.7

0.8

0.9

0

1

(c)  GIK

0

0.1

0.2

0.3

0.4

0.5

0.6

(d)   2

34

0.7

0.8

0.9

1

ACCEPTED MANUSCRIPT

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

1

(e)  RBF

0

0.1

0.2

CR IP T

1 0.9

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(f)  OAK

Figure 19. Comparison of the decision boundaries for the ‘Two Spirals’ dataset

AN US

From the above decision boundaries, previous AKs, except for the IK, fail to make circle shaped decision boundaries, while the IK, RBF and OAK have circle shaped boundaries and can separate the inner class from the outer class. As with previous experiments, the IK and

M

RBF kernel have a smooth circle, but our proposed OAK has a rectangular circle. The QTs

AC

CE

PT

ED

for the OAK are illustrated in Figure 20.

(b) Q2

(a) Q1

Figure 20. Optimized Qn for the ‘Corners’ dataset

35

ACCEPTED MANUSCRIPT

As shown in Figure 20, the QTs consist of 5 blocks which are located at the 4 corners and center of the table, and that is why the decision boundary in the OAK has a rectangular circle for the ‘Two circles’ dataset. In Table 5, the performances of the conventional kernels and OAK are summarized in

OAK utilize a LUT with a size of 2 100 for testing.

CR IP T

terms of the 5 performance measurements. As in previous experiments, all of the AKs and

Table 5. Classification Performance: ‘Circles’ Dataset

 LIN

 IK

TPR

0.8523(4)

0.9760(2)

TNR

0.0583(6)

0.8583(3)

Precision

0.4879(6)

0.8834(3)

F-measure

0.6169(6)

0.9262(2)

Accuracy

0.4681(6)

0.9188(3)

Time (ms)

0.0056(5)

0.0040(1)

2

 RBF

 OAK

0.7340(5)

0.6726(6)

1.0000(1)

0.9449(3)

0.8500(4)

0.5833(5)

0.9167(1)

0.9000(2)

0.8394(4)

0.6320(5)

0.9311(1)

0.9093(2)

0.7775(4)

0.6452(5)

0.9637(1)

0.9262(2)

0.7906(4)

0.6280(5)

0.9601(1)

0.9227(2)

0.0040(1)

0.0049(3)

3.4888(6)

0.0049(3)

M

ED



AN US

 GIK

PT

In addition, Figure 21 shows the accuracy versus testing time for all SVMs used in this

AC

CE

experiment with the ‘Two Circles’ dataset.

36

ACCEPTED MANUSCRIPT

1

0.9 Linear IK GIK

2

CR IP T

Accuracy

0.8

RBF OAK

0.7

0.6

0.4 -3 10

AN US

0.5

-2

-1

10

10

0

10

1

10

Testing Time (ms)

M

Figure 21. Accuracy versus testing time: ‘Two Circles’

ED

As in previous experiments, the RBF kernel outperforms other AKs in terms of the performance measures, but it takes about 3.48 ms per sample, which is 870 times longer

PT

than the AKs. On the other hand, the OAK shows better performance than the other AKs, while maintaining a similar testing time compared to the AKs.

CE

4.2. UCI Repository DB

AC

In this subsection, the conventional kernels and OAK are applied to a higher dimensional general dataset. 8 datasets from the UCI repository DB are used and their characteristics are summarized in Table 6 [29]. In Table 6, the columns ‘# of data’ and ‘ s  Ln / Lp ’ denote the number of samples and the ratio between the number of positives, respectively. In addition, the column ‘Data type’ denotes whether the input feature xn has a categorical value or a 37

ACCEPTED MANUSCRIPT

real value. In the column ‘Parameters’, N Qn and  denote the size of the QT and parameters of the RBF kernel (28), respectively. Table 6. Summary of UCI Repository Datasets

Acute Inflammations

s  Ln / Lp

Dimension

Positive: 50 1.400

6

Negative: 70

Monks1

1.000 Negative: 62 Positive: 60

Monks3

TicTacToc

0.530

9

ED

Negative: 332

PT

Positive: 150

Statlog-Heart

0.800

CE

Positive: 168 1.589

16

Negative: 267

AC

NQ 2~5  2

({yes, no})

  0.1

Categorical

NQ  4

(Integer 1~4)

  0.5

Categorical

NQ  4

(Integer 1~4)

  0.5

Categorical

NQ  3

({O,X,  })

  0.1

Real,

NQ  5

Categorical

 1

Categorical

NQ  3

({Yes, No, Neither})

 1

13

Negative: 120

Votes

Categorical

6

M

Positive: 626

NQ1  11

6

1.033 Negative: 62

NQ  11

Positive: 147

Parkinson

Parameters

Real,

AN US

Positive: 62

Data type

CR IP T

# of data

0.326

22

Real

Negative: 48

  0.1

Positive: 225

NQ  21

Ionosphere

0.560

33

Real

 1

Negative: 126

38

ACCEPTED MANUSCRIPT

In the parameter setting, if the data type is categorical, we set the N Qn as the number of categories. Each category in categorical data is mapped to a number in the range of  0,1 . For example, the categorical data ({O,X,  }) of ‘TicTacToc’ and ({Yes, No, Neither}) of

CR IP T

‘Votes’ are scaled to ({0, 0.5,1}), respectively.

If the data is a real value, we set N Qn to the parameter that shows the best training

accuracy among NQ  {5,11, 21} . In the case of the RBF kernel, we tune  to the

AN US

parameter that shows the best training accuracy among   {0.01,0.05,0.1,0.5,1,5,10} .

Data points are normalized to [0,1], and all classifiers are trained with C  0.01 . Since no testing data are clearly specified in UCI DB, testing is performed by 5-fold cross-validation

M

as in the case of the 2D artificial datasets. The size of LUT is N 1000 where N is the

ED

number of dimensions.

Tables 7 and 8 show a comparison of all the classifiers in terms of the TPR and TNR for the

PT

UCI DB. The highest value for each dataset is highlighted in bold face. In addition, we rank the classifiers according to the average value of each classifier, and it is displayed within the

AC

CE

bracket.

39

ACCEPTED MANUSCRIPT

Table 7. Performance Comparison for UCI Repository Datasets: TPR

 LIN

 IK

 GIK



Acute Inflammations

0.7400

0.8200

0.7800

Monks1

0.6643

0.6619

Monks3

0.7667

TicTacToc

 RBF

 OAK

0.7000

0.0000

1.0000

0.6810

0.6643

0.7262

0.4667

0.8667

0.9333

0.7000

0.4167

0.9500

1.0000

1.0000

1.0000

1.0000

1.0000

0.7714

Statlog-Heart

0.8800

0.8867

0.8867

0.8667

0.9733

0.8800

Votes

0.9404

0.9465

0.9525

0.9404

0.5167

0.9702

Parkinson

1.0000

0.9186

1.0000

1.0000

1.0000

0.9246

Ionosphere

1.0000

0.9867

1.0000

1.0000

1.0000

0.9689

Average

0.8739(3)

0.8859(2)

0.9042(1)

0.8589(5)

0.7041(6)

0.8665(4)

AN US

CR IP T

2

M

Table 8. Performance Comparison for UCI Repository Datasets: TNR

1.0000

 GIK



1.0000

1.0000

PT

Acute Inflammations

 IK

ED

 LIN

 RBF

 OAK

1.0000

1.0000

1.0000

2

0.7262

0.7738

0.6286

0.7571

0.7881

1.0000

0.8571

0.9024

0.9167

0.7762

1.0000

0.9167

0.0000

0.0182

0.0121

0.0000

0.0000

0.5893

Statlog-Heart

0.7750

0.7917

0.7750

0.7833

0.4333

0.8000

Votes

0.9210

0.9210

0.9210

0.9174

1.0000

0.9476

Parkinson

0.0000

0.5944

0.0000

0.0000

0.0000

0.7278

Ionosphere

0.4129

0.7298

0.0000

0.4600

0.0000

0.8255

Average

0.5865(4)

0.7164(2)

0.5317(5)

0.5868(3)

0.5277(6)

0.8509(1)

Monks1

CE

Monks3

AC

TicTacToc

40

ACCEPTED MANUSCRIPT

As shown in Tables 7 and 8, the OAK shows the best performance in TNR, while it ranked 4th in TPR. The reason for this can be found due to s  Ln / Lp for each dataset. For example, the number of positive samples is about twice as much as the number of negative

CR IP T

samples in ‘TicTacToc’, ‘Parkinson’ and ‘Ionosphere’ where conventional kernels have better TPR values than the OAK. For an imbalanced dataset, such as those three datasets, decision boundaries tend to be biased to one class which has more samples than other. This is why conventional kernels are almost 1.000 for TPR, but have a lower value than 0.5 for

AN US

TNR. However, our proposed method OAK has values of over 0.85 for both TPR and TNR, while the other methods do not.

Tables 8, 9, and 10 show the comparison of classifiers in terms of the Precision, Fmeasure and Accuracy for the UCI DB, respectively

M

Table 9. Performance Comparison for UCI Repository Datasets: Precision

Acute Inflammations

1.0000

 IK

 GIK



1.0000

1.0000

ED

 LIN

PT

s  Ln / Lp

 RBF

 OAK

1.0000

0.0000

1.0000

2

0.7056

0.7451

0.6434

0.7306

0.7806

1.0000

0.8408

0.8931

0.9192

0.7532

1.0000

0.9218

0.6535

0.6576

0.6562

0.6535

0.6534

0.7788

Statlog-Heart

0.8357

0.8457

0.8365

0.8314

0.6853

0.8527

Votes

0.8856

0.8867

0.8873

0.8805

1.0000

0.9230

Parkinson

0.7547

0.8785

0.7547

0.7547

0.7547

0.9145

Ionosphere

0.7530

0.8678

0.6410

0.7681

0.6410

0.9089

Average

0.8036(3)

0.8468(2)

0.7923(5)

0.7965(4)

0.6894(6)

0.9125(1)

Monks1

CE

Monks3

AC

TicTacToc

41

ACCEPTED MANUSCRIPT

Table 10. Performance Comparison for UCI Repository Datasets: F-measure s  Ln / Lp

 LIN

 IK

 GIK



Acute Inflammations

0.8405

0.8980

0.8719

Monks1

0.6816

0.6964

Monks3

0.7985

TicTacToc

 RBF

 OAK

0.7995

0.0000

1.0000

0.6582

0.6932

0.7506

0.6353

0.8780

0.9260

0.7188

0.5490

0.9353

0.7904

0.7934

0.7924

0.7904

0.7904

0.7751

Statlog-Heart

0.8556

0.8649

0.8598

0.8480

0.8034

0.8638

Votes

0.9112

0.9143

0.9175

0.9086

0.6795

0.9455

Parkinson

0.8601

0.8964

0.8601

0.8601

0.8601

0.9188

Ionosphere

0.8590

0.9232

0.7813

0.8688

0.7813

0.9374

Average

0.8246(4)

0.8581(2)

0.8334(3)

0.8109(5)

0.6518(6)

0.8764(1)

AN US

CR IP T

2

M

Table 11. Performance Comparison for UCI Repository Datasets: Accuracy

Acute Inflammations

0.8917

 IK

 GIK



0.9250

0.9083

ED

 LIN

PT

s  Ln / Lp

 RBF

 OAK

0.8750

0.5833

1.0000

2

0.6952

0.7179

0.6548

0.7107

0.7571

0.7333

0.8109

0.8840

0.9250

0.7365

0.7090

0.9333

0.6535

0.6597

0.6576

0.6535

0.6534

0.7081

Statlog-Heart

0.8333

0.8444

0.8370

0.8296

0.7333

0.8444

Votes

0.9284

0.9307

0.9331

0.9262

0.8135

0.9563

Parkinson

0.7547

0.8405

0.7547

0.7547

0.7547

0.8761

Ionosphere

0.7892

0.8945

0.6410

0.8062

0.6410

0.9173

Average

0.7946(3)

0.8371(2)

0.7889(4)

0.7866(5)

0.7057(6)

0.8711(1)

Monks1

CE

Monks3

AC

TicTacToc

42

ACCEPTED MANUSCRIPT

As shown in Tables 9-11, the RBF shows better performance than the OAK in terms of the Precision for ‘Monks3’ and ‘Votes’, as well as in terms of the F-measure and Accuracy for ‘Monks1’. However, for the 8 UCI datasets, our proposed OAK has the highest average

comparison of classifiers in terms of testing time.

CR IP T

value for the ‘Precision’, ‘F-measure’ and ‘Accuracy’ for s  Ln / Lp . Table 12 presents a

Table 12. Performance Comparison for UCI Repository Datasets: Testing Time

 LIN

 IK

Acute Inflammations

0.0120

0.0116

Monks1

0.0121

0.0119

Monks3

0.0117

0.0120

TicTacToc

0.0111

0.0111

Statlog-Heart

0.0124

0.0126

Votes

0.0111

Parkinson

0.0116

Ionosphere

0.0116

AN US

2

 RBF

 OAK

0.0116

1.6807

0.0123

0.0118

0.0117

1.7382

0.0120

0.0115

0.0125

1.7099

0.0118

0.0111

0.0110

10.8348

0.0113

0.0117

0.0115

3.8765

0.0114

0.0113

0.0113

0.0120

5.9105

0.0109

0.0118

0.0115

0.0116

2.7939

0.0121

0.0118

0.0116

0.0117

4.7868

0.0113

0.0118(5)

0.0116(1)

0.0117(3)

4.1664(6)

0.0116(1)

ED

M

0.0121

PT

0.0117(3)



CE

Average (ms)

 GIK

AC

As shown in Table 12, it is worth noting that the RBF kernel has a different testing time

for each dataset while the AKs and OAK have almost the same testing time for all of the datasets due to testing with LUTs. Figure 22 shows the average Accuracy versus testing time for the UCI DB.

43

ACCEPTED MANUSCRIPT

Linear IK GIK

0.85

2 RBF OAK

CR IP T

Accuracy

0.8

0.75

0.65

0.6

-2

-1

10

10

AN US

0.7

0

10

1

10

Testing Time (ms)

ED

M

Figure 22. Average accuracy versus testing time: UCI DB

As shown in Figure 22, it can be confirmed that our proposed OAK has the best

PT

performance among the other conventional kernels while maintaining the advantage of fast

CE

testing, as in the AKs.

AC

4.2. LIBSVM datasets

To exploit performance of OAK for large scale dataset, the conventional kernels and OAK

are applied to aDB of LIBSVM dataset [35]. 5 datasets of aDB are used their characteristics are summarized in Table 13.

44

ACCEPTED MANUSCRIPT

Table 13. Summary of LIBSVM Datasets # of data

s  L

n

/ Lp 

Dimensi on

Train

Test

Lp  395 , Ln  1210

Lp  7446 , Ln  23510

s  3.06

s  3.16

Lp  572 , Ln  1693

Lp  7269 , Ln  23027

s  2.96

s  3.17

Lp  773 , Ln  2412

Lp  7068 , Ln  22308

s  3.12

s  3.16

Lp  1188 , Ln  3593

Lp  6653 , Ln  21127

s  3.02

s  3.18

Lp  1569 , Ln  4845

Lp  6272 , Ln  19875

Data type

Parameters

setB

setC

CR IP T

setA

NQ  2

s  3.17

ED

s  3.08

  10

M

setD

setE

Binary

AN US

119

All datasets consist of 119-dimensional binary data and the numbers of training data are

PT

from 1500 to 6414. Since all attributes of data are binary, we set NQ  2 and all classifiers

CE

are trained with C  0.1 . C  0.1 . For the sake of efficiency, linear SVM is trained with

C  0.01 and only the data points which has  (l ) higher than 0.005 are used to train

AC

OAKSVM with C  0.1 . As in the other experiments, testing of AKs and OAK is performed with LUT with a size of 119  2 . In the case of the RBF kernel, we tune  to the parameter that shows the best training accuracy among   {0.1,0.5,1,10} .

45

ACCEPTED MANUSCRIPT

All the classifiers are compared in terms of 5 performance measures (TPR, TNR, Precision, F-measure, Accuracy) for LIBSVM dataset in Tables 14-18. The highest value for each dataset is highlighted in bold face. Table 14. Performance Comparison for LIBSVM Datasets: TPR

 GIK



2

 RBF

 OAK

0.7548

0.6621

0.7886

0.6924

0.7750

0.6689

0.7657

0.6750

CR IP T

 IK

 LIN

0.8022

setB

0.8243

setC

0.8355

setD

0.8568

setE

0.8591

0.7672

0.6615

Average

0.8356

0.7702

0.6720

M

AN US

setA

Table 15. Performance Comparison for LIBSVM Datasets: TNR

 LIN

 IK

 GIK



 OAK

0.8013

0.8252

0.8663

0.7930

0.8071

0.8461

0.7900

0.8164

0.8799

0.7731

0.8208

0.8805

setE

0.7684

0.8204

0.8833

Average

0.7852

0.8180

0.8712

setB

CE

setC

PT

setA

AC

setD

ED

 RBF

46

2

ACCEPTED MANUSCRIPT

Table 16. Performance Comparison for LIBSVM Datasets: Precision

 IK

 LIN

 GIK



2

 RBF

 OAK

0.5612

0.5776

0.6107

setB

0.5570

0.5633

0.5867

setC

0.5579

0.5722

0.6382

setD

0.5432

0.5736

0.6401

setE

0.5392

0.5741

0.6414

Average

0.5517

0.5722

0.6234

AN US

CR IP T

setA

Table 17. Performance Comparison for LIBSVM Datasets: F-measure

 IK

 LIN

setC

0.6604

0.6544

0.6354

0.6648

0.6572

0.6352

0.6688

0.6583

0.6532

0.6649

0.6559

0.6571

0.6626

0.6567

0.6513

0.6643

0.6565

0.6464

AC

CE

Average

PT

setD setE

 OAK

ED

setB



 RBF

M

setA

 GIK

47

2

ACCEPTED MANUSCRIPT

Table 18. Performance Comparison for LIBSVM Datasets: Accuracy

 IK

 LIN

 GIK



2

 RBF

 OAK

0.8015

0.8082

0.8172

setB

0.8005

0.8026

0.8092

setC

0.8009

0.8064

0.8291

setD

0.7932

0.8076

0.8313

setE

0.7901

0.8076

0.8301

Average

0.7972

0.8065

0.8234

AN US

CR IP T

setA

It is interesting to see that all four conventional AKs demonstrate the same results for LIBSVM data set and the reason for it is that the conventional AKs have the same kernel

2

have the same kernel value when the input is binary.

AC

CE

PT

ED



M

value when the input is binary. Table 19 shows an example in which  LIN ,  IK ,  GIK , and

48

ACCEPTED MANUSCRIPT

Table 19. Example of Additive Kernel value for binary data

x  0 0 1 1 , z  0 1 0 1

AK N

N

 LIN  x, z    xn zn

x z n 1

n 1

n n

 0  0  0  1  1 0  1 1  1

N

 min  x , z 

N

 IK  x, z    min  xn , zn 

n

n 1

n

 0  0  0 1  1

 min  x N



N

 GIK  x, z    min xn , zn n 1

2

2



2

n

n 1

, zn

2



 min  0, 0   min  0,12   min 12 , 0   min 12 ,12  N

N

2

2 xn zn n  zn

x n 1

2  0  0 2  0  1 2  1 0 2  1 1    00 0 1 1 0 11  0  0  0 1  1 

M

n 1

2 xn zn xn  zn

AN US

 0  0  0 1  1

   x, z   

CR IP T

 min  0, 0   min  0,1  min 1, 0   min 1,1

n 1

ED

Thus, the conventional AKs cannot improve from the linear SVM. On the other hand, our proposed OAK is completely different from the AKs. The kernel value in the OAK is

PT

trained by optimization and the OAK returns the value which is different from the those of AKs and its computation is as efficient as in the AKs,

CE

Concerning the performance, the OAK outperforms the others in terms of TNR, precision

AC

and accuracy, while AK and RBF kernel demonstrate better performance than OAK in terms of TPR and F-measure. The reason for these results might be that all the LIBSVM datasets are highly imbalanced, as shown in Table 13. The number of negative samples is about three times larger than that of positive samples. The decision boundary tends to be pushed to a class which has more samples than other because the OAK is trained to decrease the sum of

49

ACCEPTED MANUSCRIPT

L

error C   (i ) in (2). Therefore, the OAK and RBF underperform the AKs (Linear) in terms i 1

of TPR and F-measure, which is proportional to TPR. In terms of other measures such as TNR, Precision and accuracy, however, our proposed OAK demonstrates better

time.

CR IP T

performance than other AKs. In Table 20, all classifiers are compared in terms of testing

Table 20. Performance Comparison for LIBSVM Datasets: Testing Time(ms)

 IK

 LIN

 GIK



2

 RBF

 OAK

10.4238

0.0034

0.0031

setB

0.0026

13.4442

0.0026

setC

0.0026

18.9194

0.0027

setD

0.0026

24.0907

0.0028

setE

0.0026

26.8385

0.0026

0.0027

18.7433

0.0028

M

AN US

setA

ED

Average

PT

As shown in Table 20, it is worth noting that the testing time of RBF kernel increases as

CE

the number of training data samples increases. On the other hand, the AK and OAK which use the LUT for testing have almost the same testing time regardless of the number of data.

AC

Figure 21 shows the average Accuracy versus testing time for the LIBSVM DB.

50

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Figure 23. Average accuracy versus testing time: LIBSVM DB

ED

As shown in Figure 23, it can be seen that our proposed OAK has the best performance in terms of accuracy while it maintains the advantage of the fast and efficient evaluation as in

CE

PT

other AKs.

V. CONCLUSION

AC

In this paper, an OAK SVM is proposed and its performance was tested through its

application to synthetic problems. The training of the OAK SVM was formulated as semidefinite programming (SDP), and it was solved by convex optimization. Unlike previous AKs such as the intersection kernel (IK) or  2 kernel, the proposed OAK did not assume any specific functional form of the kernels. Instead, the kernel was parameterized in terms 51

ACCEPTED MANUSCRIPT

of the QTs, thereby increasing the nonlinearity. Finally, the OAK SVM was compared with other AKs and the RBF kernel in the experimental section, and the enhanced classification performance of the OAK over previous AKs and RBF was confirmed.

[1]

CR IP T

REFERENCES H. G. Jung, G. Kim, Support vector number reduction: Survey and experimental evaluations, IEEE Trans. Intell. Transp. Syst. 15 (2) (2013) 463-476 [2]

M. Barros de Almeida, A. de Pádua Braga, J. P. Braga, SVM-KM: speeding SVMs

AN US

learning with a priori cluster selection and k-means, in: Proceedings of Sith Brazilian Symposium on Neural Networks, 2000, pp. 162–167. [3]

Q.-A. Tran, Q.-L. Zhang, X. Li, Reduce the number of support vectors by using clustering techniques, in: Proceedings of International Conference on Machine

M

Learning and Cybernetics, 2003, vol. 2, pp. 1245–1248.

ED

[4] A. Gani, A. Siddiaq, S. Shamshirband, A survey on indexing techniques for big data: taxonomy and performance evaluation, Knowl. Inf. Syst, 46 (2) (2016) 241-284.

PT

[5] C. Jing, J.Hou, SVM and PCA based fault classification approaches for complicated

CE

industrial process, Neurocomputing 167 (2015) 636-642. [6] L, Khedher, J. Ramirez, J. Gorriz, A. Brahim, F. Segovia, The Alzheimer’s Disease Neuroimaging Initiative, Early diagnosis of Alzheimer‫׳‬s disease based on partial least

AC

squares, principal component analysis and support vector machine using segmented MRI images, Neurocomputing 151 (2015) 139-150.

[7] X. Zhang, D. Qiu, F. Chen, Support vector machine with parameter optimization by a novel hybrid method and its application to fault diagnosis, Neurocomputing 149 (2015) 641-651.

52

ACCEPTED MANUSCRIPT

[8]

J. Chen, C.-L. Liu, Fast multi-class sample reduction for speeding up support vector machines, in: Proceedings of IEEE International Workshop on Machine Learning for Signal Processing, 2011, pp. 1–6.

[9]

R. Koggalage, S. Halgamuge, Reducing the number of training samples for fast support vector machine classification, Neural Inf. Process. Rev.2 (3) (2004) 57–65.

CR IP T

[10] M. F. A. Hady, W. Herbawi, M. Weber, F. Schwenker, A multi-objective genetic algorithm for pruning support vector machines, in: Proceedings of IEEE Internatoinal Conference on Tools with Artificial Intelligence, 2011, pp. 269–275.

[11] H.-J. Lin and J. P. Yeh, “Optimal reduction of solutions for support vector machines,”

AN US

Appl. Math. Comput., vol. 214, no. 2, pp. 329–335, 2009.

[12] H. Zhang, A. C. Berg, M. Maire, J. Malik, SVM-KNN: Discriminative nearest neighbor classification for visual category recognition,” in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2006, vol. 2, pp. 2126–

M

2136.

ED

[13] H. Cheng, P.-N. Tan, R. Jin, Efficient algorithm for localized support vector machine, IEEE Trans.Knowl. Data Eng. 22 (4) (2010) 537–549.

PT

[14] Q. Ye, Z. Han, J. Jiao, J. Liu, Human detection in images via piecewise linear support vector machines, IEEE Trans. Image Process.22 (2) (2013) 778–789.

CE

[15] S. Maji, A. C. Berg, J. Malik, Efficient classification for additive kernel SVMs, IEEE

AC

Trans. Pattern Anal. Mach. Intell.36 (1) (2013) 66–77. [16] S. Boughorbel, J.-P. Tarel, N. Boujemaa, Generalized histogram intersection kernel for image recognition, in: Proceedings of IEEE International Conference on Image Processing, 2005, vol. 3.

53

ACCEPTED MANUSCRIPT

[17] S. Maji, U. C. Berkeley, A. C. Berg, Classification using Intersection Kernel Support Vector Machines is Efficient, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008. . [18] M. Varma, D. Ray, Learning the discriminative power-invariance trade-off, in: Proceedings of IEEE 11th International Conference on,Computer Vision, 2007, pp. 1–

CR IP T

8. [19] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) pp. 273–297, 1995.

[20] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M. I. Jordan, Learning the

AN US

kernel matrix with semidefinite programming, J. Mach. Learn. Res.5 (1) (2004) 27–72. [21] J. Tang, Y. Tian, A multi-kernel framework with nonparallel support vector machine, Neurocomputing, 266 (2017) 226-238.

M

[22] X. Liu, L. Wang, G, Hwang, J. Zhang, J. Yin, Multiple kernel extreme learning machine, Neurocomputing 149 (2015) 253-264

ED

[23] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming,

PT

version 2.0 beta. http://cvxr.com/cvx, September 2013. [24] P. A. Flach, The geometry of ROC space: understanding machine learning metrics through ROC isometrics, in: Proceedings of International Conference on Machine

CE

Learning, 2003, pp. 194–201.

AC

[25] K.-A. Toh G.-C. Tan, Exploiting the relationships among several binary classifiers via data transformation, Pattern Recognit., 47 (3) (2014) 1509–1522.

[26] J. Makhoul, F. Kubala, R. Schwartz, R. Weischedel, Performance measures for information extraction, in: Proceedings of DARPA broadcast news workshop, 1999, pp. 249–252.

54

ACCEPTED MANUSCRIPT

[27] M. Di Martino, G. Hernández, M. Fiori, A. Fernández, A new framework for optimal classifier design, Pattern Recognit., 46 (8) (2013) 2249–2255. [29]

A. Asuncion and D. Newman, “UCI machine learning repository.” 2007.

[30] H. Jiang, W. Ching, K. F. C. Yiu and Y. Qiu, Stationary Mahalanobis kernel SVM for

CR IP T

credit risk evaluation, Appl. Soft. Comput, 71, (2018). [31] T. Tang, S. Chen, M. Zhao, W. Huang and J. Luo, Very large-scale data classification based on K-means clustering and multi-kernel SVM, Soft Comput, (2018)

[32] J. Baek, J. Kim and E, Kim, Fast and efficient pedestrian detection via the cascade

AN US

implementation of an additive kernel support vector machine, IEEE Trans. Intell. Transp. Syst. 18 (4) (2017) 902-916.

[33] M. Bilal and M. S. Hanif, High performance real-time pedestrian detection using light weight features and fast cascaded kernel SVM classification, Journal of Signal

M

Processing Systems, (2018).

[34] O. B. Ahmed, J. Benois-Pineau, M. Allard, G. Catheline, C. B. Amar, Recognition of

ED

Alzheimer's disease and Mild Cognitive Impairment with multimodal image-derived

PT

biomarkers and Multiple kernel learning, Neurocomputing, 220 (2017) 98-110. [35] C-C. Chang, C-J. Lin, LIBSVM: a library for support vector machines, 2001, software

AC

CE

available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

55

ACCEPTED MANUSCRIPT

Prof. Euntai Kim

CR IP T

Biography of Authors

CE

PT

ED

M

AN US

Euntai Kim was born in Seoul, Korea, in 1970. He received B.S., M.S., and Ph.D. degrees in Electronic Engineering, all from Yonsei University, Seoul, Korea, in 1992, 1994, and 1999, respectively. From 1999 to 2002, he was a Full-Time Lecturer in the Department of Control and Instrumentation Engineering, Hankyong National University, Kyonggi-do, Korea. Since 2002, he has been with the faculty of the School of Electrical and Electronic Engineering, Yonsei University, where he is currently a Professor. He was a Visiting Scholar at the University of Alberta, Edmonton, AB, Canada, in 2003, and also was a Visiting Researcher at the Berkeley Initiative in Soft Computing, University of California, Berkeley, CA, USA, in 2008. His current research interests include computational intelligence and statistical machine learning and their application to intelligent robotics, unmanned vehicles, and robot vision.

Dr. Jeonghyun Baek

AC

Jeonghyun Baek received the B.S. and Ph.D degree in electrical and electronic engineering from Yonsei University, Seoul, Korea, in 2011, and 2018. He is a senior researcher of Agency for Defense Department (ADD). He has studied machine learning, computer vision and optimization.

56