Domain described support vector classifier for multi-classification problems

Domain described support vector classifier for multi-classification problems

Pattern Recognition 40 (2007) 41 – 51 www.elsevier.com/locate/patcog Domain described support vector classifier for multi-classification problems Daewo...

766KB Sizes 0 Downloads 74 Views

Pattern Recognition 40 (2007) 41 – 51 www.elsevier.com/locate/patcog

Domain described support vector classifier for multi-classification problems Daewon Lee, Jaewook Lee∗ Department of Industrial and Management Engineering, Pohang University of Science and Technology, Pohang, Kyungbuk 790-784, Republic of Korea Received 8 November 2005; received in revised form 1 May 2006; accepted 6 June 2006

Abstract In this paper, a novel classifier for multi-classification problems is proposed. The proposed classifier, based on the Bayesian optimal decision theory, tries to model the decision boundaries via the posterior probability distributions constructed from support vector domain description rather than to model them via the optimal hyperplanes constructed from two-class support vector machines. Experimental results show that the proposed method is more accurate and efficient for multi-classification problems. 䉷 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Multi-class classification; Kernel methods; Bayes decision theory; Density estimation; Support vector domain description

1. Introduction Support vector machines (SVMs), originally formulated for two-class classification problems, have been successfully applied to diverse pattern recognition problems and have become in a very short period of time the standard stateof-the-art tool. The SVMs, based on the structured risk minimization (SRM), are primarily devised in order to minimize the upper bound of the expected error by optimizing the trade-off between the empirical risk and the model complexity [1–3]. To achieve this, they construct an optimal hyperplane to separate binary class data so that the margin is maximal. Since many real-world applications are multi-class classification problems, several approaches to extend two-class SVMs to a multi-class SVM for multi-category classifications have been proposed. Most of the previous approaches try to decompose a multi-class problem to a set of multiple binary classification problems where two-class SVMs can be trained and applied. For example, one-against-all algorithm transforms a c-class problem into c two-class problems ∗ Corresponding author. Tel.: +82 54 279 2209.

E-mail addresses: [email protected] (D. Lee), [email protected] (J. Lee).

where one class is separated from the remaining ones; one-against-one (pair-wise) algorithm converts the c-class problem into c(c − 1)/2 two-class problems where pairwise optimal hyperplanes for each pair of classes are constructed and max-voting strategy is used to predict their classes, and so on (cf. [4,5]). These approaches, however, have some drawbacks inherent in the architecture of multiple binary classifications: some unclassifiable regions may exist if a data point belongs to more than one class or to none, resulting in low accuracy in correct classification. Also, to train two-class SVMs multiple times for the same data set repeatedly often results in a highly intensive time complexity for large scale problems. To overcome such drawbacks, in this paper, we propose a novel support vector classifier for multi-classification problems. The proposed classifier, based on the Bayesian optimal decision theory, tries to model the posterior probability distributions via support vector domain description (SVDD) [6,7] rather than to model the decision boundaries by constructing optimal hyperplanes. The performance of the proposed method is confirmed through simulation. The organization of this paper is as follows. In Section 2, we review the Bayesian optimal decision theory and briefly outline a SVDD algorithm. A novel method for multi-classification problems is proposed in Section 3

0031-3203/$30.00 䉷 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2006.06.008

42

D. Lee, J. Lee / Pattern Recognition 40 (2007) 41 – 51

with an illustrative example and Section 4 provides the theoretical basis of the proposed method. In Section 5, simulation results are given to illustrate the effectiveness and the efficiency of the proposed method.

Second, semi-parametric methods have a form of finite mixtures of Gaussians as follows: p(x|wi , 1 , . . . , M ) =

M 

p(x|wi , k , k)p(k),

(4)

k=1

2. Previous works In this section, we first review the Bayesian optimal decision theory and describe the existing density estimation algorithms. Then we briefly outline the SVDD algorithm employed in our proposed method. 2.1. Bayesian optimal decision theory According to the Bayesian decision theory, an optimal classifier can be designed if we know the prior probabilities p(wi ) and the class-conditional densities p(x|wi ), that is, with Bayes formula, the posterior probabilities are given by p(x|wi )p(wi ) p(x) p(x|wi )p(wi ) = c , i=1 p(x|wi )p(wi )

(1)

where c is the number of output class labels. The optimal decision rule to minimize the average probability of error can then be shown to be the Bayesian decision rule [8,9] that selects the wi maximizing the posterior probability p(wi |x) as follows: for all j  = i.

(2)

In typical classification problems, estimation of the prior probabilities presents no serious difficulties (normally all are assumed to be equal or Ni /N ). However, estimation of the class-conditional densities is quite another matter. During the last decades, lots of density estimation algorithms have been proposed and the existing density estimation algorithms may generally be categorized into three approaches: parametric, semi-parametric, and nonparametric methods. Parametric methods assume a specific functional form of p(x|wi ) to contain a number of adjustable parameters. The simplest and the most widely used form is a normal distribution given by p(x|wi , i , i ) =

p(x|wi ) =

N 

j K(xj , x).

(5)

j =1

p(wi |x) =

Decide wi if p(wi |x) > p(wj |x)

where p(x|wi , k , k) is a kth component in the form of Gaussian function and p(k) are mixing parameters. In semiparametric methods, training data do not provide any component labels to say which component was responsible for generating each data point. To select the number of components and to estimate its parameters, however, we need to incorporate with an iterative scheme such as an EM algorithm, which often proved to be highly computationally extensive. The third approach is nonparametric methods which estimate the class-conditional density function as a weighted sum of a set of kernel functions, K(·, ·), to be determined entirely by the data

1 (2) |i |1/2   1 T −1 exp − (x − i ) i (x − i ) . 2 (d/2)

(3)

The drawback of such an approach is that a particular form of parametric function might be incapable of describing the true data distribution.

Though such methods have the most descriptive capability, they typically suffer from the problem that the number of parameters grows with the size of the data set, so that the models can quickly become unwieldy. 2.2. Support vector domain description The existing methods for density estimation have a tradeoff between a descriptive ability and a computational burden. To solve this problem, our proposed method utilizes a socalled trained kernel support function that characterizes the support of a high dimensional distribution of a given data set, inspired by the SVMs. We first review a support vector domain description (SVDD) procedure (also called a oneclass support vector machine). Then we build a trained kernel support function, to be used as a pseudo-density function, via SVDD. The basic idea of SVDD is to map data points by means of a nonlinear transformation to a high dimensional feature space and to find the smallest sphere that contains most of the mapped data points in the feature space [6,7]. This sphere, when mapped back to the data space, can separate into several components, each enclosing a separate cluster of points. More specifically, let {xi } ⊂ X be a given training data set of X ⊂ Rn , the data space. Using a nonlinear transformation  from X to some high dimensional feature space, we look for the smallest enclosing sphere of radius R described by the following model:  min R 2 + C j j

s.t.

(xj ) − a2 R 2 + j , j 0 for j = 1, . . . , N,

(6)

D. Lee, J. Lee / Pattern Recognition 40 (2007) 41 – 51

where a is the center and j are slack variables allowing for soft boundaries. To solve this problem, we introduce the Lagrangian  L = R2 − (R 2 + j − (xj ) − a2 )j −



j

j j + C



j

j .

j

Setting jL/jR = 0 and jL/ja = 0, respectively, leads to  j j = 1 and  a= j (xj ). (7) j

Using these relations and transforming the objective function into a function of the variables j only, the solution of the primal (6) can then be obtained by solving the following Wolfe dual problem:   max W = K(xj , xj )j − i j K(xi , xj ) j

s.t.

i,j

0 j C,



j = 1, j = 1, . . . , N,

(8)

j

where the Gaussian kernel K(xi , xj ) = (xi ) · (xj ) = exp(−qxi − xj 2 ) with width parameter q is used. Only those points with 0 < j < C lie on the boundary of the sphere and are called support vectors (SVs). The trained Gaussian kernel support function, defined by the squared radial distance of the image of x from the sphere center, is then given by f (x) := R 2 (x) = (x) − a2  = K(x, x) − 2 j K(xj , x) +



j

i j K(xi , xj )

(9)

43

Step 1 (Data partitioning): We first partition the given training data into c-disjoint subsets {Dk }ck=1 according to their output classes. For example, the kth class data set, Dk , contains Nk elements as follows: Dk = {(xi1 , wk ), . . . , (xiNk , wk )}.

(10)

Step 2 (SVDD for each class data set): For each class data set Dk , we build a trained Gaussian kernel support function via SVDD. Specifically, we solve the dual problem Eq. (8). Let its solution be ¯ il , l = 1, . . . , Nk , and Jk ⊂ {1, . . . , Nk } be the set of the index of the nonzero ¯ il . The trained Gaussian kernel support function for each class data set Dk is then given by fk (x) = 1 − 2

 il ∈Jk

+

 il ,im ∈Jk

2 ¯ il e−qN x−xil 

2 ¯ il ¯ im e−qN xil −xim  .

(11)

Step 3 (Constructing a pseudo-density function for each class): We construct the following pseudo-density function for each class k = 1, . . . , c: 1 p(x|w ˆ k ) = 2 (rk − fk (x))

for k = 1, . . . , c,

(12)

where rk = R 2 (xsk ) for any support vector xsk of fk (·). Step 4 (Classification using estimated pseudo-posterior probabilities): For each class k, k = 1, . . . , c, we estimate a pseudo-posterior probability distribution function as follows: ˆ k |x) = p(w ˆ k ) · p(x|w ˆ (wk |x) = const. × p(w k) Nk = (−fk (x) + rk ), N

(13)

where p(x|w ˆ i ) is a pseudo-density function obtained in step 2. Then we classify x into the class

i,j

and the domain that describes the support of the data points is given by {x : f (x) = Rˆ 2 } where Rˆ 2 = R 2 (xi ) for any support vector xi . 3. The proposed method In this section, we present a method for multiclassification problems. Suppose that a set of training data {(xi , yi )}N i=1 ⊂ X × Y is given where xi ∈ X denotes an input pattern and yi ∈ Y = {w1 , . . . , wc } denotes its output class. The central idea of the proposed method is to utilize the information of the domain description generated by the SVDD for estimating the distributions of each partitioned class data and then to utilize the estimates to classify a data point via Bayesian decision rule. The detailed procedure of the proposed method is as follows:

arg max (wk |x). k=1,...,c

To illustrate the proposed method, see Fig. 1 where a data set with three classes is given. At step 1, we partition the given data set into three data sets according to their output classes and perform the SVDD for each class-conditional data set at step 2. At step 3, using the trained Gaussian kernel support functions obtained in step 2, we construct the pseudo-density functions given by Eq. (12) for each class i = 1, 2, 3 which are shown in (a)–(c) of Fig. 1. At step 4, we determine the final decision boundary via posterior density estimates in Eq. (13), which is shown in (d) of Fig. 1. The constructed pseudo-density function in Eq. (12) has several nice properties compared with the existing methods reviewed in Section 2. Firstly, each p(x|w ˆ k ) (up to a constant multiple) plays the role of an asymptotic estimate for a class-conditional density function, which will be shown in

44

D. Lee, J. Lee / Pattern Recognition 40 (2007) 41 – 51

15

15

10

10

5

5

0

0

(a)

-2

0

2

4

6

8

10

12

14

(b)

15

15

10

10

5

5

0

0

(c)

-2

0

2

4

6

8

10

12

14

(d)

-2

-2

0

0

2

4

24

6

6

8

10

8

10

12

12

14

14

Fig. 1. Illustration of the proposed method applied to the triangle data set. (a), (b), and (c) denote contours of pseudo-density functions on classes 1, 2, and 3, respectively. In (d), a red solid line represents a final decision boundary determined by the estimated posterior-density functions.

Theorem 1 in Section 4. Secondly, since only a small portion of the ¯ j turn out to have nonzero values (corresponding points are SVs) as a result of optimizing Eq. (8), the constructed estimate highly reduces the computational burden involved in the computation of p(x|w ˆ k ). Finally, as shown in Theorem 2 in Section 4, for a finite sample size, each set {x: p(x|w ˆ k )0} estimates a support of its class-conditional distribution and has a good enough descriptive ability to describe highly nonlinear and arbitrary-shaped structures, including multi-modal or noisy distributions (see Fig. 2).

4. Theoretical basis In this section we provide a theoretical basis for the proposed method developed in the previous section. To begin with, we give an asymptotic convergence result for the

constructed pseudo-density functions in our proposed method to estimate the unknown class-conditional densities for large sample size. Then we present a theoretical result on the generalization error, which characterizes estimation error of the support of the data distribution for a finite sample size. Theorem 1. Let N samples x1 , . . . , xN ∈ Rd be drawn independently and identically distributed (i.i.d.) according to some unknown probability law p(x) and the estimate pN (x) is given by pN (x) = (qN /)d/2

N 

i e−qN x−xi  , 2

i=1

where the i are the set of the coefficients satisfying N i=1 i = 1 and 0 i CN . Suppose that the parameters

D. Lee, J. Lee / Pattern Recognition 40 (2007) 41 – 51

45

1.2

2.5 2

1.1 1.5 1 1 0.5

fk (x)

0

0.9

-0.5 rk

0.8

-1 -1.5 -2

0.7 -2

-1.5

-1

-0.5

0

0.5

1

1.5

(a)

2

2.5

(b) 2.5

0.2

2 {x | p(x|wk) > 0} 1.5

0.1

1 0 0.5 0

-0.1 p(x|wk)

-0.5

{x | p(x|wk) = 0}

-0.2 -1 -1.5 -2 -2

(c)

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

(d)

Fig. 2. Illustrations of steps 2 and 3 of the proposed method: (a) is a kth class data artificially generated from the mixture of three 2D Gaussian functions; (b) is a trained Gaussian kernel support function fk (x) obtained in step 2; (c) presents a pseudo-density function p(x|w ˆ k ) in Eq. (12) obtained in step 3; (d) shows a contour plot of {x|p(x|w ˆ k ) > 0} where the region inside the solid line represents an estimate of the support of its class distribution.

CN , qN satisfy the following conditions: lim qN = ∞

N →∞

and

d/2

lim CN qN = 0.

N→∞

Then the estimate pN (x) converges to p(x), i.e., lim p¯ N (x) = p(x),

N →∞

p¯ N (x) = E[pN (x)] = =

N 

 i

N 

i E[(qN /)d/2 e−qN x−xi  ] 2

i=1

N (x − v)p(v) dv

 i=1 −→ (x − v)p(v) dv = p(x) as N → ∞

lim ¯ 2N (x) = 0.

N →∞

Proof. Let N (x − xi ) = (qN /)

 Then we have N (x−xi ) dx=1 and N (x−xi ) approaches a Dirac delta function centered at xi as qN approaches to infinity. From this fact, we get

d/2 −qN x−xi 2

e

.

 since N i=1 i = 1 and qN → ∞ as N → ∞. Because pN (x) is the sum of functions of statistically independent

46

D. Lee, J. Lee / Pattern Recognition 40 (2007) 41 – 51

Table 1 Benchmark data description and experimental settings Benchmark data set

No. of classes

Dim.

2 2 2 2 3 3 3 3 3 3 4 4 6 7 9 11 26 9 20

twospiral tae heart sonar OXours triangle iris shuttle wine DNA ring vehicle satimage segment orange vowel letter Uspst Coil20

No. of training set

2 2 13 60 2 3 4 9 13 180 2 18 36 19 2 10 16 256 1,024

194 600 180 104 400 400 100 30,319 118 2,000 160 564 4,435 1,540 140 528 10,500 1,405 1,008

No. of test set

194 200 9 104 200 200 50 14,442 60 1,186 40 282 2,000 770 30 462 5,000 602 432

Experimental settings Structure

(h, q1 , q2 , q3 )

2-20-2 2-10-2 13-20-2 30-30-2 2-20-3 2-5-3 4-10-3 – 13-10-3 180-10-3 2-5-4 18-10-4 36-20-6 19-20-7 2-5-9 10-20-11 – 256-20-9 –

(0.5, 0.5, 0.5, 1) (5, 0.05, 0.05, 1) (2, 0.3, 0.3, 0.35) (0.5, 0.1, 0.1, 2) (7, 2, 2, 0.5) (5, 2, 2, 1) (0.3, 0.005, 0.01, 3) (–, 2, 2, 30) (2, 0.3, 0.2, 0.5) (1, 0.001, 0.001, 0.1) (1, 1, 1, 2) (0.9, 1, 1, 0.55) (0.8, 1.5, 1.3, 1.7) (1, 0.01, 0.005, 0.0037) (0.5, 1, 1, 2) (0.9, 1, 1, 0.9) (0.8, 0.8, –, 1.2) (4, 0.013, 0.015, 0.11) (0.001,0.007, 0.005, 0.007)

In the experimental settings column, structure means network structure of BR-NN and h is a window size of BDM-Parzen. q1 , q2 , q3 are the Gaussian kernel parameters of 1-1-SVM, 1-all-SVM, and Proposed method, respectively.

random variables, its variance is the sum of the variances of the separate terms, and hence ¯ 2N (x) =

N 

 E

i (qN /)d/2 e−qN x−xi  − 2

i=1

2

1 p¯ N (x) N

In relation with our pseudo-density function constructed via the SVDD, note that for each k, k = 1, . . . , c, Eq. (12) can be written as ⎛ p(x|w ˆ k )= ⎝



il ∈Jk

¯ il e−qN x−xil − 2

 il ∈Jk

⎞ ¯ il e−qN xsk −xil ⎠ . (14)

N 

1 2 = E[2i (qN /)d e−2qN x−xi  ] − p¯ N (x)2 N i=1    max i (qN /)d/2 E i

×

N 

i (qN /)

d/2 −qN x−xi 2

e

·1 −

i=1

1 p¯ N (x)2 N

CN (qN /)d/2 E[pN (x)] = CN qN −d/2 p¯ N (x) d/2

→0

2

as N → ∞ d/2

since maxi i CN and CN qN

as N → ∞. 

Since the dual optimal solutions ¯ j where (C, q) = (CN , qN ) are in the set SN = { = (1 , . . . , N )T : N N i=1 i = 1, 0 i CN } and the volume of the set S shrinks to zero as CN → ∞, controlling the parameters (CN , qN ) as above leads to the second term of the trained Gaussian kernel support function f converging to an unknown density function up to a constant multiple.

Since the second term of the right-hand side converges to zero as qN → ∞, each p(x|w ˆ k ) (up to a constant multiple) plays the role of an asymptotic estimate for a classconditional density function. We next present a result on generalization error bound, which explains theoretically how the constructed pseudodensity function characterizes the support of the classconditional distributions. Theorem 2. Let N samples x1 , . . . , xN be drawn independently and identically distributed (i.i.d.) according to some unknown probability distribution law P, which does not contain discrete components. Suppose, moreover, we solve the optimization problem Eq. (6) and obtain a solution f given explicitly by Eq. (9). Let Ca,r := {x : f (x) r} denote the induced region for a level value r. With probability 1 −  over the draw of the random sample x1 , . . . , xN , for any

> 0, P {x  : x  ∈ / Ca,Rˆ 2 + }

2 N

  N2 k + log2 , 2

D. Lee, J. Lee / Pattern Recognition 40 (2007) 41 – 51

47

Fig. 3. (a) Images of 20 different objects in the Coil20 data set. (b) Different poses from different angles of the first object (in the same class) in the Coil20 data set.

where

describes the decision function that solves problem Eq. (15) by choosing the sign of ( (x) − ) ˆ where ˆ = (xi ) for any support vector xi . For a Gaussian kernel where K(x, x) = 1, problem Eq. (18) is equivalent to problem Eq. (8). Therefore, we have

2

k=

c1 log(c2 ˆ N)

ˆ 2

+

   (2N − 1)ˆ

D log2 e +1 + 2,

ˆ D

c1 = 16c2 ,

ˆ = /a,

w = a,

c2 = ln(2)/(4c2 ), c = 103,  D= max{0, f (xi ) − Rˆ 2 },

f (x) := R 2 (x) = 1 − 2 (x) + a2 .

i

Therefore, the generalization error bound in [10, Theorem 1] can be equally applied to get the result by changing

and Rˆ 2 = R 2 (xs ) for any support vector xs of f (·). Proof. Consider the problem of returning a function that takes the value +1 in a small region capturing most of the data points and −1 elsewhere. Mapping the data into the feature space corresponding to the kernel and separating them from the origin with maximum margin can be formulated as the following quadratic programming (QP). 1 w2 + C 2

min s.t.



j − j 0.

(15)

(16)

j

j (x),

(17)

j

where the j are the solution of the Wolfe dual form of problem Eq. (15): 1 i j K(xi , xj ) 2 i,j  s.t. 0 j C, j = 1, j = 1, . . . , N,

min

 −→ = 2  , D =



max{0, ˆ − (xi )} −→ D

i

=



max{0, f (xi ) − Rˆ 2 } = 2D  .



j

(w · (xj )) − j ,



−→ Rˆ 2 = 1 − 2 + a2 ,

i

Then the function given by  j K(xj , x), (x) = w · (x) = w=

w −→ a = w,

W=

j

(18)

5. Simulation results To demonstrate the performance of the proposed method empirically, we have conducted simulations on some classification problems which are categorized as follows. And additional description of the data sets is given in Table 1. Artificial data: twospiral, tae, OXours, triangle, ring, and orange are generated from highly nonlinearshaped distribution in order to demonstrate generalization capability of the various methods. Small-scale real-world data: heart, sonar, iris, wine, vehicle, vowel are from the UCI machine learning repository [11] and Statlog database [12]. Large-scale real-world data: shuttle, DNA (classification between exons and introns in the DNA sequence), satimage (classification in a satellite image), segment

0 22.7

1.25 18.62

15.13 7.47 4.29 31.63 29.62 3.49

ring vehiclea

satimagea segment orange vowel lettera Uspst Coil20

a The

0.5 2.5 2 2.90 0 34.49

0.5 4.25 2 2.78 0 0

OXours triangle iris shuttlea winea DNA

N/A

49.49 5.5 11.11 25.96

49.49 4.67 15.00 8.65

twosprial tae hearta sonar

24.28 21.75 1.43 1.14 9.53

0 10.11

0.25 1.5 2 3.95 0 4.21

N/A N/A

28.55 22.21 0 52.81 11.28

0 20.21

0 1 2 4.09 5 40.89

49.48 5 15.56 36.54

Test

7.6 0 4.29 5.87 2.68 2.63 95.44

0 6.91

2.54 0

1.75 3.25 1

0 3.33 11.11 2.88

Train

N/A

12.5 6.1 0 41.77 7.8 10.13 93.98

0 33.69

8.33 23.44

2 1.5 4

0 4 12.22 10.58

Test

BDM-parzen

corresponding data set is normalized and N/A means not available.

16.05 8.83 0 55.63 30.14 10.63

Train

Test

Train 49.48 3.83 12.78 0.00

QDA

LDA

Data set

Table 2 Simulation results on benchmark problems: error rates (%)

23.7

5.67 1.69 10.71 5.68

0 3.72

0 0

0 0.25 0

45.88 0.33 0 12.78

Train

N/A

N/A

N/A

BR-NN

32.23

13.15 4.94 10.0 51.95

0 21.63

6.7 17.12

0.5 1 6

45.88 2.0 24.44 14.44

Test

0 0 0 0 0 0 0

0 0

0 0 1 0 0 0

0 0 0 0

Train

1-1-SVM

24.9 13.25 0 38.74 6.2 9.8 9.95

0 39.36

2.5 2 6 0 10 7.67

0 5.5 15.56 42.22

Test

0 0

0 0 0 0

0 0

0 0 0 0 0 0

0 0 0 0

Train

N/A 8.472 4.17

17.6 6.88 0 34.85

0 33.33

2 1.5 10 0.09 5 6.75

0 5.5 15.56 42.22

Test

1-all-SVM

1.17 2.08 2.86 2.27 0.16 1.35 0

0 3.9

0.25 0.25 1 0.63 0 0.64

0 0.33 0.56 0.96

Train

Proposed

9.6 4.94 0 37.23 4.92 7.97 3.24

0 30.14

0 0.5 2 0.81 6.7 16.86

0 2.5 15.56 6.7

Test

48 D. Lee, J. Lee / Pattern Recognition 40 (2007) 41 – 51

D. Lee, J. Lee / Pattern Recognition 40 (2007) 41 – 51

49

Table 3 Simulation results on benchmark problems: model building time (s) Data sets

LDA

QDA

BDM-Parzen

BR-NN

twosprial tae hearta sonar

0.25 0.016 0.016 0.016

2.01 0.016 0.016 0.016

0.14 0.281 0.141 0.14

28.81 24.98 15.64 152.63

0.343 2.125 0.172 0.188

OXours triangle iris shuttlea winea DNA

0.015 0.016 0.015 0.375 0.016 0.75

0.015 0.015 0.016 0.453 0.016 0.766

0.219 0.219 0.156 N/A 0.297 3.172

52.39 13.84 15.36 N/A 15 6396.5

0.656 0.703 0.234 2264.7 0.25 7.093

ring vehiclea

0.046 0.032

0.032 0.031

0.172 0.313

2.66 104.16

0.391 1.625

satimagea segment orange vowel lettera Uspst Coil20

0.187 0.156 0.015 0.015 0.329 1.297 N/A

0.36 0.11 0.016 0.031 0.875 N/A N/A

10.032 1.36 0.281 0.469 101.719 2.156 4.437

12513 2558 10.53 392.96 N/A 2476 N/A

121.375 16.39 1.484 4.859 2614 33.984 125.95

a The

1-1-SVM

1-all-SVM

Proposed

0.343 2.125 0.172 0.188

0 0.063 0.016 0.016

1.047 1.047 0.39 3757.4 0.297 12.11

0.172 0.016 0.016 16.953 0 0.69

0.344 2.828

0.015 0.032

282.95 30.047 0.593 5.437 N/A 63.14 246.95

1.125 0.157 0.015 0.062 1.01 0.282 0.297

corresponding data set is normalized and N/A means not available.

(image segmentation data), letter (classification of the English alphabet images) [12], Uspst (handwritten digit recognition), Coil20 (classification of gray-scale images of 20 objects, see Fig. 3) [13]. The performance of the proposed method is compared with six widely used methods; linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), Bayesian decision method using parzen windows (BDM-parzen), Bayesian regularization neural network (BR-NN), oneagainst-one SVM (1-1-SVM), and one-against-all SVM (1-all-SVM) [5]. The criteria for evaluating the performance of these methods are their mis-classification error rates for training data sets and test data sets (Table 2), and model building time (Table 3). We chose the best parameters by performing model selection. That is, alternative models are constructed on the training data where the test data are assumed unknown and then the parameter set, with the best performance for the test set, is selected for constructing the final model. In our experiments, to reduce the search space of parameter sets, we used the same q values in all the pseudo-density functions of Eq. (12) (by setting C = 1, i.e., not using soft margins). The detailed descriptions of parameter values are reported in the last column of Table 1. Also the structure in Table 1 means a network architecture of BR-NN. For example, 13-20-2 means a multi-layer neural network architecture that has an input layer with 13 nodes, a hidden layer with 20 nodes, and an output layer with two nodes. Experimental results are shown in Fig. 4, Tables 2, and 3. Fig. 4 shows a decision boundaries, constructed by the

proposed method applied to multi-class data sets (from 2 class to 9 class problems) including highly nonlinear shaped data sets (e.g., two-spiral data). In Tables 2 and 3, train, test, and time denote training error (%), test error (%), and computation time for model building (s), respectively. Experimental results demonstrate that the proposed method achieves a better or comparable performance in terms of accuracy and efficiency for most of the multi-class data sets (even for two-class data sets). To analyze the time complexity of the proposed method and conventional SVM methods, let N be the number of training pattern and c is the number of output class labels. Both the proposed method and the conventional SVM have a QP procedure and most of the QP solvers have time complexity O(N 3 ) [14], so that the computational load for the large-scale multi-class problems is quite intensive (for example, in Table 2, 1-all-SVM on letter data is not available due to lack of memory). Generally, the SVM approaches for multi-classification adopt one of one-againstone or one-against-all. The one-against-one SVM for multi-classification decomposes the multi-class problems into (c · (c − 1))/2 binary subproblems, each one composed of (2N )/c. Its time complexity, therefore, is O((4N 3 )/c). The one-against-all SVM uses c different SVMs and each one has N training patterns, so that its time complexity is O(c · N 3 ). The proposed method uses c QP problems of Eq. (8) and each one has N/c training patterns. Therefore, the time complexity of the proposed method is O(N 3 /c2 ). This complexity analysis is well demonstrated in Table 3, showing the superiority of the proposed method in computing speed.

50

D. Lee, J. Lee / Pattern Recognition 40 (2007) 41 – 51 Two-spiral data, C=1, q=1

tae data, C=1, q=1

6

15

4

10

2 5 0 0

-2

-5

-4 -6

-10 -6

-4

-2

0

2

4

-10

6

(a)

-5

0

5

10

15

20

(b) Iris 2D data, C=1, q=4

OXours data, C=1, q=0.5

11 10 10

8

9

6

8

4

7

2 0

6 -2 5 -3 -2.5 -2 -1.5 -1 -0.5

0

0.5

1

1.5

(c)

-4

-2

0

2

4

6

8

10

12

14

(d) Ring data, C=1, q=2

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

-0.5

-0.5

-1

-1

-1.5

-1.5

-2

-2 -2

-1

0

1

2

(e)

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

(f)

Fig. 4. Experimental results on the 2D benchmark data sets where the red solid lines represent decision boundaries.

6. Conclusions

Acknowledgment

In this paper, a new classifier for multi-class classification problems has been proposed. The proposed method utilizes the information of the domain description generated by the SVDD to estimate the distributions of each partitioned class data. Then it classifies a data point with this estimate according to the Bayesian decision rule. Benchmark results demonstrate that the proposed method is more accurate, efficient, and robust compared to other existing methods. The application of the method to more diverse real-world problems remains to be investigated.

This work was supported partially by the Korea Research Foundation under the Grant number KRF-2005-041D00708 and partially by the KOSEF under the Grant number R01-2005-000-10746-0.

References [1] C.J. Burges, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discovery 2 (2) (1998) 121–167.

D. Lee, J. Lee / Pattern Recognition 40 (2007) 41 – 51 [2] V.N. Vapnik, An overview of statistical learning theory, IEEE Trans. Neural Networks 10 (1999) 988–999. [3] K.-R. Muller, S. Mika, G. Rätsch, K. Tsuda, B. Schölkpf, An introduction to kernel-based learning algorithms, IEEE Trans. Neural Networks 12 (2) (2001) 181–202. [4] C.-W. Hsu, C.-J. Lin, A comparison of methods for multiclass support vector machines, IEEE Trans. Neural Networks 13 (2002) 415–425. [5] J. Weston, C. Watkins, Multi-class support vector machines, in: M. Verleysen (Ed.), Proceedings of ESANN99, Brussels, Belgium, 1999. [6] D.M.J. Tax, R.P.W. Duin, Support vector domain description, Pattern Recogn Lett. 20 (1999) 1191–1199. [7] J. Lee, D. Lee, An improved cluster labeling method for support vector clustering, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 461–464. [8] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, Oxford, 1995.

51

[9] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, WileyInterscience Publication, New York, 2001. [10] B. Schölkpf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, Estimating the support of a high-dimensional distributions, Neural Comput. 13 (2001) 1443–1471. [11] http://www.ics.uci.edu/∼mlearn/MLRepository.html , UCI Repository of machine learning databases. [12] D. Michie, D.J. Spiegelhalter, C.C. Taylor, Machine Learning, Neural and Statistical Classification, Ellis Horwood, Chichester, UK, 1994. [13] O. Chapelle, A. Zien, Semi-supervised classification by low density separation, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 2005, pp. 57–64. [14] J.C. Platt, Fast training of support vector machines using sequential minimal optimization, Advances in Kernel Methods: Support Vector Machines, MIT Press, Cambridge, MA, 1999, pp. 185–208.

About the Author—DAEWON LEE received B.S. in industrial engineering from Pohang University of Science and Technology (POSTECH) in 2002 and is currently a Ph.D. candidate in the Department of Industrial and Management Engineering at POSTECH. He is interested in pattern recognition, support vector machine, and their applications to data mining. About the Author—JAEWOOK LEE is an associate professor in the Department of Industrial and Management Engineering at Pohang University of Science and Technology (POSTECH), Pohang, Korea. He received the B.S. degree from Seoul National University, and the Ph.D. degree from Cornell University in 1993 and 1999, respectively. His research interests include pattern recognition, neural networks, global optimization, nonlinear systems, and their applications to data mining and financial engineering.