Journal of Computational and Applied Mathematics 263 (2014) 288–298
Contents lists available at ScienceDirect
Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam
A nonparallel support vector machine for a classification problem with universum learning Zhiquan Qi a , Yingjie Tian a,∗ , Yong Shi a,b,∗ a
Research Center for Fictitious Economy & Data Science, Chinese Academy of Sciences, Beijing 100190, China
b
College of Information Science & Technology, University of Nebraska at Omaha, Omaha, NE 68182, USA
article
info
Article history: Received 25 March 2012 Received in revised form 29 June 2013 Keywords: Universum Twin support vector machine Multi-class classification
abstract Universum samples, defined as samples not belonging to any class for a classification problem of interest, have been useful in supervised learning. Here we design a new nonparallel support vector machine (U-NSVM) that can exploit prior knowledge embedded in the universum to construct a more robust classifier for training. To this end, U-NSVM maximizes the two margins associated with the two closest neighboring classes, which is combined by two nonparallel hyperplanes. Therefore, U-NSVM has better flexibility and can yield a more reasonable classifier in most cases. In addition, our method includes fewer parameters than U-SVM, so is easier to implement. Experiments demonstrate that U-NSVM outperforms the traditional SVM and U-SVM. Crown Copyright © 2013 Published by Elsevier B.V. All rights reserved.
1. Introduction Supervised learning with a universum sample is an interesting research topic in machine learning. A universum is defined as a sample not belonging to any class for a given learning task. For instance, for classification of ‘5’ versus ‘8’ in handwritten digit recognition, ‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘6’, ‘7’, and ‘9’ can be considered as universum samples. Since it is not required to have the same distribution as for the training data, the universum can reveal some prior information for possible classifiers. Several studies have used universum samples in machine learning. Weston et al. proposed a new support vector machine (SVM) framework called U-SVM and experimental results confirmed that U-SVM outperforms other SVMs that do not consider universum data [1]. Sinz et al. analyzed U-SVM and presented a least-squares version of the U-SVM algorithm [2]. Zhang et al. proposed a graph-based semi-supervised algorithm that learns from labeled data, unlabeled data, and universum data at the same time [3]. Other studies have also been published in the literature [4–6]. Mangasarian and Wild proposed a nonparallel plane classifier that attempts to generate two nonparallel planes such that each plane is closer to one of two classes and is at least one distance from the other [7]. Motivated by GEPSVM, Jayadeva et al. proposed a twin support vector machine (TWSVM) classifier for binary classification [8]. Experimental results showed that the nonparallel plane classifier can improve the performance of traditional SVMs [7,8]. Other extensions to TWSVM have also been described [9–13]. Inspired by this previous success [7,8], we propose a nonparallel SVM algorithm with universum learning that we call U-NSVM. It has the following innovative points:
♦ U-NSVM is a very useful extension of the nonparallel hyperplanes classifier. To obtain two nonparallel hyperplanes, GEPSVM and TWSVM have to construct two quadratic programming problems (QPPs) separately. Although it is claimed that this approach can efficiently improve the algorithm training speed, the calculation time for the inverse matrix of
∗
Corresponding authors. E-mail addresses:
[email protected],
[email protected] (Z. Qi),
[email protected] (Y. Tian),
[email protected] (Y. Shi).
0377-0427/$ – see front matter Crown Copyright © 2013 Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.cam.2013.11.003
Z. Qi et al. / Journal of Computational and Applied Mathematics 263 (2014) 288–298
289
samples is not considered. In fact, solving the inverse matrix itself is a difficult task. By contrast, U-NSVM uses universum samples, so the extra step of inverse matrix computation is not required and the method has the property of sparseness. ♦ As the U-NSVM classifier combines two nonparallel hyperplanes, compared to U-SVM [1] it has better algorithm flexibility and can yield a more reasonable classifier in most cases (Fig. 2). In addition, U-NSVM includes fewer parameters and is therefore easier to implement. In practise, the U-SVM algorithm uses an ε -insensitive loss function to divide the universum data (Eq. (11)), while our algorithm does not include a corresponding parameter. Experiments confirm that our method is superior to a traditional SVM and U-SVM. The remainder of the paper is organized as follows. Section 2 briefly introduces the background to SVM and U-SVM. Section 3 describes U-NSVM in details. Experimental results for public data sets are presented in Section 4. The final section provides conclusions. We first describe our notation. All vectors are column vectors unless transposed to a row vector by a prime ⊤ . The inner product of two vectors x and y in the n-dimensional real space ℜn is denoted by (x · y). For w ∈ ℜn , ∥w∥1 , denotes the 1-norm and ∥w∥2 the 2-norm. B ∈ ℜm×n denotes a real m · n matrix. For such a matrix, B⊤ is the transpose of B, Bi is the ith row of B, and B·,j is the jth column of B· . A vector of all ones or all zeros of arbitrary dimension will be denoted by e and 0, respectively. For x ∈ ℜn , e⊤ x denotes the sum of the components of x; x(∗) denotes two conditions of x and x∗ . 2. Background 2.1. Support vector classification SVMs were introduced in the framework of structural risk minimization and the theory of Vapnik-Chervonenkis bounds [14] for classification of training data T = {(x1 , y1 ), . . . , (xl , yl )} ∈ (ℜn × Y)l ,
(1)
where xi ∈ ℜ , yi ∈ Y = {1, −1}, and i = 1, . . . , l. The linear soft-margin algorithm of an SVM solves the following primal QPP: n
1
min
w,b,ξ
2
∥w∥22 + C
l
ξi
(2)
i =1
s.t. yi (w ⊤ xi + b) ≥ 1 − ξi ,
ξi ≥ 0, i = 1, 2, . . . , l,
where C is a penalty parameter and ξi are the slack variables. The goal is to find the an optimal separating hyperplane
w ⊤ x + b = 0,
(3)
n
where x ∈ ℜ . The Wolf dual of (2) can be expressed as l l 1
min α
s.t.
2 i =1 j =1 l
yi yj (xi · xj )αi αj −
l
αj
j =1
(4)
yi αi = 0,
i =1
0 ≤ αi ≤ C ,
i = 1, . . . , l,
where α ∈ ℜl are Lagrangian multipliers. The optimal separating hyperplane of (3) can be given by
w=
l
αi yi xi , ∗
i =1
b=
1 Nsv
yj −
Nsv
αi yi (xi · xj ) , ∗
(5)
i=1
where α ∗ is the solution of (4) and Nsv is the number of support vectors such that α > 0. A new sample is classified as +1 or −1 according to the final decision function. 2.2. SVM with universum: U-SVM Suppose that the training set Tˇ consists of two parts, labeled data and universum data: Tˇ = T
U,
(6)
290
Z. Qi et al. / Journal of Computational and Applied Mathematics 263 (2014) 288–298
where T = {(x11 , 1), . . . , (x1l1 , 1), (x31 , −1), . . . , (x3l3 , −1)} ∈ (ℜn × Y)l ,
(7)
U = {x21 , . . . , x2l2 } ∈ ℜn ,
with x1i , x3i ∈ ℜn , y ∈ Y = {−1, 1}, i = 1, . . . , l, l = l1 + l3 , and x2j ∈ ℜn , j = 1, . . . , l2 . The goal of U-SVM is to induce a real-valued function y = sgn(g(x))
(8) n
to infer the label y corresponding to any sample x in ℜ space. U-SVM uses an ε -insensitive loss function for the universum: 1 2
∥w∥22 + c
l
ϕ[yi fw,b (xi )] + d
i =1
u
ρ[fw,b (x∗j )],
(9)
j =1
where ϕε [t ] = max{0, ε − t } is the hinge loss function for prior knowledge embedded in the universum and
ρ[t ] = ρ−ε [t ] + ρ−ε [−t ]
(10)
is ε-insensitive loss. In this way, prior knowledge embedded in universum can be reflected in the sum of the losses the u ∗ j=1 ρ[fw,b (xj )]. The smaller this value is, the higher is the prior possibility of the classifier fw,b , and vice versa [3]. Thus, U-SVM can be expressed as the following QPP [1]: 1
min
w,b,ξ 1,3 ,ξ (∗)2
2
2 2
∥w∥ + C
l1
ξ + 1 i
l3
i=1
ξ + 3 i
i=1
l2
(ξ + ξs ) , 2 s
s=1
s.t. (w · x1i ) + b ≥ 1 − ξi1 ,
ξi1 ≥ 0, i = 1, . . . , l1 ,
−((w · x3i ) + b) ≥ 1 − ξi3 ,
ξi3 ≥ 0, i = 1, . . . , l3 ,
−ε − ξs∗2 ≤ (w · x2s ) + b ≤ ε + ξs2 , ξ , ξs ≥ 0, 2 s
(11)
s = 1, . . . , l2 ,
s = 1, . . . , l2 ,
∗2
(∗)2
where ξ = (ξ , . . . , ξl11 )⊤ , ξ (∗)2 = (ξ1 1 1
1
∗2
2 ⊤ , . . . , ξl(∗) ) , ξ 3 = (ξ13 , . . . , ξl33 )⊤ , and C , ε ∈ [0, +∞) are the parameters. 2
3. Nonparallel SVM for classification with a universum: U-NSVM 3.1. Linear U-NSVM
U-SVM requires that the hyperplane satisfies the maximum margin principle for labeled data and that all the universum data are as close as possible to it. Following previous success [7,8], we achieve U-SVM using two nonparallel planes. We first construct two nonparallel hyperplanes to divide the training set (6) into three parts, with the universum data sandwiched between the two hyperplanes. This can be achieved by maximizing the two margins associated with the two closest neighboring classes (labeled data and universum data). Then a new data point x can be predicted as belonging to the positive or negative class, depending on the distances between x and the two hyperplanes. It is not hard to imagine that the perpendicular bisector (decision function) of the two hyperplanes is as close as possible to the central area of the universum data distribution and far away from the labeled data. According to this idea, the primal optimization of U-NSVM can be expressed as min
1
w,b,ξ (∗)
2
(∥w ∥ + ∥w ∥ ) + C
s.t. (w1 ·
2 1 2
2 2 2
l1
l3 l2 ξ + (ξi2 + ξi∗2 ) + ξi∗3
i =1
x1i
) − b1 ≤ −1 + ξ , 1 i
(w2 · x2i ) − b2 ≤ −1 + ξi2 , (w2 ·
i = 1, . . . , l3
) − b2 ≥ 1 − ξi ,
ξ ≥ 0,
i = 1, . . . , l1
ξ ≥ 0,
ξi∗2 ≥ 0,
1 i
2 i
ξi ≥ 0, ∗3
i =1
i = 1, . . . , l1
i = 1, . . . , l2
∗3
i=1
i = 1, . . . , l2
(w1 · x2i ) − b1 ≥ 1 − ξi∗2 , x3i
1 i
i = 1, . . . , l2
i = 1, . . . , l3 ,
where w = (w1 , w2 ), b = (b1 , b2 ), and ξ (∗) = (ξ11 , . . . , ξl11 , ξ12 , . . . , ξl22 , ξ1∗2 , . . . , ξl∗22 , ξ1∗3 , . . . , ξl∗33 ).
(12)
Z. Qi et al. / Journal of Computational and Applied Mathematics 263 (2014) 288–298
291
By introducing the Lagrange function 1
L( Θ ) =
2
(∥w ∥ + ∥w ∥ ) + C 2 1 2
2 2 2
l1
lk l3 l2 2 ∗3 2 ∗2 αik ((wk · xki ) − bk + 1 − ξik ) ξi + ξ + (ξi + ξi ) + 1 i
i =1
−
lk 3
k=1 i=1
i =1
i=1
αi∗k ((wk−1 · xki ) − bk−1 − 1 + ξi∗k ) −
lk 2
ηik ξik −
ηi∗k ξi∗k ,
(13)
k=2 i=1
k=1 i=1
k=2 i=1
lk 3
where
Θ = (w, b, ξ (∗) , α (∗) , η(∗) ),
α (∗) = (α11 , . . . , αl11 , α1∗2 , . . . , αl∗22 , α12 , . . . , αl22 , α1∗3 , . . . , αl∗33 ), η
(∗)
(14)
= (η , . . . , η , η1 , . . . , ηl2 , η , . . . , η , η1 , . . . , ηl3 ) 1 1
∗2
1 l1
∗2
2 1
2 l2
∗3
∗3
are the Lagrange multipliers, the dual problem (12) can be formulated as max L(Θ ), Θ
s.t. ∇w,b,ξ (∗) L(Θ ) = 0, (∗)
α ,η
(∗)
(15)
≥ 0.
From Eq. (15) we obtain
∇ wk L = wk +
lk
αik xki −
lk+1
i=1
∇bk L = −
lk
αi∗ k+1 xki +1 = 0,
k = 1, 2,
(16)
i=1
αik +
lk+1
αi∗ k+1 = 0,
k = 1, 2,
(17)
i=1
i =1
∇ξ k L = C − αik − ηik = 0,
k = 1, 2, i = 1, . . . , lk ,
i
∇ξ ∗k L = C − αi∗ k − ηi ∗ k = 0,
(18)
k = 2; 3, i = 1, . . . , lk .
i
(19)
Substituting the above equations into (12), the dual problem can be expressed as
max − α (∗)
+
lk lk 2 1
2 k =1
lk 2
s.t.
αik +
αik =
0 ≤ αik ≤ C , 0 ≤ αi
xki
·
xkj
)−2
lk+1 lk
α αj k i
∗ k+1
(
xki
·
xkj +1
i=1 j=1
lk 3
αi
αj
∗ k+1 ∗ k+1
(
xki +1
·
xkj +1
)
i =1 j =1
αi∗k
k=2 i=1 lk+1
)+
lk+1 lk+1
(20)
αi∗ k+1 ,
k = 1, 2,
i=1
i =1
∗k
k j
i =1 j =1
k=1 i=1 lk
α α ( k i
k = 1, 2; i = 1, . . . , lk ,
≤ C,
k = 2, 3; i = 1, . . . , lk .
Theorem 3.1. Optimization problem (45) is a convex quadratic program. Proof. Concisely, (45) can be reformulated as 1 ⊤ ˆ Λ ˆ λˆ + κˆ ⊤ λˆ max − λ 2 λˆ
(21)
ˆ λˆ = 0, s.t. Ω ˆ ≤ Cˆ , 0≤λ where
λˆ = (α11 , . . . , αl11 , α1∗2 , . . . , αl∗22 , α12 , . . . , αl22 , α1∗3 , . . . , αl∗33 ), 2
κˆ = e, e ∈ ℜ Cˆ = (C , C , C )e,
k=1 (lk +lk+1 )
2
e∈ℜ
,
k=1 (lk +lk+1 )
(22)
,
292
Z. Qi et al. / Journal of Computational and Applied Mathematics 263 (2014) 288–298
⊤ ˆ = e1 Ω
−e⊤ 2
k = 1, 2,
1 ˆ = Q Λ
Q2
, −e⊤ 3 2 ˆ 1 ∈ ℜ2× k=1 (lk +lk+1 ) , Ω
e⊤ 2
ek ∈ ℜ , lk
,
(24)
2 × k=1 (lk +lk+1 )
2 k=1 (lk +lk+1 )
ˆ ∈ℜ Λ
(23)
,
(25)
and Qk =
Q1k −Q3k
−Q2k
Q4k
(xk1 · xk1 ) .. Q1k = . (xklk · xk1 )
Q1k ∈ ℜlk ×lk ,
Q k ∈ ℜ(lk +lk+1 )×(lk +lk+1 ) ,
,
(xk1 · xklk ) .. , . k k (xlk · xlk )
··· .. . ···
(xk1 · xk1+1 ) .. Q2k = Q3k = . (xklk · xk1+1 ) Q2k ∈ ℜlk ×lk+1 ,
(
(27)
k = 1, 2,
x1k+1
··· .. . ···
(xk1 · xklk++11 ) .. , . (xklk · xklk++11 )
(28)
k = 1, 2,
· ) . k .. Q4 = k+1 (xlk+1 · xk1+1 )
(26)
xk1+1
Q4k ∈ ℜlk+1 ×lk+1 ,
(xk1+1 · xklk++11 ) .. , . k+1 k+1 (xlk+1 · xlk+1 )
··· .. . ···
(29)
k = 1, 2.
ˆ is a positive semi-definite matrix. According to convex programming theory, we can obtain that It is easy to see that Λ (45) is a convex quadratic program. ˆ = (α11 , . . . , αl1 , α1∗2 , . . . , αl∗2 , α12 , . . . , αl2 , α1∗3 , . . . , αl∗3 )⊤ is a solution of the dual problem (21). Theorem 3.2. Suppose that λ 1 2 2 3 ˆ with values in the interval (0, Cˆ ), then the solution (w1 , b1 ), (w2 , b2 ) of (12) can be obtained in If there exist components of λ the following way. Let wk =
lk+1
αi∗ k+1 xki +1 −
i =1
lk
αik xki ,
k = 1, 2.
(30)
i =1
Choose a component of α k , αjk ∈ (0, C ), k = 1, 2 and compute bk = 1 +
lk+1
αi∗ k+1 (xki +1 · xkj ) −
i =1
lk
αik (xki · xkj ),
(31)
i=1
or choose a component of α ∗ k+1 , α ∗ kj +1 ∈ (0, C ), k = 1, 2 and compute bk = −1 +
lk+1
αi∗ k+1 (xki +1 · xkj +1 ) −
i =1
lk
αik (xki · xkj +1 ).
(32)
i=1
Proof. First we show that for w ∗ given by (30), there exists b¯ = (b¯ 1 , b¯ 2 ) such that (w, ¯ b¯ ) is the solution to (45). In fact, Theorem 3.1 shows that (45) can be rewritten as (21). It is easy to see that (21) satisfies the Slater condition. Accordingly, if ¯ s¯, and ξ¯ such that α (∗) is a solution to (21), there exists a multiplier b,
¯
ˆ ≤ Cˆ , 0≤λ
ˆ λ¯ˆ = 0, Ω
(33)
Z. Qi et al. / Journal of Computational and Applied Mathematics 263 (2014) 288–298
ˆ λ¯ˆ + κˆ + b¯ 1 Ω ˆ 1⊤,· + b¯ 2 Ω ˆ 2⊤,· − s¯ + ξ¯ = 0, −Λ s¯ ≥ 0,
ξ¯ ⊤ (λˆ¯ − Cˆ ) = 0,
ξ¯ ≥ 0,
293
(34)
¯
ˆ = 0. s¯⊤ λ
(35)
According to (34) and (34), we have
ˆ λ¯ˆ + κˆ + b¯ 1 Ω ˆ 1⊤,· + b¯ 2 Ω ˆ 2⊤,· + ξ¯ ≥ 0. −Λ
(36)
From (30), this is equivalent to
(w ¯ 1 · x1i ) − b¯ 1 ≤ −1 + ξ¯i1 ,
i = 1, . . . , l1
(37)
(w ¯2 ·
i = 1, . . . , l2
(38)
x2i
) − b¯ 2 ≤ −1 + ξ¯ , 2 i
(w ¯ 1 · x2i ) − b¯ 1 ≥ 1 − ξ¯i∗2 , (w ¯2 ·
x3i
) − b¯ 2 ≥ 1 − ξ¯i∗3 ,
i = 1, . . . , l2
(39)
i = 1, . . . , l3 ,
(40)
which implies that (w, ¯ b¯ ) is a feasible solution to the primal problem (12). Furthermore, from (33)–(35) we have 1 2
(∥w ¯ 1 ∥22 + ∥w ¯ 2 ∥22 ) + C
l1
ξ¯i1 + C
i =1
1 ¯⊤
=
2
ˆ λ¯ˆ + C λˆ Λ
l1
l3 l2 (ξ¯i2 + ξi∗2 ) + C ξ¯i∗3 i =1
i=1
l3 l2 ˆ λ¯ˆ + κˆ + b¯ 1 Ω ˆ 1⊤,· + b¯ 2 Ω ˆ 2⊤,· − s¯ + ξ¯ ) ξ¯i1 + C (ξ¯i2 + ξi∗2 ) + C ξ¯i∗3 + λ¯ˆ ⊤ (−Λ
i=1
i=1
i =1
1 ¯⊤
¯ˆ ˆ λ¯ˆ + κˆ ⊤ λ. = − λˆ Λ
(41)
2
This shows that the value of the objective function for the primal problem at (w, ¯ b¯ ) is equal to the optimum value of its dual problem. Thus, (w, ¯ b¯ ) is the optimal solution to the primal problem (12). If there exists a feasible solution (w, ¯ b¯ ) of the primal problem (12), we know that λ¯ˆ is nonzero by (30). According to
¯ ¯ ˆ λˆ + κˆ + b¯ 1 Ω ˆ 1⊤,· + b¯ 2 Ω ˆ 2⊤,· + ξ¯ is zero. Solving the equation implies s∗j = 0 from (35). This implies that the jth entry of −Λ ¯ w.r.t. b leads to the expressions (31) and (32). convex duality theory, (w, ¯ b¯ ) obtained from (45) is the unique solution to the primal problem (12). In fact, note that λˆ ̸= 0
3.2. Nonlinear U-NSVM Now we extend the linear U-NSVM to the nonlinear case by introducing Gaussian kernel function K (x, x′ ) = Φ (x)Φ (x′ )
(42)
and the corresponding transformation x = Φ (x),
(43)
where x ∈ H , H represents Hilbert space. Thus, the training set (6) becomes T˜
U˜ = (Φ (x11 ), 1), . . . , (Φ (x1l1 ), 1), (Φ (x31 ), −1), (Φ (x3l3 ), −1)
Φ (x21 ), . . . , Φ (x2l2 ) .
(44)
The nonlinear optimization problem to be solved is
max − α (∗)
+
lk lk 2 1
2 k =1
lk+1 lk+1
αik αjk K (xki · xkj ) − 2
i =1 j =1
s.t.
αi∗ k+1 αj∗ k+1 K (xki +1 · xkj +1 ) +
αik =
i =1
0 ≤ αi
∗k
≤ C,
lk 2 k=1 i=1
lk+1
αi∗ k+1 ,
k = 1, 2,
i=1
0 ≤ αik ≤ C ,
αik αj∗ k+1 K (xki · xkj +1 )
i=1 j=1
i=1 j=1 lk
lk+1 lk
k = 1, 2; i = 1, . . . , lk , k = 2, 3; i = 1, . . . , lk .
αik +
lk 3 k=2 i=1
αi∗k (45)
294
Z. Qi et al. / Journal of Computational and Applied Mathematics 263 (2014) 288–298
The corresponding theorems in the nonlinear case are similar to Theorems 3.1 and 3.2. In fact, we only need to take K (x, x′ ) instead of (x, x′ ). Now we establish U-NSVM as follows. (U-NSVM) (1) Input the training set (44); (2) Choose appropriate kernels K (x, x′ ), appropriate parameters and C > 0; (3) Construct and solve optimization problem (45) to obtain the solutions
λˆ =(α11 , . . . , αl11 , α1∗2 , . . . , αl∗22 , α12 , . . . , αl22 , α1∗3 , . . . , αl∗33 );
(46)
(4) Construct the decision functions f1 (x) =
l2
αi∗ 2 K (x2i · x) −
i =1
f2 (x) =
l3
l1
αi1 K (x1i · x) − b1 ,
(47)
αi2 K (x2i · x) − b2 ,
(48)
i =1
αi∗ 3 K (x3i · x) −
i =1
l2 i=1
where b., b+ are computed according to Theorem 3.1 and Theorem 3.2 for the kernel cases; (5) For any new input x, assign it to class k (k = 1, 2) according to arg min
k=1,2
|fk (x)| , ∥∆k ∥
(49)
where 1ˆ ˆ⊤ 2 ˆ ∆1 = λˆ ⊤ 1 Q λ1 , ∆2 = λ2 Q λ2 ,
(50)
and
λˆ 1 =(α11 , . . . , αl11 , α1∗2 , . . . , αl∗22 ), λˆ 2 =(α12 , . . . , αl22 , α1∗3 , . . . , αl∗33 ).
(51)
4. Experiments Our algorithm code was written in MATLAB 2010 on a PC with an Intel Core I5 processor with 2 GB of RAM. The quadprog function in MATLAB was used to solve the QPP. Accuracy was used to evaluate the methods, defined as accuracy = (TP + TN)/(TP + FP + TN + FN), where TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false negatives, respectively. To demonstrate the capabilities of our algorithm, we report results for four data sets: toy, MNIST, UCI, and ABCDETC. In all experiments, our method was compared with a standard SVM and U-SVM. The testing accuracy was computed using standard 10-fold cross-validation [15]. C and the RBF kernel parameter σ were selected from the set {2i |i = −7, . . . , 7} (ε was selected from the set {2i |i = 0.001, . . . , 0.5}) by 10-fold cross-validation on a tuning set comprising a random 10% of the training data. Once the parameters are selected, the tuning set was returned to the training set to learn the final decision function. 4.1. Toy data set To demonstrate the intuitive performance of U-NSVM for different numbers of universum data, we first used a toy 2D data to test the influence of universum data on accuracy. Positive and negative data were generated randomly from two normal distributions with means u1 = 0 and u2 = 4 and standard deviations σ1 = 4 and σ2 = 4. Unlabeled data were also generated from a Gaussian distribution with mean u3 = 2 and standard deviation σ1 = 1.5. We used 80 positive data and 80 negative data as the training set, and 10, 20, 80, or 160 unlabeled data as universum samples. A further set of 100 positive data and 100 negative data was taken as the test set. We selected a linear kernel for the toy data set. Fig. 1ac shows the intuitive expression of U-NSVM, which looks for two nonparallel hyperplanes by maximizing margins between labeled and unlabeled data. Fig. 1df shows the final prediction results for universum data of10, 80, and 160. It is clear that universum data helps the algorithm to identify a more reasonable classifier. The accuracy of U-NSVM for the toy data increases with the number of universum data. The second toy data set is a nonlinear separated example. Fig. 2 shows the U-SVM and U-NSVM results. This example clearly demonstrates the advantage of U-NSVM for nonlinear classification.
Z. Qi et al. / Journal of Computational and Applied Mathematics 263 (2014) 288–298
(a) Universum = 10.
(b) Universum = 80.
(d) Universum = 10.
295
(c) Universum = 160.
(e) Universum = 80.
(f) Universum = 160.
Fig. 1. U-NSVM results for the toy data set. (a–c) Positive points, red squares; negative points, blue diamonds; universum data, cyan stars; mean for the positive class, red solid square; mean for the negative class, blue solid diamonds; ideal classifier, dotted line; two nonparallel hyperplanes, solid lines. (d–f) Scatter plots of ‘+’ positive points and ‘∗’ negative points.
(a) U-SVM.
(b) U-NSVM.
Fig. 2. U-SVM and U-NSVM results for a crossing data set. Positive points, red squares; negative points, blue diamonds; universum data, cyan stars; and hyperplanes, solid lines. Table 1 Description of the Iris and Wine data sets. Name
Dimension (N)
Classes (K )
Examples (L)
Iris Wine
4 13
3 3
150 178
4.2. UCI data sets The Iris and Wine data sets are from the UCI machine learning repository.1 Table 1 describes the data sets. Our experiment is a binary classification problem. For the Iris data set, class 1 (50 instances) and class 2 (50 instances) were taken as the training set and 50 universum examples were randomly generated. For the Wine data set, we used class 1 1 UCI repository of machine learning databases, University of California. http://www.ics.uci.edu/mlearn/MLRepository.html.
296
Z. Qi et al. / Journal of Computational and Applied Mathematics 263 (2014) 288–298 Table 2 Percentage accuracy for tenfold testing for the Iris data set. Method
Training subset size 20
40
60
80
100
SVM U-SVM U-NSVM
85.26 ± 4.58 88.21 ± 3.24 91.28 ± 3.57
87.24 ± 3.45 89.71 ± 2.34 91.22 ± 2.11
90.39 ± 2.16 93.31 ± 1.63 95.01 ± 1.44
92.72 ± 1.87 94.12 ± 1.32 95.67 ± 1.12
93.86 ± 2.45 95.21 ± 0.85 98.05 ± 0.69
Table 3 Percentage accuracy for tenfold testing for the Wine data set. Method
Training subset size 30
50
70
90
107
SVM U-SVM U-NSVM
78.24 ± 5.21 81.25 ± 4.27 81.29 ± 4.38
81.78 ± 4.55 84.51 ± 3.54 85.61 ± 3.25
87.49 ± 3.33 91.21 ± 2.36 92.45 ± 2.29
93.42 ± 3.49 95.42 ± 2.62 97.34 ± 2.21
94.88 ± 2.55 97.31 ± 0.75 97.52 ± 0.65
Table 4 Testing accuracy for the ‘5’ versus ‘8’ data set for 350 universum examples. Method
Training subset size 400
800
1500
2000
2500
SVM U-SVM U-NSVM
94.31 ± 1.78 96.45 ± 1.54 96.51 ± 1.65
95.98 ± 1.76 96.91 ± 1.12 97.08 ± 1.14
96.64 ± 1.21 97.66 ± 0.80 97.78 ± 0.67
97.11 ± 1.08 98.02 ± 0.73 98.48 ± 0.65
97.79 ± 0.91 98.31 ± 0.65 98.58 ± 0.54
Table 5 Testing accuracy of U-NSVM for the ‘5’ versus ‘8’ data set for differing amounts of universum data. Training examples
2500
Number of universum examples 500
1000
2000
4000
6000
97.67 ± 0.81
98.24 ± 0.64
98.91 ± 0.56
99.12 ± 0.51
99.28 ± 0.42
(59 instances)and class 3 (48 instances) for classification, and 60 universum examples. Many methods can be used to collect universum data in practise. Here we used the UMean method [1],2 is better than other methods, to construct universum data. Tables 2 and 3 lists the experimental results in the case of an RBF kernel. 4.3. MNIST data set The MNIST data set comprises handwritten digits with samples from ‘0’ to ‘9’. The size of each sample is 28 × 28 pixels. We used the digits ‘5’ and ‘8’ to form a binary classification problem [1]. For experiment 1, training set sizes of 400, 800 1500, 2000, and 2500 were used (‘5’ and ‘8’ had the same number of samples). Another 350 digits uniformly distributed from the classes were selected as universum samples. For experiment 2, we selected 2500 ‘5’ and ‘8’ digits as the training set, and 500, 1000, 2000, 4000, and 6000 other digits as universum samples. Tables 4 and 5 list the results for experiments 1 and 2 in the case of a linear kernel. 4.4. ABCDETC data set We used the ABCDETC data set [1] to further test our algorithm. There are 78 classes in the data set comprising 19 646 images. These categories are: lowercase letters (a–z), uppercase letters (A–Z), digits (0–9), and other symbols (‘,. : ; ! ? + − = / $ % () @ ’’ ’). The symbols were written in pen by 51 subjects, with five replicates per symbol, on a single gridded sheet and saved as 100 × 100 binary images. To improve the computing speed, we shrank the original images to 30 × 30 pixels by bilinear interpolation. Our goal was to compare the performance of SVM, U-SVM, and U-NSVM and a multi-class SVM method according to a ‘‘one versus the rest’’ strategy. Our experiments were carried out on raw pixel features rather than a histogram of gradient features [16]. The lowercase letters ‘a’ and ‘b’ were taken as the labeled data, and remaining data (lowercase letters, uppercase letters, digits, other symbols) as universum data. We used training sets of 30, 60, 120, 180, and 250 in size, a validation set of 200, and the remaining data for testing. The results are shown in Table 6 and Fig. 3
2 Each universum example is generated by selecting two data from two different categories and then combining them with a mean coefficient.
Z. Qi et al. / Journal of Computational and Applied Mathematics 263 (2014) 288–298
297
Table 6 Classification accuracy and standard deviation for the ABCDETC data set in the case of a linear kernel. Lowercase letters ‘a’ and ‘b’ were used as the training data. Four different unlabeled data sets were taken as the universum data: lowercase letters (c–z), uppercase letters (A–Z), digits (0–9), and other symbols (‘,. : ; ! ? + − = / $ % () @ ’’ ’). The variance is shown in brackets. Method
Type of universum
Size of training data 30
60
120
180
SVM
None
57.84 [7.28]
64.14 [4.69]
72.74 [3.12]
74.16 [2.18]
250 77.32 [3.26]
U-SVM
Lowercase Uppercase Digits Symbols
61.68 [2.86] 63.11 [2.12] 61.09 [2.98] 62.91 [2.54]
68.16 [4.23] 67.82 [3.71] 67.54 [2.89] 69.34 [3.44]
75.18 [2.11] 76.01 [3.02] 74.88 [2.66] 74.33 [3.41]
76.87 [3.17] 77.64 [3.76] 77.21 [2.90] 76.89 [3.24]
78.46 [3.67] 79.13 [4.12] 78.59 [2.80] 79.61 [3.54]
U-NSVM
Lowercase Uppercase Digits Symbols
63.24 [3.12] 64.32 [2.45] 62.65 [2.68] 64.38 [3.41]
69.24 [3.23] 68.64 [4.24] 68.38 [3.38] 70.51 [2.97]
75.91 [4.09] 76.60 [2.42] 75.79 [3.77] 75.29 [2.74]
77.63 [2.64] 78.35 [3.22] 77.91 [2.45] 78.43 [4.05]
78.64 [2.16] 79.06[3.39] 78.69 [3.85] 81.02 [1.98]
Multi-class
Lowercase Uppercase Digits Symbols
58.24 [2.32] 61.23 [2.11] 57.85 [3.21] 55.64 [2.51]
64.64 [1.53] 65.48 [3.21] 64.32 [1.83] 65.80 [3.17]
70.01 [2.34] 72.64 [2.79] 71.49 [2.47] 72.38 [2.57]
74.66 [1.54] 75.25 [1.21] 74.55 [1.53] 75.67 [2.45]
76.33 [1.56] 78.03 [2.29] 77.65 [2.95] 78.33 [2.86]
85 84
80
Accuracy(standard deviations)
Accuracy(standard deviations)
82
78 76 74 72 70
80
75
70
68 66 65
64 0
50
100 150 Training size
200
0
250
50
100
150
200
250
200
250
Training size
(a) Lowercase.
(b) Uppercase.
84
85
82
Accuracy(standard deviations)
Accuracy(standard deviations)
80 78 76 74 72 70 68
80
75
70
66 64
65 0
50
100
150
200
250
0
Training size
(c) Digits.
50
100
150
Training size
(d) Symbols. Fig. 3. U-SVM and U-NSVM results for the ABCDETC data set in the case of the RBF kernel.
4.5. Discussion Our experimental results show that U-SVM and U-NSVM outperform the traditional SVM in most cases (Tables 2, 3, 4, and 6 and Fig. 3). Thus, universum data are helpful in improving classifier performance (Fig. 1). Table 5 shows that the number of universum data directly influences the performance of U-SVM and U-NSVM. The more universum data that are available,
298
Z. Qi et al. / Journal of Computational and Applied Mathematics 263 (2014) 288–298
the better is the performance. Tables 2 and 6 and Fig. 3 show that U-NSVM performs better than U-SVM for the Iris data set. Tables 3 and 4 show that U-NSVM performs slightly better than U-SVM for the Wine and 5 versus 8 data sets. These results confirm that a classifier that combines two nonparallel hyperplanes has better algorithm flexibility than other methods. In fact, for binary classification, maximizing the margin generated by parallel hyperplanes does not make full use of the actual sample distribution in constructing a classifier. A classifier that combines two nonparallel hyperplanes can usually yield a greater margin and has better generalization capability (Fig. 2). In addition, our method has fewer parameters than U-SVM, which directly reduces the operational difficulty. This is one reason why U-NSVM can yield better results. In addition, taking the universum as a third class, we compared our method to a multi-class SVM method for a ‘‘one versus the rest’’ strategy. Table 6 and Fig. 3 reveal lower performance for the multi-class SVM compared to U-SVM and U-NSVM. The main reason is that the goals of U-NSVM and the multi-class method are different. U-SVM and U-NSVM are used to deal with the binary classification problems, and universum samples are used to improve the accuracy of this binary classification model. However, the goal of the multi-class method is to deal with multi-class classification problems. Universum samples are taken as a third class to improve the classifier accuracy for the other two classes. In addition, the multi-class SVM needs to solve more optimization problems, which negatively affects the computing speed. Thus, U-SVM and U-NSVM are superior to the multi-class SVM method for a ‘‘one versus the rest’’ strategy in terms of both computation time and classification accuracy. 5. Conclusion We proposed a new U-NSVM algorithm that can exploit prior knowledge embedded in a universum sample to construct a more robust classifier. U-NSVM maximizes the two margins associated with the two closest neighboring classes (positive class, negative class, and universum class). Compared to U-SVM, the U-NSVM classifier combines two nonparallel hyperplanes, which can yield a greater margin and has better generalization capability. In addition, our method has fewer parameters than U-SVM, so it is easier to implement. Experiments showed that U-NSVM outperforms a traditional SVM and U-SVM. In some sense, our method may represent a new starting point for universum learning. A possible direction for future work would be to extend U-NSVM to a semi-supervised universum problem. How to generate or select universum data is also an interesting issue that we plan to consider. Acknowledgments This work was partly supported by the China Postdoctoral Science Foundation under Grant No. 2013M530702, and grants from the National Natural Science Foundation of China (Nos. 11271361 and 11271367), a key project of the National Natural Science Foundation of China (No. 71331005), and the CAS/SAFEA International Partnership Program for Creative Research Teams, Major International (Regional) Joint Research Project (No. 71110107026), the Ministry of water resources’ special funds for scientific research on public causes (No. 201301094). References [1] J. Weston, R. Collobert, F. Sinz, L. Bottou, V. Vapnik, Inference with the universum, in: Proceedings of the 23rd International Conference on Machine Learning, ICML’06, ACM, 2006, pp. 1009–1016. [2] F.H. Sinz, O. Chapelle, A. Agarwal, B. Schölkopf, An analysis of inference with the universum, in: Advances in Neural Information Processing Systems, vol. 20, MIT Press, 2008, pp. 1369–1376. [3] D. Zhang, J. Wang, F. Wang, C. Zhang, Semi-supervised classification with universum, in: Proceedings of the 2008 SIAM International Conference on Data Mining, 2008, pp. 323–333. [4] S. Chen, C. Zhang, Selecting informative universum sample for semi-supervised learning, in: Proceedings of the 21st International Joint Conference on Artificial Intelligence, 2009, pp. 1016–1021. [5] V. Cherkassky, S. Dhar, W. Dai, Practical conditions for effectiveness of the universum learning, IEEE Trans. Neural Netw. 22 (2011) 1241–1255. [6] C. Shen, P. Wang, F. Shen, H. Wang, UBoost: boosting with the universum, IEEE Trans. Pattern Anal. Mach. Intell. 34 (2012) 825–832. [7] O. Mangasarian, E. Wild, Multisurface proximal support vector classification via generalized eigenvalues, IEEE Trans. Pattern Anal. Mach. Intell. 28 (2006) 69–74. [8] Jayadeva, R. Khemchandani, S. Chandra, Twin support vector machines for pattern classification, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007) 905–910. [9] M.A. Kumar, M. Gopal, Application of smoothing technique on twin support vector machines, Pattern Recognit. Lett. 29 (2008) 1842–1848. [10] R. Khemchandani, Jayadeva, S. Chandra, Optimal kernel selection in twin support vector machines, Optim. Lett. 3 (2009) 77–88. [11] M.A. Kumar, M. Gopal, Least squares twin support vector machines for pattern classification, Expert Syst. Appl. 36 (2009) 7535–7543. [12] S. Ghorai, A. Mukherjee, P.K. Dutta, Nonparallel plane proximal classifier, Signal Process. 89 (2009) 510–522. [13] Y.-H. Shao, C.-H. Zhang, X.-B. Wang, N.-Y. Deng, Improvements on twin support vector machines, IEEE Trans. Neural Netw. 22 (2011) 962–968. [14] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1996. [15] N. Deng, Y. Tian, Support Vector Machines: Theory, Algorithms and Extensions, Science Press, Beijing, 2009. [16] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of the 2005 IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893.