Semi-supervised support vector classification with self-constructed Universum

Semi-supervised support vector classification with self-constructed Universum

Neurocomputing 189 (2016) 33–42 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Semi-supe...

2MB Sizes 0 Downloads 33 Views

Neurocomputing 189 (2016) 33–42

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Semi-supervised support vector classification with self-constructed Universum Yingjie Tian a,b, Ying Zhang c, Dalian Liu d,n a

Research Center on Fictitious Economy & Data Science, Chinese Academy of Sciences, Beijing 100190, China Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China c School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100190, China d Department of Basic Course Teaching, Beijing Union University, Beijing 100101, China b

art ic l e i nf o

a b s t r a c t

Article history: Received 12 March 2015 Received in revised form 16 October 2015 Accepted 15 November 2015 Communicated by Yongdong Zhang Available online 26 November 2015

In this paper, we propose a strategy dealing with the semi-supervised classification problem, in which the support vector machine with self-constructed Universum is iteratively solved. Universum data, which do not belong to either class of interest, have been illustrated to encode some prior knowledge by representing meaningful concepts in the same domain as the problem at hand. Our new method is applied to seek more reliable positive and negative examples from the unlabeled dataset step by step, and the Universum support vector machine(U-SVM) is used iteratively. Different Universum data will result in different performance, so several effective approaches are explored to construct Universum datasets. Experimental results demonstrate that appropriately constructed Universum will improve the accuracy and reduce the number of iterations. & 2016 Published by Elsevier B.V.

Keywords: Semi-supervised Classification Universum Support vector machine

output y for any input x can be predicted by

1. Introduction In many traditional supervised learning, we acquire the decision function only through learning labeled dataset, however, in some applications of machine learning, such as image retrieval [1], text classification [2], natural language parsing [3], abundant amounts of unlabeled data can be cheaply and automatically acquired. Even if we can label samples manually, it will be laborintensive and very time consuming. In such situation, the traditional supervised learning usually goes worse with the lacking of enough supervised information. Semi-supervised learning (SSL) [4–9] has attracted an increasing amount of interests which addresses this problem by using large amount of unlabeled data, together with the labeled data, to build better classifier. Semi-supervised learning problem: Given a training set T ¼ fðx1 ; y1 Þ; …; ðxl ; yl Þg⋃fxl þ 1 ; …; xl þ q g;

ð1Þ

where xi A Rn ; yi A f  1; 1g; i ¼ 1; …; l, xi A Rn ; i ¼ l þ 1; …; l þq, and the set x1 þ 1 ; …; xl þ q is a collection of unlabeled inputs known to belong to one of the classes, predict the outputs y1 þ 1 ; …; yl þ q for fxl þ 1 ; …; xl þ q g and find a real function g(x) in Rn such that the n

Corresponding author. E-mail addresses: [email protected] (Y. Tian), [email protected] (Y. Zhang), [email protected] (D. Liu). http://dx.doi.org/10.1016/j.neucom.2015.11.041 0925-2312/& 2016 Published by Elsevier B.V.

f ðxÞ ¼ sgnðgðxÞÞ:

ð2Þ

The motivation of semi-supervised methods is to take advantage of the unlabeled data to improve the performance and there are roughly five kinds of methods for solving the semi-supervised learning problem such as Generative methods [10–13], Graphbased methods [14–16], Co-training methods [17,18], Low-density separation methods [19,20], and Self-training methods [21–23]. Self-training is probably the earliest idea about using unlabeled data classification is a commonly used technique. Self-training is also known as self-learning, self-labeling, or bootstrapping (not to be confused with the statistical procedure with the same name). This is a wrapper-algorithm that repeatedly uses a supervised method. First, only a small labeled examples are trained in a classifier to classify unlabeled data and select most confident unlabeled points which will be added into the training set. The classifier is re-trained with the new data and the process is repeated. The idea has been used in many applications [24–26]. Our method belongs to this ideology. Universum, which is defined as a collection of unlabeled points known not belong to any class, was firstly proposed in [27]. It has captured a general backdrop against the problem of interest and is looked forward to represent meaningful information connected with the classification task at hand. Universum dataset is easy to acquire, since there is so few requirement for it. Additionally, it can catch some prior information of the ground-truth decision

34

Y. Tian et al. / Neurocomputing 189 (2016) 33–42

2. Background

boundary, because it need not to have the same distribution with the training set. So several algorithms about Universum have been proposed in machine learning. In [27], the authors proposed a new SVM framework called U-SVM which was used to handle supervised problem and their experimental results illustrated that U-SVM outperforms those SVMs without considering Universum data. An analysis of U-SVM was given by Sinz [28]. They also presented a Least Squares (LS) version of the U-SVM algorithm. Zhang [29] proposed a graph based semi-supervised algorithm in which the labeled data, unlabeled data and the Universum data were simultaneously utilized to improve the performance of classification. Qi et al. [30,31] used Universum to design new nonparallel Support Vector Machine in order to improve the classification performance. Other literatures can also be found in [32,33]. Inspired by the success of U-SVM, in this paper, we propose an iterative support vector machine with self-constructed Universum for semi-supervised classification. It has the following advantages:

In this section, we briefly introduce the TSVM and LapSVM for semi-supervised classification problem, and U-SVM for the Universum classification problem. 2.1. TSVM TSVM aims to identify the classification model following the framework of maximum margin for both label and unlabeled examples. The popular version of TSVM [2] is to solve the following primal problem with the training set (1): lþq lþq X X 1 JwJ2 þC ξi þ C n ξni ; w;b;ξ;ξ ;yn 2 i¼1 i¼1

minn

s: t: yi ððw  xi Þ þbÞ Z 1  ξi ; n

yi ððw  xi Þ þ bÞ Z1  ξi ; n

2.2. LapSVM The regularization framework of Laplacian support vector machine has been extended in SSL field by [6]. With the training set (1) and a kernel function Kð; Þ applied, the decision function can be obtained by f ¼ minf A Hk

l 1X ð1  yi f ðxi ÞÞ þ þ γ H ‖f ‖2H þ γ M ‖f ‖2M l i¼1

f ðxÞ ¼

lþq X

αi Kðxi ; xÞ þ b;

the regularization term ‖f ‖2H can be expressed as ‖f ‖2H ¼ J w J 2 ¼ ðΦαÞT ðΦαÞ ¼ αT K α;

J f J 2M ¼

lþq X

1 ðl þ qÞ

2

T

W i;j ðf ðxi Þ  f ðxj ÞÞ2 ¼ f Lf ;

10

10

5

5

5

0

0

0

−5

−5

−5

−10

−10

−10

−15

−15

−15

−20

ð7Þ

i; j ¼ 1

10

20

ð6Þ

and the manifold regularization is written by

15

10

ð5Þ

i¼1

15

0

ð4Þ

where f is the decision function,

15

−10

ð3Þ

we can see that the above problem is a non-convex optimization problem due to the product term yi ðwi  xi Þ in the constraints. In order to find the approximate solutions to TSVM, extensive research efforts have been devoted [19,20,34,35]. For example, a label-switching-retraining procedure is proposed in [2] to speed up the computation.

This paper is organized as follows. Section 2 dwells on the Transductive support vector machine (TSVM) [2], LapSVM (Laplacian SVM) [6] and U-SVM. Section 3 proposes our new method, the U-SVM with self-constructed Universum, termed as Us -SVM. Section 4 deals with experimental results and Section 5 contains concluding remarks.

−20

i ¼ l þ 1; …; l þ q;

ξi Z 0; i ¼ 1; …; l; ξni Z 0; i ¼ l þ 1; …; l þ q;

 We simultaneously utilize Universum data and iterative method to improve performance. Universum data is applied to catch some prior knowledge information of training set and the iterative method is applied to seek more reliable positive and negative examples from unlabeled dataset step by step. We train on labeled points using U-SVM in each step and most confident unlabeled points are added into the training set. The experiment results show that it performs better than other methods.  Different Universum data will lead to different results, so it is crucial to construct appropriate Universum. In this paper, we only use the dataset itself to generate Universum data instead of constructing from other new dataset by considering that there would be more useful information in dataset itself. For example, considering the classification of ‘5’ and ‘8’ in handwritten digits recognition, we only use ‘5’ and ‘8’ to generate Universum data instead of ‘1’,‘2’,‘3’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’. Moreover, several methods to construct the appropriate Uninversum using dataset itself are also compared and suggested.

−20

i ¼ 1; …; l;

−20 −20

−10

0

10

20

−20

−10

0

10

20

Fig. 1. Positive points (marked by “þ ”), negative points (marked by “n”), unlabeled points (marked by “.”), Universum points (marked by “⋆”), the ideal decision boundary (real dotted line), the decision boundary of standard SVM (blue dotted line), the decision boundary of U-SVM (black solid line). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

Y. Tian et al. / Neurocomputing 189 (2016) 33–42

35

where L ¼ D  W is the graph Laplacian, D is a diagonal matrix with P its i-th diagonal Dii ¼ lj þ¼q1 W ij , W is the adjacency matrix, and f ¼ ½f ðx1 Þ; …; f ðxl þ q ÞT ¼ K α, the weight γ H controls the complexity of f in the reproducing kernel Hilbert space and γ M penalizes f along the Riemann manifold M.

belong to either class. Universum is expected to represent meaningful information related to the classification task at hand. We intent to find a real function g(x) in Rn such that the value of y for any x can be predicted by the decision function

2.3. U-SVM

U-SVM [27] formulates the problem as the following quadratic programming problem (QPP):

Universum classification problem: Given a training set T~ ¼ T⋃U ¼ fðx1 ; y1 Þ; …; ðxl ; yl Þg⋃fxn1 ; …; xnu g;

ð8Þ

where xi A Rn ; yi A f  1; 1g; i ¼ 1; …; l, xnj A Rn ; j ¼ 1; …; u, and the set U ¼ fxn1 ; …; xnu g

f ðxÞ ¼ sgnðgðxÞÞ:

ð10Þ

u l X X 1 J w J 2 þC u ðψ i þ ψ ni Þ þ C t ξj ; w;b;ξ;ψ ðnÞ2 i¼1 j¼1

min

s: t:  ε  ψ ni r ðw  xni Þ þ b r ε þ ψ i ;

ð9Þ

ψ i ; ψ i Z 0; n

is a collection of unlabeled inputs (Universum) known not to

i ¼ 1; …; u;

ð11Þ

15

15

15

15

10

10

10

10

5

5

5

5

0

0

0

0

−5

−5

−5

−5

−10 −15

−10 −15

−10 −15

−10 −15

−20

−20

−20 −20 −10

0

10

20

−20 −10

0

10

−20 −20 −10

20

0

10

20

15

15

15

15

10

10

10

10

5

5

5

5

0

0

0

0

−5

−5

−5

−5

−10 −15

−10 −15

−10 −15

−10 −15

−20

−20

−20 −20 −10

0

10

20

−20 −10

0

10

0

10

20

15

15

15

15

10

10

10

10

5

5

5

5

0

0

0

0

−5

−5

−5

−5

−10 −15

−10 −15

−10 −15

−10 −15

−20

−20

−20 −20 −10

0

10

20

−20 −10

0

10

0

10

20

15

15

15

15

10

10

10

10

5

5

5

5

0

0

0

0

−5

−5

−5

−5

−10 −15

−10 −15

−10 −15

−10 −15

−20

−20

−20 −20 −10

0

10

20

−20 −10

0

10

20

0

10

20

−20 −10

0

10

20

−20 −10

0

10

20

−20 −10

0

10

20

−20 −20 −10

20

−20 −10

−20 −20 −10

20

i ¼ 1; …; u;

−20 −20 −10

0

10

20

Fig. 2. (a)–(d) Umean ðallÞ, (e)–(h) Umean ðnearestÞ, (i)–(l) Umean ðself Þ, (m)–(p) Umean ðvaryÞ. Positive points (marked by “n”), negative points (marked by “n”), unlabeled points (marked by “.”), Universum points (marked by “⋆”), the ideal decision boundary (real dotted line), the original decision boundary of SVM (blue dotted line), the decision boundary of Us -SVM (black solid line). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

36

Y. Tian et al. / Neurocomputing 189 (2016) 33–42

yj ððw  xj Þ þbÞ Z 1  ξj ;

ξj Z 0;

j ¼ 1; …; l;

training set self as the Uninversum, Schemes 1, 2 and 4 use the whole or part of the training set to generate the Uninversum:

j ¼ 1; …; l;

where ψ ðnÞ stands for ðψ 1 ; ψ n1 ; …; ψ u ; ψ nu Þ > , and C u ; C t Z0, ε Z0 are parameters.

 Umean ðallÞ: All of the examples are chosen to contribute to the U. First, we compute the mean of all the examples as the center data, then take the midpoint of each example and the center to constitute U. Umean ðnearest=farthestÞ: Only a part of examples which located near (far) from the center will be used to compose U. Similarly, we calculate the mean of all the points as the center data, then select N examples which locate nearest (farthest) from the center, finally take the midpoint of each example chosen and the center to constitute U. Umean ðself Þ: Instead of generating new examples, we use the original examples as the U. First, we compute the mean of all the points as the center data, then choose N examples which locate nearest from the center to constitute U. Umean ðvaryÞ: Different with the above three cases, we construct varying U in each U-SVM training step. First, we compute the mean of all the unlabeled examples as the center data, then take the midpoint of each example and the center to constitute the original U, after each U-SVM training step, a part of the unlabeled data are labeled, then the U varies.

 3. U-SVM with self-constructed Universum In this section, we propose the U-SVM with self-constructed Universum, termed as Us -SVM, to solve the semi-supervised problem. First, construct the Universum from the original training set, then train the U-SVM on the constructed Universum and only the labeled examples to classify the unlabeled data and select most confident unlabeled points which will be added into the training set, then the U-SVM is re-trained with the new data and the process is repeated.





3.1. Self-constructed Universum The experiment results in U-SVM [27] have shown that the obtained performance depends on the quality of the Universum, therefore the methodology of selecting the appropriate Universum is the first and important step of Us -SVM. For the supervised problem, a possible and efficient way to construct Universum is, creating artificial inputs by first selecting random positive points and negative points from the training set, and then calculating the mean of these two points. However, there is only a few labeled positive and labeled negative examples can be used, so maybe it is unsuitable for the semi-supervised problem. Chen [36] proposed a method to select informative Universum, termed as the in-between Universum samples (IBU) instead of out-of-samples. However, they constructed Universum datasets only considering the labeled dataset, it will encounter the same problem when the labeled dataset is too small. Differently, we want to construct the Universum set U based on the whole training set, i.e., both the labeled points and the unlabeled points, and more importantly, we hope the Universum locates near the center of the training set, in order that the Uninversum can drag the boundary to traverse through low density regions, which just coincides with the ideology of low-density separation, therefore we propose the following four schemes: Schemes 1 2, and 3 construct the Universum one-time for the whole training procedure, and Scheme 4 constructs the Universum dynamically with the variations of unlabeled points; while Scheme 3 simply uses the

3.2. Us -SVM Based on the self constructed U, Us -SVM consists of two main steps: Step 1: Training U-SVM on the labeled data and Universum data U to classify the unlabeled data and the U iteratively; Step 2: Identifying a set of reliable positive and negative points from the predicted unlabeled dataset, then add the new positive and negative samples to the labeled dataset. We believe that those predicted unlabeled samples locating far from the boundary, or far from the mean of the whole training set, or far from the mean of predicted Universum are reliable. These two steps together can be seen as an iterative method of increasing the number of classified unlabeled examples. The algorithm is thus constructed. Algorithm 1. Us -SVM for the semi-supervised problem. Input: The labeled positive set, PT; The labeled negative set, NT;

18 16

5

14 0

12 10

−5

8 −10

0

5

10

15

20

5

10

15

Fig. 3. Two artificial datasets: Class A (marked by red “n”), and class B (marked by blue “þ ”). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

Y. Tian et al. / Neurocomputing 189 (2016) 33–42

37

Construct U, train U-SVM with parameters C t ; C u , kernel function K on the datasets P t ⋃Nt ⋃U, get the decision function f(x) ; Apply the decision function to classify the unlabeled set Ut as U tþ ⋃U t , and Universum set U as U þ ⋃U  ; Calculate the mean of U þ as p-mean and the mean of U  as n-mean; Choose m examples in U tþ which are farthest from the pmean as PU and choose m examples in U t which are farthest from the u-mean as NU; P t ¼ fP t ; P u g, N t ¼ fN t ; N u g and i ¼ i þ 1; end for

The unlabeled set, UT; The Universum set, U; The index of iteration and the maximum number of iterations, i and N; The number of labeled examples in each step, M; The parameters, C t ; C u ; The kernel function K; The labeled positive examples from unlabeled dataset in each step, m Output: The decision function, f ðxÞ ¼ sgnðgðxÞÞ; Algorithm for each i A ½1; N do

5

5

5

5

0

0

0

0

−5

−5

−5

−5

−10

0

5

10

15

20

−10

0

5

10

15

20

−10

0

5

10

15

20

5

5

5

5

0

0

0

0

−5

−5

−5

−5

−10

0

5

10

15

20

−10

0

5

10

15

20

−10

0

5

10

15

20

−10

5

5

5

5

0

0

0

0

−5

−5

−5

−5

−10

0

5

10

15

20

−10

0

5

10

15

20

−10

0

5

10

15

20

−10

5

5

5

5

0

0

0

0

−5

−5

−5

−5

−10

0

5

10

15

20

−10

0

5

10

15

20

−10

0

5

10

15

20

−10

0

5

10

15

20

0

5

10

15

20

0

5

10

15

20

0

5

10

15

20

Fig. 4. (a)–(d) Umean ðallÞ, (e)–(h) Umean ðnearestÞ, (i)–(l) Umean ðself Þ, (m)–(p) Umean ðvaryÞ. Positive points (marked by “n”), negative points (marked by “n”), unlabeled points (marked by “.”), Universum points (marked by “⋆”), the ideal decision boundary (real dotted line), the original decision boundary of SVM (blue dotted line), the decision boundary of Us -SVM (black solid line). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

38

Y. Tian et al. / Neurocomputing 189 (2016) 33–42

We can give the complexity of Algorithm 1. It is easy to show that Us -SVM can be solved efficiently by the sequential minimization optimization (SMO) type technique [37], while [38] has proved that for the convex quadratic programming problem (QPP) such that in Us -SVM, an SMO-type decomposition method [39] implemented in LIBSVM has the complexity M  Oðl þ 2uÞ

ð12Þ

if most columns of the kernel matrix are cached throughout iterations ([38] also pointed out that there is no theoretical result yet on LIBSVM's number of iterations. Empirically, it is known that

the number of iterations M may be higher than linear to the number of training data). Therefore the complexity of Algorithm 1 is N  ðM  Oðl þ2uÞÞ

ð13Þ

where N is the maximum number of iterations. For the convergence of Algorithm 1, we can refer to the reference [23] to see the convergence of their proposed “self-training semi-supervised SVM”, where the difference between our algorithm and this method is that we apply U-SVM to take instead of

18

18

18

18

16

16

16

16

14

14

14

14

12

12

12

12

10

10

10

10

8

8

8

8

5

10

15

5

10

5

15

10

15

18

18

18

18

16

16

16

16

14

14

14

14

12

12

12

12

10

10

10

10

8

8

8

8

5

10

15

5

10

5

15

10

15

18

18

18

18

16

16

16

16

14

14

14

14

12

12

12

12

10

10

10

10

8

8

8

8

5

10

15

5

10

5

15

10

15

18

18

18

18

16

16

16

16

14

14

14

14

12

12

12

12

10

10

10

10

8

8

8

8 5

10

15

5

10

15

5

10

15

5

10

15

5

10

15

5

10

15

5

10

15

Fig. 5. (a)–(d) Umean ðallÞ, (e)–(h) Umean ðnearestÞ, (i)–(l) Umean ðself Þ, (m)–(p) Umean ðvaryÞ. Positive points (marked by “n”), negative points (marked by “n”), unlabeled points (marked by “.”), Universum points (marked by “⋆”), the ideal decision boundary (real dotted line), the original decision boundary of SVM (blue dotted line), the decision boundary of Us -SVM (black solid line). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

100

100

98

98

96

96

Accuracy

Accuracy

Y. Tian et al. / Neurocomputing 189 (2016) 33–42

94 92

94 92

90 88

39

TSVM Lap−SVM Our method

500

1000

1500

2000

2500

Labled data set

90 88

TSVM Lap−SVM Our method

500

1000

1500

2000

2500

Unlabled data set

Fig. 6. (a) Labeled dataset (b) unlabeled dataset. The test accuracy and standard deviations of TSVM, LapSVM, and Us -SVM on MINIST Dataset for the case of RBF kernel.

the SVM, therefore it is easy to prove the convergence of Algorithm 1. 3.3. Illustration of Algorithm 1 Now we discuss the method geometrically in Fig. 1. First, we construct the decision function (black solid line) by U-SVM using 2 positive points (PT, marked by blue “þ ”), 2 negative points (NT, marked by red “n”), and the Universum set (U, marked by “⋆”). Obviously it performs better than the decision function (blue dotted line) which is constructed by the standard SVM only using P t ⋃N t (Fig. 1(a)). Then we choose m ¼10 samples (marked by“□”) which are farthest from the means of U þ and U  (marked by “⋆”), then add them into the labeled dataset. Next, the classifier is retrained with the new data and the process is repeated. The result has been showed in Fig. 1(b). Eventually (Fig. 1(c)), we hope that the final decision hyperplane will locate as close as the ideal decision hyperplane (the red dotted line). It is easy to extend the linear case to the nonlinear case, we only need to apply nonlinear U-SVM instead of linear U-SVM by introducing the kernel function Kðx; x0 Þ ¼ ðΦðxÞ  Φðx0 ÞÞ and the corresponding transformation x ¼ ΦðxÞ, where x A H, H is the Hilbert space.

4. Experimental results In this section, we show the effectiveness of the proposed method on artificial and real datasets. In order to demonstrate capabilities of our algorithm, we compare our algorithm with LapSVM [6] and TSVM [2]. All methods are implemented in MATLAB 2013a on a PC with an Intel Core I5 processor and 2 GB RAM. The quadprog function in MATLAB was used to solve the QPP. The testing accuracies of all experiments are computed using 10-fold cross validation [40]. 4.1. Artificial datasets To illustrate the intuitive performance of our method in both linear case and nonlinear case, we create a linearly separable dataset and two nonlinearly separable datasets. 4.1.1. Linear case For the linear case, 100 original positive and 100 original negative data are generated randomly from two normal distributions with means ν1 ¼  10 and ν2 ¼ 10 and standard deviations σ 1 ¼ 5 and σ 2 ¼ 5. In order to construct the semi-supervised problem, we randomly select 10 original positive and 10 original

Fig. 7. Samples from ABCDETC dataset.

negative as new positive and negative data, the rest as the unlabeled dataset. First, we compare the performances of different Universum datasets. C u ¼ 100; C t ¼ 1; ε ¼ 0:001 are given, there are only 2 labeled positive examples and 2 labeled negative examples, the rest 196 points are unlabeled. In each step, we respectively choose m ¼10 examples which located farthest from Universum examples as new positive and negative examples. The number of iteration is 9. In Fig. 2, we only show the results of the 1st, 3rd, 5th, 7th iteration, it is easy to see that the decision boundary tends to be close to the ideal boundary for all cases, and there are no big differences between different Universum datasets used. 4.1.2. Nonlinear case For the nonlinear case, we create two artificial datasets (see Fig. 3). Similar with the linear case, 2 original positive examples and 2 original negative examples are chosen as new positive and negative data, the rest 196 points as unlabeled data. RBF kernel and C u ¼ 100; C t ¼ 1; ε ¼ 0:001 are used. Fig. 4(a)–(d) illustrates that the result performs better when we choose the Umean ðallÞ Universum since the Umean ðallÞ can reflect the distribution of the original dataset. Umean ðnearest=farthestÞ, Umean ðself Þ and Umean ðvaryÞ are not suitable for this case. The same

Y. Tian et al. / Neurocomputing 189 (2016) 33–42

70

78

88

68

76

86

66

74

62

72 70

82 80

68

58

66

78

56 20

64 20

76 20

60

80

100

Unabled data set

79

84

60

40

80 Accuracy

64

82 81

Accuracy

Accuracy

Accuracy

40

40

60

80

100

Unabled data set

78 77 76 75 74 73

40

60

80

100

Unabled data set

72 20

40

60

80

100

Unabled data set

Fig. 8. The test accuracy of TSVM, LapSVM, and Us -SVM on ABCDETC dataset. (a) The result of ‘,’ and ‘.’ (b) The result of ‘!’ and ‘?’ (c) The result of ‘:’ and ‘;’ (d) The result of ‘a’ and ‘b’.

conclusion can be derived from Fig. 5. Fig. 5(e)–(h) shows that the result performs worse when we choose Umean ðnearest=farthestÞ. Because the number of the Universum examples is too small to reflect the distribution of positive dataset and negative dataset. The Umean ðvaryÞ performs not better than Umean ðallÞ and Umean ðself Þ. 4.2. Handwritten symbol recognition 4.2.1. MINIST dataset In this section, MINIST dataset with samples from ‘0’–‘9’ is used to show the performance of our method. This dataset contains 10 ; 000; 28  28 pixels imagines, with 1000 for each category and 10 category in total. We use ‘5’ and ‘8’ to form a binary classification (the number of ‘5’ and ‘8’) are same. First, we constructed the semi-supervised problem. The number of labeled examples is chosen ranging in f300; 600; 1200; 1800g, another 420 ‘5’ and ‘8’ are selected as the unlabeled data. The number of the test data is 1500. Then we selected 500 ‘5’ and ‘8’ to training set, another 400, 600, 1500, 2400 ‘5’ and ‘8’ as unlabeled data. We construct the Universum dataset using Umean ðallÞ. Fig. 6 shows the result in the case of RBF kernel, we can see that our method performs better for most of the datasets, and the greater the dataset, the better the result. 4.2.2. The ABCDETC dataset In this section, the ABCDETC dataset [27] which has 78 classes in the dataset of 19 646 images is used to further test our method. There are different categories such as lowercases letters (‘a–z‘), uppercase letters (‘A–Z’), digits (‘0–9’) and other symbols (‘, . : ; ! ? þ -¼/ $ %() ‘’ ’) (Fig. 7). Subjects wrote in pen 5 versions of each symbol on a single gridded sheet. We performed experiments on predicting whether a letter is a lowercase ’‘a’‘ or ’‘b’‘, whether a letter is a symbol ’‘; ’‘ or ’‘:’‘, whether a letter is a symbol ’‘!’‘ or ’‘?’‘, whether a letter is a symbol ’‘ : ’‘ or ’‘; ’‘. All labeled data sizes are set to 50, and unlabeled data are set to 20; 40; 60; 80; and 100 respectively. We construct the Universum dataset using Umean ðallÞ. All results are shown in Fig. 8. Fig. 8 shows the results in the case of RBF kernel. We can see that our method performs better for most of the datasets, and the greater the dataset, the better the result. 4.3. UCI datasets In this section, we perform the methods on the UCI datasets [41]. All data are scaled such that features locate in ½0; 1 before training. In order to create semi-supervised problem, we randomly select the same number of samples from different classes,

Table 1 The average tenfold cross-validation results on UCI datasets in terms of accuracy. Best accuracy is in underlined. Datasets

Us -SVM

LapSVM

TSVM

Australian BUPA liver CMC Heart-Statlog Hepatitis Ionosphere Pima Indian Sonar Votes Credit Image Spect WPBC Diabetes

68:14 7 4:15 67:65 7 4:22 66:10 7 3:43 73:75 7 5:77 79:12 7 4:42 74:39 7 3:61 72:65 7 4:31 69:54 7 4:12 70:56 7 4:35 75:39 7 3:53 84:24 7 3:51 73:60 7 3:89 75:24 7 3:56 61:41 7 3:81

66:89 7 4:87 67:61 7 5:63 63:21 7 1:98 74:31 7 6:19 77:26 7 6:24 73:88 7 2:29 75:61 7 4:21 67:24 7 4:02 70:23 7 3:14 77:68 7 2:424 82:81 7 2:17 71:48 7 4:41 73:21 7 3:14 62:99 7 2:78

65:49 7 3:98 65:72 7 4:28 61:71 7 3:57 72:69 7 5:49 76:54 7 5:28 69:44 7 3:87 63:24 7 4:12 66:21 7 4:13 69:54 7 4:08 72:36 7 3:20 79:32 7 3:51 70:51 7 5:50 72:14 7 3:35 59:66 7 2:84

therefore based on this set to verify the above methods. We construct the Universum dataset using Umean ðallÞ. RBF kernel and C u ¼ 100; C t ¼ 1; ε ¼ 0:001 are used. The results of other methods are reported in [42]. From Table 1 and Fig. 9, we can see that our method performs better for most of the datasets.

5. Conclusion In this paper, we proposed a new method, Us -SVM, to deal with the semi-supervised classification problem. Based on solving the U-SVM iteratively, our new method is applied to seek more reliable positive and negative examples from the unlabeled dataset step by step. Since different Universum data results in different performance, so several effective approaches are explored to construct Universum datasets. Kinds of experimental results demonstrate that appropriately constructed Universum will improve the accuracy and reduce the number of iterations. Extending this idea to multi-class classification problem is still under our consideration, though it can be easily solved by the “decomposition-reconstruction” strategy by solving a series of binary class problems, but solving a single optimization problem based on “all-together” strategy seems to be more challenging and interesting. Furthermore, searching for efficient solver such as the sequential minimal optimization (SMO) or dual coordinate descent (DCD) methods for large scale problems is also a future direction.

Y. Tian et al. / Neurocomputing 189 (2016) 33–42

41

Fig. 9. The average tenfold cross-validation accuracy and variations on UCI datasets for Us -SVM, LapSVM and TSVM.

Acknowledgment This work has been partially supported by grants from National Natural Science Foundation of China (Nos. 61472390, 11271361, 71331005), Major International (Regional) Joint Research Project (No. 71110107026), “New Start” Academic Research Project of Beijing Union University (No. ZK10201409) and the Ministry of water resources special funds for scientific research on public causes (No. 201301094).

References [1] L. Wang, K.L. Chan, Z.H. Zhang, Bootstrapping svm active learning by incorporating unlabelled images for image retrieval, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2003, pp. 629–634. [2] T. Joachims, Transductive inference for text classification using support vector machines, in: the Sixteenth International Conference on Machine Learning, ICML '99, Morgan Kaufmann Publishers Inc., 1999, pp. 200–209. [3] G. Tr, D. Hakkani-Tr, R.E. Schapire, Combining active and semi-supervised learning for spoken language understanding, Speech Commun. 45 (2) (2005) 171–186. [4] O. Chapelle, B. Schölkopf, A. Zien, Semi-supervised Learning, MIT Press, 2006. [5] Z.Q. Qi, Y.J. Tian, Y. Shi, Laplacian twin support vector machine for semisupervised classification, Neural Netw. 35 (2012) 46–53. [6] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res. 7 (2006) 2399–2434. [7] W. Liu, J. BuWang, S.F. Chang, Robust and scalable graph-based semisupervised learning, Proc. IEEE 100 (9) (2012) 2624–2638. [8] Z.H. Zhou, M. Li, Semi-supervised learning by disagreement, Knowl. Inf. Syst. 24 (3) (2010) 415–439. [9] X.J. Zhu, Z.B. Ghahramani, J. Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions, in: ICML, 2003, pp. 912–919. [10] D.J. Miller, H.S. Uyar, A mixture of experts classifier with learning based on both labelled and unlabelled data, in: Proceedings of Advances in Neural Information Processing Systems 9, 1997, pp. 571–577. [11] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the em algorithm, R. Stat. Soc. (1977) 1–38. [12] K. Nigam, A.K. McCallum, S. Thrun, T. Mitchell, Text classification from labeled and unlabeled documents using em, Mach. Learn. 39 (2) (2000) 103–134. [13] S. Baluja, Probabilistic modeling for face orientation discrimination: learning from labeled and unlabeled data, in: Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, 1998, pp. 854– 860. [14] M. Belkin, I. Matveeva, P. Niyogi, Regularization and semi-supervised learning on large graphs, in: 17th Annual Conference on Learning Theory, Springer, Berlin, Heidelberg, 2004, pp. 624–634. [15] D.Y. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, in: Advances in Neural Information Processing Systems 16, MIT Press, 2004, pp. 321–328. [16] X.J. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions, in: ICML, 2003, pp. 912–919. [17] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, 1998, pp. 92–100.

[18] Z.H. Zhou, M. Li, Tri-training: exploiting unlabeled data using three classifiers, IEEE Trans. Knowl. Data Eng. 17 (11) (2005) 1529–1541. [19] O. Chapelle, M.M. Chi, A. Zien, A continuation method for semi-supervised svms, in: ICML ’06 Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 185–192. [20] O. Chapelle, A. Zien, Semi-supervised classification by low density separation, in: Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 2005, pp. 57–64. [21] H.J. Scudder, Probability of error of some adaptive pattern-recognition machines, IEEE Trans. Inf. Theory 11 (3) (1965) 363–371. [22] S. Fralick, Learning to recognize patterns without a teacher, IEEE Trans. Inf. Theory 13 (1) (1967) 57–64. [23] Y. Li, C.T. Guan, H. Li, Z.Y. Chin, A self-training semi-supervised svm algorithm and its application in an eeg-based brain computer interface speller system, Pattern Recognit. Lett. 29 (2008) 1285–1294. [24] D. Yarowsky, Unsupervised word sense disambiguation rivaling supervised methods, in: The 33rd Annual Meeting on Association for Computational Linguistics, 1995, pp. 189–196. [25] E. Riloff, J. Wiebe, T. Wilson, Learning subjective nouns using extraction pattern bootstrapping, in: Proceedings of the Seventh Conference on Natural Language Learning, vol. 4, Association for Computational Linguistics, 2003, pp. 25–32. [26] C. Rosenberg, M. Hebert, H. Schneiderman, Semi-supervised self-training of object detection models, in: Seventh IEEE Workshop on Applications of Computer Vision, 2005, pp. 29–36. [27] J. Weston, R. Collobert, F. Sinz, L. Bottou, V. Vapnik, Inference with the universum, in: The 23rd International Conference on Machine Learning, 2006, pp. 1009–1016. [28] F.H. Sinz, O. Chapelle, A. Agarwal, B. Schlökopf, An analysis of inference with the universum, in: Neural Information Processing Systems 20, 2007. [29] D. Zhang, J.D. Wang, F. Wang, C.S. Zhang, Semi-supervised classification with universum, in: SIAM International Conference on Data Mining, 2008. [30] Z.Q. Qi, Y.J. Tian, Y. Shi, Twin support vector machine with universum data, Neural Netw. 36 (2012) 112–119. [31] Z.Q. Qi, Y.J. Tian, Y. Shi, A nonparallel support vector machine for a classification problem with universum learning, Comput. Appl. Math. 263 (2014) 288–298. [32] V. Cherkassky, S. Dhar, W. Dai, Practical conditions for effectiveness of the universum learning, IEEE Trans. Neural Netw. 22 (8) (2011) 1241–1255. [33] C. Shen, P. Wang, F. Shen, H. Wang, Uboost: boosting with the universum, IEEE Trans. Pattern Anal. Mach. Intell. 34 (4) (2011) 825–832. [34] V. Sindhwani, S.S. Keerthi, O. Chapelle, Deterministic annealing for semisupervised kernel machines, in: ICML 06: Proceedings of the 23rd International Conference on Machine Learning, 2006, 2006, pp. 841–848. [35] R. Collobert, F. Sinz, J. Weston, L. Bottou, Large scale transductive svms, J. Mach. Learn. Res. 7 (2006) 1687–1712. [36] S. Chen, C.S. Zhang, Selecting informative universum sample for semisupervised learning, in: International Joint Conferences on Artificial Intelligence, 2009, pp. 1016–1021. [37] J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods Support Vector Learning, MIT Press, Cambridge, MA. [38] C. Chang, C. Lin, Libsvm: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 1–27. [39] R. Fan, P. Chen, C. Lin, Working set selection using second order information for training svm, J. Mach. Learn. Res. 6 (2005) 1889–1918. [40] N.Y. Deng, Y.J. Tian, C.H. Zhang, Support Vector Machines: Optimization based Theory, Algorithms, and Extensions, CRC Press, 2012. [41] C.L. Blake, C.J. Merz, Uci repository for machine learning databases [online], Dept. Inf. Comput. Sci., Univ. California, Irvine Available: 〈http://www.ics.uci. edu/  mlearn/MLRepository.html〉. [42] Z.Q. Qi, Y.J. Tian, Y. Shi, Laplacian twin support vector machine for semisupervised classification, Neural Netw. 35 (2012) 46–53.

42

Y. Tian et al. / Neurocomputing 189 (2016) 33–42 Yingjie Tian received the first degree in mathematics in 1994, the masters degree in applied mathematics in 1997, and the Ph.D. degree in management science and engineering. He is currently a Professor with the Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing, China. He has published four books about support vector machines (SVMs), one of which has been cited over 1400 times. His current research interests include SVMs, optimization theory and applications, data mining, intelligent knowledge management, and risk management.

Ying Zhang is currently a student with the Chinese Academy of Sciences, Beijing, China. Her current research interests include support vector machines and data mining.

Dalian Liu is currently an Associate Professor with the Beijing Union University, and also a Ph.D. student with Beijing Jiaotong University. Her current research interests include optimization and data mining.