Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Semi-supervised learning combining transductive support vector machine with active learning Xibin Wang a, Junhao Wen a,b,n, Shafiq Alam c, Zhuo Jiang a, Yingbo Wu b a
College of Computer Science, Chongqing University, Chongqing, China School of Software Engineering, Chongqing University, Chongqing, China c Department of Computer Science, University of Auckland, Auckland, New Zealand b
art ic l e i nf o
a b s t r a c t
Article history: Received 15 April 2015 Received in revised form 15 July 2015 Accepted 30 August 2015 Communicated by Liang Lin
In typical data mining applications, labeling the large amounts of data is difficult, expensive, and time consuming, if annotated manually. To avoid manual labeling, semi-supervised learning uses unlabeled data along with the labeled data in the training process. Transductive support vector machine (TSVM) is one such semi-supervised, which has been found effective in enhancing the classification performance. However there are some deficiencies in TSVM, such as presetting number of the positive class samples, frequently exchange of class label, and its requirement for larger amount of unlabeled data. To tackle these deficiencies, in this paper, we propose a new semi-supervised learning algorithm based on active learning combined with TSVM. The algorithm applies active learning to select the most informative instances based on the version space minimum–maximum division principle with human annotation for improve the classification performance. Simultaneously, in order to make full use of the distribution characteristics of unlabeled data, we added a manifold regularization term to the objective function. Experiments performed on several UCI datasets and a real world book review case study demonstrate that our proposed method achieves significant improvement over other benchmark methods yet consuming less amount of human effort, which is very important while labeling data manually. & 2015 Elsevier B.V. All rights reserved.
Keywords: Transductive support vector machine Active learning Version space minimum–maximum division principle Graph-based method
1. Introduction Support vector machine (SVM) is a supervised machine learning approach for solving binary classification pattern recognition problems. It adopts maximum margin to find the decision surface that separates the positive and negative labeled training examples of a class [1]. For a given date point, the regular SVM results in distances among data points ranges from 0 to 1. The value 0 indicates that this data point locates on the hyper-plane and the value 1 means that this data point is a support vector. Although SVM has been successfully used in various fields, such as [2–6], however, in many real word applications, there is not enough labeled data to train a good classification model. Compare to the standard SVM which uses only labeled training data, many semi-supervised SVM employ unlabeled data along with some labeled data for training classifiers with improved generalization and performance. Semisupervised SVM has been well received attention because of two reasons. Firstly, labeling a large number of examples is timen Corresponding author at: College of Computer Science, Chongqing University, Chongqing, China. E-mail address:
[email protected] (J. Wen).
consuming and labor-intensive. This task has also to be carried out by qualified experts and thus is expensive. Secondly, some studies show that using unlabeled data for learning can improve the accuracy of classifiers [7,8]. Transductive support vector machine (TSVM) [9] is an efficient method for improving the generalization accuracy of SVM by finding a labeling for the unlabeled data, so that a linear boundary has the maximum margin on both the original labeled data and the labeled unlabeled data [10]. The notable characteristic of TSVM, being transductive, aims at such learning problems that are really interested in only the particular datasets of the testing or working (or training) data [9,11], while traditional work on inductive learning estimates a classifier based on some training data that generalizes to any input examples. The main idea of transductive learning is building models for the best prediction performance on a particular testing dataset instead of developing generalized models to be applied to any testing dataset [12]. In other words, by explicitly including the working dataset consisting of unlabeled examples in problem formulation, a better generalization can be achieved on problems with insufficient labeled data points [13]. One of the most common problems is that the machine may incorrectly label the training dataset, which will lead to classification error. The solution for this problem is in active learning.
http://dx.doi.org/10.1016/j.neucom.2015.08.087 0925-2312/& 2015 Elsevier B.V. All rights reserved.
Please cite this article as: X. Wang, et al., Semi-supervised learning combining transductive support vector machine with active learning, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.087i
X. Wang et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
2
Active learning (AL) is a technique of selecting a small subset from the unlabeled data such that labeling on the subset maximizes the learning accuracy. The selected subset is manually labeled by experts. In this way, AL can complement the TSVM by reducing the labeling errors [14]. In this paper, we explore combining the TSVM with the AL to improve the performance of the classification task. Below are the major contributions of our work: 1. In learning process, TSVM exploits a large amount of unlabeled data which explicitly include the geometrical distribution characteristics. To capture the geometrical structure of the data, we define L0 as a function of Laplacian graph. In this way, it can explore the structure of the data manifold structure by adding a regularization term that penalize any “abrupt changes” of the function values evaluated on neighbor samples in the Laplacian graph. 2. Active learning uses query framework, where an active learner queries the instances for labeling. Our algorithm defined the version space minimum–maximum division principle as the selection criteria to achieve best labeling results. The selected most informative instance is considered to be the most likely support vector, which reduced learning cost by deleting nonsupport vector. At the same time, it can achieve batch-sampling mode, which improves the training efficiency. Overall, this method can achieve more significant improvement for consuming the same amount of human effort and produce a desirable result considerably fewer labeled data. The rest of this paper is organized as follows. Section 2 describes the concept of TSVM, active learning, and graph-based method. Section 3 reviews some of the issues in the TSVM and active learning. Section 4 of the paper introduces the proposed algorithm. Section 5 reports experimental results on several UCI datasets and the book reviews dataset, and further analyze the underlying reasons for choosing the algorithm. Finally, Section 6 concludes and highlights the future work.
2. Background and related work 2.1. Transductive SVM Transductive SVM (TSVM) is a semi-supervised large-margin classification method based on the low density separation assumption. Similar to traditional SVM, TSVM searches for a hyperplane with largest margin to separate the classes, and simultaneously takes into account labeled and unlabeled examples. The detail descriptions and proofs of the concepts can be found in [9]. Set a group of independent and identically distributed labeled examples ðx1 ; D1 Þ; …; ðxi ; Di Þ A Rn R; i ¼ 1; …; l; yi ¼ f 1; þ 1g ð1Þ and u unlabeled examples xl þ 1 ; …; xl þ u
ð2Þ
In general, the learning process of TSVM can be formulated as the following optimization problem: min y1 ; …; yn ; w; b; ξ1 ; …; ξl ; ξl þ 1 ; …; ξl þ u l lþu X X 1 ‖w‖2 þ C 1 ξi þ C 2 ξj 2 i¼1 i ¼ iþ1 subject to 8 li ¼ 1 : yi w xi þ b Z 1 ξi ; 8 li þ¼ul þ 1 : yj w xj þ b Z 1 ξj ; ξj Z 0
ξi Z 0 ð3Þ
where C1 and C2 are user-specified parameters, which are used to penalize misclassified samples. C2 is called the “effect factor” of the
unlabeled examples in the training process. C 2 ξj is called the “effect term” of the jth unlabeled example in the objective function. Training process of the TSVM is to solve the above optimization problem. TSVM training algorithm can be described as follows: Step 1: Specify the parameter C1 and C2, conduct an inductive learning on the labeled samples, and obtain an initial classifier. Specify N—an estimated number of positive labeled examples in the unlabeled examples. Step 2: Compute the values of the decision function for all unlabeled examples using the initial classifier. The examples having the N largest decision function values are labeled as positive, and the remaining labeled as negative. Set a temporary effect factor Ctemp. Step 3: Retrain the SVM based on all the examples. For the newly yielded classifier, switch labels of one pair of different labeled unlabeled examples using a certain rule to lower the value of the objective function in (3) as much as possible. This step is repeated until no pair of examples meeting the switching condition. Step 4: Increase uniformly the value of Ctemp and return to the Step 3. On C temp Z C 2 , the algorithm is terminated and the labels of all unlabeled examples are returned. 2.2. Active learning There are many real-world tasks involving classification. Meanwhile, the number of unlabeled data points in training dataset is often very large, so it is time consuming and costly to label these unlabeled data for training the learning algorithm to a reasonable level of performance. Active learning (AL), a sub-area of machine learning, attempts to overcome the problem of labeling by asking “queries” in the form of unlabeled examples to be labeled by an “oracle” (e.g., a human annotator). In this way, the active learner aims to achieve high accuracy using as few labeled examples as possible, so that minimizing the cost of obtaining labeled data. AL is well-received in many modern machine learning problems where data may be abundant but labels are scarce or expensive to obtain [15]. The key question in AL is how to choose the next example or unlabeled example that is the most informative and to ignore the most uncertain example. The main aspect of AL is to measure the classification uncertainty of unlabeled examples. Zhu et al. [16] proposed a new semi-supervised learning framework, which allows combination of AL using Gaussian fields and harmonic functions. Hassanzadeh et al. [17] proposed a combined semisupervised and AL approach for sequence labeling which reduces manual annotation cost in a way that only highly uncertain tokens need to be manually labeled and other sequences and subsequences are labeled automatically. Elhamifar et al. [18] proposed an efficient AL framework based on convex programming. They use the two principles of classifier uncertainty and example diversity in order to guide the optimization program towards selecting the most informative unlabeled examples, which have the least information overlap. The results show the effectiveness of their framework in person detection, scene categorization and face recognition. 2.3. Graph-based method Graph-based method is popular semi-supervised learning which assumes that similar data points should have the same class labels. It first creates a fully connected graph where the vertices are all labeled and unlabeled data points. The edge between any two examples i, j has a weight W i;j , which represents the similarity of every pair samples.
Please cite this article as: X. Wang, et al., Semi-supervised learning combining transductive support vector machine with active learning, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.087i
X. Wang et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
There are many graph-based methods which are mainly different based on the choice of regularization terms and loss functions. Zhu et al. [19] adopt Gaussian fields and harmonic functions to semi-supervised learning which introduce a quadratic loss function with infinity weight for labeled data, and incorporate unlabeled data with a regularization term based on the graph combinatorial Laplacian. The local and global consistency method with the normalized Laplacian as the regularization term and a quadratic loss function was proposed by [20]. Certainly, there are some new graph-based learning methods, such as discriminatively trained and-or graph models for object shape detection [21] and dynamical and/or graph learning for object shape modeling and detection [22]. The geometry of probability distribution that generates the data and incorporates it as an additional regularization term is exploited in generative manifold regularization framework [23]. Generally speaking, the regularization framework can be described as an optimization problem with two regularization terms and a loss function, which can be shown as follows: arg min f A Hk
l X V xi ; yi ; f þ r H ‖f ‖2H þ r M ‖f ‖2M
ð4Þ
i¼1
where the first term represents some loss function on the labeled data, e.g., hinge loss in SVM that enforces the distributions of two different classes has a large margin. The second term prefers the decision function to be a simple classifier and r H is the weight of ‖f ‖2H controlling the complexity of f in the reproducing kernel Hilbert space H k . The third term enforces that similar examples have similar output based on the similarity weighted matrix W of all training examples. The parameter r M is the weight of ‖f ‖2M and controls the complexity of the function in the intrinsic geometry of marginal distribution, ‖f ‖2M is able to penalize f along the Riemann manifold M. The manifold regularization can be defined as ‖f ‖2M ¼
lþu X
1 ðl þ uÞ
2
2 T W i;j f ðxi Þ f xj ¼ f L0 f
ð5Þ
i;j ¼ 1
T where f ¼ f ðx1 Þ; …; f xl þ u is a vector evaluation on the labeled and unlabeled data, and f is defined as f ¼ ðw xÞ þb ¼ P w ϕðxÞ þ b ¼ li þ¼u1 αi K ðxi ; xÞ þ b, ϕ is a nonlinear mapping from a low dimensional space to a higher dimensional Hilbert space H and K is some kernel function. L0 is the graph Laplacian, which can be expressed as L0 ¼ D W , and D is a diagonal matrix with its ith P diagonal, Dii ¼ lj þ¼u1 W ij and Wij are the edge weights in a data adjacency graph. W ij ¼ 1 if vertices i and j are connected by an edge and W ij ¼ 0 if vertices i and j are not connected by an edge.
3. Active learning and issues in transductive SVM 3.1. Characteristics of active learning Active learning (AL) is well-motivated in many modern machine learning problems where data may be abundant but labels are scarce or expensive to obtain. It is an interactive learning technique designed to reduce the labor cost of labeling in which the learning algorithm can freely assign the unlabeled examples to the training set. The main idea is to select the most informative examples and ask the expert for their label in the successive learning rounds. The strategy of AL is to select a most useful set of unlabeled examples with the human involvement that minimizes the expected risk of the next round. In this way, it can greatly improve the performance of the learning model and also can accelerate the convergence speed.
3
From Section 2.1, it is evident that when training the TSVM, it will use unlabeled training examples, but it does not conclude that by using more examples for training will certainly increase the learning performance. However, in many practical applications, unlabeled examples for the semi-supervised learning methods may come from different environments, where the distribution of examples' is complex, unknown, and most likely contains noise. According to the characteristics of AL, it takes advantage of the existing knowledge and initiates the selection of most likely examples to solve the problem. It effectively reduces the number of examples required for assessment, which can be used for TSVM to improve the performance of selecting unlabeled examples. This results in selecting the most favorable examples to the TSVM classification model, hence improves the TSVM's performance. 3.2. Issues in transductive SVM TSVM can achieve better performance than inductive learning as it takes into account the distribution information, which is implicitly embodied in the large number of the unlabeled examples. However, it also has some drawbacks, such as its objective function is non-convex quadratic programming problem (called non-convex problem), thus difficult to minimize, the parameter N has to be specified (called presetting N problem) in advance, and there is no agreement on using more unlabeled examples for training will lead to better learning performance. In fact it may introduce incorrect labels to the training data, as the labeling is done by machine, and such labeling errors are critical to the classification performance (called exploiting unlabeled examples problem). In the past few years, researchers have attempted to solve the non-convex problem. Chapelle and Zien [24] proposed semisupervised classification by low density separation which combines a smooth loss function with a gradient descent method to find the decision boundary in low density regions. However, the approximation accuracy is unsatisfactory [25]. Chapelle and Sindhwani [26] proposed another technique which is a branchand-bound method that searches for obtaining exact, globally optimal solutions. But it is only suitable to a small number of examples. For solving large-scale data problems, large-scale transductive SVMs were proposed by [27], called the improved concave–convex procedure to TSVM. For presetting N problem, in general, presetting the N correctly is very difficult before the training TSVM, because the data distribution is difficult to be estimated. If the incorrect estimated N, it will lead to a classifier that cannot describe the actual data distribution. So, this drawback severely limits the TSVM's applications. To address this problem, [11,28] proposed improved version, where in [11] the authors proposed an improved algorithm, called PTSVM. It labeled one negative example and one positive example simultaneously according to the maximal absolute decision function value. But it is only suitable for solving a small number of unlabeled samples. In [28] the authors tackled the problem of remote sensing and proposed that a novel transductive procedure exploits a weighting strategy for unlabeled patterns, based on a time-dependent criterion. But it is for a specialized application and has its own limitations. To tackle the exploiting unlabeled examples problem, Zhou and Li [29] proposed the S3VM-us method by using hierarchical clustering to select the unlabeled instances to reduce the chances of performance degeneration by using unlabeled data. Settles [15] showed that active learning (AL) and semi-supervised learning attack the same problem from opposite directions. Semisupervised methods exploit what the learner thinks it knows about the unlabeled data, AL methods attempt to explore the unknown aspects. AL can select the most informative instances of
Please cite this article as: X. Wang, et al., Semi-supervised learning combining transductive support vector machine with active learning, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.087i
X. Wang et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
unlabeled instances to be labeled by the users (e.g., a human annotator). In this way, the active learner aims to achieve high accuracy using as few labeled instances as possible, thereby minimizing the cost of obtaining labeled data. So, incorporating AL in the process, the semi-supervised learning can better exploit the unlabeled data. TSVM is an important technique for improving the predictive accuracy when labeled data is scarce. However, it is considered to underperform in the real world applications. Thus, in order to solve the drawbacks of TSVM, such as presetting the number of positive samples and exploiting unlabeled examples problems, inspired by the above analysis and some literatures, in this paper a new semi-supervised SVM algorithm was designed based on AL which takes advantage of the benefits of these two algorithms and overcomes TSVM's shortcomings. In this way, it can reduce the annotation effort of operators and improve the training efficiency.
xi ¼ xi u ;
i A l þ u þ 1; …; l þ 2u
(6) can be rewritten as l lX þ 2u X 1 H 1 yi f ðxi Þ þ C 2 Rs yi f ðxi Þ min ‖w‖2 þ C 1 2 i¼1 i ¼ lþ1
In this section we provide a new semi-supervised learning algorithm based on active learning (AL), which combines TSVM with AL, called ALTSVM algorithm. Firstly, to explore the data manifold structure, we add a regularization term that penalizes any “abrupt changes” of the evaluated function values on neighbor samples in the Laplacian graph. Secondly, we propose a new unlabeled sample selection principle for AL, called version space minimum–maximum division principle. Thirdly, we describe the ALTSVM algorithm. 4.1. Adding the manifold regularization term to the objective function Adding a regularization term that is defined over unlabeled data, to the traditional SVM optimization function, leads to the following optimization problem of the TSVM: l lþu X X 1 min ‖w‖2 þC 1 H 1 yi f ðxi Þ þ C 2 H 1 f ðx i Þ 2 i¼1 i ¼ lþ1
ð6Þ
where H 1 ðÞ ¼ maxð0; 1 Þ is the classical hinge loss for labeled data, H 1 ðjjÞ ¼ maxð0; 1 jjÞ is the symmetric hinge loss for unlabeled data. For C2 ¼0 in (6) we obtain the standard SVM optimization problem. For C 2 4 0 we penalize unlabeled data that is inside the margin. Note that its non-convex hat shape makes it a hard to solve optimization problem. Collobert et al. [27] proposed the CCCP approximate optimization technique, which decomposes a non-convex function into a convex and a concave part, and then solves it iteratively. Each iteration of the CCCP approximates the concave part by its tangent and minimizes the resulting convex function. In [27], the loss function applied to unlabeled data is replaced by “Ramp Loss”, which is decomposed into the sum of a hinge loss function and a concave loss function. The ramp loss function is expressed as follows: Rs ðzÞ ¼ H 1 ðzÞ H s ðzÞ ¼ minð1 s; maxð0; 1 zÞÞ
ð7Þ
where H 1 ðzÞ is the hinge loss function, H s ðzÞ is the concave loss function that can be expressed as H s ðzÞ ¼ maxð0; s zÞ , and s is a predefined parameter such that 1 o s r 0. In here, set s ¼ 0:3. Training the TSVM is equivalent to training an SVM using the hinge loss H1(z) for labeled data, and using the ramp loss Rs(z) for unlabeled data, where each unlabeled data appears as two data labeled with both possible classes. After introducing yi ¼ 1; i A l þ 1; …; l þ u yi ¼ 1; i A l þ u þ 1; …; l þ 2u
ð9Þ
Based on [27], the CCCP for TSVM has the following objective function: l lX þ 2u lX þ 2u X 1 ξi þ C 2 ξi þ β i f ðx i Þ min ‖w‖2 þ C 1 2 i¼1 i ¼ lþ1 i ¼ lþ1
ð10Þ
þ 2u l 1 lX 1X f ðx i Þ ¼ y u i ¼ lþ1 l i¼1 i
yi f ðxi Þ Z 1 ξi ; 8 1 r i rl þ 2u
ξi Z 0; 8 1 r ir l þ 2u 4. Combining transductive SVM with active learning
ð8Þ
ð11Þ
where β i is related to the derivative of the concave loss function, which is notated as ( C 2 R0s yi f ðxi Þ if i Z l þ 1 βi ¼ 0 otherwise C 2 if yi f ðxi Þ os and iZ l þ 1 ð12Þ ¼ 0 otherwise In order to capture the geometrical structure of data, a common method is to define L0 as a function of Laplacian graph. In this way, we can explore the structure of the data manifold by adding a regularization term that penalize any “abrupt changes” of the function values evaluated on neighbor samples in the Laplacian graph. The corresponding optimization problem of TSVM can be defined as follows: l lX þ 2u lX þ 2u X 1 ξi þ C 2 ξi þ βi f ðxi Þ þC 3 f T L0 f min ‖w‖2 þ C 1 2 i¼1 i ¼ lþ1 i ¼ lþ1
ð13Þ
Subject to þ 2u l 1 lX 1X f ðx i Þ ¼ y u i ¼ lþ1 l i¼1 i
yi f ðxi Þ Z 1 ξi ; 8 1 r i rl þ 2u
ξi Z 0; 8 1 r i rl þ2u
ð14Þ
where C2 controls the influence of unlabeled data over the objective function, C3 controls the influence of graph-based regularization terms. Setting C3 alone to zero causes the TSVM to ignore manifold information of the training data. Introduction of the solution of ω, the optimization problem (13) and (14) can be rewritten as follows: min
l lX þ 2u X 1 T α K α þC 1 ξi þ C 2 ξi 2 i¼1 i ¼ lþi 0 1 lX þ 2u lX þ 2u @ þ βi yi αj K xi ; xj þ bA j¼1
i ¼ lþ1
þ C 3 α K LK α T
T
Subject to 0 1 þ 2u lX þ 2u l 1 lX 1X @ α K xi ; xj þ bA ¼ y 2u i ¼ l þ 1 j ¼ 1 j l i¼1 i 0 1 lX þ 2u @ yi αj K xi ; xj þbA Z 1 ξi ; ξi Z 0
ð15Þ
ð16Þ
j¼1
Please cite this article as: X. Wang, et al., Semi-supervised learning combining transductive support vector machine with active learning, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.087i
X. Wang et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Introducing Lagrange multipliers, and solving the dual problem, we can obtain the decision function f ðxÞ ¼
lX þ 2u
yi ρ i þ γ i K ðxi ; xÞ þ b
ð17Þ
i¼1
where ρ ¼ ρ β,
ρ and γi are Lagrange multipliers.
4.2. Version space minimum–maximum division principle for active learning In the regularization framework of (6), let R(f, L) denotes the objective function, that is, Rðf ; LÞ ¼
l X
max 0; 1 yi f ðxi Þ :
i¼1
þ
lþu X
λ min 1 s; max 1 yi f ðxi Þ þ ‖f ‖2H 2 i ¼ lþ1
ð18Þ
In order to identify the most informative example, we select the unlabeled example xn that leads to a small value for the objective function regardless of its assigned class label yn . Based on this idea, the minimum–maximum division framework can be cased as follows: max R f ; L [ xn ; yn min n n x A uy A f 1;1g
Furthermore, it can be expressed as 0 1 l X B C max 0; 1 yi f ðxi Þ B C Bi¼1 C C B P min max min B þ min 1 s; max 1 yi f ðxi Þ C C xnj A uyn A f 1;1g f A H B B i A u [ fj g C @ A þ 2λ‖f ‖2H
ð19Þ
n
0
l X
1
n max 0; 1 ynj f xnj
B C Bi¼1 C B C max min B lþu
C X C xnj A uynj A f 1;1gB n n n @þ A min 1 s; max 1 yj f xj i ¼ lþ1
1 n þ max 0; 1 f xnj ; 1 þ f xnj B
C ¼ min @ A n n xnj A u min 1 s; 0; 1 f xnj ; 1 þf xnj 0
n
n n 1 þ ¼ min f xj n xj A u
n n ¼ min f xj n
ð20Þ
xj A u
From the above discussion, we can see that an approximation to the minimum–maximum division framework is to select the n unlabeled example closest to the decision boundary f which is trained on the current labeled example dataset. This result is similar to the result derived from the version space analysis in [30]. Next, we will apply the minimum–maximum division framework to active learning. 4.3. The description of ALTSVM algorithm Because the formula (20) is equal to minxnj A u ynj f exists such a proposition:
n
Proposition 1. Let l þ u examples determine the version space as follows: V ¼ f A H K 8 i A 1; 2; ⋯; l þu ; yi f ðxi Þ 4 0 xl þ2 ; yl þ 2 , we When labeled the examples xl þ 1 ; yl þ 1 and get new new 4 y and V . If y f x f xl þ 2 , the new version spaces V l þ 1 l þ 1 l þ 2 l þ 1 l þ 2 new new then Area V l þ 1 4 Area V l þ 2 , where AreaðV Þ represents the size of the version space. Proof. After labeled the examples xl þ 1 ; yl þ 1 and xl þ 2 ; yl þ 2 , we new get the new version spaces V new l þ 1 and V l þ 2 . According to the expression of f(x), these new version spaces can be expressed as follows: 8 9 J ω J ¼ 1; > > < = 0; i ¼ 1; 2; …; l þ u V new ω yi ω ϕðxi þ b 4 ð21Þ lþ1 ¼ > > : y ; l þ 1 ω ϕ xl þ 1 þ b 4 0 8 9 kωk ¼ 1; > > < = y ω ϕðx þ b 4 0; i ¼ 1; 2; …; l þ u i ¼ ω V new i lþ2 > > : y ; l þ 2 ω ϕ xl þ 2 þ b 4 0
ð22Þ
If yl þ 1 f xl þ 1 4 yl þ2 f xl þ 2 , then yl þ 2 ω ϕ xl þ 2 þ b 4 0, further yl þ 1 ω ϕ xl þ 1 þb 4 0. Based on the above analysis, we can obtain new V new l þ 2 V l þ 1;
Let the optimal decision function f is found in formula (6), and then the formula (19) can be simplified as max R f ; L [ xn ; yn min n n xj A uyj A f 1;1g
5
xnj , so there
new Therefore, Area V new l þ 1 4 Area V l þ 2 . □ From Proposition 1, we can find that for the example xi ; yi , the value of yi f ðxi Þ is smaller, labeled this example corresponding to the hyper-plane divides the current version space will reserve the smaller portion. Thus, the example is more valuable for training classification model. The main idea of active sample selection strategy is to select a series of samples in accordance with the principle of diminishing the value of yi f ðxi Þ to ensure that the current selected sample has more labeled value than the previous, so that the final selection of a number of samples has the highest possible information. The framework of TSVM based on active learning (ALTSVM) algorithm is described as Algorithm 1. Algorithm 1. TSVM based on active learning (ALTSVM). Input: L,U =n Labeled sample set; Unlabeled sample set k/n The number of samples in each round of interaction required labeled Output: f(x) /n Classification function 1: Specify the parameter C1 and C2. Select several examples from U, labeling them (positive examples and negative examples are not less than one), and add them to L. Using all labeled examples to build an initial classification model with inductive learning; 2: Compute the decision function values of all unlabeled examples. Form a sequence S of unlabeled, according to the values of f(xi) in increasing order; 3: Select an example xi with the minimum objective function value to be labeled, that is, p ¼ minxi A u f ðxi Þ, record the corresponding label:yper ¼ yi . Delete the xi from U, and S. Simultaneously, add the xi to L:
U’U fxi g; S’S fxi g; L’L þ fxi g:
Please cite this article as: X. Wang, et al., Semi-supervised learning combining transductive support vector machine with active learning, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.087i
X. Wang et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
6
4: while m ¼ 1; 2; …; k do: if yper ¼1, then select the adjacent example xi þ q in the opposite direction of S, and label it: yper ¼ yi þ q , where, q can be either a positive or negative value. if yper ¼ 1, then select the adjacent example xi þ q in the increasing direction of S, and label it: yper ¼ yi þ q , where, q can be either a positive or negative value. Delete the xi þ q from U, and S. Simultaneously, add the xi þ q to L:
the book is considered to be non-spam or useful in case of (1) it contains a declarative sentence, where all questions are considered to be spam reviews, and (2) it expresses opinions on a book or book feature. Opinions include the reviewer's positive or negative personal emotion about a book or book feature, and/or the advantages and disadvantages analysis of a book or book feature, such as a book is only suitable for beginners or the basic for a certain developers to use in depth study. Table 2 shows each of the six features along with some sample dictionary terms and dictionary size. 5.2. Compared schemes and experimental setup
U’U xi þ q ; S’S xi þ q ; L’L þ xi þ q 5: Retrain the TSVM over the L, and return f(x). If there are still unlabeled examples, return to 2.
In this algorithm, it is worth noting that we use the yper f ðxÞ instead of yf ðxÞ, to measure the sample information. Where yper is not true label of the sample, but is the class label of the previous labeled adjacent sample. In this strategy, the approximation cannot be directly dependent on the accuracy of the current classification model, but based on this relative relationship of the similar classification results have the same class labels. Also, only a certain number of unlabeled samples are labeled, the algorithm starts retraining. The advantage of this strategy is that under the premise of promising precision, the algorithm can improve calculative efficiency, compared with each unlabeled sample is labeled [31], the algorithm starts retraining, which is computationally expensive [18].
5. Experimental results and analysis To evaluate the performance of the proposed algorithm, we conduct a set of experiments by comparing the proposed algorithm with several state-of-the-art active learning methods on benchmark UCI datasets [32], and also on a book reviews dataset as a real world application. 5.1. Experimental Testbed For our experimentation we choose four datasets from the UCI machine learning dataset, Hepatitis, WPBC, Bupa liver, and Votes (see Table 1). These datasets have been used in many studies, such as [33–35]. These four datasets are binary classification problem. For each dataset, we choose a certain number of data as the labeled examples, and put them into the labeled data pool L; remove the label of the remaining data as the unlabeled data, and put them into the unlabeled data pool U. In this way, the unlabeled data can be labeled by different example selection strategies. Further, we can test different example selection strategies, such as active learning methods, and random sample strategy. Book reviews dataset was created by crawling online bookstore-JD. com. Superfluous punctuation marks and stop words were deleted, and reviews which contain less than two characters were removed. We manually labeled 5 true reviews and 5 spam reviews. A review of Table 1 The description of UCI datasets. Data set
Attribute
Size
Class
Hepatitis WPBC Bupa liver Votes
19 34 6 16
155 198 345 4355
2 2 2 2
We compare the proposed algorithm ðTSVMAL þ Graph Þ against TSVMOAL, TSVMRandom, TSVM, SVMAL [36,37], and the standard SVM [38]. TSVMAL is the TSVMAL þ Graph algorithm without the manifold regularization term. TSVMOAL iteratively requests the label of that example data which is closest to the current hyper-plane and it uses the current predicted class label instead of the previous labeled adjacent example. At the same time, it starts retraining after one unlabeled example is labeled, and it does not wait for a certain number of unlabeled examples to be labeled. TSVM algorithm initially trains a classifier on both labeled examples and unlabeled examples, which exploits the cluster structure of examples and treat it as a prior knowledge about the learning task. SVM algorithm only uses the labeled examples, and performance well in the case of a sufficient number of labeled examples, but the performance will be degraded when the labeled examples are scarce. TSVMAL þ Graph algorithm not only exploits the manifold structure of the data to improve the performance of the classifier, but also exploits informative examples for human annotator. To compare these algorithms, TSVM, TSVMAL, TSVMRandom, and SVM are employed as the benchmark. SVM was solved by using the available matlab toolbox [38] and TSVMAL was own coded and implemented using the matlab. TSVM is solved by the concave convex procedure (CCCP), which was proposed by Collobert et al. [27]. SVM trains classifier over only the initial labeled training examples while TSVM trains classifiers on labeled examples together with unlabeled examples for cluster assumption. SVMAL trains classifier on the initial labeled training examples, and uses the active learning strategy to query the unlabeled examples to be labeled by experts. For TSVMAL þ Graph algorithm the solution has been discussed in Section 4.1. 5.3. Experiment I: experimental results on UCI datasets 5.3.1. Fixed initial labeled dataset size equals to 10 (L ¼10) and fixed k¼ 1 First, we conduct experiments with both label size fixed to 10 and k ¼1. Fig. 1 shows the classification accuracy of the four datasets: Bupa liver, Hepatitis, Votes, and WPBC. From this figure, it can be seen clearly that: 1. The classification performance of using active learning methods is better than without using active learning. Meanwhile, the classification accuracy is higher for using active learning. In the case of relatively small number of samples, active learning can actively select the samples (they are considered to be the most likely support vectors) and provide them to experts to label; the labeled samples are considered to be playing the biggest role in improving the classification performance. In this way, continuously expand the labeled sample dataset, and collate a high-quality training sample dataset for training the classifier. Although traditional SVM has higher performance than other classification models in the case of small sample, such as Back Propagation (BP) neural networks and decision trees, yet it cannot utilize the useful information implying in large amounts of unlabeled samples to improve the classification accuracy. Fig. 1(c) and (d) also illustrates this point. With increasing number of labeled
Please cite this article as: X. Wang, et al., Semi-supervised learning combining transductive support vector machine with active learning, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.087i
X. Wang et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
7
Table 2 The six extracted features and their sample dictionary terms.
SVM
TSVM
TSVM
TSVM
Random
75 Accuracy
Accuracy
50 45 40
70 65 60
35
55
2
4
6
8
10
12
14
16
18
50
20
2
4
6
8
Label rate (%)
100
80
95
75
90
70
85
60
75
55
4
6
8
10
12
12
14
16
18
20
14
16
18
20
65
80
2
10
Label rates (%)
Accuracy
Accuracy
AL+Graph
80
55
70
TSVM
AL
85
60
30
TSVM
Random+Graph
14
16
18
Label rate (%)
20
50
2
4
6
8
10
12
Label rate (%)
Fig. 1. The classification accuracy of each comparing algorithm changes as the number of labeled training instances increases.
samples, the performance of TSVMAL þ Graph increased gradually. When reaching a certain labeled percentage, its classification performance is better than traditional SVM. This also shows the active learning strategy is effective and reasonable for sample selection, and is very helpful to improve the classifier's performance. 2. The selected samples using active learning can better reflect the true distribution of the sample data than random sampling strategy, which ensures that the selected samples can further improve the classifier's accuracy. Meanwhile, the random sampling strategy has great randomness, and cannot guarantee that larger labeled samples yields better performance. On the contrary, when the labeled samples using the random sampling are not well
reflected the data characteristics, larger the number of samples are labeled, the classification accuracy degrades faster. Fig. 1(a)– (d) shows that the random sampling strategy corresponding to the accuracy rate and shows trends such as greater volatility, instability, and unsuitable for practical applications. Particularly, for Votes dataset, the volatility of TSVMRandom is the maximum, which may be due to the distribution of the dataset. 3. After introducing the manifold regularization term, the proposed method can perfectly utilize the manifold structure of unlabeled data. For Bupa liver dataset, Hepatitis dataset, and WPBC dataset, the performance difference between the TSVMAL þ Graph and TSVMAL is marginal. However, for Votes dataset, the classification
Please cite this article as: X. Wang, et al., Semi-supervised learning combining transductive support vector machine with active learning, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.087i
X. Wang et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
8
performance of TSVMAL þ Graph is improved rapidly, but TSVMAL is relatively slow. Especially, when the labeled rate equals to 10%, the performance of TSVMAL þ Graph improves faster. Obviously, it is helpful to enhance the performance of TSVM after introducing the manifold regularization term. 5.3.2. Different values of k and fixed label size equals to 10 (L ¼10) In order to compare the effect of different values of k, we set the initial labeled dataset size equals to 10 (L ¼10), and changed values of k. Meanwhile, to avoid the random error generated by single experiment, we conducted five times experiments, and obtain the average accuracy of four methods. Table 3 shows the experimental results of average classification accuracy of Votes dataset with three active learning iterations. Comparing TSVMAL, and TSVMRandom with SVMAL, we can see that the classification performance gap among them varies with the changes of k (the number of labeling unlabeled examples in on iteration). For the proposed TSVMAL þ Graph , it achieves considerably better performance with 0.9–3.9% improvement over the SVMAL. For the value of k larger than 15, the proposed algorithm consistently outperforms the SVMAL method with significant improvement. This is because the more unlabeled examples are labeled in one iteration, the proposed algorithm may better take advantage of the manifold structure implicating in unlabeled data. From the detailed analysis of experimental results, we can find that when the value of k increases, the relative improvement by TSVMAL þ Graph tends to become more significant. For example, when the value of k equals to 15, the Table 3 The average classification accuracy of Votes dataset with different values of k. k
SVMAL TSVMRandom TSVMRandom þ Grap (Growth rate) (Growth rate)
TSVMAL (Growth rate)
TSVMAL þ Graph (Growth rate)
5
83.1
10
86.6
15
88.5
20
89.1
25
91.4
78.5 5.54 83.6 3.46 87.8 0.79 89.5 0.45 92.2 0.88 86.32 1.69
80.2 3.49 84.4 2.54 89.3 0.9 92.6 3.93 94.8 3.72 88.26 0.54
Average 87.7
70.4 15.3 75.2 13.2 81.2 8.25 85.7 3.81 87.1 4.7 79.92 9.05
72.6 12.6 78.1 9.82 83.7 5.42 86.5 2.92 88.6 3.06 81.9 6.76
improvement of TSVMAL þ Graph over the SVMAL is about 0.9%, however, the improvement achieved by TSVMAL is negative. Furthermore, the accuracy ratio increment is more obvious than TSVMAL, when the value of k equals to 20. These results again show that the proposed active learning algorithm is more effective in selecting a batch of informative unlabeled examples for labeling. 5.3.3. Comparison of using different active learning strategies To compare the effects on classification performance of our proposed active learning (TSVMAL þ Graph ) with the other active learning strategies, this experiment was carried out on Votes dataset and Hepatitis dataset. Those active learning strategies are described as follows: TSVMOAL þ Graph method adopts the predicted class label using the current classifier as the measure of active learning; TSVMAL þ Graph method adopts the class label of previously labeled adjacent sample, which makes full use of cluster assumption among the data, that is, the samples have the similar predicted results with the same predicted class label; TSVMTAL method selects the samples which is located at the geometric position around the hyperplane for training samples to learn, and define a threshold T to the end of training by the last two results labeled and all samples needed labeled [39]; TSVMBT AL method adopts breaking ties (BT) criterion to select the samples with smallest classification confidence, or with smallest difference between the two highest probability classes [40]. Fig. 2 shows results of the proposed TSVMAL þ Graph method compared with TSVMOAL þ Graph , TSVMTAL , and TSVMBT AL . For Votes dataset (Fig. 2 (a)), the classification performance difference is not obvious between TSVMAL þ Graph and TSVMOAL þ Graph , which may be related to the distribution of the dataset. But the proposed TSVMAL þ Graph method significantly outperforms TSVMTAL and TSVMBT AL . This may have a great relationship with active learning strategies. For TSVMTAL method, because the value of T reflects the stability of the results of the samples categories labeled recently. When the value of T is smaller, the greater consistency of the classifier and it has more credible of the classification results. Otherwise, when the greater of the value of T, the more greatly of the results of the last two classification results vary, and the more not credible of the classification results of the classifier. That is to say, if the value of T is too larger, the classification accuracy of the model cannot meet the requirements. On the contrary, the accuracy of the classification model can meet the requirements, but will lead to too high computational complexity. For TSVMBT AL method, because SVM is not a probabilistic approach and thus it does not directly yield in output probabilistic quantities. Meanwhile, when the two highest
85
98 96
80 94
Accuracy
Accuracy
92 90 TSVMAL+Graph
88
75
70
TSVM
AL+Grap
TSVM 86 84 82
TSVMAL+Graph
OAL+Graph TSVMT AL TSVMBT AL
5
10
15
20
Label rate (%)
25
T
TSVMAL
65
BT
TSVMAL 30
60
5
10
15
20
25
30
Label rate (%)
Fig. 2. The comparative results of different active learning strategies.
Please cite this article as: X. Wang, et al., Semi-supervised learning combining transductive support vector machine with active learning, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.087i
X. Wang et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 90 85 80 75 Accuracy
probabilities are close, the classifier confidence is low. In other words, it could be trapped in a single (complex) boundary. For Hepatitis dataset (Fig. 2(b)), compared with the other three methods, TSVMAL þ Graph method also has better performance. In particular, when the labeled number of samples is 15% and 20%, the classification performance of TSVMAL þ Graph is significantly superior to the other three methods. This further backs that our proposed sample selection strategy is effective and feasible. 5.4. Experiment II: the book review application
1. 2. 3. 4. 5. 6.
The proportion of opinion phrases in a review sentence. The proportion of questioning patterns in a review sentence. The proportion of language in a review sentence. The proportion of category mentioned in a review sentence. The length of review sentence. How many stars are given by a user about the book overall review.
To assess the importance of unlabeled data in situations where labeled data is rare, we evaluate the performances of TSVMAL þ Graph versus SVM, TSVM and TSVMAL by varying the number of labeling unlabeled data ratios from 2% to 20%, where the total number of unlabeled book reviews data is 500, as well as, SVM and TSVM are the benchmark classifiers. The classification accuracy of TSVMAL and TSVMAL þ Graph algorithms improved with increasing the number of labeled samples. When the number of labeled data is very small, e.g., 4%, as shown in Fig. 4, the classification accuracy of TSVMAL þ Graph is significantly better than the SVM and TSVM. As shown in Fig. 3, when the labeled data is 4%, the accuracy of TSVMAL þ Graph increased by approximately 3% compared with the SVM and TSVM. In Fig. 4, the accuracy of TSVMAL þ Graph increased by approximately 3% compared with the SVM, and approximately 2.5% compared with the TSVM. At the same time, these two figures demonstrate that the TSVMAL shines when the amount of labeled data is very small, but it also outperforms the TSVM and SVM classifiers with the number of labeled data increases. So, we can conclude that TSVMAL þ Graph algorithm is consistently superior against the compared algorithms on increased ratio of labeled data. Therefore, we can employ the TSVMAL þ Graph to practical application regardless of the amount of labeled data at hand, as it
70 65 60 SVM
55
TSVM TSVM
50
TSVM
45 2
4
6
8
10
12
14
16
18
20
Label rate (%)
Fig. 3. The classification accuracy rate curve corresponding to Lexical perspective. 85 80 75
Accuracy
Book review task is one of the main focuses of our research group, which analyzes the users’ reviews information about books, and differentiates spam and legitimates reviews. Another task is to further analyze key words or key patterns in those valuable reviews to discover the characteristics of the user's purchase. Some of such key words or key patterns are shown in Table 1. After discovering the users’ buying habits, we can train the personalized recommendation model based on these behavioral data, to provide the targeted products for users. In here, we only introduce how to discovery true reviews and spam reviews, and regard the book review screening task as a classification problem. For training the classifier, six types of features were selected as the input of classification model, which are listed in Table 1. In order to fully prove the superiority of our proposed method, we conduct comparative experiment from two perspectives about the book reviews dataset. The first perspective is lexical. Because there are no space delimiters between Chinese characters, and there are many difficulties to deal with them, we adopted the ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System) to perform Chinese word segmentation and part of speech tagging, and then using the TF-IDF (term frequency inverse document frequency) to convert each sentence to a word vector. The second perspective is statistical. The main statistics collated are the proportion of key words or key patterns in each review sentence, and forming a quantization matrix, including:
9
70 65 60 55
SVM TSVM TSVM
50 45
TSVM
2
4
6
8
10
12
14
16
18
20
Label rate (%)
Fig. 4. The classification accuracy rate curve corresponding to statistical perspective.
always produces comparable or better results than traditional SVM and TSVM.
6. Conclusions In this paper, we proposed to solve the problems with using transductive support vector machine (TSVM), by a preset number of positive class samples N. Presetting the N correctly is very difficult before training the TSVM, therefore leads to considerable estimation error, especially when the number of the labeled examples is very small. To avoid using more unlabeled examples in a native way, we suggested active learning. Studies have found no correlation between using more unlabeled examples lead to better learning performance, hence more accurate selection of labels is required instead of large number. Active learning solves this problem. The main idea of our proposed algorithm is that in order to capture the geometrical structure of the data, we define L0 as a function of Laplacian graph. In this way, we can explore the structure of the data manifold by adding a regularization term that penalize any “abrupt changes” of the function values evaluated on neighbor samples in the Laplacian graph. Similarly, utilizing the active learning to select the most informative instance reduces learning cost by deleting non-support vector, and achieves significant improvement on considerably fewer labeled data. Compared with the sample selection based on random sampling method, the proposed algorithm has a positive effect on the classifier performance because of the increase number of labeled samples and
Please cite this article as: X. Wang, et al., Semi-supervised learning combining transductive support vector machine with active learning, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.087i
10
X. Wang et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
the performance enhancement. However, the impact of the selection sample based on random sampling method on the classifier performance is very volatile. Compared with TSVM based on active learning, the proposed algorithm added a more manifold regularization term, which makes full use of the distribution characteristics of unlabeled examples. Compared with TSVM, the proposed algorithm has more advantages, as it does not need presetting number of positive class samples, does not repeatedly exchange class label, and make use of active learning, which selects the best unlabeled data to maximize the performance of classifier. In additional, our algorithm works well on small labeled training dataset, has strong generalization ability, and is applicable to real world applications. In future work, we will conduct more insightful theoretical analyses about the effectiveness of our proposed method. Furthermore, in order to select the informative examples, the more appropriate selection strategies will be designed and discussed. And we will use the experimental results to personalize recommendations.
Acknowledgements This work was supported by the National Key Basic Research Program (973) of China under Grant no. 2013CB328903, the National Science Foundations of China under Grant nos. 61379158 and 71301177, and the Basic and Advanced Research Program of Chongqing under Grant nos. cstc2013jcyjA1658 and cstc2014jcyjA40054.
References [1] C. Burges, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discov. 2 (2) (1998) 121–167. [2] Y. Guerbai, Y. Chibani, B. Hadjadji, The effective use of the one-class SVM classifier for handwritten signature verification based on writer-independent parameters, Pattern Recognit. 48 (1) (2015) 103–113. [3] S. Yin, X. Zhu, C. Jing, Fault detection based on a robust one class support vector machine, Neurocomputing 145 (2014) 263–268. [4] C. Chen, S. Li, Credit rating with a monotonicity-constrained support vector machine model, Expert Syst. Appl. 41 (16) (2014) 7235–7247. [5] G. Guraksin, H. Haklib, H. Uguzb, Support vector machines classification based on particle swarm optimization for bone age determination, Appl. Soft Comput. 24 (2014) 597–602. [6] X. Wang, J. Wen, Y. Zhang, Y. Wang, Real estate price forecasting based on SVM optimized by PSO, Optik—Int. J. Light Electron Opt. 125 (3) (2014) 1439–1443. [7] K. Bennett, A. Demiriz, Semi-supervised support vector machines, Adv. Neural Inf. Process. Syst. (1999) 368–374. [8] G. Schohn, D. Cohn, Less is more: active learning with support vector machines, in: Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 839–846. [9] T. Joachims, Transductive inference for text classification using support vector machines, in: Proceedings of International Conference on Machine Learning, 1999, pp. 200–209. [10] Y. Tian, Y. Shi, X. Liu, Recent advances on support vector machines research, Technol. Econ. Dev. Econ. 18 (1) (2012) 5–33. [11] Y. Chen, G. Wang, S. Dong, Learning with progressive transductive support vector machine, Pattern Recognit. Lett. 24 (12) (2003) 1845–1855. [12] E. Kondratovich, I. Baskin, A. Varnek, Transductive support vector machines: promising approach to model small and unbalanced datasets, Mol. Inf. 32 (3) (2013) 261–266. [13] A. Gammerman, V. Vovk, V. Vapnik, Learning by transduction, in: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 1998, pp. 148–155. [14] G. Tur, D. Hakkani-Tur, R. Schapire, Combining active and semi-supervised learning for spoken language understanding, Speech Commun. 45 (2) (2005) 171–186. [15] B. Settles, Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009. [16] X. Zhu, J. Lafferty, Z. Ghahramani, Combining active learning and semisupervised learning using gaussian fields and harmonic functions, In: ICML 2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, 2003, pp. 58–65. [17] H. Hassanzadeh, M. Keyvanpour, A two-phase hybrid of semi-supervised and active learning approach for sequence labeling, Intell. Data Anal. 17 (2) (2013) 251–270. [18] E. Elhamifar, G. Sapiro, A. Yang, S.S. Sasrty, A convex optimization framework for active learning, In: 2013 IEEE International Conference on Computer Vision (ICCV), 2013, pp. 209–216.
[19] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions, In: Proceedings of the International Conference on Machine Learning, 2003, pp. 912–919. [20] D. Zhou, O. Bousquet, T. Lal, J. Weston, B. Scholkopf, Learning with local and global consistency, Adv. Neural Inf. Process. Syst. 16 (16) (2004) 321–328. [21] L. Lin, X. Wang, W. Yang, J.H. Lai, Discriminatively trained. And-or graph models for object shape detection, IEEE Trans. Pattern Anal. Mach. Intell. 37 (5) (2015) 959–972. [22] X. Wang, L. Lin, Dynamical And-Or Graph Learning for Object Shape Modeling and Detection, In: Proceedings of Advances in Neural Information Processing Systems (NIPS), 2012, pp. 242–250. [23] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res. 7 (2006) 2399–2434. [24] O. Chapelle, A. Zien, Semi-supervised classification by low density separation, in: Proceedings of 10th International Workshop Artificial Intelligence and Statistics, 2005, pp. 57–64. [25] L. Yang, L. Wang, A class of smooth semi-supervised SVM by difference of convex functions programming and algorithm, Knowl. Based Syst. 41 (2013) 1–7. [26] O. Chapelle, V. Sindhwani, S. Keerthi, Branch and bound for semisupervised support vector machines, Adv. Neural Inf. Process. Syst. 19 (2007). [27] R. Collobert, F. Sinz, J. Weston, L. Bottou, Large scale transductive Svms, J. Mach. Learn. Res. 7 (2006) 1687–1712. [28] L. Bruzzone, M. Chi, M. Marconcini, A novel transductive SVM for semisupervised classification of remote-sensing images, IEEE Trans. Geosci. Remote Sens. 44 (11) (2006) 3363–3373. [29] Y. Li, Z. Zhou, Improving Semi-Supervised Support Vector Machines Through Unlabeled Instances Selection, arXivpreprintarXiv:1005.1545, 2010. [30] M. Hady, F. Schwenker, Combining committee-based semi-supervised learning and active learning, J. Comput. Sci. Technol. 25 (4) (2010) 681–698. [31] K. Brinker, Incorporating diversity in active learning with support vector machines, In: ICML, vol. 3, 2003, pp. 59–66. [32] C. Blake, C. Merz, UCI Repository for Machine Learning Databases, 〈http:// www.ics.uci.edu/mlearn/MLRepository.html〉, 1998. [33] Y. Zhang, J. Wen, X. Wang, Z. Jiang, Semi-supervised learning combining cotraining with active learning, Expert Syst. Appl. 41 (5) (2014) 2372–2378. [34] M. Sariyar, A. Borg, Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity data, Comput. Methods Prog. Biomed. 108 (3) (2012) 1160–1169. [35] C. Constantinopoulos, A. Likas, Semi-supervised and active learning with the probabilistic RBF classifier, Neurocomputing 71 (13) (2008) 2489–2498. [36] C. Campbell, N. Cristianini, A. Smola, Query learning with large margin classifiers, In: Proceedings of the International Conference on Machine Learning, 2000, pp. 111–118. [37] S. Tong, E. Chang, Support vector machine active learning for image retrieval, in: Proceedings of the Ninth ACM International Conference on Multimedia, 2001, pp. 107–118. [38] C. Chang, C. Lin, LIBSVM: A Library for Support Vector Machines, Science, 〈http://www.csie.ntu.edu.tw/cjlin/libsvm〉, 2001. [39] C. Li, C. Wu, A new semi-supervised support vector machine learning algorithm based on active learning, In: 2010 2nd International Conference on Future Computer and Communication (ICFCC), IEEE, 2010, pp. V3-638-V3-641. [40] T. Luo, K. Kramer, S. Samson, A. Remsen, D.B. Goldgof, L.O. Hall, T. Hopkins, Active learning to recognize multiple types of plankton, in: Proceedings of the 17th International Conference on Pattern Recognition, 2004, ICPR 2004,. IEEE, 2004, pp. 478–481.
Xibin Wang is a PhD student in College of Computer Science at Chongqing University, China. He received his MS degree in computer science from Guizhou University in 2012. His research focuses on computational intelligence, data mining and business intelligence, and machine learning.
Junhao Wen received his BS, MS and PhD degrees in computer science from Chongqing University, China, in 1991, 1999 and 2008, respectively. Currently, he is a professor and PhD supervisor at Chongqing University. His research focuses on recommended system, data mining and business intelligence, and machine learning.
Please cite this article as: X. Wang, et al., Semi-supervised learning combining transductive support vector machine with active learning, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.087i
X. Wang et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Shafiq Alam received his PhD degree from University of Auckland, New Zealand. He is currently a postdoctoral research fellow at the Department of Computer Science, University of Auckland. He has published in International Journals of high repute, A* ranked conferences, and edited a book in his research area. He has been also a general chair of a workshop, and served on the program committee of various conferences. His research interests include computational intelligence, shilling and fraud detection in recommender systems, data mining, and decision support systems.
11 Yingbo Wu received his BS, MS and PhD degrees in Computer Science from Chongqing University, China, in 2000, 2003, and 2012, respectively. Currently, he is a professor and MS supervisor at Chongqing University. His research focuses on distributed and intelligent computing, service systems engineering, field engineering and industry information.
Zhuo Jiang received his BS degree from the Mathematics department of Xinjiang University, China, in 2008 and his MS degree in computer science from Chongqing University, China, in 2010. Currently, he is a PhD candidate at Chongqing University, China. His research interest includes AI planning, Web service and data mining.
Please cite this article as: X. Wang, et al., Semi-supervised learning combining transductive support vector machine with active learning, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.08.087i