MSSBoost: A new multiclass boosting to semi-supervised learning

MSSBoost: A new multiclass boosting to semi-supervised learning

Neurocomputing 314 (2018) 251–266 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom MSSBoos...

1MB Sizes 0 Downloads 54 Views

Neurocomputing 314 (2018) 251–266

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

MSSBoost: A new multiclass boosting to semi-supervised learning Jafar Tanha a,b,∗ a b

Electrical and Computer Engineering Department, University of Tabriz Bahman 29 St. P.O.Box 19395-4697, Tabriz, Iran School of Computer Science, Institute for Research in Fundamental Sciences (IPM), P.O.Box 19395-5746, Tehran, Iran

a r t i c l e

i n f o

Article history: Received 15 March 2017 Revised 18 March 2018 Accepted 19 June 2018 Available online 3 July 2018 Communicated by Feiping Nie Keywords: Multiclass classification Semi-supervised learning Similarity learning Boosting

a b s t r a c t In this article, we focus on the multiclass classification problem to semi-supervised learning. Semisupervised learning is a learning task from both labeled and unlabeled data points. We formulate the multiclass semi-supervised classification problem as an optimization problem. In this formulation, we combine the classifier predictions, based on the labeled data, and the pairwise similarity. The goal here is to minimize the inconsistency between classifier predictions and the pairwise similarity. A boosting algorithm is proposed to solve the multiclass classification problem directly. The proposed multiclass approach uses a new multiclass formulation to loss function, which includes two terms. The first term is the multiclass margin cost of the labeled data and the second term is a regularization term on unlabeled data. The regularization term is used to minimize the inconsistency between the pairwise similarity and the classifier predictions. It in fact assigns the soft labels weighted with the similarity between unlabeled and labeled examples. First, the gradient descent approach is used to solve the resulting optimization problem and derive a boosting algorithm, named MSSBoost. The derived algorithm also uses a learning optimal similarity function for a given data. The second approach to solve the optimization problem is to apply the coordinate gradient descent. The resulting algorithm is called CD-MSSB. We also use a variation of CD-MSSB in the experiments. The results of our experiments on a number of UCI and real-world text classification benchmark datasets show that MSSBoost and CD-MSSB outperform the state-of-theart boosting methods to multiclass semi-supervised learning. Another observation is that the proposed methods exploit the informative unlabeled data. © 2018 Elsevier B.V. All rights reserved.

1. Introduction Supervised learning algorithms are basically effective when there are sufficient. labeled data points However, in many realworld application domains, such as object detection, document and web-page categorization, and medical domains labeled data are difficult, expensive, or time consuming to obtain easily, because they typically require empirical research or experienced human annotators to assign label [37]. Semi-supervised learning algorithms employ not only the labeled data, but also the unlabeled data to build an adequate classification model. The goal of semi-supervised learning is to use unlabeled examples and combine the implicit information in the unlabeled data with the explicit classification information of labeled data to improve the classification performance. The main issue of the semi-supervised learning algorithms is how to find a set of informative data points from the unlabeled data. A number of different algorithms have been proposed ∗ Correspondence to: Electrical and Computer Engineering Department, University of Tabriz Bahman 29 St. P.O.Box 19395-4697, Tabriz, Iran. E-mail address: [email protected]

https://doi.org/10.1016/j.neucom.2018.06.047 0925-2312/© 2018 Elsevier B.V. All rights reserved.

to semi-supervised learning, such as generative models [20,27], self-training [37,40], co-training [6,34], Transductive Support Vector Machine (TSVM) [18], Semi-Supervised SVM (S3VM) [4], graphbased methods [3,29,42,43], and boosting based semi-supervised learning methods [5,7,8,23,35,36]. The main focus of this article is based on the boosting approach to multiclass semi-supervised learning. Boosting framework is a popular approach to supervised learning. In boosting a set of weak learners is used to build a strong classification model. Therefore, this is well-motivated to extend the boosting approach to semi-supervised classification problems. In [7] a boosting algorithm is presented, called MarginBoost, to semi-supervised learning using a new definition for the pseudo-margin to unlabeled data. [5] uses the same approach but different pseudo-margin definition to unlabeled data. The main issue in these approaches is that although these methods can improve the classification margin, they do not provide information from the unlabeled examples, such as similarity between examples or marginal distributions. Consequently, the new classifier that is trained on newly-labeled examples, is likely to share the same decision boundary with the first classifier instead of constructing

252

J. Tanha / Neurocomputing 314 (2018) 251–266

a new one. The reason is that by adapting the decision boundary the poor predictions will not gain higher confidence instead the examples with high classification confidence will gain even higher confidence, see [8,23], and [36]. Most recently, new boosting methods are proposed to semi-supervised classification problems, e.g. SemiBoost [23] and RegBoost [8], which use both the classifier predictions and pairwise similarity to maximize the margin. In this approach the pairwise similarity information between labeled and unlabeled data is used to guide the resulting classifier to assign more reliable pseudo-labels to unlabeled examples. The experimental results show that these boosting approaches outperform the state-of-the-art methods in this filed [19,21,46] and are comparable to LapSVM [3]. The key advantage of the boosting approach is that it can boost any type of base learners and is not limited to specific base learner. The aforementioned approaches are basically proposed to solve the binary semi-supervised classification problems. Two main approaches can be used to handle the multiclass classification problems. The first approach converts the multiclass problem into a set of binary classification problems. Examples of this method include one-vs-all, one-vs-one, error-correcting output code [2,9]. This approach may have various problems, such as imbalanced class distributions, increased complexity, no guarantee to have an optimal joint classifier or probability estimation, and different scales for the outputs of generated binary classifiers which complicates combining them, see [26,30,44]. The second approach uses a multiclass classifier directly to solve the multiclass classification problem. Although a number of approaches have been recently presented to multiclass semi-supervised classification problems, e.g. [35,36,41], none of them has been shown to maximize the multiclass margin properly, which is the aim of this article. The second important point in many promising semi-supervised learning approaches especially for those that are based on graphs or pairwise similarity, e.g. LapSVM [3], SemiBoost [23], RegBoost [8] and MSAB [35,36], is that the performance of the algorithm strongly depends on the used pairwise similarity function. A good quality similarity measure can significantly influence the classification performance of the learning algorithm. However, there is no unique method that can effectively measure the pairwise similarity between data points. Hence, the aforementioned methods suffer from the lack of an adequate similarity function. Recently a number of methods have been proposed to distance/similarity learning in the context of classification, clustering, and information retrieval, see [15,17,33]. Most of these works, have been presented to learn the Mahalanobis distance function, e.g. [15]. These approaches often use the parametric Mahalanobis distance in combination with K-means or EM clustering methods as well as the constraint-based approach in order to learn the optimized Mahalanobis function. Hillel and Weinshall [16] proposes the approach to specific application of distance/similarity for continuous variables using gaussian assumption. More recently, in [28] a semisupervised metric learning is presented using entropy maximization approach. The proposed approach tends to optimize the distance function by optimizing the probability parameterized by that distance function. A new boosting approach is proposed in [32] to learn a Mahalanobis distance to supervised learning. In this article, we propose a new form of boosting framework for learning the optimal similarity function to multiclass semi-supervised classification problem. The main contribution of this article is to introduce a new loss function formulation to the multiclass semi-supervised classification problems. Our proposed approach in this article uses the regular simplex vertices as a new formulation to the multiclass classification problems and combines the similarity information between labeled and unlabeled data with the classifier predictions to assign pseudo-labels to unlabeled data using a new boosting formulation.

We propose a new multiclass exponential loss function to semisupervised learning, which includes two main terms: the first term is used to find a large margin multiclass classifier and the second term is a regularization term to unlabeled data, which consists of the pairwise similarity and the classifier predictions. The goal of the regularization term is to minimize the inconsistency between data, which means that the similar data must share the same class label. In fact, it assigns the soft labels weighted with the similarity between unlabeled and labeled examples. Unlike the existing methods that use a predefined similarity function, we propose a boosting framework to learn from weak similarity functions as well as weak base classifiers. To solve the resulting optimization problem, we first employ a functional gradient descent procedure. We then derive a boosting algorithm from the resulting loss function, named MSSBoost. The proposed boosting approach can boost any type of multiclass weak base classifiers, e.g. decision tree. At each boosting iteration, MSSBoost updates one multiclass ensemble predictor. These updates then lead to minimize the loss function. We obtain the weighting factor to labeled and unlabeled data by solving the optimization problem. We also derive a boosting method for learning the similarity functions, which are used to guide the multiclass predictor to learn more from the unlabeled data. The second approach we use to solve the optimization problem is a functional coordinate gradient descent procedure. We next obtain a boosting framework from the resulting loss function, called CDMSSB. The proposed boosting approach can boost any type of weak base learners, e.g. decision stump. At each boosting iteration, CDMSSB updates one component of the multiclass predictor. We also present a variation of CD-MSSB in the experiments. The experiments on a number of UCI [10] benchmark datasets and a set of real-world text classification [38] datasets show that MSSBoost and CD-MSSB outperform the state-of-the-art boosting methods to semi-supervised learning. The results also emphasize that MSSBoost and CD-MSSB can effectively exploit information from the unlabeled data to improve the classification performance. The rest of this article is organized as follows. Section 2 addresses the multiclass boosting on labeled and unlabeled data. Sections 3 presents the resulting risk function. A variation of the proposed algorithm is discussed in Section 4. Section 5 presents the time complexity. Section 6 addresses the related work. The experimental setup and the results are presented in Sections 7, 8, and 9, Section 10 addresses the discussion and conclusion. 2. Multiclass supervised and semi-supervised learning In this section we first overview one of the current formulations to the multiclass supervised learning using boosting framework [30], and then extend it to the multiclass semi-supervised classification problems. We formulate the problem as an optimization problem. We then use the gradient descent approach to solve the resulting optimization problem. This article is the extension of our previously presented work in [33]. 2.1. Multiclass supervised boosting In the binary classification problems, the labels ± 1 play a main role in the definition of the margin and the formulation of the loss function. However, this formulation cannot directly be applicable to handle the multiclass classification problems. Recently, boosting methods in several studies have been used to formulate the multiclass classification problems directly, see [1,11,26,30,44]. Methods of this type of approach include AdaBoostM1 [11], SAMME [44], and AdaBoost-Cost [25]. Typically, these methods need strong weak learners as the base learner, which substantially increase complexity of the resulting predictor and may lead to overfitting [30]. This is due to the fact that none of these

J. Tanha / Neurocomputing 314 (2018) 251–266

methods has been shown to maximize the multiclass margin properly. In this article we use MCBoost formulation which has been recently shown to outperform other multiclass methods [30]. This method proposes a new formulation to multiclass supervised learning with the boosting framework using a gradient descent in the functional space. It has been shown that MCBoost is Bayes consistent, margin enforcing, and convergent to the global minimum [30]. MCBoost uses vertices of a regular simplex in RM−1 as codewords to multiclass classification problem. A regular RM−1 dimensional simplex is the convex hull of RM normal vectors which have equal pairwise distances. Suppose we are given a set of training data (X, Z ) = {(xi , zi )}ni=1 , where xi ∈ Rd , zi ∈ {1, 2, ..., M} denote a class label, M is the number of classes, and (X, Z) are drawn from an unknown underlying distribution. Then if Yˆ = {y1 , . . . , yM } is a set of vertices of a regular simplex in RM−1 , MCBoost first assigns codewords y1 . . . yM ∈ RM−1 to examples of classes {1, ..., M} respectively. It then learns a predictor f ∈ RM−1 by minimizing the classification risk as follows:

R( f ) = EX,Yˆ {L[yi , f (xi )]} ≈

n 1 L [ y i , f ( x i )] , n

(1)

i=1

where y, yi ∈ Yˆ and L[., .] is an exponential multiclass loss function

L[y, f (x )] =

M 

e

−1 2 [< f

(x ),y>−< f (x ),yk >]

(2)

where < . , . > is the standard dot-product operator. The learning/minimization procedure in MCBoost is based on gradient descent in functional space [12,24]. In this work we mainly focus on GD-MCBoost which is gradient descent implementation of MCBoost in functional space for minimizing (1). After learning a predictor f, the decision rule for classifying a new example x will be as:

F (x ) = arg max < f (x ), yk >

(3)

k

where F(x) is the class of largest margin for the predictor f. It has been shown that minimizing (1), maximizes multiclass margin

M( f (x ), k ) = < f (x ), yk > − max < f (x ), yl >. l=k

(4)

In addition in asymptotic case minimizer of (1), f∗ (x), implements Bayes decision rule, i.e.

arg max < f ∗ (x ), yk > = arg max log PC|X (k|x ). k

examples. We start by introducing a new risk function to multiclass semi-supervised classification problem as follows:

Rs (Yˆ , f, S ) = C1 RL (Yˆ , f ) + C2 RU (Yˆ , f, S )

k

(5)

In the next section, we extend this formulation to multiclass semi-supervised learning proposing a new risk function and using a novel pairwise similarity function. 2.2. Multiclass semi-supervised boosting In this section we design a new multiclass semi-supervised boosting algorithm, which employs the classifier predictions and pairwise similarity function in its risk function. The proposed method assumes that if two data points are similar, then they must share the same class label. This assumption, called cluster assumption, is commonly used in many graph-based semisupervised learning algorithms, e.g. [3,35,45]. Based on this assumption and the maximizing margin approach, as in MCBoost, for multiclass case, we introduce a new risk function to multiclass semi-supervised classification problem as follows. Here, assume that L = {x1 , x2 , ..., xl } is the set of labeled training examples and U = {xl+1 , xl+2 , ..., xl+u } is the set of unlabeled

(6)

where Yˆ ⊂ RM−1 , Yˆ is a set of vertices of a regular simplex, C1 and C2 are the contribution of the labeled and unlabeled examples respectively, and f(.) is an ensemble classifier. As seen, (6) includes two terms. The first term is similar to supervised multiclass boosting, MCBoost, and is formulated as:



RL (Yˆ , f ) =

( xi

,yi

LL [yi , f ( xi )

(7)

)∈L

where yi ∈ Yˆ in RM−1 . This term defines the cost of margin for labeled data. The second term in (6) is the penalty term related to the unlabeled examples. Since there is no true label for unlabeled examples, this term defines the cost of pseudo-margin for unlabeled examples as follows:



RU (Yˆ , f, S ) =



S (xi , x˜ j )LU [yi , f (x˜ j )]

(8)

(xi ,yi )∈L x˜ j ∈U

where S is a real-valued function that measures the similarity between x and x˜. Note that larger S (., . ) corresponds to more similar pairs. We now define C1 and C2 in (6). Based on our empirical experience, we assign C1 = |L1 | and C2 = |L||λU | , where λ is a tuning parameter. The resulting risk function will be formulated as:

Rs (Yˆ , f, S ) =

k=1

253

1 |L| +



L L [ y i , f ( x i )]

(xi ,yi )∈L

  λ S (xi , x˜ j )LU [yi , f (x˜ j )] |L||U | i x˜ ∈U (xi ,y )∈L

(9)

j

The risk function of (9) has two terms. As mentioned, the first term is identical to MCBoost. It operates on the labeled examples and its minimization results in a large margin multiclass classifier. The second term is a weighted summation of the loss of assigning codeword yi to the unlabeled example x˜ j . The weights in this summation is determined by S (xi , x˜ j ) which measures the similarity between xi and x˜ j . If xi , x˜ j are similar, then with high probability they belong to the same class, i.e. x˜ j belongs to class yi . In this case S (xi , x˜ j ) will be large and LU [yi , f (x˜ j )] will have more effect on the second term of (9). In fact the second term can be interpreted as soft label assignment where the hypothesis of assigning each label yi to example x˜ j is weighted by the similarity between xi and x˜ j . Therefore (9) uses similarity between unlabeled and labeled examples to assign a soft label to unlabeled data. The final classifier is then trained based on these soft label assignments along with the labeled training examples. Unlike, the previous approaches such as ASSEMBLE [5] and MarginBoost [7] that use only the classifier predictions to assign labels to unlabeled data, the main advantage of our method is to assign the soft labels based on the pairwise similarity and the classifier predictions to unlabeled data. In other words, our method on one hand maximize the multiclass margin on labeled data and on the other hand maximizes the consistency between the classifier predictions and the pairwise similarity. To minimize the risk function (9), one approach is to use the boosting framework to find the best weak learner. As can be seen, the second term of (9) includes a similarity measure. This similarity measure can be any type of kernel functions, such as Radial base function, quadratic function, Laplacian function, or Mahalanobis based similarity function. In this case a combination of the similarity information and the classifier predictions is used to assign weights to the unlabeled examples. Similar approach has been used in SemiBoost [23] and RegBoost [8] to binary semi-supervised

254

J. Tanha / Neurocomputing 314 (2018) 251–266

learning. These methods sample a set of high-confidence predictions from unlabeled data at each iteration. The sampled examples along with their pseudo-labels are then used to train a new component classifier. However, finding these subsets are challenging and need some intuitive criteria. Moreover, these methods employ a predefined similarity measure in their loss functions to exploit the informative unlabeled examples. The main issue here is that finding a suitable similarity measure and tuning its parameters is a challenging task and needs a lot of effort, which has high time complexity [23,35]. Our proposed method on one hand assigns weights to all data points and therefore it does not need to sample from unlabeled data. The weighted data points are then used to find a new optimal classifier such that it most decreases the risk function. On the other hand, it uses the boosting framework to learn from weak similarity functions and thus there is no need to tune the parameters of the similarity measure.

where each element of H can be any type of multiclass weak base learners, such as decision tree, naive bayes, or other multiclass base learners. we now solve (11) by using Gradient descent in functional space. Let f t ∈ RM−1 denote a multiclass classifier after t iterations of the boosting framework. For the next iteration, (t + 1 ) iteration, there is a g(x) ∈ H, g : X → RM−1 , such that it most decreases the risk function Rs (Yˆ , f t + α g, S ). Therefore, the best weak learner to add to the ft will be as follows:

g∗ (x ) = arg min δ Rs (Yˆ , f t+1 , S ),

where δ Rs is the functional derivative of Rs along the direction of the functional g , at point f (x ) = f t (x ), and as shown in Appendix B, we obtain δ Rs (Yˆ , f t+1 , S ) as follows:

δ Rs (Yˆ , f t+1 , S ) =

3. Proposed risk function

Rs (Yˆ , f, S )

subject to

f (x ) = [ f1 (x ), ..., fM−1 (x )],

f,S

fm ∈ Span(H )

∀m = 1, .., M − 1,

(10)

< g( x i ) , w i > +

(xi ,yi )∈L



 < g(x˜ j ), v j > ,

x˜ j ∈U

(13)

wi =

M 1 i −1 t i k (y − yk )e 2 < f (xi ),y −y > , 2

(14)

k=1

and the weighting factor for the unlabeled data as the form of:

where H = {h1 (x ), ..., h p (x )} is a set of weak classifiers as hi : X → RM−1 , and hi can be any type of the multiclass weak base learners, such as decision tree and Naive Bayes, and K = {K1 (x, x˜; A1 ), ..., Kq (x, x˜; Aq )} is a set of real-valued similarity functionals, i.e. Kn : X × X˜ → [0, 1], where A ∈ Rd×d is a square matrix, see Section 3.2. As we show in Appendix A, Eq. (10) is a convex function. Therefore, it is shown that Eq. (10) is convergent to the global optimal point, see Appendix A. There are several approaches to solve the optimization problem (10). As we mentioned in [33], one approach is to use the coordinate gradient descent to handle the optimization problem. In this case the coordinate gradient descent is applied for each class label, mth coordinate, in order to optimize the risk function. The second approach to solve (10) is to use Gradient descent. In this case, the best multiclass classifier is added to the ensemble multiclass classifier as in supervised multiclass boosting, SAMME [44], at each boosting iterations without any extra inner iteration. In this article we follow both approaches as addressed in more details in the next sections. Regarding a given similarity function S in (10), we first employ Gradient descent in functional space. The goal in this case is to find the optimal multiclass predictor. We then solve the problem in terms of the similarity function using gradient descent in matrix space to find the optimal similarity function. We next derive an algorithm based on the resulting optimization problem. We also apply coordinate descent in functional space to solve the optimization problem as our second approach. 3.1. Learning weak base Learners using boosting framework In this section, given a similarity function S we propose a boosting algorithm to build a multiclass predictor f which utilizes information from labeled and unlabeled training examples by solving the following optimization problem:

minimize

Rs (Yˆ , f, S )

subject to

f (x ) = [ f1 (x ), ..., fM−1 (x )] fm ∈ Span(H )



we then compute the weight of labeled data as follows:

S ∈ Span(K ),

f

 ∂ Rs (Yˆ , f t +  g, S )   ∂  =0 

=−

We start by formulating the risk function of (9) as an optimization problem to find the optimal multiclass predictor and similarity function respectively. This results in:

minimize

(12)

g∈H

∀m = 1, .., M − 1

(11)

vj =

1 2



S (xi , x˜ j ) ×

(xi ,yi )∈L

M 

−1

( yi − yk )e 2 < f

t

(x˜ j ),yi −yk >

.

(15)

k=1

Finally, the optimal step size along this direction will be as:

α ∗ (x ) = arg min Rs (Yˆ , f t + α g∗ , S ) α ∈R

(16)

As mentioned, we employ the boosting framework in our proposed approach. At each iteration of boosting, we find the best learner which results in lowest risk Rs . 3.2. Learning similarity function using boosting framework As shown in the experiments, the performance of the semisupervised learning algorithms, especially those that are based on pairwise similarity, critically depends on the quality of the used distance/metric function to measure the similarity between points. On the other hand, finding a well-fitted similarity function for specific domain is a very challenging and time consuming task [23,35,47]. In this section, given a multiclass predictor f, we design a boosting algorithm to construct a similarity function S using pairwise similarity from labeled and unlabeled examples by solving the following optimization problem:

minimize

Rs (Yˆ , f, S )

subject to

S ∈ Span(K )

S

(17)

where K = {K1 (x, x˜; A1 ), ..., Kq (x, x˜; Aq )} is a set of weak similarity functions. Here, we solve Eq. (17) by using gradient descent in matrix T t space. Let S t (x, x˜; At ) = e−(x−x˜) A (x−x˜) denote the available similarity function after t iterations. The goal here is to find a matrix At such that it most decreases the risk function (17) and it then gives a suitable similarity function. Note that, as seen in (18), using this similarity function results in minimizing the inconsistency between the classifier predictions and the similarity information.

J. Tanha / Neurocomputing 314 (2018) 251–266

For simplicity, we assume matrix A to be a diagonal matrix with all entries −1 ≤ Aii ≤ 1. At (t + 1 ) iteration, similar to boosting procedure, the best A to add to At is derived as:

A∗ = arg min δ Rs (Yˆ , f, S t+1 ; A )

(18)

A∈K

where S t+1 (x, x˜; At + β A ) = e−(x−x˜) (A +β A )(x−x˜) . As shown in Appendix C, we derive δ Rs ( f, S t+1 ; A ) as follows: T

δ Rs (Yˆ , f, S t+1 ; A ) =

t

 ∂ Rs (Yˆ , f, S t+1 (x, x˜; At +  A ))   ∂  =0   

=−



Pi, j Wi, j ,

(19)

(xi ,yi )∈L x˜ j ∈U

where Wi, j and Pi, j are the weights and are computed as follows:

Wi, j = e−(xi −x˜ j )

T

At (xi −x˜ j )



e

−1 t 2
(x˜ j ),yi −yk >

,

(20)

k

and

Pi, j = (xi − x˜ j )T A(xi − x˜ j ).

(21)

At the end, the optimal step size along the direction of A∗ will be the solution of

β ∗ = arg min Rs (Yˆ , f, S t+1 (x, x˜; At + β A∗ )). β ∈R

(22)

After each iteration if there exist entries with negative values in At + β ∗ A∗ , then we assign zero for those entries in order to maintaining the positive semi-definite property of the metric. Next, the similarity function is updated to S t+1 = S t (x, x˜; At + β ∗ A∗ ). The next section presents the resulting algorithm. 3.3. The proposed MSSBoost algorithm In this section, based on the previous discussion, we provide the details of our proposed algorithm which utilizes the boosting framework through the training procedure. Algorithm 1 gives the Algorithm 1 MSSBoost Initialize: L: Labeled data; U: Unlabeled data;S 0 : Similarity function; f 0 (x ) = 0 ∈M−1 ; t ← 0; // N is the number of iterations While (t < N) do Begin For each (xi , yi ) ∈ L do - Compute wi for xi using (14) For each x˜ j ∈ U do - Compute v j for x˜ j using (15) Train a weak classifier on Labeled and Unlabeled examples using (13) Compute g∗ (x ) and α ∗ using (12) and (16) Update f t+1 (x ) predictor After serval iterations to learn weak classifiers, start to learn weak similarity functions For each (xi , yi ) ∈ L and x˜ j ∈ U do -Compute Wi, j and Pi, j using (20) and (21) Compute A∗ and β ∗ using (18) and (22) Update the similarity function S t+1 t ←t +1 End// End of iterations Output Decision rule: F (x ) = argmax < f N (x ), yk > k

255

details of the MSSBoost algorithm. As can be seen, the algorithm first computes the weights for labeled and unlabeled examples according to Eqs. (14) and (15) respectively. It then selects the best weak learner to add to ft ensemble classifier such that it most decreases the risk function, see (11). After several iterations of the aforementioned training procedure, the algorithm can start to update the similarity function according to Eqs. (20) and (21). In order to update the similarity function, first the weights are computed and then the best weak similarity function is selected to add to the previous similarity functions as addressed in the previous section, see Eqs. (18) and (22). Next, the similarity function is updated regarding Eq. (17). The training procedure of MSSBoost is then repeated until it reaches to stopping conditions. For the similarity learning algorithm, we use β ∗ ≤ 0 as a stopping criterion as is usually used in the boosting algorithms. Although it has been empirically determined that using a fixed number of classifiers, around 20 is the best for AdaBoost [12], we use both conditions, α ∗ ≤ 0 and a fix number of iterations, as the stopping criterion to learn the weak predictors. Finally, the classification model F(x) is formed as the final hypothesis, which is the weighted combination of the weak classifiers, as shown in Algorithm 1. Unlike the methods like ASSEMBLE, SemiBoost and MSAB, the MSSBoost algorithm uses all unlabeled examples through the training process. It assigns some weights and labels to unlabeled examples and then uses these data points in the training procedure. However, SemiBoost and MSAB use several intuitive approaches to sample from the unlabeled examples according to the assigned weights. 4. Variations of MSSBoost As we mentioned in [33], coordinate gradient descent is used to handle the optimization problem (10). In this case the coordinate descent is applied for each class label, mth coordinate. Therefore, to solve (10), regarding a given similarity function S, we use coordinate gradient descent in functional space. The goal in this case is to find the optimal multiclass predictor. We then solve the problem in terms of the similarity function using gradient descent in matrix space to find the optimal similarity function. t Let f t (x ) = [ f1t (x ), ..., fM−1 (x )] denote the available predictor after t iterations. At (t + 1 ) iteration, as addressed in [24], the mth coordinate is updated in the direction of functional g, g : X → R, t + such that it most decreases the risk function Rs (Yˆ , [ f1t , ..., fm t α g, ..., fM−1 ], S ). Therefore the best weak learner to add to the mth coordinate is as follows:

g∗m (x ) = arg min δ Rs (Yˆ , f t+1 , S; m ),

(23)

g∈H

where δ Rs is the functional derivative of Rs along the direction of the functional g, at point f (x ) = f t (x ), with respect to the mth component fm (x) of f(x). We then obtain δ Rs (Yˆ , f t+1 , S; m ) as follows:

δ Rs (Yˆ , f t+1 , S; m ) =

 ∂ Rs (Yˆ , [ f t +  gem ], S )   ∂  =0 

=−



(xi ,yi )∈L

g( )

xi wm i

+



g(x˜ j )v

 m j

,

(24)

x˜ j ∈U

where em ∈ RM−1 is the canonical unit vector. Now we compute the weights as:

wm i =

M 1 −1 t i k < em , yi − yk > e 2 < f (xi ),y −y > , 2 k=1

(25)

256

J. Tanha / Neurocomputing 314 (2018) 251–266

and

vmj =

6. Comparison to previous related work

1 2



S (xi , x˜ j ) ×

(xi ,yi )∈L

M 

< em , yi − yk > e

−1 t 2
(x˜ j ),yi −yk >

. (26)

k=1

where wm and vm are the weights for the labeled and unlabeled i j examples respectively. Finally, the optimal step size along this direction is

αm∗ (x ) = arg min Rs (Yˆ , f t + α g∗m em , S ) α ∈R

(27)

The next step for solving of the optimization problem (10) is to find the optimal similarity function. Similar to the section (3.2), we follow the same approach. We then employ the boosting framework in our proposed approach. At each iteration of boosting, we compute the best update for all components and update the coordinate which results in lowest risk Rs . The resulting algorithm is named CD-MSSBoost (CD-MSSB). Since in CD-MSSB the underlying optimization problem is solved by Coordinate descent. Hence it needs several updates for each class label and in case of datasets with more classes, its time complexity will be high. To solve this issue, we use only one inner iteration of the boosting procedure. The resulting algorithm is called One CD-MSSB (OCD-MSSB). In this case the optimization problem (10) is only optimized one times in the direction of each coordinate, which indeed decreases the time complexity of the resulting algorithm. 5. Discussion of the time complexity In this section, we give the details of the time complexity of the proposed algorithms. We first discuss the MSSBoost algorithm. As mentioned in section (3), MSSBoost employs a multiclass weak base learner. It then computes the weights for labeled and unlabeled examples. Next, MSSBoost selects the best weak multiclass learner to add to ft ensemble classifier such that it most decreases the risk function. After several iterations, MSSBoost starts to update the similarity function. In order to update the similarity function, first the weights are computed and then the best weak similarity function is selected to add to the previous similarity functions. The training procedure of MSSBoost is then repeated until it reaches to stopping conditions. Let T denote the number of boosting iterations, H is the number of weak base learner candidates, and let K be the number of candidates for the weak similarity functions. Hence, the time complexity of MSSBoost will be O(T × (H + K )). Now we discuss the CD-MSSB Algorithm. Similar to MSSBoost, the algorithm first computes the weights for labeled and unlabeled examples according to Eqs. (25) and (26) respectively. CD-MSSB then selects the best weak learner to add to the mth coordinate such that it most decreases the risk function. This procedure is repeated until all components of the multiclass predictor are updated according to the proposed formulation. The similarity function will be updated as in MSSBoost. Let M denote the number of classes, hence the time complexity of CD-MSSB will be O(T × (M × H + K )). As seen, this approach requires several inner iterations in terms of the number of classes. Therefor it will have high time complexity. Meanwhile, each component is updated independently which may not lead to find the best multiclass weak classifier as a whole. To handle the aforementioned problem, we propose a variation of CD-MSSB, OCD-MSSB. Similar to CD-MSSB, this algorithm first calculates the weights for labeled and unlabeled data points according to Eqs. (25) and (26) respectively. OCD-MSSB then finds the best weak learner for each coordinate such that it most decreases the risk function. Finally the best similarity function is selected as mentioned in MSSBoost. Regarding the training procedure of OCD-MSSB the time complexity will be O(M × (H + K )).

Using boosting approach to handle the semi-supervised learning problems has been addressed in several recent studies, see [5,8,23,35,41]. In this section we compare MSSBoost to the related methods as follows. Bennett et al. [5] and dAlch Buc et al. [7] have proposed the boosting framework to handle the binary semi-supervised classification problem. These methods introduce the pseudo-margin concept to unlabeled examples in their loss functions. The goal of these methods is to find a new classifier such that it minimizes the loss function at each iteration. However, although this minimization leads to minimize the cost of the margin, it may not change the decision boundary. To solve this problem, recently, in [23] and [8] the new boosting algorithms are proposed to binary semi-supervised learning, which use the classifier predictions and pairwise similarity to sample from unlabeled data using some intuitive criterion. These methods then sample a set of high-confidence predictions from unlabeled data, which is challenging. The sampled examples along with their pseudo-labels are used to train a new component classifier at each iteration of boosting procedure. Moreover, these methods employ a predefined similarity function, which is not optimal in practice. The performance of these methods extremely depends on the used similarity functions and its tuned parameters. We address these methods and compare them to the proposed method in this article. In [41], a boosting algorithm is presented to multiclass semisupervised learning. This method employs a multiclass weak learner ht : X → {1, 2, ..., M} as the base learner and a classification rule as:

H (x ) = argmax



i∈{1,2,...,M} i|ht (x )=i

αt ht (x )

(28)

where α t is the coefficient of weak learner ht (x) at tth iteration, which is equivalent to the decision rule of (3) when f(x) is an M dimensional classifier with the ith component fi = ht (x )=i αt ht (x ) and the class label yk = ek . However, in this case maximizing the multiclass margin is the main issue. Because the used codewords to model the multiclass problems may not lead to the optimal solution for maximizing the margin. It also employs a fix similarity function. Most recently in [35] the use of an M-dimensional predictor is proposed with the following M-dimensional labels, called MSAB:



yi, j =

1 −1 M−1

if i = j if i = j

(29)

and decision rule

H (x ) = argmax fi (x ).

(30)

i∈{1,2,...,M}

The proposed method then employs LMl = e− M as the cost 1   k of the margin on labeled data and LMu = j∈u k e− M as the regularization on unlabeled data. Regarding the formulation (29) for labels, the margin will then be as: 1

M ( f ( x ), yk ) = f k ( x ) −

1  f i ( x ). M

k

(31)

i=k

However, when M (f(x), yk ) > 0, it does not imply correct classification, i.e fk (x) > fi (x), ∀i = k. Therefore, MSAB does not guarantee to lead a large margin solution to handle the multiclass classification problems. Moreover, as the same as the other methods, this method uses a fix similarity function, RBF function, in its loss formulation.

J. Tanha / Neurocomputing 314 (2018) 251–266

Another key advantage of our proposed method is the use of boosting to learn the weak similarity functions. The previous methods, like RegBoost, SemiBoost, MSAB, MCSSB [41], and graph-based methods, e.g. LapSVM, use pairwise similarity in the loss function. Most of these methods employ radial basis function as the similarity measure,

S ( xi , x j ) = e−

xi −x j 22

257

use the one-vs-all approach to handle the multiclass classification problems in RegBoost, SemiBoost, LapSVM. We perform experiments with three different types of datasets: i) synthetic datasets which we will present the main points of MSSBoost, ii) UCI datasets, as a general machine learning data repository [10], and iii) real-world text classification data, to show the efficiency of MSSBoost in application domain.

(32)

σ2

where σ is the scale parameter controlling the spread of the radial basis function (RBF). The choice of σ has a large impact on the performance of the learning algorithm. However, finding this scale parameter is a difficult problem and in most cases it even is impossible to tune the σ parameter. Typically, all similarity measure methods have several parameters that should be tuned, e.g. radial basis function, Laplace similarity measure, and quadratic kernel function. Instead of tuning this parameter, our proposed method uses a boosting approach in order to learn the similarity function of (9). There are also several studies on distance/metric learning, e.g. [15–17,28], which are somehow related to our work. Some of these works consider the learning task as semi-supervised clustering problem, e.g. [15]. The goal of these methods is to improve the performance of the clustering algorithm. These methods often employ the parametric Mahalanobis distance in combination with K-means or EM clustering methods using constraint based approach in order to learn the optimized Mahalanobis function. A new boosting approach is presented in [32] for learning a Mahalanobis distance to supervised learning. The main difficulty in learning a Mahalanobis distance is to ensure that the Mahalanobis matrix remains positive semi-definite. The main advantage of our proposed method for learning similarity function is that it uses

7.1. Experimental setup

as similarity function. At each iteration, the algorithm selects X ∗ such that it most decreases the risk with the optimal step size, which basically leads to minimizing the inconsistency.

For each dataset, 30% of the data are kept as test set, and the rest is used as training data. Training data in each experiment are first partitioned into 90%unlabeled data and 10% labeled data, keeping the class proportions in all sets similar to the original data set. We run each experiment 10 times with different subsets of training and testing data. We then report the mean classification accuracy and the standard deviation (std) of 10 times repeating the experiments in the result section. Each experiment includes 20 training iterations for its boosting procedure. The results reported refer to the test set. In our experiments, we use the WEKA implementation of the base classifiers with default parameter settings [13] for all used approaches, SemiBoost, MSAB, OCD-MSSB, CD-MSSB, and MSSBoost, in Java. As the base learner in the experiments, we use three multiclass classifiers which are Decision tree (J48, Java implementation of C4.5), Naive Bayes, Support Vector Machine (SVM) with default setting in WEKA (polynomial kernel and sequential minimal optimization algorithm). We also use Decision Stump as a binary classification method in default setting. As addressed earlier, SemiBoost and MSAB employ RBF as similarity function. The RBF similarity function typically requires a tuning parameter σ . For fair comparison, we tune the σ value for generated syntectic dataset. The σ value varies between 1 and 5 in this experiment. The best setting is then selected to use in SemiBoost and MSAB. In LapSVM, we use the default setting as mentioned in [3].

7. Experiments

7.2. Syntactic data

In this section, we perform several experiments to compare the classification performance of MSSBoost to the state-of-the-art semi-supervised methods using several different datasets, syntactic, UCI [10], and real-world text classification datasets. We also setup several experiments to show the impact of using unlabeled data for improving the classification performance. The first experiment includes comparison between CD-MSSB and OCD-MSSB. In this experiment, we also compare these two algorithms to other semi-supervised boosting methods when the weak base learner is simple Decision Stump. In the second experiment, we compare MSSBoost to several other algorithms. One comparison is to use different base learners on the labeled data only and a second is to employ several boosting approaches (AdaBoost.MH [31], SAMME [44], and MCBoost) with the same base learner. Next, we compare MSSBoost to the supervised learning and the state-of-the-art of the semi-supervised learning methods. The purpose of this experiment is to evaluate if MSSBoost exploits the information from the unlabeled data. We also compare MSSBoost with a version that does not include the similarity learning function. The goal here is to show the impact of the used similarity learning function. In this experiment, we use different static similarity functions which are: 1) Radial basis function (RBF), 2) Laplacian function, and 3) Rational quadratic function. We further include comparisons to the state-of-the-art algorithms to the semi-supervised boosting algorithms, in particular SemiBoost [23],RegBoost [8], MSAB, and graph-based semisupervised learning methods, like LapSVM [3]. For comparison, we

We start with a synthetic data, for which the optimal decision rule is known. This is a three class problem, with two-dimensional Gaussian classes of means [1, 2] , [−1, 0] , [2, −1] and covariances of [1, 0.5; 0.5, 2], [1, 0.3; 0.3, 1], [0.4, 0.1; 0.1, 0.8] respectively. We randomly sampled 300 data points and then made with 30% a separate testing set to evaluate the resulting classification model. In the experiments, we use only 10% labeled data. The associated Bayes error rate was 12.81% in the training data and 12.32% in the test data. In this experiment, we employ J48 classifier as the base learner. As shown in Fig. 1b, our proposed method exploits effectively information from the unlabeled data and improves the classification accuracy of the supervised classifier, J48 base learner which is only trained on labeled data. Another observation is that our method assigns mostly true labels to the unlabeled examples. The performance of the MSSBoost is indeed close to the performance of the supervised model which has been trained by the fully labeled data, see Fig. 1c. This comparison shows that the proposed method effectively exploits information from the unlabeled data and improves the classification performance. In order to find the tuned λ value in the experiments, we use different λ values for the used synthetic dataset, Fig. 2 shows the results. We also show the changes of the loss function in terms of the different λ values, see Fig. 3. As shown, when λ gets 0.1, then MSSBoost achieves the best classification performance. Therefore, we set λ to 0.1 in the rest of experiments in this study.

S t (x, z; X t ) = e−(x−z )

T

X t (x−z )

(33)

258

J. Tanha / Neurocomputing 314 (2018) 251–266

Fig. 1. Plot of the classification performance of MSSBoost on synthetic dataset using J48 classifier as the base learner.

Table 1 Overview of the used UCI datasets.

Fig. 2. Sensitivity of MSSBoost to λ value.

Dataset

#Samples

#Attributes

# Classes

Balance Car Cmc Dermatology Glass Iris Optdigits Segment Sonar Vehicle Vowel Wave Wine Zoo

625 1728 1473 366 214 150 1409 2310 208 846 990 50 0 0 178 101

4 6 9 34 9 4 64 19 60 17 14 21 13 17

3 4 3 6 6 3 10 7 2 4 11 3 3 7

7.3. UCI datasets We use the UCI benchmark machine learning datasets for assessing the proposed algorithm. Table 1 summarizes the specification of 14 benchmark datasets from the UCI data repository

Fig. 3. The changes of the risk function in terms of different λ value.

J. Tanha / Neurocomputing 314 (2018) 251–266

259

Table 2 The classification accuracy and standard deviation of different algorithms with 10% labeled data and Decision Stump as the base leaner. Supervised learning

Semi-supervised

Dataset

OneVsAll

AdaBoost.MH

MCBoost

SB

RB

CD-MSSB

OCD-MSSB

Balance

61.46 ± 3.6 70.24 ± 0.0 45.14 ± 4.2 83.89 ± 5.3 44.41 ± 4.3 84.22 ± 5.7 51.52 ± 2.7 79.76 ± 2.0 58.21 ± 8.7 52.41 ± 2.4 32.91 ± 3.9 65.08 ± 2.9 73.09 ± 4.5 76.66 ± 0.0

77.00 ± 5.6 73.27 ± 2.3 47.72 ± 4.5 85.59 ± 5.6 46.81 ± 6.1 87.79 ± 7.7 75.07 ± 2.5 90.97 ± 0.7 59.62 ± 9.0 61.05 ± 2.3 41.45 ± 2.3 79.67 ± 1.4 72.22 ± 6.9 76.66 ± 0.0

77.60 ± 5.3 70.06 ± 2.3 45.51 ± 3.6 86.86 ± 3.5 52.45 ± 8.5 87.79 ± 7.7 78.31 ± 1.2 91.82 ± 0.8 58.57 ± 9.1 60.92 ± 4.0 45.37 ± 2.0 79.59 ± 1.3 76.6 ± 9.9 90 ± 3.6

72.51 ± 6.8 70.24 ± 0.0 48.24 ± 7.1 90.53 ± 3.1 56.61 ± 3.0 92.85 ± 2.9 63.64 ± 4.3 86.03 ± 8.6 65.44 ± 4.3 59.4 ± 4.1 41.19 ± 5.2 73.08 ± 2.1 95.32 ± 2.1 89.44 ± 5.7

73.32 ± 6.1 71.98 ± 1.7 47.95 ± 5.3 88.54 ± 5.6 53.12 ± 2.3 90.28 ± 3.7 70.56 ± 2.4 88.32 ± 7.1 62.19 ± 6.1 58.74 ± 3.8 40.23 ± 4.3 75.18 ± 3.2 90.56 ± 2.3 87.41 ± 4.6

82.03 ± 3.9 76.76 ± 2.0 50.45 ± 1.5 95.50 ± 2.2 59.19 ± 7.6 95.53 ± 2.2 83.94 ± 2.6 92.18 ± 0.5 70.35 ± 5.2 63.98 ± 3.9 51.58 ± 2.8 81.10 ± 1.0 95.12 ± 3.1 96.31 ± 3.9

72.8 ± 6.7 70.24 ± 0.0 48.51 ± 3.1 86.7 ± 3.6 54.25 ± 4.2 93.75 ± 2.1 70.32 ± 3.3 87.23 ± 2.9 58.21 ± 8.7 59.34 ± 3.1 37.54 ± 2.9 74.81 ± 1.4 76.6 ± 4.1 76.66 ± 0.0

Car Cmc Dermatology Glass Iris Optdigits Segment Sonar Vehicle Vowel Wave Wine Zoo

Table 3 The classification accuracy and standard deviation of different algorithms with 10% labeled data and J48 as the base leaner. Supervised learning

Semi-supervised learning

Dataset

J48

SAMME

MCBoost

SemiBoost

MSAB

CD-MSSB

MSSBoost

Balance

65.16 ± 4.4 73.99 ± 3.9 45.48 ± 3.5 71.18 ± 7.4 50.29 ± 9.4 86.39 ± 6.5 63.82 ± 3.2 83.46 ± 1.1 59.68 ± 8.3 54.88 ± 5.3 38.08 ± 6.9 69.64 ± 1.9 70.17 ± 6.9 78.88 ± 2.7

69.98 ± 3.1 75.56 ± 2.6 44.56 ± 4.2 74.89 ± 5.3 50.94 ± 7.6 86.39 ± 6.5 70.01 ± 4.2 84.43 ± 2.3 58.12 ± 6.9 54.12 ± 4.8 40.13 ± 5.4 73 ± 1.3 70.17 ± 6.9 70.54 ± 8.1

70.23 ± 3.2 70.96 ± 3.5 45.73 ± 1.7 84.9 ± 5.8 51.15 ± 7.0 88.72 ± 5.2 70.12 ± 2.2 87.72 ± 1.1 59.19 ± 7.0 54.94 ± 4.7 41.5 ± 3.3 72.98 ± 1.3 78 ± 4.0 80.01 ± 6.5

75.32 ± 4.1 78.6 ± 1.1 48.46 ± 2.2 86.68 ± 3.9 60.34 ± 4.2 93.12 ± 5.1 74.65 ± 1.3 85.23 ± 3.1 70.32 ± 5.7 55.89 ± 4.6 42.14 ± 2.9 76.67 ± 1.1 92.23 ± 4.1 87.98 ± 5.1

80.01 ± 2.3 80.65 ± 1.6 49.26 ± 2.7 88.76 ± 2.3 65.45 ± 4.1 94.12 ± 4.6 80.21 ± 2.1 84.98 ± 2.1 72.87 ± 4.9 54.12 ± 4.8 48.31 ± 3.5 78.34 ± 1.2 93.41 ± 3.7 89.56 ± 4.9

80.64 ± 2.8 84.47 ± 6.1 49.72 ± 1.5 93.56 ± 2.3 59.26 ± 5.5 96.94 ± 3.3 71.33 ± 2.1 88.73 ± 1.4 64.35 ± 3.8 55.06 ± 6.7 52.50 ± 0.7 73.01 ± 1.8 95.57 ± 5.0 90.56 ± 7.1

81.01 ± 2.1 84.9 ± 5.0 50.01 ± 1.6 93.15 ± 2.1 62.5 ± 4.1 97.01 ± 3.2 78.03 ± 1.9 89.02 ± 1.7 71.5 ± 3.1 56.5 ± 5.5 51.9 ± 1.1 79.5 ± 1.3 94.7 ± 4.3 91.5 ± 6.0

Car Cmc Dermatology Glass Iris Optdigits Segment Sonar Vehicle Vowel Wave Wine Zoo

which are used in our experiments. We selected these datasets because: (i) they involve multiclass classification and (ii) these are used on several other studies in Semi-Supervised learning for example, [23,35,36].

8. Results The results of the experiments are shown in Tables 2–6. Tables consist two parts: supervised and semi-Supervised. The first part

260

J. Tanha / Neurocomputing 314 (2018) 251–266 Table 4 The classification accuracy and standard deviation of different algorithms with 10% labeled data and Naive Bayes as the base leaner. Supervised learning

Semi-supervised

Dataset

NB

SAMME

MCBoost

SemiBoost

MSAB

MSSBoost

Balance

75.88 ± 3.5 77.5 ± 1.9 46.98 ± 1.4 81.52 ± 7.4 50.0 ± 7.0 81.25 ± 5.5 72.04 ± 1.9 81.6 ± 1.9 64.41 ± 7.1 53.2 ± 5.4 47.52 ± 3.4 80.44 ± 1.9 84.56 ± 5.8 87.77 ± 4.4

76.71 ± 2.5 78.51 ± 0.5 48.05 ± 1.3 81.52 ± 7.4 50.21 ± 7.6 80.95 ± 6.6 70.57 ± 1.7 82.1 ± 2.1 63.23 ± 5.1 53.04 ± 4.8 48.04 ± 2.0 80.91 ± 3.8 84.56 ± 5.8 87.77 ± 4.4

78.9 ± 4.6 78.9 ± 2.1 48.5 ± 1.8 79.5 ± 1.7 50.8 ± 6.2 84.6 ± 2.5 73.6 ± 1.0 84.8 ± 2.1 58.57 ± 7.7 54.0 ± 3.2 49.08 ± 3.0 80.8 ± 2.4 84.56 ± 5.8 89.9 ± 3.7

77.36 ± 4.1 78.66 ± 2.1 49.09 ± 1.4 86.44 ± 3.9 56.51 ± 7.1 90.47 ± 6.3 77.95 ± 2.5 86.2 ± 2.5 70.91 ± 5.4 58.1 ± 2.6 44.5 ± 2.2 81.83 ± 1.3 92.28 ± 2.6 89.52 ± 3.8

82.09 ± 3.2 80.75 ± 1.8 51.11 ± 1.7 90.00 ± 2.4 60.08 ± 3.4 93.45 ± 4.7 83.61 ± 2.0 86.8 ± 2.01 70.58 ± 6.5 54.56 ± 2.5 52.53 ± 3.3 83.43 ± 0.6 95.17 ± 2.9 91.11 ± 3.4

84.03 ± 3.1 82.6 ± 2.1 53.5 ± 1.8 94.3 ± 2.4 59.90 ± 2.1 94.9 ± 4.2 83.9 ± 2.6 90.8 ± 1.5 70.35 ± 5.9 62.3 ± 1.9 51.9 ± 2.9 82.8 ± 1.2 96.9 ± 3.1 94.2 ± 3.8

Car Cmc Dermatology Glass Iris Optdigits Segment Sonar Vehicle Vowel Wave Wine Zoo

Table 5 The classification accuracy and standard deviation of different algorithms with 10% labeled data and SVM as the base leaner. Supervised learning

Semi-supervised learning

Dataset

SVM

AdaBoost.MH

MCBoost

SB

LapSVM

CD-MSSB

MSSBoost

Balance

82.66 ± 3.2 80.92 ± 2.0 45.12 ± 1.9 86.54 ± 1.7 47.64 ± 5.2 84.32 ± 6.7 89.92 ± 3.0 85.88 ± 4.0 65.88 ± 4.0 56.5 ± 2.6 40.52 ± 3.9 80.44 ± 1.4 88.81 ± 4.9 90.83 ± 2.3

81.23 ± 4.1 83.54 ± 2.1 45.69 ± 2.3 89.01 ± 3.1 51.72 ± 6.1 84.12 ± 3.2 89.16 ± 1.9 86.31 ± 4.5 64.56 ± 3.5 57.43 ± 2.7 35.76 ± 2.9 78.54 ± 1.7 87.41 ± 5.7 90.12 ± 2.8

83.21 ± 4.1 85.01 ± 1.6 45.66 ± 2.7 91.2 ± 4.2 54.41 ± 7.5 89.88 ± 5.9 90.43 ± 2.4 85.73 ± 3.1 66.73 ± 3.4 58.32 ± 2.9 33.12 ± 2.5 77.21 ± 0.6 88.6 ± 4.0 89.54 ± 2.9

83.84 ± 3.4 80.00 ± 0.9 47.72 ± 4.7 90.67 ± 3.8 52.94 ± 4.6 86.19 ± 9.7 88.6 ± 4.6 87.6 ± 4.0 70 ± 4.6 57.50 ± 1.2 40.50 ± 2.2 81.83 ± 1.3 96.49 ± 3.0 92.00 ± 3.3

84.32 ± 4.1 82.78 ± 1.2 47.87 ± 3.1 89.79 ± 2.7 56.8 ± 4.1 93.45 ± 4.5 90.1 ± 2.8 88.91 ± 2.4 69.1 ± 2.8 62.47 ± 2.6 40.34 ± 2.1 82.89 ± 0.6 93.40 ± 3.2 91.40 ± 3.8

85.85 ± 0.9 86.65 ± 2.1 48.40 ± 1.1 95.79 ± 0.7 58.65 ± 4.6 96.06 ± 2.6 90.96 ± 3.7 87.96 ± 1.2 70.96 ± 3.2 60.6 ± 1.4 36.36 ± 3.4 82.55 ± 0.5 92.61 ± 3.5 92.01 ± 2.8

86.01 ± 1.1 85.9 ± 1.9 49.4 ± 0.7 95.6 ± 0.6 60.7 ± 3.9 97.01 ± 1.9 91.6 ± 2.3 89.12 ± 1.5 71.5 ± 2.3 62.3 ± 1.9 39.4 ± 3.8 82.8 ± 0.9 95.6 ± 3.4 92.79 ± 1.6

Car Cmc Dermatology Glass Iris Optdigits Segment Sonar Vehicle Vowel Wave Wine Zoo

presents the classification accuracy of the supervised learning algorithms using only labeled data. The second part of the tables shows the classification performance of the semi-supervised learning algorithms. The used base learners in the experiments are Decision Stump, J48, Naive Bayes (NB), and SVM. In each table, the best classification performance is boldfaced for each dataset.

CD-MSSB and supervised decision stump base learner The results of the first experiment are shown in Table 2. In this table, the columns OneVsAll, AdaBoost.MH, and MCBoost show the results of supervised one-vs-all, AdaBoost.MH, and MCBoost classifiers respectively, when the base learner is Decision Stump. It also

J. Tanha / Neurocomputing 314 (2018) 251–266

261

Table 6 The classification accuracy and standard deviation of different semi-supervised learning algorithms with 10% labeled data and J48 as the base leaner. Semi-supervised learning Dataset

SB

MSAB

MSSB-RBF

MSSB-Lap

MSSB-RQK

MSSBoost

Balance

75.32 ± 4.1 78.6 ± 1.1 48.46 ± 7.1 86.68 ± 3.9 60.34 ± 4.2 93.12 ± 2.9 74.65 ± 1.3 85.23 ± 3.1 70.33 ± 5.7 55.89 ± 4.6 42.14 ± 2.9 76.67 ± 1.1 92.23 ± 4.1 87.98 ± 5.1

80.01 ± 2.3 80.65 ± 1.6 49.26 ± 5.3 88.76 ± 2.3 65.45 ± 4.1 94.12 ± 3.7 80.21 ± 2.1 84.98 ± 2.1 72.87 ± 4.9 54.12 ± 4.8 48.31 ± 3.5 78.34 ± 1.2 93.41 ± 3.7 89.56 ± 4.9

80.1 ± 5.4 82.03 ± 3.3 48.89 ± 4.2 91.5 ± 3.1 64.59 ± 4.2 95.89 ± 2.2 84.23 ± 1.8 85.9 ± 1.8 73.9 ± 4.1 54.22 ± 2.9 50.5 ± 3.2 77.9 ± 1.3 94.12 ± 2.9 88.55 ± 3.9

77.3 ± 6.1 81.19 ± 1.9 47.98 ± 5.3 90.2 ± 4.1 63.3 ± 4.2 95.1 ± 2.8 82.08 ± 1.2 87.9 ± 1.9 70.1 ± 2.9 54.7 ± 3.6 49.7 ± 2.7 77.44 ± 1.1 94.19 ± 2.6 87.9 ± 3.1

80.19 ± 2.1 79.3 ± 2.0 48.17 ± 2.8 91.2 ± 2.1 59.2 ± 3.6 94.39 ± 2.0 80.1 ± 1.7 84.76 ± 1.9 72.42 ± 4.1 54.5 ± 3.8 50.18 ± 4.1 76.9 ± 1.7 93.52 ± 3.5 88.1 ± 4.2

81.01 ± 2.1 84.9 ± 5.0 50.01 ± 1.6 93.15 ± 2.1 62.5 ± 5.5 97.01 ± 3.2 78 ± 1.9 89.02 ± 1.7 71.5 ± 3.1 56.05 ± 5.5 51.9 ± 1.1 79.5 ± 1.3 94.7 ± 4.3 91.5 ± 6.0

Car Cmc Dermatology Glass Iris Optdigits Segment Sonar Vehicle Vowel Wave Wine Zoo

shows the results of the semi-supervised SemiBoost (SB), RegBoost (RB), CD-MSSB, and OCD-MSSB. As shown in Table 2, CD-MSSB improves the classification performance of supervised base classifiers for nearly all the datasets. Using statistical t-test, we also observe that CD-MSSB significantly improves the performance of the supervised MCBoost on 14 out of 14 datasets. The results also show that the supervised MCBoost gives better performance than the other supervised learning algorithms used in this study. Furthermore, the results indicate that CD-MSSB outperforms the state-of-the-art methods on 13 out of 14 datasets. Another comparison is between CD-MSSB and OCD-MSSB. The results show that in all cases CD-MSSB gives better classification performance than the OCD-MSSB algorithm. However, OCD-MSSB outperforms the supervised base learner on 10 out of 14 datasets. The results also indicate that although OCD-MSSB works quite faster than CD-MSSB, it may not reach the optimal point using only one inner iteration. This can be the main reason for the low classification performance of OCD-MSSB. MSSBoost and supervised J48 learning The results of the second experiment are shown in Table 3. In this experiment, the J48 classifier is used as the base learner for boosting algorithms. The columns J48, SAMME, and MCBoost show the results of the supervised J48, SAMME, and MCBoost multiclass boosting classifiers. It also shows the results of the semi-supervised SemiBoost, MSAB, CD-MSSB, and MSSBoost algorithms. As can be seen, in this experiment MSSBoost (MSSB) performs better than the supervised learning algorithms on all used datasets. Using statistical t-test, we observe that MSSBoost significantly improves the performance of the supervised MCBoost and J48 base classifiers on 14 out of 14 datasets. Another observation is that MSSBoost outperforms semi-supervised SemiBoost and MSAB on

11 out of 14 datasets. We further observe that the classification models generated by using MSSBoost are relatively more stable than J48 base classifier because of lower standard deviation in classification accuracy.

MSSBoost and supervised Naive Bayes base learner The results of the third experiment are shown in Table 4. In this table, the columns NB, SAMME, and MCBoost show the results of supervised Naive Bayes, SAMME, and MCBoost classifiers respectively, when the base learner is Naive Bayes. It also shows the results of the semi-supervised SemiBoost, RegBoost, and MSSBoost respectively. As shown in Table 4, MSSBoost improves the classification performance of supervised base classifiers for nearly all the datasets. Using statistical t-test, we also observe that MSSBoost significantly improves the classification performance of the supervised Naive Bayes and MCBoost on 14 out of 14 datasets. We also observe that the supervised MCBoost gives better performance than the other supervised learning algorithms used in this study. Furthermore, the results indicate that MSSBoost outperforms the state-of-the-art methods on 10 out of 14 datasets.

MSSBoost and supervised SVM learning In the fourth experiment, we employ SVM as the base classifier, as shown in Table 5. In order to compare MSSBoost to the state-of-the-art semi-supervised methods, we include SemiBoost and LapSVM in the comparison, when the base learner is SVM classifier. In the supervised column, we include supervised SVM, AdaBoost.MH, and MCBoost. Note that we use the one-vs-all approach in SemiBoost and LapSVM to handle the multiclass classification problems. The results are shown in Table 5.

262

J. Tanha / Neurocomputing 314 (2018) 251–266

Fig. 4. Average Classification Performance of MSSBoost with increasing proportions of labeled data on different datasets using J48 and SVM as the base learners.

Similar to the previous results, we observe that MSSBoost improves the performance of the supervised classifiers on 13 out of 14 datasets. The results also show that MSSBoost outperforms LapSVM and SemiBoost algorithms. We further perform a set of experiments to show the impact of the similarity learning. We next show the convergence of MSSBoost as well as the changes of the margin cost on labeled data. Learning optimal similarity function The goal here is to show the impact of using the optimal similarity learning approach to exploit information from unlabeled data in order to improve the classification performance. Besides the proposed similarity approach in MSSBoost, we employ three different similarity functions in this experiment: Radial Bases Function (RBF), Laplacian, and Rational quadratic function. We then use

these similarity functions in MSSBoost instead of learning similarity functions as mentioned in Section 3.2. Table 6 shows the results. We use J48 as the base learner in the experiment. In Table 6, the columns SB, MSAB, MSSB-RBF, MSSB-Lap, and MSSB-RQK give the classification performance of semi-supervised methods SemiBoost, MSAB, and MSSBoost with RBF similarity measure, MSSBoost with Laplacian similarity measure, and MSSBoost with Rational quadratic similarity measure respectively. Note that this experiment wishes to show the impact of using different similarity measure methods on the performance. The results in Table 6 indicate that MSSBoost outperforms SemiBoost and RegBoost, which are the state-of-the-art boosting methods to semi-supervised learning, on 9 out of 14 datasets. Another observation is that MSSBoost with similarity learning outperforms MSSBoost using a fix similarity function on 10 out of 14 datasets. Note that the results for MSSBoost without learning similarity function were obtained by manually tuning the parameters

J. Tanha / Neurocomputing 314 (2018) 251–266

263

Fig. 5. The margin cost, regularization value, and risk over the boosting iterations for Segment and Iris datasets, when the base learner is J48.

of the related similarity functions, which indeed have high time complexities.

of the base classifiers. Consistent with our hypothesis, MSSBoost improves the performance of the base classifier and performs better than supervised MCBoost and used base classifiers.

8.1. Different proportions of labeled data 8.2. Convergence of MSSBoost To study the sensitivity of MSSBoost to the number of labeled data, we run a set of experiments with different proportions of labeled data which vary from 5% to 50%. We compare the supervised MCBoost to semi-supervised MSSBoost. We expect that the difference between the performance of the supervised algorithm and MSSBoost decreases when more labeled data are available. Similar to the previous experiments, we use a separate test set to evaluate the performance. Fig. 4 shows the performance on 3 datasets, Iris, Sonar, and Glass with two different base classifiers: J48 and SVM. It is observed that with the additional unlabeled data and only 5% labeled data, MSSBoost significantly improves the performance

In this section, we empirically show how MSSBoost minimizes the margin cost on labeled data, as addressed in the first term of (9), and how it minimizes the value of the regularization term on unlabeled data, the second term of (9). We further show the convergence of MSSBoost. Two datasets, Segment and Iris, are chosen in this experiment and J48 is used as the base learner. Fig. 5a and c shows the margin cost. As can be seen, as iterations progress the classes get more and more separated, because the cost is decreasing repeatedly. Fig. 5b and d depicts the same results for the regularization term.

264

J. Tanha / Neurocomputing 314 (2018) 251–266 Table 7 Overview of text classification datasets. Dataset

Source

# Samples

# Words

# Min class size

# Classes

re0 re1 tr11 tr31 tr45

Reuters-21578 Reuters-21578 TREC TREC TREC

1504 1657 414 927 690

2886 3758 6429 10128 8261

11 10 6 2 14

13 25 9 7 10

Table 8 The mean classification accuracy (and std) of different algorithms with 10% labeled examples and J48 as the base learner. Supervised

Semi-supervised learning

Fully

Datasets

J48

SAMME

MCBoost

MSAB

MSSBoost

Labeled

re0

63.35 1.4 67.33 2.6 69.84 5.6 88.43 3.3 80.86 3.5

65.43 2.1 67.18 3 72.13 4.9 91.29 3.5 86.73 2.4

68.32 1.2 64.36 5 73.02 3.9 92.02 8 86.03 2.2

68.74 3.6 69.69 1.7 77.6 2.6 92.22 2.1 89.67 2.3

70.2 3.2 72.6 0.8 79.6 2.2 93.06 2.1 89.2 1.1

73.61 0.7 79.84 1.2 81.8 1.2 94.68 1.0 93.95 1.2

re1 tr11 tr31 tr45

We further demonstrate the changes in the risk function of (9) as new weak classifiers and similarity functions are added through the iterations in Fig. 5e and f. As shown, for the first 10 iterations, the value of risk function fall rapidly and after that it is decreasing slowly. We conclude that although MSSBoost still needs more iterations to converge, the new weak classifiers will not significantly change the decision value.

formance of the supervised base classifier for nearly all datasets. Using statistical test (t-test), we observed that MSSBoost significantly improves the performance of J48 on 5 out of 5 datasets. It also outperforms the other semi-supervised algorithms on 4 out of 5 datasets, when the base classifier is J48.

10. Conclusion 8.3. Discussion Based on our experiments in this study, there are datasets where the proposed algorithms may not significantly improve the classification performance of the base classifiers. As can be seen in Tables 3–6, in these cases the supervised algorithm outperforms all the Semi-Supervised algorithms, for example in Table 5 the SVM classifier performs better than the proposed method on Vowel dataset and outperforms the other methods as well. This kind of results emphasize that the unlabeled examples do not guarantee that they always are useful and improve the classification performance. Comparing the results of Tables, we observe that almost in all cases CD-MSSB and MSSBoost improve the classification performance of the base learners and in most cases they outperform the state-of-the-art semi-supervised boosting algorithms. 9. MSSBoost for text classification problem In this section, we evaluate the performance of the MSSBoost algorithm on the text classification problem using the popular text datasets. The specification of the used datasets are summarized in Table 7. Datasets re0 and re1 are derived from Reuters21578 [22] and tr11, tr31, and tr45 are from TREC[5–7] [39]. These datasets were used in many text classification literature, such as [14,36]. We took these datasets from [38]. We evaluate the classification performance of the MSSBoost using J48 as the base classifier. Similar to the last experiment, we use 10% labeled data and a separate test set in the experiment. We run each experiment 10 times. The mean classification accuracy is reported in Table 8. Table 8 gives the results of the experiments on text classification datasets. The results show that MSSBoost improves the per-

In this paper, we proposed two multiclass boosting methods to semi-supervised learning, named CD-MSSB and MSSBoost. Our assumption is that labeled and unlabeled data with high similarity must share the same labels. Therefore, we combine the similarity information between labeled and unlabeled data with the classifier predictions to assign pseudo-label for the unlabeled examples. We design a new multiclass loss function consisting of the multiclass margin cost on labeled data and the regularization term on unlabeled data. For the regularization term as in graph-based methods, we define the consistency term between pairwise similarity and classifier predictions. It assigns the soft labels weighted with the similarity between unlabeled and labeled examples. We then derive weights for labeled and unlabeled data from the loss function and use boosting framework to derive the algorithms, which aim at minimizing the empirical risk at each boosting iteration. Although the proposed methods improve the classification performance, however, finding a well-fitted similarity function is tedious. To solve this problem, we further propose a boosting algorithm to learn from weak similarity functions. The final classification model is formed by combination of the weak classifiers and similarity functions. Our experimental results showed that in almost all cases CD-MSSB and MSSBoost significantly improve the classification performance of the base learners and outperform the state-of-the-art methods to semi-supervised learning. However, there are datasets where CD-MSSB and MSSBoost may not significantly improve the classification performance of the base classifiers. In these cases the supervised algorithm outperforms all the semi-supervised algorithms, for example see vowel dataset in Table 5. This kind of result emphasizes that the unlabeled examples do not guarantee that they always are useful and improve the classification performance.

J. Tanha / Neurocomputing 314 (2018) 251–266

265

Acknowledgment

Appendix C

This research was partially supported by a grant from IPM (No. CS1396-4-69). We also thank the anonymous reviewers for their valuable comments.

We use the functional derivative of (17) along the direction of A at point S = S t :

(xi ,yi )∈L

Appendix A

e

i

λi xi 



λi exi

(A.1)



 ∂  −12 < f t (xi ),yi −yk >  e  ∂ k  =0

∂  −(xi −x˜ j )T (X t + A)(xi −x˜ j ) e ∂ k (xi ,yi )∈L x˜ j ∈U  −1 t i k  × e 2 < f (x˜ j ),y −y >  (C.1) + C2

Regarding the fact that 



δ Rs (Yˆ , f, S t ; X ] = C1



 =0

i

where λi ∈ [0, 1]. Hence, the first term of Eq. (10) is convex. The second term includes two exponential functions which are convex. Since multiplication of two convex functions is convex. As a result the second term in (10) is also convex. Finally the combination of two convex functions leads to a convex function. Therefore (10) is a convex function.

 

= C1

(xi ,yi )∈L k

We use the functional derivative of (11) along the direction of g at point f (x ) = f t (x ):

δ Rs (Yˆ , f t , S ) =

  ∂  C1 L L [ y i , f t ( x i ) +  g ( x i )]  ∂  =0 i



(xi ,yi )∈L x˜ j ∈U k

×e

Appendix B



+ C2

 ∂ −12 < f t (xi ),yi −yk >  e  ∂  =0

−1 t 2
= − C2

 

∂ −(xi −x˜ j )T (At + A)(xi −x˜ j ) e ∂

(x˜ j ),yi −yk > 



(C.2)

 =0



(xi − x˜ j )T A(xi − x˜ j )

(xi ,yi )∈L x˜ j ∈U k

× e−(xi −x˜ j )

T

At (xi −x˜ j )

e

−1 t 2
(x˜ j ),yi −yk >

(C.3)

(xi ,y )∈L

  ∂ C2 S (xi , x˜ j ) ∂ (xi ,yi )∈L x˜ j ∈U   × LU [yi , f t (x˜ j ) +  g(x˜ j )]

References

+

 =0



= C1

(xi ,yi )∈L

+ C2

 ∂  −12 < f t (xi )+ g(xi ),yi −yk >  e  ∂ k  =0





(xi ,yi )∈L x˜ j ∈U

 

= C1

(xi ,yi )∈L k

+ C2



 ∂  −1 t i k  S (xi , x˜ j )e 2 < f (x˜ j )+ g(x˜ j ),y −y >  ∂ k  =0

×e

(B.2)

 ∂ −12  −12 < f t (xi ),yi −yk >  [e ]e  ∂  =0



(xi ,yi )∈L x˜ j ∈U k −1 t 2
(B.1)

 

∂ −12  [e ]S (xi , x˜ j ) ∂

(x˜ j ),yi −yk > 

(B.3)

 =0

  1 −1 t i k = − C1 < g(xi ), yi − yk > e 2 < f (xi ),y −y > 2 i (xi ,y )∈L k

 1 − C2 2 i



< g(x˜ j ), yi − yk > S(xi , x˜ j )

(xi ,y )∈L x˜ j ∈U k

×e

−1 t 2
= −C1

(x˜ j ),yi −yk >

 (xi ,yi )∈L

1 − C2 2 x˜ j ∈U

×e

−1 t 2
(B.4)

1 −1 t i k < g(xi ), yi − yk > e 2 < f (xi ),y −y > 2 k

  ( xi

,yi

S(xi , x˜ j ) < g(x˜ j ), yi − yk >

)∈L k

(x˜ j ),yi −yk >

(B.5)

[1] E.L. Allwein, R.E. Schapire, Y. Singer, Reducing multiclass to binary: a unifying approach for margin classifiers, J. Mach. Learn. Res. 1 (2001) 113–141. [2] M.A. Bagheri, G.A. Montazer, E. Kabir, A subspace approach to error correcting output codes, Pattern Recognit. Lett. 34 (2013) 176–184. [3] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res. 7 (2006) 2399–2434. [4] K. Bennett, A. Demiriz, Semi-supervised support vector machines, in: NIPS, 1999, pp. 368–374. [5] K. Bennett, A. Demiriz, R. Maclin, Exploiting unlabeled data in ensemble methods, in: Proceedings of ACM SIGKDD Conference, 2002, pp. 289–296. [6] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, ACM, 1998. Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 92–100. [7] F. dAlch Buc, Y. Grandvalet, C. Ambroise, Semi-supervised marginboost, in: NIPS, 14, 2002, pp. 553–560. [8] K. Chen, S. Wang, Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions, Pattern Anal. Mach. Intell. 33 (2011) 129–143. [9] T.G. Dietterich, G. Bakiri, Solving multiclass learning problems via error-correcting output codes, J. Artif. Intell. Res. 2 (1995) 263–286. [10] A. Frank, A. Asuncion, UCI Machine Learning Repository, 2010, http://archive. ics.uci.edu/ml. [11] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: ICML, 1996, pp. 148–156. [12] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors), Ann. Stat. 28 (20 0 0) 337–407. [13] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The weka data mining software: an update, SIGKDD Explor. Newsl. 11 (2009) 10–18. [14] E.H. Han, G. Karypis, Centroid-based document classification: analysis and experimental results, in: D. Zighed, J. Komorowski, J. Zytkow (Eds.), Principles of Data Mining and Knowledge Discovery, volume 1910, Springer Berlin/Heidelberg, 20 0 0, pp. 116–123. Lecture Notes in Computer Science. [15] T. Hertz, A. Bar-Hillel, D. Weinshall, Boosting margin based distance functions for clustering, in: ICML, 2004, p. 50. [16] A. Hillel, D. Weinshall, Learning distance function by coding similarity, ACM, 2007. Proceedings of the 24th International Conference on Machine Learning, 65–72. [17] S. Hoi, W. Liu, S. Chang, Semi-supervised distance metric learning for collaborative image retrieval, in: CVPR, 2008, pp. 1–7. [18] T. Joachims, Transductive inference for text classification using support vector machines, in: ICML, 1999, pp. 200–209. [19] T. Joachims, Transductive learning via spectral graph partitioning, in: ICML, 2003, pp. 290–297. [20] D.P. Kingma, S. Mohamed, D.J. Rezende, M. Welling, Semi-supervised learning with deep generative models, in: Advances in Neural Information Processing Systems, 2014, pp. 3581–3589.

266

J. Tanha / Neurocomputing 314 (2018) 251–266

[21] N. Lawrence, M. Jordan, Semi-supervised learning via gaussian processes, in: NIPS, 17, 2005, pp. 753–760. [22] D. Lewis, D., Reuters-21578 Text Categorization Test Collection Distribution, 1999, http://www.research.att.com/∼lewis. [23] P. Mallapragada, R. Jin, A. Jain, Y. Liu, Semiboost: Boosting for semi-supervised learning, Pattern Anal. Mach. Intell. 31 (2009) 2000–2014. [24] L. Mason, J. Baxter, P. Bartlett, M. Frean, Boosting algorithms as gradient descent in function space, in: NIPS, 1999. [25] I. Mukherjee, R. Schapire, A theory of multiclass boosting, in: NIPS, 23, 2010, pp. 1714–1722. [26] I. Mukherjee, R.E. Schapire, A theory of multiclass boosting, J. Mach. Learn. Res. 14 (2013) 437–497. [27] K. Nigam, A. McCallum, S. Thrun, T. Mitchell, Text classification from labeled and unlabeled documents using em, Mach. Learn. 39 (20 0 0) 103–134. [28] G. Niu, B. Dai, M. Yamada, M. Sugiyama, Information-theoretic semi-supervised metric learning via entropy regularization, Neural Comput. 26 (8) (2014) 1717– 1762, doi:10.1162/NECO\_a\_00614. [29] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, T. Raiko, Semi-supervised learning with ladder networks, in: Advances in Neural Information Processing Systems, 2015, pp. 3546–3554. [30] M. Saberian, N. Vasconcelos, Multiclass boosting: Theory and algorithms, in: In Proceedings of the Neural Information Processing Systems (NIPS), 2011. [31] R. Schapire, Y. Singer, Boostexter: a boosting-based system for text categorization, Mach. Learn. 39 (20 0 0) 135–168. [32] C. Shen, J. Kim, L. Wang, A. van den Hengel, Positive semidefinite metric learning with boosting, in: Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada, 2009, pp. 1651–1659. [33] J. Tanha, M.J. Saberian, M. van Someren, Multiclass semi-supervised boosting using similarity learning, in: 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7–10, 2013, 2013, pp. 1205–1210. [34] J. Tanha, M. van Someren, H. Afsarmanesh, Disagreement-based co-training, IEEE, 2011. Proceedings of the 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 803–810. [35] J. Tanha, M. van Someren, H. Afsarmanesh, An adaboost algorithm for multiclass semi-supervised learning, in: Proceedings of the 12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, December 10–13, 2012, 2012, pp. 1116–1121. [36] J. Tanha, M. van Someren, H. Afsarmanesh, Boosting for multiclass semi-supervised learning, Pattern Recognit. Lett. 37 (2014) 63–77.

[37] J. Tanha, M. van Someren, H. Afsarmanesh, Semi-supervised self-training for decision tree classifiers, Int. J. Mach. Learn. Cybernet. 8 (2017) 355–370. [38] TextDatasets, 1999, Public datasets, http://tunedit.org. http://trec.nist.gov. [39] TREC, Text Retrieval Conference, 1999, http://trec.nist.gov. [40] I. Triguero, S. García, F. Herrera, Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study, Knowl. Inf. Syst. 42 (2015) 245–284. [41] H. Valizadegan, R. Jin, A. Jain, Semi-supervised boosting for multi-class classification, in: ECML, 2008, pp. 522–537. [42] M. Wang, W. Fu, S. Hao, D. Tao, X. Wu, Scalable semi-supervised learning by efficient anchor graph regularization, IEEE Trans. Knowl. Data Eng. 28 (2016) 1864–1877. [43] D. Zhou, O. Bousquet, T. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, in: NIPS, 16, 2004, pp. 321–328. [44] J. Zhu, S. Rosset, H. Zou, T. Hastie, Multi-class adaboost, Stat. Interface 2, (2009) 349–360. [45] X. Zhu, Semi-Supervised Learning Literature Survey, 2005, Technical Report, 1530. Computer Sciences, University of Wisconsin-Madison. [46] X. Zhu, Semi-Supervised Learning Literature Survey, 2006. [47] X. Zhu, Z. Ghahramani, J.D. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, in: ICML, 2003, pp. 912–919. Jafar Tanha was born in Bonab, Iran. He received the B.Sc and M.Sc degree in computer science from the university of AmirKabir (Polytechnic), Tehran, Iran, in 1999 and 2001, respectively, and the Ph.D degree in computer science-Artificial Intelligence from the University of Amsterdam (UvA), Amsterdam, The Netherlands, in 2013. He joined at INL institute, Leiden, The Netherland, as a researcher, from 2013 to 2015. Since 2015, he has been with the Department of Computer Engineering, Payame-Noor University, Tehran, Iran, where he was an Assistance Professor. He has held lecturing positions at the Iran university of Science & Technology, Tehran, Iran, in 2016. His current position is the IT manager at the University of Payame-Noor, Tehran Iran. His main areas of research interest are machine learning, pattern recognition, data mining, and document analysis. He was the PC-member of the 11th International Conference on E-learning (icelet 2017) held in Tehran, Iran.