Author's Accepted Manuscript
A general framework for co-training and its applications Weifeng Liu, Yang Li, Dapeng Tao, Yanjiang Wang
www.elsevier.com/locate/neucom
PII: DOI: Reference:
S0925-2312(15)00642-6 http://dx.doi.org/10.1016/j.neucom.2015.04.087 NEUCOM15524
To appear in:
Neurocomputing
Received date: 11 February 2015 Revised date: 28 April 2015 Accepted date: 29 April 2015 Cite this article as: Weifeng Liu, Yang Li, Dapeng Tao, Yanjiang Wang, A general framework for co-training and its applications, Neurocomputing, http://dx.doi.org/ 10.1016/j.neucom.2015.04.087 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
A General Framework for Co-Training and Its Applications
Weifeng Liu1, Yang Li1, Dapeng Tao2*, Yanjiang WANG1 1
College of Information and Control Engineering, China University of Petroleum(East China), Qingdao 266580, P.R. China 2
School of Information Science and Engineering, Yunnan University, Kunming 650091, Yunnan, P.R. China
Abstract Co-training is one of the major semi-supervised learning paradigms in which two classifiers are alternately trained on two distinct views and they teach each other by adding the predictions of unlabeled data to the training set of the other view. Co-training can achieve promising performance, especially when there is only a small number of labeled data. Hence, cotraining has received considerable attention, and many variant co-training algorithms have been developed. It is essential and informative to provide a systematic framework for a better understanding of the common properties and differences in these algorithms. In this paper, we propose a general framework for co-training according to the diverse learners constructed in co-training. Specifically, we provide three types of co-training implementations, including co-training on multiple views, co-training on multiple classifiers, and co-training on multiple manifolds. Finally, comprehensive experiments of different methods are conducted on the UCF-iPhone dataset for human action recognition and the USAA dataset for social activity recognition. The experimental results demonstrate the effectiveness of the proposed solutions.
Key Words: Semi-supervised learning; Co-training; multi-view; human action recognition; social activity recognition;
1. Introduction Currently, it is easy to obtain a massive amount of multimedia data with the fast growth of personal devices such as smart phones and digital cameras. However, most of these multimedia data are unlabeled, and it is difficult and time-consuming to conduct manual annotation. As a consequence, semi-supervised learning [1][2][3], which attempts to make use of costless and abundant unlabeled data in addition to labeled data to improve the performance, has attracted considerable attention. During the past decade, many semisupervised learning approaches have been developed, such as generative-based methods, graph-based methods, semi-supervised support vector machines (S3VMs) and co-training [5][6][7]. Co-training is one of the most attractive paradigms of semi-supervised learning and is also an important part of multi-view learning [8]-[11], which was first proposed by Blum and Mitchell [5]. In recent years, a
great amount of co-training variants under different names have been reported and have achieved great success in many applications, such as natural language processing [12][13][14], content-based image retrieval (CBIR) [15]-[19], image classification [20], computer-aided diagnosis [22] and others [21][23][25]. One main requirement for Blum and Mitchell’s standard co-training is that the dataset can be described by two sufficient and redundant attribute subsets, namely, each view is sufficient to predict the class perfectly and the two views are independently given the class label. For example, web pages [5] can be described by either the test on the web page or the test on the hyperlinks pointing to the web page. Standard co-training works in an iterative manner on two distinct feature sets, namely, two classifiers are first trained using the initial labeled data on two different views and then each classifier is reinforced by the prediction results of unlabeled data in the other view, classifiers are iteratively reinforced until a fixed point is reached or some other stopping criterion is met. Specifically, two classifiers teach each other on two views to improve the classification performance. In many practical situations, it is not intuitively obvious how to obtain two sufficient and redundant natural feature sets; hence, many co-training variants with other assumptions that guarantee its success have been proposed to relax the sufficient and redundant assumptions. Furthermore, many real-world datasets only have a single view instead of two views; therefore, some variants of co-training that do not require two views have been developed successively, and some empirical studies have shown that these co-training algorithms still work well. A key to the success of these co-training algorithms is to generate different learners by exploiting different techniques: one learner to help improve the accuracy of the other by providing it with unknown information. By taking advantage of the correlations between the learners, many co-training algorithms show their effectiveness. However, very little work has been performed to bring these methods into a unified framework. As a result, the common essence and differences of these algorithms are not completely clear. Therefore, it is essential and informative to provide a systematic framework for a better understanding of the common properties and differences in these algorithms. In this paper, we present a simple and general framework in which diverse learners are constructed to learn each other. Specifically, we summarize these approaches in three groups including (1) learning with multiple views [5] [26]-[31][32], (2) learning with multiple classifiers [33][34][35] and (3) learning with multiple manifolds [36]. We also conduct extensive experiments on the UCF-iPhone dataset for human action
recognition and the USAA dataset for social activity recognition, respectively. The experimental results demonstrate the effectiveness of co-training algorithms. The rest of this paper is assigned as follows. Section 2 presents some related work. Then, section 3 introduces a unified framework for co-training and some theoretical analysis. Finally, section 4 describes the experiments details and some discussions followed by a conclusion in section 5.
2. Related work A key reason for co-training algorithm success is to train multiple learners by exploiting different techniques and combining their predictions to learn each other and to decrease classification error. In this section, we review the previous work related to co-training and summarize these approaches in three groups including (1) learning with multiple views, (2) learning with multiple classifiers and (3) learning with multiple manifolds.
2.1 Co-training on multiple views In a few special applications, the dataset has natural disjoint subsets of attributes, i.e., web page classification [5]. In most real-world applications, the datasets have only one attribute set as opposed to two. Methods that have artificial and manual feature splitting are developed to take advantage of the interaction between multiple learners. Natural or artificial feature sets are called view. Standard co-training was applied in domains with truly sufficient and independent feature splits. The procedure is simple, and it works as follows. Two classifiers with reasonable performance can be built using the original labeled data on each view separately. Then, two classifiers in a loop label all of the unlabeled examples, and each classifier takes turns in selecting the high-confident predicted examples and adds these into the training set of the other. Later, both classifiers will be refined using the newly added examples provided by the other view. The loop will repeat until a fixed point is reached or some other stopping criterion is met. Dasgupta et al. [26] justified the sufficiency and independence assumptions and showed that the cotrained classifiers can make fewer generalization errors by maximizing their agreement over the unlabeled data. However, when the data has two views in real-world applications, it is rare that the two views are conditionally independent given the class label. So, several other assumptions on co-training were proposed
to relax the two powerful assumptions. Abney [27] showed that the conditional independence can be relaxed to weak dependence for co-training to work well, and he presented a new co-training algorithm named the greedy agreement algorithm. Balcan et al. [28] suggested a weaker assumption called εexpansion, and they theoretically showed that given an approximately strong PAC-learner on the two different views, the ε-expansion assumption on the underlying data distribution guarantees the success of co-training. However, most real-world datasets only have a single attribute set as opposed to two. To exploit the advantages of co-training, effective methods that do not rely on the existence of two views are needed. A straightforward method is to split the attribute sets into two disjoint sets, where the aim is to maximize the disagreement between the two feature subsets and conduct standard co-training based on the manually generated views. Nigam et al. [29] showed that when attribute sets do not have natural feature sets but rather sufficient redundancy, co-training on a random division of the feature set may work well; however, many applications are not described by a large number of attributes. Du et al. [30] proposed four simple heuristic splitting methods to split a single view into two views. Unfortunately, their empirical results showed that view splitting is unreliable when the number of labeled examples is small. Chen et al. [31] proposed a novel feature decomposition algorithm named pseudo multi-view co-training (PMC), which automatically divides the features of a single-view data set into two mutually exclusive subsets for cotraining to succeed. In addition, Zhou et al. [32] proposed a novel co-training style algorithm called tritraining in which three different classifiers are trained on bootstrap sampled labeled examples. The original labeled example set is bootstrap sampled to generate three different training sets, and each training set can be seen as a view. In summary, for natural or artificial feature sets, each view has a unique feature space, co-training can generate learners with a disagreement on multiple views, and then one learner can use the disagreement and provide the other with unknown information to boost the performance of co-training.
2.2 Co-training on multiple classifiers As described in subsection 2.1, co-training on randomly partitioned views is not always effective, e.g., the effective methods [29][30][31] that tailor the feature sets for standard two-view co-training is limited. Thus, some methods that use a single view without feature splitting are developed [33][34][35] by designing co-
training on multiple classifiers. Goldman et al. [33] proposed a co-training algorithm that does not rely on the existence of two views but instead requires different learning algorithms (e.g., decision tree algorithm) to construct classifiers that can partition the input instance space into a set of equivalence classes. Additionally, they used 10-fold cross validation to identify the unlabeled examples to label. Later, they extended another single-view method and named it Democratic co-training, where it uses three or more learning algorithms [34] to build multiple classifiers. Wang et al. [35] presented a new PAC analysis on co-training style algorithms, and their theoretical study showed that if the two learners have large differences, the performance can be improved through the co-training process. If the two initial learners have small differences, the performance can be improved when the number of labeled examples is small. Moreover, they analyzed the reason why the performance of the co-training process could not be improved further after a number of rounds. This problem is often encountered in practical applications of co-training. As a short summary, learning with a single view and two different classification algorithms (e.g., decision tree algorithm, naïve Bayes) is used to generate two different learners. Different learners have different biases, which is an intuitive explanation on why co-training on multiple classifiers can succeed.
2.3 Co-training on multiple manifolds Manifold regularization tries to explore the geometry of the intrinsic data probability distribution by penalizing the classification function along the implicit manifold. Belkin et al. [37] proposed a manifold regularization framework in reproducing kernel Hillbert space (RKHS). Sindhwani et al. [38] embedded manifold regularization into a semi-supervised kernel defined over the overall input space. Then, Sindhwani et al. [39] proposed a co-regularization framework where classifiers are learned in each view through forms of multi-view regularization. The co-regularization algorithm formulates co-training as joint complexity regularization between the two hypothesis spaces, each of which contains a predictor approximating the target function. Liu et al. [40] proposed Hessian-regularized co-training in which Hessian regularization is integrated into the learner training process of each view to boost significance. After that, Li et al. [36] proposed the manifold regularized co-training (Co-Re) framework, which combines the co-training and manifold regularization. Different manifolds are integrated into the classifier training process to promote each other and approximate the intrinsic manifold.
Note that co-training on multiple manifolds does not rely on the existence of two views. Different classifiers are generated by two different manifold regularization algorithms on a single view, and the correlation between two manifolds can provide some helpful information and boost the performance.
3. A framework and theoretical foundations for co-training
Let X be the input space and Y = {0,1} be the output space. Suppose data set X = L ∪ U , where L =
l
{( x , y )} i
i
l +u
i =1
∈ X × Y is the labeled examples and U = { xi }i =l +1 ∈ X is the unlabeled
data set. l and u represent the number of labeled and unlabeled training sets, respectively. Assume f = { f1 , f 2 } is the target function, where f1 and f 2 can denote target functions defined over different views, classifiers or manifolds separately. We formalize the framework of co-training as follows. Initially, two classifiers with different target functions ( f1 and f 2 ) can be built using the original labeled data. Then, each classifier labels all of the unlabeled data on each training set, ranks the examples by the confidence in their prediction and adds several relatively confident examples into the other training set. Later, both classifiers retrained on the enlarged training set so that they take into account the newly added (previously unlabeled) data. The loop will repeat for a number of iterations until certain stopping criterion is met. There are several traditional stopping criteria including the two classifiers approaching convergence and a preset iteration number meeting. For simple implementation, we employ the latter criterion by setting an empirical iteration number. Generally, the two classifiers will approach convergence quickly. Hence, the computational complexity is mostly decided by each classier. Suppose the time cost of the
( )
classifiers is O n s1
( ) respectively, and the number of iterations is k . Then, the total kO ( n ) + kO ( n ) . Usually, k n , and the total computational
and O n s 2
s1
time cost is approximately
(
complexity can be simplified as O n
s2
max ( s1, s 2 )
) . Table Ⅰ summarizes the procedure of a general co-
training algorithm. TABLE I A GENERAL FRAMEWORK OF CO-TRAINING Inputs: training set examples.
X = L ∪U ,
where
Outputs: classifier
f = { f1 , f 2 } .
Initialize classifiers
f1 , f 2
1.
contains
l
labeled training examples and
U
contains
u
unlabeled training
using the original labeled data in L;
Repeat
Apply classifier 2.
L
fi ( i=1,2) to predict labels of unlabeled training data in U ;
Estimate the labeling confidence of each classifier;
Choose several relatively confident examples to augment the labeled training set of the other, and form new training set
X = L '∪ U ' . Update classifiers 3.
f = { f1 , f 2 }
using the new updated training set
X = L '∪ U' .
Until {certain stopping criterion is met}.
Return classifier
f = { f1 , f 2 } .
Blum et al. [5] first analyzed the effectiveness of co-training. They supposed that the instance space
X can be divided into two different views X = X 1 × X 2 , where X 1 and X 2 correspond to two different views of an example; hence, a training example can be represented by a pair x = ( x1 , x2 ) ε X 1 × X 2 . Specially, let D denote the distribution over X , and C1 , C2 denote the concept classes defined over X 1 , X 2 . Due to the sufficiency of the two views, for any example
x = ( x1 , x2 ) observed with label y , f ( x ) = f1 ( x1 ) = f 2 ( x2 ) = y . They defined compatibility between f and D , namely, the target function f = ( f1 , f 2 ) ∈ C1 × C2 is considered as being compatible with D, and D assigns probability zero to any instance
( x1 , x2 )
such that
f1 ( x1 ) ≠ f 2 ( x2 ) : PrD ( x1 , x2 ) : f1 ( x1 ) ≠ f 2 ( x2 ) = 0 . They analytically proposed a theorem that “If C2 is learnable in the PAC model with classification noise, and if the conditional independence assumption is satisfied, then
( C1 , C2 )
is learnable in the
co-training model from unlabeled data only, given an initial weakly useful predictor h ( x1 ) ”. This is a very strong conclusion, which implies that if the two assumptions are satisfied and the target class is learnable from random classification noise, then the predictive accuracy of an initial weak learner can be boosted arbitrarily high using only unlabeled examples by co-training. Later, some theoretical analysis [7][13] was performed to relax the two remarkably powerful and easily violated assumptions. The theoretical details are not mentioned in this section. For co-training on a single-view, many approaches have empirically verified the effectiveness. Wang and Zhou [29] presented a theoretical analysis that can explain why co-training without two views can succeed. Let H : X → Y be the hypothesis space. Assume that H is finite. D is generated by *
(
i
ground truth h ε H , and we can obtain a classifier h ε H . Let d h i , h*
)
denote the difference
( )
i * i i between the two classifiers h and h . Then , d h i = Prx∈D h i ( x ) ≠ h* ( x ) , and h1 and h2
denote the classifiers in the i th round of the iterative co-training process. The main result is shown in Theorem1. Theorem1. Given the initial labeled data set L which is clean and assuming that the size of L is 0 0 sufficient to learn two classifiers h1 and h2 whose upper bound of generalization error is a0 < 0.5
1 |H | 1 |H | ln , ln , . Then, h10 selects u number a b δ δ 0 0
and b0 < 0.5, respectively , i.e., l ≥ max
of unlabeled instances from U to label and puts them into
lb0 ≤ e M M ! − M
,
σ 2 by minimizing the empirical risk. If
PrD d h21 , h* ≥ b1 ≤ δ
(
then
)
,
where
M = ua0
and
lb0 + ua0 − ud h10 , h21 b1 = max ,0 . l
(
)
Theorem1 suggests that the difference between the two classifiers is the key for successful co-training, which also explains the reason why co-training algorithms with a single view can work well. It is worth noting that the theorem does not assume that data has two specific views. In other words, the existence of two views is a sufficient condition instead of a necessary condition for co-training. When there are two specific views, two classifiers are trained from the two view features and then boost the performance.
When there is only a single view, it will also succeed by employing two diverse learners that are constructed on different classification algorithms or on different manifolds. Therefore, we can construct successful co-training algorithms by only considering the diversity of two learners with different techniques. In the above subsections, we summarize three types of co-training algorithms based on the different techniques used, which are (1) using two views [5][26]-[28] or splitting feature subsets [29]-[31] (co-training on multiple views), (2) using different supervised learning algorithms [33][34][35](co-training on multiple classifiers) and (3) employing different manifold regularization algorithms [36] (co-training on multiple manifolds). Then, in the following section, we implement these three types of co-training algorithms.
4. Experiments We conducted experiments for the mentioned co-training variants on the UCF-iPhone dataset for human action recognition and the USAA dataset for social activity recognition [41]-[43] [44]. Particularly, we verified the performance of standard co-training (Std-Co) on two views to show co-training on multiple views. Then, we ran experiments on the single view by using two different classifiers (Co-Cla) to show the effect of co-training on multiple classifiers. Finally, we adopted the results (Co-Re) [36] to illustrate the effectiveness of co-training on multiple manifolds.
4.1 Dataset 4.1.1 UCF-iPhone dataset The UCF-iPhone dataset is provided by the University of Central Florida [4]. There are 10 subjects who were called to bind the Apple iPhone 4 smartphone to their belts on the right hand side and to do a series of aerobic exercises. Nine actions (Biking, Climbing, Descending, Exercise biking, Jump Roping, Running, Standing, Treadmill Walking, and Walking) were recorded by the inertial measurement unit (IMU) on the
Fig. 1.
Time series of different actions. From left to right: the time series of jumping, running, and standing.
smartphone. Each action was repeated five times, and each IMU (60 Hz) recorded instantaneous 3D acceleration (accelerometer), angular velocity (gyroscope), and orientation (magnetometer) at the same time. Then, each sample was trimmed to 500 samples (8.33 s) manually. Several typical time series of different actions are shown in Fig. 1. From Fig. 1, we can see that the sensor signals of jumping and running change acutely and periodically, while the signal of standing remains stable except for some noise. It is worth noting that the signal from sensor magnetometer varies smoothly for all actions, and experiments in [4] show that the data from the magnetometer is not useful. 4.1.2 USAA dataset The USAA dataset is short for Unstructured Social Activity Attribute dataset [45]. This dataset is a subset of the CCV database [46] and contains eight different semantic class videos, which are home videos of social occasions, including a birthday party, graduation party, music performance, non-music performance, parade, wedding ceremony, wedding dance and wedding reception. It has tagging features that are 69 ground-truth attributes and visual features that are low features that concatenate SIFT, STIP and MFCC [46][47][48]. In our experiments, we only use the visual features.
4.2 Experimental setup For the Std-Co and Co-Cla experiments, we implemented them on the UCF-iPhone dataset. We randomly selected six actions (Biking, Climbing, Jump Roping, Running, Treadmill Walking, Walking) to conduct experiments. In our experiments, any two of the six classes were selected to evaluate performance, resulting in a total of 15 one vs. one binary classification experiments. Because the data from the magnetometer is unavailable, for the Std-Co experiments, we only considered the data from the accelerometer and gyroscope, which can be seen as two sufficient natural views. For the Co-Cla experiments, we randomly selected the data from the gyroscope as the single view. In each experiment, we selected the label length followed by the candidate set {1, 5, 10,…, 45, 50}. For the Co-Re experiments [36], we implemented them on the USAA dataset, and visual features on all eight classes were utilized to conduct one vs. one binary classification experiments. In each experiment, we successively selected one, five, and ten training instances as labeled data and the remaining instances as unlabeled data. For all experiments, the most confident examples chosen to enlarge the training set during each iterative round and
the number of iterations was set to three. To examine the robustness of different methods, each program was repeated for 10 runs, and the average results were recorded. We used the average precision (AP) for each action and the mean average precision (mAP) of all activities as assessment criteria. The error bars, which can show the confidence intervals of data, are used to report AP and the box plots, which are composed of five conventionally used values (the extremes, the upper and lower hinges, and the median) that are used to reflect mAP. Additionally, to reflect the overall recognition result relatively, we analyzed the confusion matrix of all methods.
4.3 Experimental results 4.3.1 Results of co-training on multiple views First, we ran Std-Co experiments on the two views (data from accelerometer and gyroscope in the UCFiPhone dataset) with the same learning algorithm. Here, we used the support vector machine (SVM) algorithm to train classifiers. At the same time, learners trained on different single views by the same algorithm (SVM) were employed for comparison. A learner trained on the accelerometer is called View1 and a learner trained on the gyroscope is called View2. The AP, mAP and confusion matrix of different methods for the six actions are shown in Fig. 2, Fig. 3 and Fig. 4, respectively. Fig. 2 shows the AP of different methods with different sizes of labeled examples. Each subfigure represents the performance curves of a particular category from Climbing, Running, Jump Roping, Walking, Biking, and Treadmill Walking. The x-axis shows the number of labeled examples, from 1 to 50 with interval 5, and the y-axis reflects the average precision. The red line shows the results of Std-Co experiments, and the other two dotted lines with other colors show the performances of the two comparison methods. This shows that the performance of Std-Co is considerably better than that of learners on a single view in most cases. Fig. 3 illustrates the mAP boxplots of different methods, with each subfigure corresponding to one case of labeled examples. The x-axis shows different methods from left to right (View1, View2, and Std-Co), and the y-axis reflects the mean average precision. The mAP performance also shows that Std-Co performs better than the learners on different single views in all cases. Fig. 4 shows the confusion matrix of different methods on six actions, and each row corresponds to the overall recognition results of different methods with a particular number of labeled examples. The diagonal elements represent the average recognition accuracy for different actions, and other elements in row i and
column j show the error rate of action i incorrectly recognized as action j . Here, we show the confusion matrix of different methods with 1, 25, and 50 labeled examples. For each subfigure, the labels of abscissa and ordinate are the names of six actions in the UCF-iPhone dataset. The values on the main diagonal represent the average recognition accuracy for different actions, and other elements in row x and column y show the error rate of action x incorrectly recognized as action y. As shown in all confusion matrixes, Jumping and Running are usually confused with each other. This result is reasonable because the raw signals for Jumping and Running are indeed similar in Fig. 1. From Fig. 4, we can see that Std-Co improves in performance significantly.
Fig. 2.
The average precision of different methods for six actions. Each subfigure corresponds to one
action class in the UCF-iPhone dataset.
Fig. 3.
The mAP performance of different methods, with each subfigure corresponding to one case of
labeled examples from 1 to 50 with interval 5.
Fig. 4.
The confusion matrix of different methods on six actions. Each row corresponds to the overall
recognition results of different methods with a particular number of labeled examples. 4.3.2 Results of co-training on multiple classifiers Then, we conducted Co-Cla experiments with two different classifiers on a single view. Here, we only considered the gyroscope view in the UCF-iPhone dataset, and SVM and least square (LS) classification algorithms were employed to construct different classifiers. In addition, we used learners trained by SVM and LS on the same view as baseline methods. Similarly, the AP, mAP and confusion matrix of different methods on the six actions are shown in Fig. 5, Fig. 6 and Fig. 7, respectively. Fig. 5 shows the AP of different methods with different sizes of labeled examples. Each subfigure represents one activity class. The x-axis is the number of labeled examples, from 1 to 50 with interval 5, and the y-axis is the average precision. The red line shows the experimental results of Co-Cla, and the other two dotted lines show the performance of the comparison methods. From Fig. 5, we can see that the performance of Co-Cla is always better than the baseline learners except in very few cases. Fig. 6 shows the mAP boxplots of different methods. Each subfigure corresponds to the performance of three methods with a certain labeled length. The x-axis shows different methods from left to right (SVM, LS, and Co-Cla), and the y-axis reflects the mean average precision. The mAP performance shows that in most cases, Co-Cla is superior to the baseline learners trained only on the initial labeled data. Fig. 7 shows the confusion matrix for different methods on six actions, and each row corresponds to the performance of three methods with a given size of labeled examples. From all subfigures, we can see that when only a single natural feature set is used, the performance of Co-Cla is beneficial versus the baseline methods in the application of human action recognition.
Fig. 5.
The average precision of different methods for the six actions. Each subfigure corresponds to one
action class in the UCF-iPhone dataset.
Fig. 6.
The mAP boxplots for different methods, with each subfigure corresponding to one case of labeled
examples. Each subfigure shows mAP for different classifier inducers, from left to right: SVM, LS, and CoCla.
Fig. 7.
The confusion matrix of different methods on six actions. Each row corresponds to the overall
recognition results of different methods with a particular number of labeled examples. 4.3.3 Results of co-training on multiple manifold regularization Finally, we quoted the results of Co-Re that we previously performed on USAA visual features [36]. Two manifold regularized semi-supervised learning algorithms were used to collaborate during the Co-Re process. Here, we used the Laplacian regularized SVM (LapSVM) and Hessian regularized SVM (HesSVM) methods to train classifiers. At the same time, learners trained by LapSVM and HesSVM on the same view were used as baseline classifiers. The AP and mAP of different methods on the eight social activities are shown in Fig. 8 and Fig. 9, respectively. Fig. 8 illustrates the AP of different methods for eight activity classes as the number of labeled training examples increases. Each subfigure represents one activity class in the dataset. The red line shows the experimental results of Co-Re, and the other two dotted lines with other colors show the performance of the comparison methods on a single view. As shown in Fig. 8, in most cases, the performance of Co-Re is superior or at least comparable to that of the two comparison baseline algorithms, especially when the number of labeled instances is small.
Fig. 9 reports the mAP boxplots for different methods, with each subfigure corresponding to one case of labeled examples. As shown in Fig. 9, the gain of Co-Re over other comparison algorithms is relatively remarkable when there are few examples in labeled training set L .
Fig. 8.
The AP of different methods for the eight activity classes. Each subfigure corresponds to one
activity class in the UAAA dataset.
Fig. 9.
The mAP boxplots for different methods on eight classes. Each subfigure corresponds to one case
of labeled examples. The recall changes as the number of labeled examples increases. For each figure, the methods from left to right are LapSVM, HesSVM and Co-Re.
5. Conclusions During the past decade, a large number of co-training algorithms have been proposed, and many theoretical analyses have been reported. However, it will be more informative to provide a systematic framework for understanding the common properties and differences in these algorithms. In this paper, we unify cotraining into a simple and general framework that can provide summaries for the existing co-training algorithms. Specifically, diverse learners are constructed and the correlation between the learners ensures the success of co-training. We further divide co-training into three categories according to different
techniques. Additionally, comprehensive experiments on several popular datasets are conducted to demonstrate the effectiveness of the proposed implementations.
Acknowledgements This work is partially supported by the National Natural Science Foundation of China (Grant No. 61301242, 61271407), the Guangdong Natural Science Funds (Grant No. 2014A030310252)
, the
Shenzhen Technology Project (Grant No. JCYJ20140901003939001) and the Fundamental Research Funds for the Central Universities, China University of Petroleum (East China) (Grant No. YCX201549).
References [1] O. Chapelle, B. Schölkopf, and A. Zien, Semi-Supervised Learning, Cambridge, MIT Press, 2006. [2] X. Zhu, Semi-supervised learning literature survey, Computer Science, 2006. [3] A. Iosifidis, A. Tefas, I. Pitas, Regularized extreme learning machine for multi-view semi-supervised action recognition, Neurocomputing, 145: 250-262, 2014. [4] C. McCall, K. Reddy and M. Shah, Macro-class selection for hierarchical K-NN classification of inertial sensor data, in: Proceedings 2nd Int. Conf. Pervasive and Embedded Computing and Communication Systems, PECCS 2012, 2012, pp. 106–114. [5] A. Blum and T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proc. 11th Ann. Conf. Comput. Learn. Theory, pp. Madison, WI, 1998, pp. 92–100. [6] S. Wang, L. Wu, L. Jiao, H. Liu, Improve the performance of co-training by committee with refinement of class probability estimations, Neurocomputing, 136 (2014) 30-40. [7] Y. Ren, Y. Wu, Y. Ge, A co-training algorithm for EEG classification with biomimetic pattern recognition and sparse representation, Neurocomputing, 137 (2014) 212-222. [8] A. Iosifidis, A. Tefas, I. Pita, Regularized extreme learning machine for multi-view semi-supervised action recognition. Neurocomputing, 2014, 145: 250-262.. [9] J. Yu, Y. Rui, D. Tao, Click prediction for web image reranking using multimodal sparse coding, IEEE Transactions on Image Processing (TIP),23(5): 2019-2032, 2014. [10] J. Yu, D. Liu, D. Tao, et al. Complex object correspondence construction in two-dimensional animation, IEEE Transactions on Image Processing, 2011, 20(11): 3257-3269. [11] C. Xu, D. Tao, and C. Xu, Large-Margin multi-view information bottleneck, IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 36, no. 8, pp. 1559-1572, August 2014.
[12] D. Pierce and C. Cardie, Limitations of co-training for natural language learning from large data sets, in: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 1-9, 2001 [13] Z.-J Zha, L. Yang, T. Mei, et al., Visual query suggestion: towards capturing user intent in internet image search, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMMCAP), vol. 6(3), Article No. 13, 2010. [14] M. Steedman, M. Osborne, A. Sarkar, et al., Bootstrapping statistical parsers from small data sets, in: Proceedings of 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 331-338, 2003. [15] D. Tao, X. Tang, X. Li and X. Wu, Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 28, no.7, pp. 1088-1099, July 2006.. [16] Y. Wu, Q. Tian, T.S. Huang, Discriminant EM algorithm with application to image retrieval, in: Proceedings of the IEEE international conference on computer vision and pattern recognition, Hilton Head, SC, pp 222–227, 2000. [17] Z.-J Zha, M. Wang, Y.-T. Zheng, Y. Yang, and et al., Interactive video indexing with statistical active learning, IEEE Transactions on Multimedia, vol. 14(1), pp.17-27, 2012. [18] Z.-J Zha, H. Zhang, M. Wang and et al., Detecting group activities with multi-camera context, IEEE Transactions on Circuits and Systems for Video Technologies, vol. 2(5), pp. 856-869, 2013. [19] Z.-H. Zhou, Learning with unlabeled data and its application to image retrieval, in: Proceedings of the 9th international conference on Artificial Intelligence, Guilin, China, pp 5–10, 2006. [20] J. Yu, M. Wang, and D. Tao, Semi-supervised multiview distance metric learning for cartoon synthesis, IEEE Transactions on Image Processing, 21(11): 4636-4648, 2012. [21] M. Li, H. Li, Z.-H. Zhou, Semi-supervised document retrieval, Information Processing & Management, vol.45, no.3, pp. 341-355, 2009. [22] M. Li, Z.-H. Zhou, Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples, IEEE Tran. Systems, Man and Cybernetics (SMC), Part A: Systems and Humans, 2007, pp.1088-1098. [23] Z.-J Zha, L. Yang, T. Mei, et al., Visual query suggestion: towards capturing user intent in internet image search, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMMCAP): 6(3), Article No. 13, 2010. [24] Z-J Zha, L. Yang, T. Mei, M. Wang, Z. Wang, Visual query suggestion, in: Proceedings of the 17th ACM international conference on Multimedia. 2009, pp. 15-24. [25] J. Yu and D. Tao, Modern machine learning techniques and their applications in cartoon animation research, John Wiley & Sons, 2013. [26] S. Dasgupta, M. Littman, and D. McAllester, PAC generalization bounds or co-training, Advances in Neural
Information Processing Systems 14, MIT Press, 2002, pp. 375–382. [27] S. Abney, Bootstrapping, in: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002, pp. 360-367. [28] M.-F. Balcan, A. Blum, and K. Yang, Co-training and expansion: Towards bridging theory and practice, Advances in neural information processing systems, 2004,pp. 89-96. [29] K. Nigam, R. Ghani. Analyzing the effectiveness and applicability of co-training, in: Proceedings of the 9th international conference on Information and knowledge management, ACM, 2000, pp. 86-93. [30] J. Du, C. X. Ling, Z.-H. Zhou, When does co-training work in real data? IEEE Transactions on Knowledge Discovery and Data Mining, vol. 23, no. 5, pp. 596-603, 2009. [31] M.Chen, Y.Chen, and K. Q.Weinberger, Automatic feature decomposition for single view co-training, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 953-960, 2011. [32] Z.-H. Zhou and M. Li, Tri-training: Exploiting unlabeled data using three classifiers, IEEE Trans on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1529-1541, 2005. [33] S. Goldman and Y. Zhou, Enhancing supervised learning with unlabeled data, in: Proceedings of the 28th International Conference on Machine Learning, pp. 953-960, 2011. [34] Y. Zhou, S. Goldman, Democratic co-learning, in: Proceedings of the 16th IEEE international conference on tools with artificial intelligence, pp 594–602, 2004. [35] W. Wang, Z.-H. Zhou, Analyzing co-training style algorithms, in: Proceedings of the 18th European conference on machine learning, pp 454–465, 2007. [36] Y. Li, D. P. Tao, W.F. Liu, Y. J. Wang,
Manifold regularization for classification, in: Proceedings of the
International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), 2014, pp 218-222. [37] M. Belkin, P. Niyogi, and V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research, vol. 7, pp. 2399-2434, 2006. [38] V. Sindhwani, P. Niyogi, M.Belkin, Beyond the point cloud: From transductive to semi-supervised learning, in: Proceedings of the 22nd international conference on machine learning, pp 824–831, 2005. [39] V. Sindhwani, P. Niyogi, M.Belkin, A co-regularization approach to semi-supervised learning with multiple views, in: Proceedings of ICML workshop on learning with multiple views, pp. 74-79, 2005. [40] Liu W F, Li Y, Lin X, Tao D, Wang Y J. Hessian-Regularized Co-Training for Social Activity Recognition[J]. PLOS ONE 9(9),2014. [41] C. Xu, D. Tao, and C. Xu, Multi-view intact space learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015. [42] D. Tao, X. Li, X. Wu, et al. Geometric mean for subspace selection, IEEE Transactions on Pattern Analysis and
Machine Intelligence (T-PAMI), vol.31, no.2, pp. 260-274, February 2009. [43] D. Tao, X. Li, X. Wu and S.J. Maybank, General tensor discriminant analysis and gabor features for gait recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1700-1715, October 2007. [44] D. Zhao, L. Shao, X. Zhen, Y. Liu, Combining appearance and structural features for human action recognition, Neurocomputing, 113 (2013) 88-96. [45] Y. Fu, T. M. Hospedales, T. Xiang, et al., Attribute learning for understanding unstructured social activity, Computer Vision ECCV, pp. 530-543, 2012. [46] Y. G. Jiang, G. Ye, S. F. Chang, et al., Consumer video understanding: A benchmark database and an evaluation of human and machine performance, in: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, 2011. [47] G. Zhu, Q. Wang, Y. Yuan, P. Yan, SIFT on manifold: An intrinsic description, Neurocomputing, 113 (2013) 227233. [48] Y. Ming, Hand fine-motion recognition based on 3D Mesh MoSIFT feature descriptor, Neurocomputing, 151, Part 2 (2015) 574-582.
Weifeng Liu received the double B.S. degree in automation and business administration and the Ph.D. degree in pattern recognition and intelligent systems from the University of Science and Technology of China, Hefei, China, in 2002 and 2007, respectively. He is currently an Associate Professor with the College of Information and Control Engineering, China University of Petroleum (East China), China. His current research interests include computer vision, pattern recognition, and machine learning.
Yang Li received the B.S degree in electronic and information engineering from Yantai University in 2014. She is currently pursuing her master degree in electronic and information engineering. Her research focuses on the applications of co-training algorithms in computer vision.
Dapeng Tao received a B.E degree from Northwestern Polytechnical University and a Ph.D. degree from South China University of Technology, respectively. He is currently
with School of Information Science and Engineering, Yunnan University, Kunming, China, as an engineer. He has authored and co-authored more than 30 scientific articles. He has served more than 10 international journals including IEEE TNNLS, IEEE TMM, IEEE SPL, and PLOS-ONE. Over the past years, his research interests include machine learning, computer vision and cloud computing.
Yanjiang Wang received his Ph.D degree from Beijing Jiaotong University in 2001. He is currently a professor with the College of Information and Control Engineering, China University of Petroleum (East China), China. His research focuses on intelligent information processing, computer vision and pattern recognition.