Knowledge-Based Systems 23 (2010) 563–571
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
A classification algorithm based on local cluster centers with a few labeled training examples Tianqiang Huang a,b,*, Yangqiang Yu a, Gongde Guo a, Kai Li a a b
Department of Computer Science, School of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350007, China Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
a r t i c l e
i n f o
Article history: Received 4 June 2009 Received in revised form 23 February 2010 Accepted 26 March 2010 Available online 29 March 2010 Keywords: Classification learning Supervised clustering Semi-supervised learning
a b s t r a c t Semi-supervised learning techniques, such as co-training paradigms, are proposed to deal with data sets with only a few labeled examples. However, the family of co-training paradigms, such as Tri-training and Co-Forest, is likely to mislabel an unlabeled example, thus downgrading the final performance. In practical applications, the labeling process is not always free of error due to subjective reasons. Even some mislabeled examples exist in the few labeled examples given. Supervised clustering provides many benefits in data mining research, but it is generally ineffective with only a few labeled examples. In this paper, a Classification algorithm based on Local Cluster Centers (CLCC) for data sets with a few labeled training data, is proposed. This can reduce the interference of mislabeled data, including those provided by both domain experts and co-training paradigm algorithms. The experimental results on UCI data sets show that CLCC achieves competitive classification accuracy as compared to other traditional and state-of-the-art algorithms, such as SMO, AdaBoost, RandomTree, RandomForest, and Co-Forest. Ó 2010 Elsevier B.V. All rights reserved.
1. Introduction In many practical applications, obtaining a vast number of labeled training examples is difficult because of expense. In contrast, a large number of unlabeled training examples which can be obtained easily and cheaply are available. Therefore, the semi-supervised learning is proposed to combine the few labeled examples and vast unlabeled examples to extract knowledge from data sets. The typical semi-supervised learning method is the family of cotraining paradigm which includes Co-training [2], Tri-training [18], and Co-Forest [11]. Co-training is an attractive semi-supervised learning paradigm, which trains two classifiers through labeling the unlabeled examples for each other. The original cotraining requires data to be describable by two sufficient and redundant attribute subsets, each of which is sufficient for learning and independent on the other given class-label. Tri-training and Co-Forest are the extensions of Co-training, and they use three classifiers and multiple classifiers instead of two. The problem, however, is that in the process of labeling the unlabeled examples, experts are likely to assign an error label to an unlabeled example, which will affect the final performance of classification. They may also commit mistakes when labeling the
* Corresponding author at: Department of Computer Science, School of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350007, China. Tel.: +86 13665040506. E-mail address:
[email protected] (T. Huang). 0950-7051/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2010.03.015
original examples. Consider the few mislabeled examples in the original labeled data set in Fig. 1. By using the Co-Forest algorithm with some unlabeled examples, the additional labeled examples in Fig. 2 were obtained. However, Fig. 2 shows that the Co-Forest algorithm also produces additional objects with wrong labels due to the original mislabeled examples. Once the mislabeled examples are used to guide the process of learning, the performance of classification will be heavily influenced. Supervised clustering, an interesting method of supervised learning, deviates from traditional clustering. It is applied on the labeled examples. It can enhance understanding of the data sets (e.g., summaries could be generated for each cluster) as well as improve the classification performance. A good supervised clustering, however, requires a considerable amount of class-labeled training examples. In other words, a supervised clustering with only a few labeled data will lead to an unreliable performance. Based on the above observation, this paper proposes a Classification algorithm based on Local Cluster Centers with a few labeled training examples (CLCC), which demonstrates the following merits: (1) with a few labeled training examples, CLCC, by using the semisupervised learning algorithm, can augment the labeled examples, which will pave the way for subsequent supervised clustering; (2) CLCC can effectively reduce the interference of mislabeled data, including the data produced by domain experts and semi-supervised learning algorithms, because it inherits the idea of supervised clustering and uses the local cluster centers that reflect the distribution of the data set to represent the whole data set; and (3) the
564
T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571
Fig. 1. Original labeled examples with wrong labels: there are six wrong objects; four positive objects are labeled by the negative class, and two negative objects are labeled by the positive class.
The Tri-training and the Co-Forest algorithm proposed by Zhou and Li [18], with no requirements on the sufficient and redundant attribute subsets or the special supervised learning algorithm, can partition the instance space into a set of equivalence classes that is not less desirable than Co-training. In particular, Tri-training employs three classifiers so that it could smoothly choose examples to label and use multiple classifiers to compose the final hypothesis. First, the three classifiers are initially trained on the labeled data set. Then for any one of them, an unlabeled example can be labeled for it as long as the other two classifiers agree on the labeling of this example. Meanwhile, the confidence of the labeling of the classifiers does not need explicit measurements. Obviously, the Tri-training algorithm has its own approach to control the augmenting labeled data under certain conditions. Co-Forest, which is the extension of Tri-training, uses N classifiers instead of three classifiers in the process of learning. It trains an ensemble of N classifiers on labeled data and then refines each component classifier with unlabeled examples selected by its concomitant ensemble. 2.2. Supervised clustering
Fig. 2. Labeled examples with more wrong labels by Co-Forest. Compared with objects in Figs. 1 and 11 objects are mislabeled by using the Co-Forest algorithm when working on some unlabeled examples and the original labeled examples in Fig. 1.
CLCC algorithm, as tested by various UCI data sets in the research, outperforms other classification algorithms, especially Co-Forest. This paper is organized as follows. Section 2 briefly reviews the relevant algorithms, including semi-supervised learning and supervised clustering. Section 3 presents the algorithm CLCC. Section 4 reports the experimental results on UCI data sets. Section 5 concludes the paper.
2. Background 2.1. Semi-supervised learning In traditional supervised learning, classifiers must be trained on abundant labeled data. When a portion of the training data is unlabeled, the classifiers could be implemented with semi-supervised learning, an effective way to combine labeled and unlabeled data. As a result, many semi-supervised classification algorithms have been proposed, including the generative model [13–15], the transductive support vector machines approach [9], and the Graph-cut algorithm [3]. The Co-training [2] algorithm, another famous semi-supervised learning algorithm, requires sufficient and redundant attribute subsets, but the constraint on data can be relaxed by using two supervised learning algorithms, each of which produces hypothesis that can partition the instance space into a set of equivalence classes [8].
Supervised clustering deviates from traditional clustering. It is applied on classified examples with the aim of producing clusters that have high probability density with respect to single classes. Supervised clustering algorithms are widely applied to understand the real inwardness of a data set. Dettling and Buhlmann [4] proposed a partition-based and incremental active clustering algorithm for the supervised clustering of genes from a microarray experiment. The algorithm tried to cluster genes in such way that the discrimination of different tissue types becomes as simple as possible. Eick et al. [5] proposed some representative-based supervised clustering algorithms according to a special fitness function, which could improve the performance of classification. A new supervised clustering proposed by Eick et al. [6] can discover the interesting regions in spatial data sets. In addition, Li et al. [12] dealt with data sets with mixed attributes by combining the k-prototype algorithm with supervised clustering. Li and Ye [10] presented a data mining algorithm based on supervised clustering to learn data patterns and used these patterns for data classification. Pu et al. [16] presented a new supervised bin-split hierarchy clustering method. 3. CLCC algorithm In this section, a CLCC is proposed. Let L and U denote the labeled example set and the unlabeled one, respectively. The sizes of L and U are |L| and |U|, respectively, |L| << |U|. The outline of CLCC is composed of four steps as follows: (i) The semi-supervised paradigm Co-Forest-Sim is used to label the unlabeled examples in U, so a new labeled set denoted by L* is obtained. It likewise calculates a matrix of confidence between the unlabeled example and its classlabel. (ii) A center-based supervised clustering guided by an objective function is trained on the new labeled set L*. Some representative local cluster centers are selected. (iii) The result of the center-based supervised clustering is processed to find the ‘‘best” local cluster centers. In our experiment, this step was considered a candidate operation. (iv) A K-NN classification is trained on the local cluster centers.
565
T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571
3.1. Augmenting the labeled example set
EðXÞ ¼ In CLCC, a semi-supervised learning method named Co-ForestSim is used for augmenting the labeled example set and in obtaining the confidence of each unlabeled example. In this aspect, Co-Forest-Sim is similar to Co-Forest [11]. Co-Forest-Sim, however, differs from Co-Forest because the former calculates the matrix of confidence of the unlabeled example set, and it is used by CLCC to augment the labeled example set rather than train a classifier. Before proceeding, some necessary notations are summarized. An ensemble of N classifiers is denoted by H* whose component is hi, i = 1, . . . , N. A new ensemble of other component in H* except hi is denoted by Hi = H*-{hi}, which is called the concomitant ensemble of hi. The class-label set of L is C = {Ci}i=1,. . .,m. First, the Co-Forest-Sim trains an ensemble of N random tree classifiers, H*, on a few labeled examples L. Then the learning iteration starts. In the tth iteration, Hi examines each example in the subset of the unlabeled set U. For an unlabeled example u, if the number of classifiers voting for a particular label exceeds a pre-set threshold h, the unlabeled example u along with the newly assigned label is then copied into the newly labeled set Li,t which stores the new labeled examples for classifier hi. In this step, the confidence vector wi,u between u and each label in C is recorded. Each element of the confidence vector wi,u named wi,u,k denotes the confidence of the unlabeled example u whose label is Ck for classifier hi. wi,u,k could be estimated by the degree of agreements on the label Ck assigned by Hi. For example, when the num classifiers that assign the unlabeled example u to the class-label Ck in the current iteration are present, then wi,u,k = num/(N 1). In vector wi,u = {wi,u,1, wi,u,2, . . .wi,u,m}, for the example u, the class-label with most votes from the classifiers is denoted by Cmajor, whose confidence is wi,u,major. Without loss of generality, xi,t,j is the predictive confidence of Hi on xj with the most votes class-label in Li,t in the tth iteration. Let the error rate P of Hi on Li,t be ei,t then the total confidence is xi;t¼ M j¼0 xi;t;j in the tth iteration. M is the size of Li,t. Set L[Li,t is used for the refinement of hi in the next iteration if ei,txi,t
! Nj k X X ð1 W ij Þ=n þ b PenaltyðkÞ; j¼0
i¼0
where PenaltyðkÞ ¼
ð1Þ
( pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðk mÞ=n k P m; 0
k < m:
In Eq. (1), X = {R1, . . . , Rk} is the result of the clustering of clusters R1–Rk,. For each Ri 2 X, its class-label is the class-label of its center object in the latter algorithm; k is the number of current clusters; and m is the number of class-label set C = {Ci}i=1,. . .,m consisting of labels C1–Cm. Nj is the size of the cluster Rj, W ij is the confidence of the ith object in Rj whose class-label is that of the cluster Rj, and n is the size of the new labeled set L*. The parameter b determines the penalty associated with the numbers of the clusters, k, in the process of clustering. The larger the number of clusters, the higher the value of b needed, which implies a higher penalty. Let a circle denote an example. The areas of the black and white parts in the example are the degrees of membership of each class (black and white class). As Fig. 3 illustrates, the incredible degree of cluster A is composed of the white parts of the examples in cluster A, and the incredible degree of cluster B is composed of the black parts of the examples in cluster B. We incorporate the label confidence into the objective function to fairly reflect the incredible degree of the clustering result. 3.3. Local cluster centers selection Before depicting the detail of CLCC, the necessary notations are summarized in Table 1. The second step (ii) of CLCC involves the search for the best local cluster centers from the whole labeled examples, which can reflect the local similar property well as shown in Fig. 4. First, it selects k objects in the labeled set L* as the centers of the k clusters randomly. The number of the original center set, k, is between m and 2m in general, where m is the size of classes of the labeled set L*. The value of k should be dynamic in order to find the best local cluster centers set to reflect the distribution of the original data set. In Fig. 4, three solid objects reflect the distribution of the whole data set effectively. Second, clusters are then created by assigning examples to the cluster of their closest center. Starting from this randomly generated set of centers, the algorithm uses a greedy search strategy to find the best local cluster centers. In each
3.2. Design of the objective function A good objective function for clustering algorithms is very important because it directly affects their performance. The objective function in Eq. (1) is chosen and used in our CLCC algorithm according to the new labeled set L* whose matrix of confidence is W* as explained in Section 3.1. In Eq. (1), the objective function E(X) includes two divisions: the first refers to the incredible degree of cluster in clustering result X according to confidence matrix W*, and the second aims to control the number of clusters whose value is high when the number of clusters is large. A lower value for E(X) indicates a better solution.
Fig. 3. Incredible degree of the cluster. The circle represents an example. Black and white represents two classes. The areas of the black and white parts in the example refer to the degree of membership of each class. For cluster A, its class-label is the black color, and six examples accurately belong to the black class, two examples may belong to the black class, and three examples may belong to the white class. Thus, its incredible degree is composed of the white parts of all examples in the cluster A.
566
T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571
Table 1 Notation for the CLCC algorithm. Symbol
Definition
L* k W* z set_num CX E
Labeled example set with confidence Initial number of the center The label confidence matrix of L* Running frequency Number of the best center sets The set of local cluster centers, with size z Objective function of supervised clustering
Fig. 4. The value of the center set. Three solid objects rather than two objects can reflect the distribution of the whole data set well.
iteration, the algorithm checks two tentative operations: adding a single non-center object in L* to the set of the centers and removing a center object from the set of the centers. The best result of clustering measured by the objective function E in Section 3.2 from the two operations is selected as the current optional result if it improves the quality of the clustering in comparison with the result in the last iteration. Otherwise, the search stops. Due to randomicity, the second step of CLCC is performed z times, and the first set_num best center sets are selected according to the value of objective function E in the experiment. The pseudo of the search algorithm is shown in Fig. 5. In Fig. 5, CRAdd records the cluster result partitioned by the corresponding center set similarly as in CRRe. The function Local_Cluster_Center returns the best center set in each run time. The search algorithm is a local optimal strategy. Due to the wrong class-label produced by (i) and the domain experts’ mistakes, the best center set of the data set may not be found, especially when there are more mislabeled examples. Consider the cluster in Fig. 6. The solid circle object is the current cluster center, and three triangle objects are mislabeled objects. Obviously, the solid triangle object is the best cluster center. In (iii), the result of (ii) is processed in order to find a better center set which is used for training a K-NN classifier. For each cluster in the clustering result, the examples whose labels are different from the label of the cluster itself are divided into three parts according to their confidence and distance between these objects and the cluster center. Then the labels of the examples, which have the highest confidence and are closest to the cluster center, are changed into the class-label of the cluster, and the examples that have the lowest confidence and are located farthest from the cluster center are removed. The rest of the examples, however, are not processed in order to avoid over-fit. In this experiment, (iii) is considered as a candidate operation as shown in Fig. 7.
Fig. 5. The pseudo code of (ii) in CLCC.
T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571
567
learning applications, which could affect the performance of the classification trained on the data set. Local cluster centers not only reflect the distribution of the original data set effectively but can also reduce the interference of these exceptional data because only local cluster centers are used to train a classifier. 4. Experiments
Fig. 6. Candidate operation: change the solid triangle object to a circle object and remove the bottom triangle object.
The algorithm CLCC is described in details in Fig. 8. Note that CLCC selects a certain center set as result because of randomicity and local convergence. The operation of the third and fourth lines in Fig. 8 is considered optional. If the accuracy as compared to speed is not a critical consideration, the two lines should be skipped. 3.4. Analysis of CLCC’s advantages According to manifold assumption, the objects in the small local neighborhood share similar properties. The local cluster centers conform to the manifold assumption. Each of the local cluster centers represents a certain property. Accordingly, one or more objects will be selected by CLCC from the example set with the same classlabel, namely, one cluster. For a complex distribution of data with the same class-label, which includes complex shapes or multi-densities of the clusters, more local cluster centers can reflect its distribution effectively. Furthermore, there are some exceptional data such as outliers and some mislabeled examples in practical
Fifteen data sets from UCI machine learning repository [1] are tested in the experiments. Part of the information about the data sets is presented in Table 2. Euclidean distance is used to compute the distance between two instances in the experiments. For each data set, the 5-fold cross-validation method is employed for evaluation. That is, in each fold, 20% of the data set is selected as test data and the other 80% as the training data. In the meantime, 20% of the training data is selected as labeled set L and the other 80% of the training data as unlabeled set U randomly. Note that the class distributions in L and U should be kept similar to that in the original data set. The instances with missing values for some data sets are removed. For convenience, the wisconsinbreast-cancer data set is denoted by WBC, the data set ‘‘Blood Transfusion Service Center” by BTSC, the data set ‘‘Mammographic Mass” by MM, and the data set ‘‘Kdd Synthetic Control” by KSC. Table 3 shows the result and some parameters in the frame of (ii) in the CLCC algorithm, including the final number of cluster k, the parameter b, and the value of the objective function E. Note that the final k is the size of the center set produced when supervised clustering ends the iteration in the frame of (ii) in CLCC. The experiments compare the classifier trained by CLCC with the Co-Forest in few labeled data and large unlabeled data and SMO, RTree, RForest, and AdaBoost [7], which are trained on the original labeled example set without utilizing the unlabeled example set. The average error of CLCC, Co-Forest, SMO (SMO in WEKA [17], AdaBoost (AdaBoostM1 in WEKA), RTree (RandomTree in WEKA), and RForest (RandomForest in WEKA) on the same test data sets and the improvement of CLCC in comparison with Co-Forest is shown in Table 4. All error rates are calculated to three decimal places. The balance-scale data set is called BS. The value of
Function Process_Cluster(CX) 1. Select the best result R ={R1, …, R|CM |}whose objective function E is lowest in CX; its center set is CMS={cmi}i=1…|CM |, the label of cmi is Clai 2. For each Ri in R 3.
In cluster Ri, assume that there are numobject objects whose class-label is different from Clai ; divide the numobject objects into three segments, and each segment has numobject/3 objects. The examples with the highest confidence and closest to the cluster center cmi are in the first segment, while those with the lowest confidence and farthest from the cluster center cmi are in the last segment.
4.
Change the class-label of objects in first segment to Clai and change its confidence to 1 for label Clai. At the same time, remove the objects in the last segment.
5. End for 6. Return new labeled example set L**and its confidence matrix W**
Fig. 7. Candidate operation of CLCC.
568
T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571
Algorithm: CLCC Input: the labeled set L, the unlabeled U, the confidence of threshold θ , the size of classifier ensemble N, manipulative parameter β , the number of class-label set m, the initial number of clusters k, running frequency z, best center sets number set_num. Output: a K-NN classifier Process: 1.
(L*,W*)=Co-Forest-Sim(L,U, θ ,N)
2.
CX = Local_Cluster_Center (L*, k, W*,z, β )
3.
(L**,W**)=Process_Cluster(CX)
4.
CX = Local_Cluster_Center (L**, k, W**,z, β )
5. Array the value of objective function E in CX in ascending order and select the first set_num center sets as the CMS_Set according to the value of the objective function E. 6. Train set_num K-NN classifiers by using center set CMS in CMS_Set (in general, K=1). Use these classifiers to predict the test data and select the best classifiers.
Fig. 8. The pseudo code of CLCC.
Table 2 Experimental data sets. Data set
#Features
#Instances
#Classes
Wine Glass Diabetes Iris WBC Balance-scale E.coli Heart-statlog Live-disorder Haberman BTSC Ionosphere wpbc MM KSC
13 9 8 4 9 4 7 13 6 3 4 34 33 5 60
178 214 768 150 683 625 336 270 345 306 748 351 194 830 600
3 6 2 3 2 3 8 2 2 2 2 2 2 2 6
Table 3 Result and parameter of (ii) in CLCC. Data set
Initial k
b
Cluster impurity E
Wine Glass Diabetes Iris WBC Balance-scale E.coli Heart-statlog Live-disorder Haberman BTSC Ionosphere wpbc MM KSC
4 9 3 5 3 3 12 3 3 3 3 3 3 3 9
0.1 0.4 1 1 1 1 1 0.1 0.4 1 1 0.1 1 0.1 0.1
0.157168778 0.136307962 0.206736658 0.016666667 0.026366843 0.139675174 0.137246377 0.213277002 0.149421146 0.162764866 0.132116496 0.072870958 0.08757716 0.070050909 0.092194621
set_num in our experiment is set to 6. In general, the value of N in Co-Forest-Sim is set to 6, the confidence threshold h is set to 0.75, and K is 1 in K-NN classifiers. According to Table 4, Figs. 9 and 10, the K-NN classifier produced by the algorithm CLCC outperforms Co-Forest in 15 data sets by 5.1% on the average. For some unbalanced data sets, the CLCC algorithm achieves better performance. The improvement of the balance-scale data set is 9.7%, that of the haberman data set is 7.3%, and that of the WPBC data set is 6.7%. The reasons for improvement include the following: (1) the local cluster center set can reflect the distribution of the original data set effectively, and (2) it can also reduce the interference of some exceptional data such as wrong class-label. In Eq. (1), parameter b is designed to control the number of clusters denoted by final k, which is the same as the algorithm [5]. Table 5 presents the detailed information about the result of CLCC on Wine data set with a different b, whose value is from 0.1 to 2. As seen in Table 5, when the value of b is larger than 0.6, the number of clusters is the same as the number of classes in the Wine data set, and some values of cluster impurity are equal, such as 0.183680556 and 0.233680556, with the different values of parameter b. Fig. 11 shows the influence of parameter b on the error rate on the identical test data for Wine data set. In general, the more complex the distribution of the data set, the smaller the value of parameter b; the more local cluster centers are selected by set to fit the local property of the data set, the smaller the value of parameter b. The selection of parameter b requires a user-directed tuning in real applications. In our experiment, the value of parameter b from {0.1, 0.4, 1} is selected. In general, the parameter K in the final step K-NN classification is set 1 because each local cluster center represents a certain local feature of the data set. Thus, a new example should be classified into the most similar local cluster feature. The parameter initial cluster number k is insensitive, and the number of clusters is dynamic, not static, in the process of clustering. The supervised clustering in Fig. 5 could find the proper local clusters by searching the data space. In other words, the different values of b and k will affect the performance of our algorithm,
569
T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571 Table 4 Average error of different algorithms on the test data set. Data set
Traditional classifier
Co-training paradigm
SMO
RTree
RForest
AdaBoost
CLCC
Co-Forest
Improv.
Wine Glass Diabetes Iris WBC BS E. coli Heart-statlog Live-disorder Haberman BTSC Ionosphere wpbc MM KSC
0.111 0.562 0.273 0.067 0.064 0.159 0.250 0.179 0.42 0.262 0.238 0.21 0.238 0.227 0.083
0.167 0.497 0.370 0.100 0.099 0.230 0.353 0.272 0.471 0.357 0.286 0.154 0.347 0.275 0.433
0.278 0.335 0.325 0.100 0.088 0.222 0.221 0.321 0.406 0.307 0.263 0.123 0.306 0.254 0.167
0.139 0.611 0.279 0.100 0.070 0.325 0.353 0.204 0.399 0.295 0.266 0.2 0.301 0.262 0.667
0.057 0.367 0.234 0.040 0.031 0.152 0.204 0.167 0.319 0.23 0.222 0.11 0.249 0.213 0.075
0.107 0.404 0.288 0.047 0.057 0.249 0.260 0.247 0.37 0.303 0.265 0.129 0.316 0.254 0.142
0.05 0.037 0.054 0.007 0.026 0.097 0.056 0.08 0.051 0.073 0.043 0.019 0.067 0.041 0.067
Avg.
0.223
0.294
0.248
0.298
0.178
0.229
0.051
iris Dataset
WBC
bs
ecoli
Co-Forest
CLCC
C KS
M
M
c pb w
re he
er
e-
di
t-s
Diabetes
liv
ar
glass
he
Wine
SC
ta
rd
tlo
0.1 0
AdaBoost
0.4 0.3 0.2 0.1 0 g
0.2
RForest
sp
0.3
RTree
no
0.4
SMO
Io
Error Rate
Error Rate
0.5
0.8 0.7 0.6 0.5
an
CLCC
BT
Co-Forest
m
AdaBoost
er
RForest
ab
RTree
H
SMO
0.6
so
0.7
Dataset
(a)
(b)
Fig. 9. Error rate of different algorithms on different data sets.
0.35
Table 5 Result of CLCC on the Wine data set with a different b.
Average Error Rate
0.3 0.25 0.2 0.15 0.1 0.05 0 SMO
RTree
RForest AdaBoost Co-Forest Algorithm
CLCC
Fig. 10. Average error rate of different algorithms for all data sets.
but the influence is not major. The values of b and k are not difficult to confirm in real application. In many real applications, experts may make mistakes when they label the original examples. In this case, there are some wrong labels in the original labeled example set L as mentioned in Section 3.1. If the examples with mislabels are used to guide the process of learning, they will affect the performance of the classification. In our experiment, the original correct labels of some examples are changed artificially and randomly for each original, few labeled example set in three UCI data sets. The rate of the error label refers
b
Rate of error
Final k
Cluster impurity
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
0.138889 0.111111 0.055556 0.111111 0.055556 0.083333 0.083333 0.083333 0.166667 0.111111 0.083333 0.111111 0.111111 0.111111 0.111111 0.055556 0.083333 0.055556 0.111111 0.055556
8 16 9 6 5 11 3 3 3 3 3 3 3 3 3 3 3 3 3 3
0.128444791 0.142507291 0.156893691 0.172569444 0.175868056 0.183680556 0.183680556 0.233680556 0.233680556 0.191493056 0.233680556 0.238368056 0.241493056 0.246180556 0.191493056 0.233680556 0.238368056 0.238368056 0.241493056 0.246180556
to the rate of the wrong label in the original few, labeled set L. For instance, 14% (4/28) means that there are 28 objects in the original labeled set L, but four objects are mislabeled. As shown in Table 6 and Fig. 12, Co-Forest does not perform well with some wrong labels in the original labeled example set.
570
T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571 0.18
0.35
0.16
0.3 CLCC
0.14
Co-Forest
0.25 Error Rate
Error Rate
0.12 0.1 0.08 0.06
0.2 0.15 0.1
0.04
0.05
0.02 1.9
1.8
1.7
1.6
1.4
1.5
1.3
1.2
1
1.1
0.9
0.7
0.8
0.5
0.6
0.4
0.2
0.3
0.1
Value of Beta
Fig. 11. Error rate of CLCC on the Wine data set with a different parameter b.
Table 6 Average error of CLCC and Co-Forest on the test data set with wrong labels. Data set
Rate wrong label
CLCC
Co-Forest
Improv.
Wine
0% 14%(4/28)
0.083333 0.083333
0.138888889 0.166666667
0.055555889 0.083333667
Diabetes
0% 16%(20/123)
0.171053 0.184211
0.230263158 0.302631579
0.059210158 0.118421
WBC
0% 4%(4/101)
0.02924 0.035088
0.064285714 0.087719298
0.035045714 0.052631
Error Rate
CLCC
14%
21% 28% Rate of Wrong Label
35%
Fig. 13. Error rate of CLCC and Co-Forest on the Wine data set data set with different wrong labels.
Table 8 Average error of CLCC and Co-Forest on the Wine data set data set with different wrong labels. Rate of wrong label
CLCC
Co-Forest
0% 14% 21% 28% 35%
0.083333 0.083333 0.138889 0.083333 0.111111
0.138888889 0.166666667 0.194444444 0.25 0.333333333
(4/28) (6/28) (8/28) (10/28)
the negative impacts of the mislabeled examples on performance. Even for a different ratio of the mislabeled examples, it has the same error rate with those of the 14% and 28% wrong labels. In short, CLCC performs better than Co-Forest with mislabeled examples.
0.35 0.3
0%
2
0 0
Co-Forest
0.25 0.2 0.15 0.1
5. Conclusions
Wine
4% (4/101)
0%
16% (20/123)
0%
14% (4/28)
0
0%
0.05
Diabetes Dataset
wbc
Fig. 12. Error rate of CLCC and Co-Forest on the test data set with wrong labels.
Table 7 Result and parameter of (ii) in CLCC with wrong labels. Data set
Final k
b
Cluster impurity E
K in K-NN
Wine Diabetes WBC
3 2 2
1 1 1
0.212727273 0.218053097 0.049895178
1 1 1
The performance of CLCC is better under the same rate of error label because CLCC can eliminate the interference of mislabeled data. Table 7 shows the detailed information of the result and some parameters in (ii) of CLCC in the state of some wrong labels. In Table 8 and Fig. 13, the performance of CLCC and Co-Forest with different rates of wrong label on data set Wine in UCI is obtained. With the growth rate of the error label, the performance of Co-Forest shows a big change from 0.138888889 to 0.333333333; however, the average error of CLCC makes a small change from 0.083333 to 0.138889. In particular, the third step of CLCC in Fig. 7 revises the wrong labels of the examples, which reduces
This paper proposes a classification algorithm based on local cluster centers with only a few labeled data (CLCC), which combines the traditional technique of semi-supervised learning and supervised clustering. A co-training paradigm is utilized to create a new large labeled set from an unlabeled set. Then a center-based supervised clustering guided by a new objective function works on the new labeled set in order to obtain the best local cluster center set which not only reflects the distribution of the whole data set effectively but also reduces the interference of exceptional data (e.g., mislabeled data). Finally, the center set trains a K-NN classifier. Experiments on 15 UCI data sets show that CLCC can improve the accuracy of classification effectively in comparison with other classification algorithms. The use of other semi-supervised algorithms to label the unlabeled examples is a possible focus for future work. Furthermore, finding information from supervised clustering results can help domain experts identify and check the labels of some exceptional data, which is also an interesting research area. Acknowledgments This work was supported in part by the Natural Science Foundation of Fujian Province of China (2008J04004, 2007J0016), the Innovation Project of Young Scientific Talents in Fujian Province (2006F3045), the University Services HaiXi Major Project in Fujian Province (Information Tech -nology research based on mathematical), and the Spatial Data Mining and Information Sharing Key Laboratory of Ministry of Education Fund (201008).
T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571
References [1] C. Blake, E. Keogh, C.J. Merz, UCI repository of machine learning databases, Department of Information and Computer Science, University of California, Irvine, CA, 1998,
. [2] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of the 11th Annual Conference on Computational Learning Theory, ACM Press, Madison, 1998, pp. 92–100. [3] A. Blum, J. Lafferty, M. Rwebangira, R. Reddy, Semi-Supervised learning using randomized mincuts, in: Proceedings of the 21st International Conference on Machine Learning, ACM Press, Banff, 2004, pp. 934–947. [4] M. Dettling, P. Buhlmann, Supervised clustering of genes, Genome Biology 3 (12) (2002) 00691–006915. [5] C.-F. Eick, N. Zeidat, Z.-H. Zhao, Supervised clustering: algorithms and application, in: International Conference on Tools with AI, Boca Raton, Florida, 2004, pp. 774–776. [6] C.-F. Eick, B. Vaezian, D. Jiang, J. Wang, Discovering of interesting regions in spatial data sets using supervised cluster, in: PKDD’06, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, 2006, pp. 127–138. [7] Y. Freund, R.E. Schapire, A decision-theoretic generalization of online learning and an application to boosting, in: Proceedings of the Second European Conference on Computational Learning Theory, Barcelona, Spain, 1995, pp. 23– 37. [8] S. Goldman, Y. Zhou, Enhancing supervised learning with unlabeled data, in: Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufman Publishers, San Francisco, 2000, pp. 327–334. [9] T. Joachims, Transductive inference for text classification using support vector machines, in: Proceedings of the 16th International Conference on Machine Learning, Bled, Slovenia, 1999, pp. 200–209.
571
[10] X.-Y. Li, N. Ye, A supervised clustering and classification algorithm for mining data with mixed variables, IEEE Transactions on Systems Man and Cybernetics – Part A: Systems and Humans 36 (2) (2006) 396–406. [11] M. Li, Z.-H. Zhou, Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples, IEEE Transactions on Systems, Man and Cybernetics – Part A 37 (6) (2007) 1088–1098. [12] S.-J. Li, J. Liu, Y.-L. Zhu, X.-H. Zhang, A new supervised clustering algorithm for data set with mixed attributes, in: Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/ Distributed Computing, 2007, pp. 844–849. [13] D.J. Miller, H.S. Uyar, A mixture of experts classifier with learning based on both labeled and unlabeled data, in: M. Mozer, M.I. Jordan, T. Petsche (Eds.), Advances in Neural Information Processing Systems, vol. 9, MIT Press, Cambridge, MA, 1997, pp. 571–577. [14] D.J. Miller, J. Browning, A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/ unlabeled data sets, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (11) (2003) 1468–1483. [15] K. Nigam, A.K. McCallum, S. Thrun, T. Mitchell, Text classification from labeled and unlabeled documents using EM, Machine Learning 39 (2–3) (2000) 103– 134. [16] L.-P. Pu, P.-D. Zhao, G.-D. Hu, Z.-F. Zhang, Q.-L. Xia, PCA and K-means based supervised split hierarchy clustering method, Application Research of Computers 25 (5) (2008) 1412–1414. [17] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, San Francisco, 2000. [18] Z.-H. Zhou, M. Li, Tri-training: exploiting unlabeled data using three classifiers, IEEE Transactions on Knowledge and Data Engineering 17 (11) (2005) 1529– 1541.