A classification algorithm based on local cluster centers with a few labeled training examples

Knowledge-Based Systems 23 (2010) 563–571 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locat...

Download PDF

742KB Sizes 2 Downloads 41 Views

Report

PDF Reader
Full Text

Knowledge-Based Systems 23 (2010) 563–571

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

A classiﬁcation algorithm based on local cluster centers with a few labeled training examples Tianqiang Huang a,b,*, Yangqiang Yu a, Gongde Guo a, Kai Li a a b

Department of Computer Science, School of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350007, China Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

a r t i c l e

i n f o

Article history: Received 4 June 2009 Received in revised form 23 February 2010 Accepted 26 March 2010 Available online 29 March 2010 Keywords: Classiﬁcation learning Supervised clustering Semi-supervised learning

a b s t r a c t Semi-supervised learning techniques, such as co-training paradigms, are proposed to deal with data sets with only a few labeled examples. However, the family of co-training paradigms, such as Tri-training and Co-Forest, is likely to mislabel an unlabeled example, thus downgrading the ﬁnal performance. In practical applications, the labeling process is not always free of error due to subjective reasons. Even some mislabeled examples exist in the few labeled examples given. Supervised clustering provides many beneﬁts in data mining research, but it is generally ineffective with only a few labeled examples. In this paper, a Classiﬁcation algorithm based on Local Cluster Centers (CLCC) for data sets with a few labeled training data, is proposed. This can reduce the interference of mislabeled data, including those provided by both domain experts and co-training paradigm algorithms. The experimental results on UCI data sets show that CLCC achieves competitive classiﬁcation accuracy as compared to other traditional and state-of-the-art algorithms, such as SMO, AdaBoost, RandomTree, RandomForest, and Co-Forest. Ó 2010 Elsevier B.V. All rights reserved.

1. Introduction In many practical applications, obtaining a vast number of labeled training examples is difﬁcult because of expense. In contrast, a large number of unlabeled training examples which can be obtained easily and cheaply are available. Therefore, the semi-supervised learning is proposed to combine the few labeled examples and vast unlabeled examples to extract knowledge from data sets. The typical semi-supervised learning method is the family of cotraining paradigm which includes Co-training [2], Tri-training [18], and Co-Forest [11]. Co-training is an attractive semi-supervised learning paradigm, which trains two classiﬁers through labeling the unlabeled examples for each other. The original cotraining requires data to be describable by two sufﬁcient and redundant attribute subsets, each of which is sufﬁcient for learning and independent on the other given class-label. Tri-training and Co-Forest are the extensions of Co-training, and they use three classiﬁers and multiple classiﬁers instead of two. The problem, however, is that in the process of labeling the unlabeled examples, experts are likely to assign an error label to an unlabeled example, which will affect the ﬁnal performance of classiﬁcation. They may also commit mistakes when labeling the

* Corresponding author at: Department of Computer Science, School of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350007, China. Tel.: +86 13665040506. E-mail address: [email protected] (T. Huang). 0950-7051/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2010.03.015

original examples. Consider the few mislabeled examples in the original labeled data set in Fig. 1. By using the Co-Forest algorithm with some unlabeled examples, the additional labeled examples in Fig. 2 were obtained. However, Fig. 2 shows that the Co-Forest algorithm also produces additional objects with wrong labels due to the original mislabeled examples. Once the mislabeled examples are used to guide the process of learning, the performance of classiﬁcation will be heavily inﬂuenced. Supervised clustering, an interesting method of supervised learning, deviates from traditional clustering. It is applied on the labeled examples. It can enhance understanding of the data sets (e.g., summaries could be generated for each cluster) as well as improve the classiﬁcation performance. A good supervised clustering, however, requires a considerable amount of class-labeled training examples. In other words, a supervised clustering with only a few labeled data will lead to an unreliable performance. Based on the above observation, this paper proposes a Classiﬁcation algorithm based on Local Cluster Centers with a few labeled training examples (CLCC), which demonstrates the following merits: (1) with a few labeled training examples, CLCC, by using the semisupervised learning algorithm, can augment the labeled examples, which will pave the way for subsequent supervised clustering; (2) CLCC can effectively reduce the interference of mislabeled data, including the data produced by domain experts and semi-supervised learning algorithms, because it inherits the idea of supervised clustering and uses the local cluster centers that reﬂect the distribution of the data set to represent the whole data set; and (3) the

564

T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571

Fig. 1. Original labeled examples with wrong labels: there are six wrong objects; four positive objects are labeled by the negative class, and two negative objects are labeled by the positive class.

The Tri-training and the Co-Forest algorithm proposed by Zhou and Li [18], with no requirements on the sufﬁcient and redundant attribute subsets or the special supervised learning algorithm, can partition the instance space into a set of equivalence classes that is not less desirable than Co-training. In particular, Tri-training employs three classiﬁers so that it could smoothly choose examples to label and use multiple classiﬁers to compose the ﬁnal hypothesis. First, the three classiﬁers are initially trained on the labeled data set. Then for any one of them, an unlabeled example can be labeled for it as long as the other two classiﬁers agree on the labeling of this example. Meanwhile, the conﬁdence of the labeling of the classiﬁers does not need explicit measurements. Obviously, the Tri-training algorithm has its own approach to control the augmenting labeled data under certain conditions. Co-Forest, which is the extension of Tri-training, uses N classiﬁers instead of three classiﬁers in the process of learning. It trains an ensemble of N classiﬁers on labeled data and then reﬁnes each component classiﬁer with unlabeled examples selected by its concomitant ensemble. 2.2. Supervised clustering

Fig. 2. Labeled examples with more wrong labels by Co-Forest. Compared with objects in Figs. 1 and 11 objects are mislabeled by using the Co-Forest algorithm when working on some unlabeled examples and the original labeled examples in Fig. 1.

CLCC algorithm, as tested by various UCI data sets in the research, outperforms other classiﬁcation algorithms, especially Co-Forest. This paper is organized as follows. Section 2 brieﬂy reviews the relevant algorithms, including semi-supervised learning and supervised clustering. Section 3 presents the algorithm CLCC. Section 4 reports the experimental results on UCI data sets. Section 5 concludes the paper.

2. Background 2.1. Semi-supervised learning In traditional supervised learning, classiﬁers must be trained on abundant labeled data. When a portion of the training data is unlabeled, the classiﬁers could be implemented with semi-supervised learning, an effective way to combine labeled and unlabeled data. As a result, many semi-supervised classiﬁcation algorithms have been proposed, including the generative model [13–15], the transductive support vector machines approach [9], and the Graph-cut algorithm [3]. The Co-training [2] algorithm, another famous semi-supervised learning algorithm, requires sufﬁcient and redundant attribute subsets, but the constraint on data can be relaxed by using two supervised learning algorithms, each of which produces hypothesis that can partition the instance space into a set of equivalence classes [8].

Supervised clustering deviates from traditional clustering. It is applied on classiﬁed examples with the aim of producing clusters that have high probability density with respect to single classes. Supervised clustering algorithms are widely applied to understand the real inwardness of a data set. Dettling and Buhlmann [4] proposed a partition-based and incremental active clustering algorithm for the supervised clustering of genes from a microarray experiment. The algorithm tried to cluster genes in such way that the discrimination of different tissue types becomes as simple as possible. Eick et al. [5] proposed some representative-based supervised clustering algorithms according to a special ﬁtness function, which could improve the performance of classiﬁcation. A new supervised clustering proposed by Eick et al. [6] can discover the interesting regions in spatial data sets. In addition, Li et al. [12] dealt with data sets with mixed attributes by combining the k-prototype algorithm with supervised clustering. Li and Ye [10] presented a data mining algorithm based on supervised clustering to learn data patterns and used these patterns for data classiﬁcation. Pu et al. [16] presented a new supervised bin-split hierarchy clustering method. 3. CLCC algorithm In this section, a CLCC is proposed. Let L and U denote the labeled example set and the unlabeled one, respectively. The sizes of L and U are |L| and |U|, respectively, |L| << |U|. The outline of CLCC is composed of four steps as follows: (i) The semi-supervised paradigm Co-Forest-Sim is used to label the unlabeled examples in U, so a new labeled set denoted by L* is obtained. It likewise calculates a matrix of conﬁdence between the unlabeled example and its classlabel. (ii) A center-based supervised clustering guided by an objective function is trained on the new labeled set L*. Some representative local cluster centers are selected. (iii) The result of the center-based supervised clustering is processed to ﬁnd the ‘‘best” local cluster centers. In our experiment, this step was considered a candidate operation. (iv) A K-NN classiﬁcation is trained on the local cluster centers.

565

T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571

3.1. Augmenting the labeled example set

EðXÞ ¼ In CLCC, a semi-supervised learning method named Co-ForestSim is used for augmenting the labeled example set and in obtaining the conﬁdence of each unlabeled example. In this aspect, Co-Forest-Sim is similar to Co-Forest [11]. Co-Forest-Sim, however, differs from Co-Forest because the former calculates the matrix of conﬁdence of the unlabeled example set, and it is used by CLCC to augment the labeled example set rather than train a classiﬁer. Before proceeding, some necessary notations are summarized. An ensemble of N classiﬁers is denoted by H* whose component is hi, i = 1, . . . , N. A new ensemble of other component in H* except hi is denoted by Hi = H*-{hi}, which is called the concomitant ensemble of hi. The class-label set of L is C = {Ci}i=1,. . .,m. First, the Co-Forest-Sim trains an ensemble of N random tree classiﬁers, H*, on a few labeled examples L. Then the learning iteration starts. In the tth iteration, Hi examines each example in the subset of the unlabeled set U. For an unlabeled example u, if the number of classiﬁers voting for a particular label exceeds a pre-set threshold h, the unlabeled example u along with the newly assigned label is then copied into the newly labeled set Li,t which stores the new labeled examples for classiﬁer hi. In this step, the conﬁdence vector wi,u between u and each label in C is recorded. Each element of the conﬁdence vector wi,u named wi,u,k denotes the conﬁdence of the unlabeled example u whose label is Ck for classiﬁer hi. wi,u,k could be estimated by the degree of agreements on the label Ck assigned by Hi. For example, when the num classiﬁers that assign the unlabeled example u to the class-label Ck in the current iteration are present, then wi,u,k = num/(N 1). In vector wi,u = {wi,u,1, wi,u,2, . . .wi,u,m}, for the example u, the class-label with most votes from the classiﬁers is denoted by Cmajor, whose conﬁdence is wi,u,major. Without loss of generality, xi,t,j is the predictive conﬁdence of Hi on xj with the most votes class-label in Li,t in the tth iteration. Let the error rate P of Hi on Li,t be ei,t then the total conﬁdence is xi;t¼ M j¼0 xi;t;j in the tth iteration. M is the size of Li,t. Set L[Li,t is used for the reﬁnement of hi in the next iteration if ei,txi,t
! Nj k X X ð1 W ij Þ=n þ b PenaltyðkÞ; j¼0

i¼0

where PenaltyðkÞ ¼

ð1Þ

( pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðk mÞ=n k P m; 0

k < m:

In Eq. (1), X = {R1, . . . , Rk} is the result of the clustering of clusters R1–Rk,. For each Ri 2 X, its class-label is the class-label of its center object in the latter algorithm; k is the number of current clusters; and m is the number of class-label set C = {Ci}i=1,. . .,m consisting of labels C1–Cm. Nj is the size of the cluster Rj, W ij is the conﬁdence of the ith object in Rj whose class-label is that of the cluster Rj, and n is the size of the new labeled set L*. The parameter b determines the penalty associated with the numbers of the clusters, k, in the process of clustering. The larger the number of clusters, the higher the value of b needed, which implies a higher penalty. Let a circle denote an example. The areas of the black and white parts in the example are the degrees of membership of each class (black and white class). As Fig. 3 illustrates, the incredible degree of cluster A is composed of the white parts of the examples in cluster A, and the incredible degree of cluster B is composed of the black parts of the examples in cluster B. We incorporate the label conﬁdence into the objective function to fairly reﬂect the incredible degree of the clustering result. 3.3. Local cluster centers selection Before depicting the detail of CLCC, the necessary notations are summarized in Table 1. The second step (ii) of CLCC involves the search for the best local cluster centers from the whole labeled examples, which can reﬂect the local similar property well as shown in Fig. 4. First, it selects k objects in the labeled set L* as the centers of the k clusters randomly. The number of the original center set, k, is between m and 2m in general, where m is the size of classes of the labeled set L*. The value of k should be dynamic in order to ﬁnd the best local cluster centers set to reﬂect the distribution of the original data set. In Fig. 4, three solid objects reﬂect the distribution of the whole data set effectively. Second, clusters are then created by assigning examples to the cluster of their closest center. Starting from this randomly generated set of centers, the algorithm uses a greedy search strategy to ﬁnd the best local cluster centers. In each

3.2. Design of the objective function A good objective function for clustering algorithms is very important because it directly affects their performance. The objective function in Eq. (1) is chosen and used in our CLCC algorithm according to the new labeled set L* whose matrix of conﬁdence is W* as explained in Section 3.1. In Eq. (1), the objective function E(X) includes two divisions: the ﬁrst refers to the incredible degree of cluster in clustering result X according to conﬁdence matrix W*, and the second aims to control the number of clusters whose value is high when the number of clusters is large. A lower value for E(X) indicates a better solution.

Fig. 3. Incredible degree of the cluster. The circle represents an example. Black and white represents two classes. The areas of the black and white parts in the example refer to the degree of membership of each class. For cluster A, its class-label is the black color, and six examples accurately belong to the black class, two examples may belong to the black class, and three examples may belong to the white class. Thus, its incredible degree is composed of the white parts of all examples in the cluster A.

566

T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571

Table 1 Notation for the CLCC algorithm. Symbol

Deﬁnition

L* k W* z set_num CX E

Labeled example set with conﬁdence Initial number of the center The label conﬁdence matrix of L* Running frequency Number of the best center sets The set of local cluster centers, with size z Objective function of supervised clustering

Fig. 4. The value of the center set. Three solid objects rather than two objects can reﬂect the distribution of the whole data set well.

iteration, the algorithm checks two tentative operations: adding a single non-center object in L* to the set of the centers and removing a center object from the set of the centers. The best result of clustering measured by the objective function E in Section 3.2 from the two operations is selected as the current optional result if it improves the quality of the clustering in comparison with the result in the last iteration. Otherwise, the search stops. Due to randomicity, the second step of CLCC is performed z times, and the ﬁrst set_num best center sets are selected according to the value of objective function E in the experiment. The pseudo of the search algorithm is shown in Fig. 5. In Fig. 5, CRAdd records the cluster result partitioned by the corresponding center set similarly as in CRRe. The function Local_Cluster_Center returns the best center set in each run time. The search algorithm is a local optimal strategy. Due to the wrong class-label produced by (i) and the domain experts’ mistakes, the best center set of the data set may not be found, especially when there are more mislabeled examples. Consider the cluster in Fig. 6. The solid circle object is the current cluster center, and three triangle objects are mislabeled objects. Obviously, the solid triangle object is the best cluster center. In (iii), the result of (ii) is processed in order to ﬁnd a better center set which is used for training a K-NN classiﬁer. For each cluster in the clustering result, the examples whose labels are different from the label of the cluster itself are divided into three parts according to their conﬁdence and distance between these objects and the cluster center. Then the labels of the examples, which have the highest conﬁdence and are closest to the cluster center, are changed into the class-label of the cluster, and the examples that have the lowest conﬁdence and are located farthest from the cluster center are removed. The rest of the examples, however, are not processed in order to avoid over-ﬁt. In this experiment, (iii) is considered as a candidate operation as shown in Fig. 7.

Fig. 5. The pseudo code of (ii) in CLCC.

T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571

567

learning applications, which could affect the performance of the classiﬁcation trained on the data set. Local cluster centers not only reﬂect the distribution of the original data set effectively but can also reduce the interference of these exceptional data because only local cluster centers are used to train a classiﬁer. 4. Experiments

Fig. 6. Candidate operation: change the solid triangle object to a circle object and remove the bottom triangle object.

The algorithm CLCC is described in details in Fig. 8. Note that CLCC selects a certain center set as result because of randomicity and local convergence. The operation of the third and fourth lines in Fig. 8 is considered optional. If the accuracy as compared to speed is not a critical consideration, the two lines should be skipped. 3.4. Analysis of CLCC’s advantages According to manifold assumption, the objects in the small local neighborhood share similar properties. The local cluster centers conform to the manifold assumption. Each of the local cluster centers represents a certain property. Accordingly, one or more objects will be selected by CLCC from the example set with the same classlabel, namely, one cluster. For a complex distribution of data with the same class-label, which includes complex shapes or multi-densities of the clusters, more local cluster centers can reﬂect its distribution effectively. Furthermore, there are some exceptional data such as outliers and some mislabeled examples in practical

Fifteen data sets from UCI machine learning repository [1] are tested in the experiments. Part of the information about the data sets is presented in Table 2. Euclidean distance is used to compute the distance between two instances in the experiments. For each data set, the 5-fold cross-validation method is employed for evaluation. That is, in each fold, 20% of the data set is selected as test data and the other 80% as the training data. In the meantime, 20% of the training data is selected as labeled set L and the other 80% of the training data as unlabeled set U randomly. Note that the class distributions in L and U should be kept similar to that in the original data set. The instances with missing values for some data sets are removed. For convenience, the wisconsinbreast-cancer data set is denoted by WBC, the data set ‘‘Blood Transfusion Service Center” by BTSC, the data set ‘‘Mammographic Mass” by MM, and the data set ‘‘Kdd Synthetic Control” by KSC. Table 3 shows the result and some parameters in the frame of (ii) in the CLCC algorithm, including the ﬁnal number of cluster k, the parameter b, and the value of the objective function E. Note that the ﬁnal k is the size of the center set produced when supervised clustering ends the iteration in the frame of (ii) in CLCC. The experiments compare the classiﬁer trained by CLCC with the Co-Forest in few labeled data and large unlabeled data and SMO, RTree, RForest, and AdaBoost [7], which are trained on the original labeled example set without utilizing the unlabeled example set. The average error of CLCC, Co-Forest, SMO (SMO in WEKA [17], AdaBoost (AdaBoostM1 in WEKA), RTree (RandomTree in WEKA), and RForest (RandomForest in WEKA) on the same test data sets and the improvement of CLCC in comparison with Co-Forest is shown in Table 4. All error rates are calculated to three decimal places. The balance-scale data set is called BS. The value of

Function Process_Cluster(CX) 1. Select the best result R ={R1, …, R|CM |}whose objective function E is lowest in CX; its center set is CMS={cmi}i=1…|CM |, the label of cmi is Clai 2. For each Ri in R 3.

In cluster Ri, assume that there are numobject objects whose class-label is different from Clai ; divide the numobject objects into three segments, and each segment has numobject/3 objects. The examples with the highest confidence and closest to the cluster center cmi are in the first segment, while those with the lowest confidence and farthest from the cluster center cmi are in the last segment.

4.

Change the class-label of objects in first segment to Clai and change its confidence to 1 for label Clai. At the same time, remove the objects in the last segment.

5. End for 6. Return new labeled example set L**and its confidence matrix W**

Fig. 7. Candidate operation of CLCC.

568

T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571

Algorithm: CLCC Input: the labeled set L, the unlabeled U, the confidence of threshold θ , the size of classifier ensemble N, manipulative parameter β , the number of class-label set m, the initial number of clusters k, running frequency z, best center sets number set_num. Output: a K-NN classifier Process: 1.

(L*,W*)=Co-Forest-Sim(L,U, θ ,N)

2.

CX = Local_Cluster_Center (L*, k, W*,z, β )

3.

(L**,W**)=Process_Cluster(CX)

4.

CX = Local_Cluster_Center (L**, k, W**,z, β )

5. Array the value of objective function E in CX in ascending order and select the first set_num center sets as the CMS_Set according to the value of the objective function E. 6. Train set_num K-NN classifiers by using center set CMS in CMS_Set (in general, K=1). Use these classifiers to predict the test data and select the best classifiers.

Fig. 8. The pseudo code of CLCC.

Table 2 Experimental data sets. Data set

#Features

#Instances

#Classes

Wine Glass Diabetes Iris WBC Balance-scale E.coli Heart-statlog Live-disorder Haberman BTSC Ionosphere wpbc MM KSC

13 9 8 4 9 4 7 13 6 3 4 34 33 5 60

178 214 768 150 683 625 336 270 345 306 748 351 194 830 600

3 6 2 3 2 3 8 2 2 2 2 2 2 2 6

Table 3 Result and parameter of (ii) in CLCC. Data set

Initial k

b

Cluster impurity E

Wine Glass Diabetes Iris WBC Balance-scale E.coli Heart-statlog Live-disorder Haberman BTSC Ionosphere wpbc MM KSC

4 9 3 5 3 3 12 3 3 3 3 3 3 3 9

0.1 0.4 1 1 1 1 1 0.1 0.4 1 1 0.1 1 0.1 0.1

0.157168778 0.136307962 0.206736658 0.016666667 0.026366843 0.139675174 0.137246377 0.213277002 0.149421146 0.162764866 0.132116496 0.072870958 0.08757716 0.070050909 0.092194621

set_num in our experiment is set to 6. In general, the value of N in Co-Forest-Sim is set to 6, the conﬁdence threshold h is set to 0.75, and K is 1 in K-NN classiﬁers. According to Table 4, Figs. 9 and 10, the K-NN classiﬁer produced by the algorithm CLCC outperforms Co-Forest in 15 data sets by 5.1% on the average. For some unbalanced data sets, the CLCC algorithm achieves better performance. The improvement of the balance-scale data set is 9.7%, that of the haberman data set is 7.3%, and that of the WPBC data set is 6.7%. The reasons for improvement include the following: (1) the local cluster center set can reﬂect the distribution of the original data set effectively, and (2) it can also reduce the interference of some exceptional data such as wrong class-label. In Eq. (1), parameter b is designed to control the number of clusters denoted by ﬁnal k, which is the same as the algorithm [5]. Table 5 presents the detailed information about the result of CLCC on Wine data set with a different b, whose value is from 0.1 to 2. As seen in Table 5, when the value of b is larger than 0.6, the number of clusters is the same as the number of classes in the Wine data set, and some values of cluster impurity are equal, such as 0.183680556 and 0.233680556, with the different values of parameter b. Fig. 11 shows the inﬂuence of parameter b on the error rate on the identical test data for Wine data set. In general, the more complex the distribution of the data set, the smaller the value of parameter b; the more local cluster centers are selected by set to ﬁt the local property of the data set, the smaller the value of parameter b. The selection of parameter b requires a user-directed tuning in real applications. In our experiment, the value of parameter b from {0.1, 0.4, 1} is selected. In general, the parameter K in the ﬁnal step K-NN classiﬁcation is set 1 because each local cluster center represents a certain local feature of the data set. Thus, a new example should be classiﬁed into the most similar local cluster feature. The parameter initial cluster number k is insensitive, and the number of clusters is dynamic, not static, in the process of clustering. The supervised clustering in Fig. 5 could ﬁnd the proper local clusters by searching the data space. In other words, the different values of b and k will affect the performance of our algorithm,

569

T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571 Table 4 Average error of different algorithms on the test data set. Data set

Traditional classiﬁer

Co-training paradigm

SMO

RTree

RForest

AdaBoost

CLCC

Co-Forest

Improv.

Wine Glass Diabetes Iris WBC BS E. coli Heart-statlog Live-disorder Haberman BTSC Ionosphere wpbc MM KSC

0.111 0.562 0.273 0.067 0.064 0.159 0.250 0.179 0.42 0.262 0.238 0.21 0.238 0.227 0.083

0.167 0.497 0.370 0.100 0.099 0.230 0.353 0.272 0.471 0.357 0.286 0.154 0.347 0.275 0.433

0.278 0.335 0.325 0.100 0.088 0.222 0.221 0.321 0.406 0.307 0.263 0.123 0.306 0.254 0.167

0.139 0.611 0.279 0.100 0.070 0.325 0.353 0.204 0.399 0.295 0.266 0.2 0.301 0.262 0.667

0.057 0.367 0.234 0.040 0.031 0.152 0.204 0.167 0.319 0.23 0.222 0.11 0.249 0.213 0.075

0.107 0.404 0.288 0.047 0.057 0.249 0.260 0.247 0.37 0.303 0.265 0.129 0.316 0.254 0.142

0.05 0.037 0.054 0.007 0.026 0.097 0.056 0.08 0.051 0.073 0.043 0.019 0.067 0.041 0.067

Avg.

0.223

0.294

0.248

0.298

0.178

0.229

0.051

iris Dataset

WBC

bs

ecoli

Co-Forest

CLCC

C KS

M

M

c pb w

re he

er

e-

di

t-s

Diabetes

liv

ar

glass

he

Wine

SC

ta

rd

tlo

0.1 0

AdaBoost

0.4 0.3 0.2 0.1 0 g

0.2

RForest

sp

0.3

RTree

no

0.4

SMO

Io

Error Rate

Error Rate

0.5

0.8 0.7 0.6 0.5

an

CLCC

BT

Co-Forest

m

AdaBoost

er

RForest

ab

RTree

H

SMO

0.6

so

0.7

Dataset

(a)

(b)

Fig. 9. Error rate of different algorithms on different data sets.

0.35

Table 5 Result of CLCC on the Wine data set with a different b.

Average Error Rate

0.3 0.25 0.2 0.15 0.1 0.05 0 SMO

RTree

RForest AdaBoost Co-Forest Algorithm

CLCC

Fig. 10. Average error rate of different algorithms for all data sets.

but the inﬂuence is not major. The values of b and k are not difﬁcult to conﬁrm in real application. In many real applications, experts may make mistakes when they label the original examples. In this case, there are some wrong labels in the original labeled example set L as mentioned in Section 3.1. If the examples with mislabels are used to guide the process of learning, they will affect the performance of the classiﬁcation. In our experiment, the original correct labels of some examples are changed artiﬁcially and randomly for each original, few labeled example set in three UCI data sets. The rate of the error label refers

b

Rate of error

Final k

Cluster impurity

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

0.138889 0.111111 0.055556 0.111111 0.055556 0.083333 0.083333 0.083333 0.166667 0.111111 0.083333 0.111111 0.111111 0.111111 0.111111 0.055556 0.083333 0.055556 0.111111 0.055556

8 16 9 6 5 11 3 3 3 3 3 3 3 3 3 3 3 3 3 3

0.128444791 0.142507291 0.156893691 0.172569444 0.175868056 0.183680556 0.183680556 0.233680556 0.233680556 0.191493056 0.233680556 0.238368056 0.241493056 0.246180556 0.191493056 0.233680556 0.238368056 0.238368056 0.241493056 0.246180556

to the rate of the wrong label in the original few, labeled set L. For instance, 14% (4/28) means that there are 28 objects in the original labeled set L, but four objects are mislabeled. As shown in Table 6 and Fig. 12, Co-Forest does not perform well with some wrong labels in the original labeled example set.

570

T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571 0.18

0.35

0.16

0.3 CLCC

0.14

Co-Forest

0.25 Error Rate

Error Rate

0.12 0.1 0.08 0.06

0.2 0.15 0.1

0.04

0.05

0.02 1.9

1.8

1.7

1.6

1.4

1.5

1.3

1.2

1

1.1

0.9

0.7

0.8

0.5

0.6

0.4

0.2

0.3

0.1

Value of Beta

Fig. 11. Error rate of CLCC on the Wine data set with a different parameter b.

Table 6 Average error of CLCC and Co-Forest on the test data set with wrong labels. Data set

Rate wrong label

CLCC

Co-Forest

Improv.

Wine

0% 14%(4/28)

0.083333 0.083333

0.138888889 0.166666667

0.055555889 0.083333667

Diabetes

0% 16%(20/123)

0.171053 0.184211

0.230263158 0.302631579

0.059210158 0.118421

WBC

0% 4%(4/101)

0.02924 0.035088

0.064285714 0.087719298

0.035045714 0.052631

Error Rate

CLCC

14%

21% 28% Rate of Wrong Label

35%

Fig. 13. Error rate of CLCC and Co-Forest on the Wine data set data set with different wrong labels.

Table 8 Average error of CLCC and Co-Forest on the Wine data set data set with different wrong labels. Rate of wrong label

CLCC

Co-Forest

0% 14% 21% 28% 35%

0.083333 0.083333 0.138889 0.083333 0.111111

0.138888889 0.166666667 0.194444444 0.25 0.333333333

(4/28) (6/28) (8/28) (10/28)

the negative impacts of the mislabeled examples on performance. Even for a different ratio of the mislabeled examples, it has the same error rate with those of the 14% and 28% wrong labels. In short, CLCC performs better than Co-Forest with mislabeled examples.

0.35 0.3

0%

2

0 0

Co-Forest

0.25 0.2 0.15 0.1

5. Conclusions

Wine

4% (4/101)

0%

16% (20/123)

0%

14% (4/28)

0

0%

0.05

Diabetes Dataset

wbc

Fig. 12. Error rate of CLCC and Co-Forest on the test data set with wrong labels.

Table 7 Result and parameter of (ii) in CLCC with wrong labels. Data set

Final k

b

Cluster impurity E

K in K-NN

Wine Diabetes WBC

3 2 2

1 1 1

0.212727273 0.218053097 0.049895178

1 1 1

The performance of CLCC is better under the same rate of error label because CLCC can eliminate the interference of mislabeled data. Table 7 shows the detailed information of the result and some parameters in (ii) of CLCC in the state of some wrong labels. In Table 8 and Fig. 13, the performance of CLCC and Co-Forest with different rates of wrong label on data set Wine in UCI is obtained. With the growth rate of the error label, the performance of Co-Forest shows a big change from 0.138888889 to 0.333333333; however, the average error of CLCC makes a small change from 0.083333 to 0.138889. In particular, the third step of CLCC in Fig. 7 revises the wrong labels of the examples, which reduces

This paper proposes a classiﬁcation algorithm based on local cluster centers with only a few labeled data (CLCC), which combines the traditional technique of semi-supervised learning and supervised clustering. A co-training paradigm is utilized to create a new large labeled set from an unlabeled set. Then a center-based supervised clustering guided by a new objective function works on the new labeled set in order to obtain the best local cluster center set which not only reﬂects the distribution of the whole data set effectively but also reduces the interference of exceptional data (e.g., mislabeled data). Finally, the center set trains a K-NN classiﬁer. Experiments on 15 UCI data sets show that CLCC can improve the accuracy of classiﬁcation effectively in comparison with other classiﬁcation algorithms. The use of other semi-supervised algorithms to label the unlabeled examples is a possible focus for future work. Furthermore, ﬁnding information from supervised clustering results can help domain experts identify and check the labels of some exceptional data, which is also an interesting research area. Acknowledgments This work was supported in part by the Natural Science Foundation of Fujian Province of China (2008J04004, 2007J0016), the Innovation Project of Young Scientiﬁc Talents in Fujian Province (2006F3045), the University Services HaiXi Major Project in Fujian Province (Information Tech -nology research based on mathematical), and the Spatial Data Mining and Information Sharing Key Laboratory of Ministry of Education Fund (201008).

T. Huang et al. / Knowledge-Based Systems 23 (2010) 563–571

References [1] C. Blake, E. Keogh, C.J. Merz, UCI repository of machine learning databases, Department of Information and Computer Science, University of California, Irvine, CA, 1998, . [2] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of the 11th Annual Conference on Computational Learning Theory, ACM Press, Madison, 1998, pp. 92–100. [3] A. Blum, J. Lafferty, M. Rwebangira, R. Reddy, Semi-Supervised learning using randomized mincuts, in: Proceedings of the 21st International Conference on Machine Learning, ACM Press, Banff, 2004, pp. 934–947. [4] M. Dettling, P. Buhlmann, Supervised clustering of genes, Genome Biology 3 (12) (2002) 00691–006915. [5] C.-F. Eick, N. Zeidat, Z.-H. Zhao, Supervised clustering: algorithms and application, in: International Conference on Tools with AI, Boca Raton, Florida, 2004, pp. 774–776. [6] C.-F. Eick, B. Vaezian, D. Jiang, J. Wang, Discovering of interesting regions in spatial data sets using supervised cluster, in: PKDD’06, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, 2006, pp. 127–138. [7] Y. Freund, R.E. Schapire, A decision-theoretic generalization of online learning and an application to boosting, in: Proceedings of the Second European Conference on Computational Learning Theory, Barcelona, Spain, 1995, pp. 23– 37. [8] S. Goldman, Y. Zhou, Enhancing supervised learning with unlabeled data, in: Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufman Publishers, San Francisco, 2000, pp. 327–334. [9] T. Joachims, Transductive inference for text classiﬁcation using support vector machines, in: Proceedings of the 16th International Conference on Machine Learning, Bled, Slovenia, 1999, pp. 200–209.

571

[10] X.-Y. Li, N. Ye, A supervised clustering and classiﬁcation algorithm for mining data with mixed variables, IEEE Transactions on Systems Man and Cybernetics – Part A: Systems and Humans 36 (2) (2006) 396–406. [11] M. Li, Z.-H. Zhou, Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples, IEEE Transactions on Systems, Man and Cybernetics – Part A 37 (6) (2007) 1088–1098. [12] S.-J. Li, J. Liu, Y.-L. Zhu, X.-H. Zhang, A new supervised clustering algorithm for data set with mixed attributes, in: Eighth ACIS International Conference on Software Engineering, Artiﬁcial Intelligence, Networking, and Parallel/ Distributed Computing, 2007, pp. 844–849. [13] D.J. Miller, H.S. Uyar, A mixture of experts classiﬁer with learning based on both labeled and unlabeled data, in: M. Mozer, M.I. Jordan, T. Petsche (Eds.), Advances in Neural Information Processing Systems, vol. 9, MIT Press, Cambridge, MA, 1997, pp. 571–577. [14] D.J. Miller, J. Browning, A mixture model and EM-based algorithm for class discovery, robust classiﬁcation, and outlier rejection in mixed labeled/ unlabeled data sets, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (11) (2003) 1468–1483. [15] K. Nigam, A.K. McCallum, S. Thrun, T. Mitchell, Text classiﬁcation from labeled and unlabeled documents using EM, Machine Learning 39 (2–3) (2000) 103– 134. [16] L.-P. Pu, P.-D. Zhao, G.-D. Hu, Z.-F. Zhang, Q.-L. Xia, PCA and K-means based supervised split hierarchy clustering method, Application Research of Computers 25 (5) (2008) 1412–1414. [17] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, San Francisco, 2000. [18] Z.-H. Zhou, M. Li, Tri-training: exploiting unlabeled data using three classiﬁers, IEEE Transactions on Knowledge and Data Engineering 17 (11) (2005) 1529– 1541.

A classification algorithm based on local cluster centers with a few labeled training examples

A classification algorithm based on local cluster centers with a few labeled training examples

Recommend Documents