A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents

Expert Systems with Applications 39 (2012) 3127–3134 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal hom...

Download PDF

782KB Sizes 12 Downloads 141 Views

Report

PDF Reader
Full Text

Expert Systems with Applications 39 (2012) 3127–3134

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

A multi-class SVM classiﬁcation system based on learning methods from indistinguishable chinese ofﬁcial documents q JuiHsi Fu ⇑, SingLing Lee Department of Computer Science and Information Engineering, National Chung Cheng University, 168 University Road, Minhsiung Township, 62162 Chiayi, Taiwan, ROC

a r t i c l e

i n f o

Keywords: Support Vector Machines (SVM) Multi-class classiﬁcation Chinese ofﬁcial document classiﬁcation Indistinguishability identiﬁcation Incremental learning

a b s t r a c t Support Vector Machines (SVM) has been developed for Chinese ofﬁcial document classiﬁcation in Oneagainst-All (OAA) multi-class scheme. Several data retrieving techniques including sentence segmentation, term weighting, and feature extraction are used in preprocess. We observe that most documents of which contents are indistinguishable make poor classiﬁcation results. The traditional solution is to add misclassiﬁed documents to the training set in order to adjust classiﬁcation rules. In this paper, indistinguishable documents are observed to be informative for strengthening prediction performance since their labels are predicted by the current model in low conﬁdence. A general approach is proposed to utilize decision values in SVM to identify indistinguishable documents. Based on veriﬁed classiﬁcation results and distinguishability of documents, four learning strategies that select certain documents to training sets are proposed to improve classiﬁcation performance. Experiments report that indistinguishable documents are able to be identiﬁed in a high probability and are informative for learning strategies. Furthermore, LMID that adds both of misclassiﬁed documents and indistinguishable documents to training sets is the most effective learning strategy in SVM classiﬁcation for large set of Chinese ofﬁcial documents in terms of computing efﬁciency and classiﬁcation accuracy. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction In government departments and companies, some kinds of articles and documents are still handled by human labors. Among these data, Chinese ofﬁcial documents have been used very often to inform and communicate ofﬁcially with companies/corporations. Each employee is responsible for dispatching ofﬁcial documents to all related departments. Hence, designing an accurate classiﬁcation system to handle Chinese ofﬁcial documents will improve government employee’s working efﬁciency. However, in Chinese documents, it is very often that segmented terms could not represent the original meaning of this content completely, since no delimiter exits in the content. Moreover, no formal stop-word list is deﬁned like English to remove those meaningless words. Due to lack of complete content representation and stop-word lists, distinct features are difﬁcult to be extracted from the document content. Additionally, in Chinese ofﬁcial documents which are well-formed and textual, some special characteristics are listed below:

q This work is supported by NSC, Taiwan, ROC under Grant No. NSC 97-2221-E194-029-MY2. ⇑ Corresponding author. E-mail addresses: [email protected] (J. Fu), [email protected] (S. Lee).

0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.08.176

1. Short and brief content The abstract of an ofﬁcial document usually is used to represent this document for classiﬁcation. The abstract needs to be short and brief to describe government affairs. 2. Fewer distinct features In ofﬁcial affairs, Chinese ofﬁcial documents belonging to different units (classes) tend to be similar due to most terms in a document are not discriminative. Classifying Chinese ofﬁcial documents is difﬁcult when depending on only few distinct terms. Our objective is to classify Chinese ofﬁcial documents more precisely. One department in the company is corresponding to a class label, so the problem of automatically dispatching an ofﬁcial document is reduced to a multi-class classiﬁcation problem. However, special characteristics of Chinese ofﬁcial documents usually cause poor classiﬁcation results. The traditional learning strategy is to add misclassiﬁed documents to the training set in order to adjust classiﬁcation rules. In this paper, some correctly classiﬁed documents are observed to be indistinguishable since their labels are predicted in low conﬁdence. It is worth noting that, indistinguishable documents should be informative for strengthening prediction rules since following similar ones could be correctly classiﬁed in a higher probability. Hence, a distinguishing method, Identifying Possibly Misclassiﬁcation Documents (IPMD), is proposed to

3128

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

distinguish whether veriﬁed documents tend to be misclassiﬁed or not. Based on veriﬁed classiﬁcation results and document distinguishability, four learning strategies that select certain documents to training sets are proposed to enhance classiﬁcation accuracy and reduce the size of training sets. They are introduced in more details in Section 3. Fig. 1 is an overview of our document classiﬁcation system. Initially, training documents are processed by modules of Text Preprocessing and Classiﬁer Training to build a prediction model. Feature extraction eliminates terms with weights lower than a predeﬁned threshold, and term weighting methods (Combarro, Montanes, e Diaz, Ranilla, & Mones, 2005; Quinlan, 1986; Salton & Buckley, 1988; Salton & McGill, 1983) are used to represent document vectors. Classiﬁer Training is kernel in classiﬁcation systems. Some well-known classiﬁcation methods, K-Nearest Neighbors (KNN) (Yuan, Yang, & Yu, 2005), Support Vector Machines (SVM) (Cortes & Vapnik, 1995; Cristianini & Taylor, 2000; Liang, 2004), Naive Bayes (Lewis, 1998; Lewis & Ringuette, 1994), and neural network (Wiener, 1995), have been well studied recently. Notably, SVM is adopted for solving our document classiﬁcation problems since it has been proven to perform very effectively in many research results (Deng & Peng, 2006; Diaz, e Ranilla, Montanes, Fernandez, &

Combarro, 2004; Dumais, Platt, Heckerman, & Sahami, 1998; Joachims, 1998; Kecman, 2001; Lee & Lee, 2005; Ramirez, Durdle, Raso, & Hill, 2006; Özgür & Güngör, 2006; Wang, 2005; Wang, Sun, Zhang, & Li, 2006; Wang & Fu, 2005) and is able to deal with high dimensional feature spaces. Geometrically speaking, SVM (Cortes & Vapnik, 1995) generates a hyperplane to separate positive instances from negative ones. The objective function is to maximize the distance from the nearest training instance to the separating hyperplane. When the prediction model is generated, testing documents are also processed by Text Preprocess and their class labels are properly predicted by Classiﬁer Training. Veriﬁed module judges the prediction results (supervised learning Alpaydin, 2004). Then, IPMD module utilizes the decision values of veriﬁed documents in SVM classiﬁcation to determine whether they are distinguishable or not. Next, a revised Learning Strategy module that utilizes veriﬁed classiﬁcation results and distinguishability of documents to select new training instances is developed in order to update prediction models and improve classiﬁcation performance. The objective of semi-supervised learning (SSL) (Joachims, 1999) is to utilize unlabeled samples for decreasing the use of labeled samples. In this paper, the proposed methods identify

Fig. 1. Text Preprocessing, Classiﬁer Training, and classiﬁer testing of our SVM classiﬁcation system.

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

indistinguishable documents from classiﬁed testing ones and put them into training sets in order to build SVM models more precisely. Testing documents are parts of the labeled dataset. Hence, our methods focus on constructing new training sets based on the original classiﬁcation models, not like SSL that utilizes unlabeled samples for generating prediction models. Moreover, the SVM prediction model of traditional online learning is updated by each new training instance to achieve the speciﬁc objective greedily (Bordes, Ertekin, Weston, & Bottou, 2005; Cauwenberghs & Poggio, 2000; Lau & Wu, 2003). Although proposed learning strategies keep adding new training instances to the training set, they are different from traditional online learning. This paper focuses on how to appropriately construct the training set to a certain size such that it will be used later to rebuild SVM classiﬁer to improve classiﬁcation accuracy. So, when a new training instance comes, it will be ﬁrst kept in disks without updating our SVM model immediately. Until new training instances are accumulated to a speciﬁc amount (two sizes are chosen, 100 and 300, in our experiment), the SVM model is rebuilt by adding a part of new training instances to the current training set. Our prediction models are rebuilt by original classiﬁers with a new created training set, not like online ones that process each new training instance with no need for storage and reprocessing. Eventually, the proposed methods are designed for selecting informative labeled documents to build an accurate classiﬁcation system. The rest of this paper is organized as follows: Section 2 introduces multi-class SVM classiﬁcation methods. Section 3 presents the proposed methods: one distinguishing method and four learning strategies. Section 4 shows the experiment performance of our SVM document classiﬁcation method. Section 5 is the summary. 2. Related works SVM is originally designed for binary classiﬁcation, and can be applied to solve multi-class classiﬁcation problems with multiclass schemes (Chang, Chou, Lin, & Chen, 2004; Hsu & Lin, 2002; Liang, 2003, 2004; Lin, Peng, & Liu, 2006; Rennie & Rifkin, 2001; Zou, Chen, & Guo, 2005). One-against-All (OAA) and One-againstOne (OAO) multi-class schemes are adopted more frequently by current research results (Chang et al., 2004; Chin, 1998; Hsu & Lin, 2002; Rennie & Rifkin, 2001). One widely used approach in OAA scheme is called ‘‘winner-takes-all’’. The label of an instance is predicted according to the maximum output value among all SVM classiﬁers. Practically, OAA SVM classiﬁcation scheme trains C binary SVM classiﬁers for C classes. SVMi classiﬁer is trained by positive instances of which class labels are ci and negative instances of which class labels are not ci. After these C binary SVM classiﬁers are trained, C corresponding decision functions are generated, respectively. In the classiﬁcation procedure, the predicted class label of a testing instance, x, is equal to the label of a decision function which has the maximum decision value in the SVM classiﬁer:

arg maxðwi x þ bi Þ i¼1...C

ð1Þ

(wi x + bi) is the decision function for class ci. Thus, OAA SVM classiﬁcation is to solve C quadratic programming problems in which each has l variables (l is the size of whole training set). In OAO scheme, one major method is pairwise with majority voting. The class label is assigned to an instance by winning the most pairwise comparisons, also called ‘‘Max Wins’’. Consequently, multi-class SVM classiﬁcation problems are solved by training C(C 1)/2 binary SVM classiﬁers for all pairs of classes and formulating C(C 1)/2 corresponding decision functions. SVMij is trained by positive instances of which class labels are ci and negative instances of which class labels are cj. The predicted class label of

3129

a testing instance is determined by all decision values and a voting strategy. If the sign of (wij x + bij), the decision function for class ci and cj, shows the predicted class label of instance x is ci, class ci is voted:

arg max v otex ðiÞ; i¼1...C

v otex ðiÞ ¼

C X

ð2Þ

v x ði; jÞ;

j¼1;j!¼i

8 > < 1; if signððwij x þ bij ÞÞ v x ði; jÞ ¼ > says x is in class ci ; : 0; otherwise: In OAO SVM classiﬁcation scheme, C(C 1)/2 binary SVM classiﬁers are constructed, and the average number of training documents in each classiﬁer is 2l/C, l is the size of whole training set. Thus, OAO SVM classiﬁcation needs to solve C(C 1)/2 quadratic programming problems in which each has 2l/C variables. Error-Correcting Output Codes (ECOC) (Allwein, Schapire, & Singer, 2000; Dietterich & Bakiri, 1995) are methods for combining different binary classiﬁers. Practically, it is demonstrated that OAA and OAO schemes with well-tuned classiﬁers are as effective as other methods in ECOC (Rifkin & Klautau, 2004). Furthermore, the predicting time of binary SVM classiﬁers increases with the number of support vectors. Experiments in (Hsu & Lin, 2002; Milgram, Cheriet, & Sabourin, 2006) demonstrate that there are less support vectors in OAO SVM scheme, and then conclude the predicting time of OAO SVM classiﬁcation is shorter than that of OAA SVM classiﬁcation. However, (Rifkin & Klautau, 2004) argues that, when binary SVM classiﬁers in OAA and OAO schemes are tuned properly, the difference between this two schemes is small. Thus, it is difﬁcult to conclude whether OAO or OAA SVM scheme performs better. In this paper, we focus on OAA scheme and propose revised learning strategies based on decision values of SVM classiﬁcation. 3. The proposed methods Ertekin, Huang, Bottou, and Lee Giles (2007) argues that adding training documents which are close to the hyperplane in SVM to the training set will improve classiﬁcation performance. However, indistinguishable documents are not always close to the hyperplane since their labels are predicted in low conﬁdence (low SVM decision values). Thus, the proposed method utilizes decision values in SVM to ﬁnd indistinguishable veriﬁed documents before they are applied to the training set. Section 3 is divided into two parts. Section 3.1 presents the distinguishing method which is the preprocess of the learning strategies. The other subsections present the learning strategies which handle veriﬁed and distinguished documents differently. 3.1. Identifying Possibly Misclassiﬁed Documents (IPMD) In SVM classiﬁcation, signs and the shortest distance between the testing document and the hyperplane are used to decide the predicted class label. However, two situations cause poor SVM classiﬁcation results in OAA scheme. In the ﬁrst case, all decision values might be negative that implies no SVM classiﬁer can recognize the testing document in proper conﬁdence. Secondly, the maximum and second maximum decision values are very close; the testing document is hard to be differentiated between these two SVM classiﬁers. Therefore, the class label of a testing document might be predicted in low conﬁdence when only depending on the maximum decision value.

3130

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

Fig. 3. DV distribution of testing documents with RV smaller than 0.4 in the arbitrary class. Fig. 2. RV distribution of testing documents of which the predicted labels are the arbitrary class.

Before introducing the distinguishing method, we ﬁrst deﬁne some notations in the following. When the class label of a testing document is predicted by OAA SVM classiﬁers, Result-Value (RV) and Difference-Value (DV) are deﬁned by Eqs. (3) and (4). Given a testing document x, RV(x) is the maximum decision value when x is in OAA SVM classiﬁcation. If RV(x) is generated by the decision function in SVMi, it stands for how possible x’s predicted class label is ci. Assuming the second largest decision value is generated by the decision function in SVMj. DV(x) is the difference between RV(x) and the second largest decision value, which stands for how clear x can be differentiated between SVMi and SVMj.

RVðxÞ ¼ maxðwi x þ bi Þ;

ð3Þ

DVðxÞ ¼ RVðxÞ max ðwj x þ bj Þ;

ð4Þ

i¼1...C

j¼1...C;j!¼i

where x is the document vector and ci is the class label of the decision function which has the maximum decision value. Fig. 2 shows RV distribution of testing documents, of which predicted labels are the arbitrary class, (3000 training documents and 600 testing documents) in SVM classiﬁcation with our experimental setting. The x-axis is RV, and the y-axis is the number of documents. The white bars are RV distribution of correctly classiﬁed documents, and the gray bars are RV distribution of misclassiﬁed documents. We observe that when RV(x) is larger than 0.4, the probability of classifying x correctly is over 0.5. When RV(x) is smaller than 0.4, DV(x) is calculated for identifying the other condition. Fig. 3 is DV distribution of testing documents with their RVs smaller than 0.4. The x-axis is DV, and the y-axis is the number of documents. The white bars are DV distribution of correctly classiﬁed documents, and the gray bars are DV distribution of misclassiﬁed documents. When DV(x) is larger than 0.3, the probability that x is correctly classiﬁed is over 0.5. Thus, if RV(x) is smaller than 0.4 and DVx is smaller than 0.3, x is labeled as ‘‘indistinguishable’’. Undoubtedly, predicted labels of the indistinguishable documents are decided by the SVM prediction model in low conﬁdence and are inconsistent with the true labels in a high probability. The main idea for our distinguishing method is to ﬁnd two thresholds in each class, name RV bound (RVB) and DV bound (DVB), to deﬁne indistinguishable documents. Let ci and ni denote

RVB and DVB in class ci. Assuming x’s predicted class is ci, RV(x) and DV(x) are compared with ci and ni to ﬁnd whether x is ‘‘distinguishable’’ or ‘‘indistinguishable’’. In Figs. 2 and 3, ci and ni in the dominant class are 0.4 and 0.3, respectively. Typically, RV and DV distributions of testing documents could be simulated by those of training documents because decision values of testing documents are based on training documents in SVM classiﬁers. Consequently, ci and ni are determined by the training set in SVMi. Given a training document t, RV(t) is generated by the decision function in SVMi. Then, t is deﬁned as a true document if t is a positive training instance in SVMi; t is deﬁned as a false document if t is negative one. That implies true documents in the training set are corresponding to correctly classiﬁed documents in the testing set and false documents in the training set are corresponding to misclassiﬁed documents in the testing set. When all SVM classiﬁers and decision functions are generated, RV(t) and DV(t) are evaluated by Eqs. (3) and (4) for each training document t. Then, two score distributions of false and true documents are utilized to calculate ci and ni by Eqs. (5) and (6).

c DT RV c DF c DF þ RV ci ¼ RV i i i

jDFi j ; jDi j

ð5Þ

where Di is the set of documents in class ci, DiF is the set of false documents in class ci, DiT is the set of true documents in class i, and c DF is the average RV value of documents in DiF. RV i

ci is a threshold over which the probability of training documents being true is higher than the probability of training documents being false. So, ci should fall between the average values of score distributions of false and true documents. In Eq. (5), c DF and RV c DT are calculated, and they are lower and upper RV i

i

bounds of ci. Then, sizes of instance sets in these two distributions jDF j are used to calculate a weighting percentage, jDii j , that decides jDF j where ci should fall between the lower and upper bounds. If jDi j i c DF : more testing documents is large, ci should be far from RV i that are similar to misclassiﬁed (or false) instances should be identiﬁed as ‘‘indistinguishable’’.

c D e F þ DV c D e T DV c D eF ni ¼ DV i i i

eF Di ; Di f

ð6Þ

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

3131

separating hyperplane. They are informative for correcting the separating hyperplane since they are around the boundary between the positive/negative zones (Ertekin et al., 2007). However, certain correctly classiﬁed documents are not selected even they are useful for adjusting prediction models. More, adding all correctly classiﬁed documents, positive instances, to the training set would cause the overﬁtting problem. That is the reason we do not select correctly classiﬁed documents in the learning process to improve computing efﬁciency.

Fig. 4. Classiﬁcation results of testing documents of which the predicted class labels are i in OAA SVM classiﬁcation. Triangle instances are misclassiﬁed, square instances are correctly classiﬁed, and dotted ones are indistinguishable. Among these instances, gray instances are selected as new training instances by (a) LMD, (b) LID, (c) LDMD, and (d) LMID methods respectively.

e i ¼ fxjx 2 Di ; RVðxÞ < c g is training documents in class ci where D i e F is the set of false documents in with RVs smaller than ci ; D i e T is the set of true documents in D e i , and DV c D e F is the avere i; D D i i F e age DV of false documents in D i . Similar with Eq. (5), in Eq. (6), ni is determined between two c D e T , according to the percentage c D e F and DV average values, DV i i e e D Fi D Fi of false documents, . If is large, ni should be far from je Di j je Dij c D e F : more testing documents similar to misclassiﬁed inDV i stances can be distinguished as ‘‘indistinguishable’’. IPMD is the preprocess of learning strategies: it distinguishes veriﬁed documents before they are added to the training set. Then, four different approaches that handle veriﬁed and distinguished documents are designed in order to improve classiﬁcation performance. For example, veriﬁed classiﬁcation results of testing documents of which the predicted class labels are ci are presented in Fig. 4. In selecting new training instances, gray instances are chosen by the proposed methods, respectively. Furthermore, it is noted that some instances that are on the right of wix + bi = ci are distinguishable since their DV values are assumed to be larger than ni. In the following, four learning approaches that apply different kinds of veriﬁed instances are detailedly introduced. 3.2. Learning strategies 3.2.1. Learning from the Misclassiﬁed Documents (LMD) Adding misclassiﬁed documents to the training set might adjust classiﬁcation rules, since the current prediction model could learn from mistakes. This strategy is called ‘‘Learning from the Misclassiﬁed Documents (LMD)’’. An example is shown in Fig. 4(a) that LMD adds triangle instances to the training set. The distributions of instances in positive and negative zones are both adjusted since triangle instances are usually in the positive zone or around the

3.2.2. Learning from Indistinguishable Documents (LID) After adding indistinguishable veriﬁed documents which cause poor SVM classiﬁcation results to the training set, other similar testing documents would be correctly classiﬁed in a higher probability. This strategy is called ‘‘Learning from Indistinguishable Documents (LID)’’. An illustration is presented in Fig. 4(b) that LID adds all dotted instances, which are indistinguishable, to the training set. Most of dotted instances locate in the negative zone since their labels are predicted in low conﬁdence. So, LID pays particular attention on adjusting the distribution of instances in the negative zone. It might lead that the prediction models are just slightly updated since the number of new positive training documents is much less than that of new negative ones. The reason we do not consider to add distinguishable veriﬁed documents to the training set is that they cannot improve the classiﬁcation performance for testing documents which are similar to the indistinguishable veriﬁed documents.

3.2.3. Learning from Distinguishably Misclassiﬁed Documents (LDMD) From another point of view, if indistinguishable veriﬁed documents are noisy for classiﬁcation, adding them to training set would make no improvement, or even worse. Hence, this learning strategy consider using only distinguishable misclassiﬁed documents as new training instances. An example is shown in Fig. 4(c) that LDMD adds triangle and not dotted instances, which are distinguishable and misclassiﬁed, to the training set. Most of gray instances locate in the positive zone since their labels are predicted in high conﬁdence. LDMD concentrates on adjusting the distribution of positive instances. However, minor errors might be ignored for correctness and still exist in the prediction models. The reason we do not consider to add distinguishable and correctly classiﬁed documents to the training set is that they would cause the overﬁtting problem.

3.2.4. Learning from the Misclassiﬁed and the Indistinguishable Documents (LMID) In this learning strategy, taking advantage of both LMD and LID strategies, misclassiﬁed documents and indistinguishable documents are added into classiﬁers’ training sets. This strategy is called ‘‘Learning from the Misclassiﬁed and Indistinguishable Documents (LMID)’’. An example is illustrated in Fig. 4(d) that LMID adds triangle instances and dotted square instances, which are misclassiﬁed or indistinguishable, to the training set. Triangle instances are useful for correcting the classiﬁcation model, since they are in the positive zone or around the separating hyperplane. Additionally, others enhance the accuracy of low-conﬁdence classiﬁcation because they are indistinguishable training documents. Hence, the current prediction model could be corrected more appropriately than that applying LMD. The reason we do not consider to add indistinguishable documents and correctly classiﬁed documents to the training set is that they have no obvious advantage of improving classiﬁcation results.

3132

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134 Table 2 classiﬁcation accuracy of learning strategies in OAA SVM classiﬁcation.

4. Experiment results 4.1. Environment 4200 Chinese ofﬁcial documents in National Chung Cheng University from the year 2002–2005 are used in our experiments. There are 20 units (classes), and 210 documents are in each unit. Chinese ofﬁcial documents have special characteristics mentioned in Section 1. In Sentence Segmentation module illustrated in Fig. 1, we use the Chinese sentence segmentation tool (The Chinese Knowledge & Information Processing), developed by Institute of Information Science in Academia Sinica, to segment Chinese sentences. Document Frequency with ﬁltering level 0.8 is the feature selection method, and TFIDF plus L2 normalization is the term weighting method. About the classiﬁcation tool, SVMlight (Joachims, 1999) developed by Thorsten Joachims is chosen as SVM classiﬁers (linear kernel for text classiﬁcation problems Joachims, 1998). In our classiﬁcation analysis, HD is the number of correctly classiﬁed and distinguishable documents, MD is the number of misclassiﬁed and distinguishable documents, HI is the number of correctly classiﬁed and indistinguishable documents, and MI is the number of misclassiﬁed and indistinguishable documents. The classiﬁcation accuracy metric which is frequently used is deﬁned as HDþHI . For measuring the distinguishing ability, the accuracy HDþMDþHIþMI HD of distinguishable documents is HDþHI , and the miss ratio of indisMI tinguishable documents is HIþMI.

1 2 3 4 5 6 7 8 9 10 11 12

Original

LID (%)

LDMD (%)

LMD (%)

LMID (%)

Learning all (%)

69.87 72.70 68.90 69.90 69.57 72.07 71.90 70.33 70.57 70.53 70.47 70.73

69.87 72.87 69.20 70.23 69.57 72.17 72.17 70.80 70.67 70.70 71.03 71.27

69.87 73.93 72.73 73.73 75.70 77.47 77.30 77.10 77.33 76.80 79.24 77.93

69.87 74.20 73.33 74.77 76.80 79.07 78.17 78.30 78.57 80.37 81.87 80.13

69.87 74.17 73.93 75.03 77.10 78.93 78.83 78.30 79.63 80.37 82.47 80.73

69.87 75.90 74.73 75.83 77.83 80.53 79.93 78.80 80.57 82.07 83.90 83.07

respectively. That means a large part of correctly classiﬁed documents would not be identiﬁed as ‘‘indistinguishable’’. Next, bold c DF < c or DV c D e F < ni . In the cases that both values are RV i i i F F c c e RV Di and DV D i are bold, misclassiﬁed documents of which predicted labels is ci are identiﬁed as ‘‘indistinguishable’’ more than those in other cases. In whole, our numerical results of IPMD are calculated accordingly: HD = 876, MD = 121, HI = 81, MI = 122. That is, 91.5% distinguishable documents are correctly classiﬁed, and 60.1% indistinguishable documents are misclassiﬁed. Hence, it is shown in our experiments that IPMD has an effective ability of identifying indistinguishable/distinguishable documents.

4.2. Performance of identifying indistinguishable documents

4.3. Comparisons of learning strategies

This experiment presents that IPMD could identify indistinguishable documents that tend to be misclassiﬁed according to their SVM decision values. The whole document set is randomly divided into 3000 training documents and 1200 testing documents. Ten combination of training and testing sets are generated for averaging decision values of SVM classiﬁcation. In Table 1, ci and ni are calculated by training documents. Given the predicted label c D e F are the average c DF and DV ci and the testing set D; RV i i c DT and RV(t) and DV(t) of which t 2 D is misclassiﬁed; RV i T c e DV D i are those of which t 2 D is correctly classiﬁed. It is ob c D e T are larger than ci and ni, c DT and DV served that both RV i i

Chinese ofﬁcial documents make poor classiﬁcation results because of indistinguishable contents. In the following experiments, indistinguishable documents are observed to be useful for strengthening classiﬁcation accuracy. The whole document set is randomly divided into 14 sets. Any two of 14 sets are selected as the training set, and others are testing sets in 12 rounds for simulating the environment of correcting classiﬁers by new training documents. Ten combinations of training and testing sets are generated for averaging classiﬁcation performance in each round. Classiﬁcation accuracy of each method is presented in Table 2. Detailedly, the number of added new training sets and current ones in each round is demonstrated in Figs. 5 and 6. Compared with proposed strategies, ‘‘original’’ is the learning strategy

Table 1 ci, ni, and average decision values of classiﬁed instances. Ci

ci

ni

c DT RV i

c DF RV i

c D eT RV i

c D eF RV i

H100 M062 T000 L011 7256 7104 M070 L031 U000 N020 7206 X000 N050 P000 Z030 M040 Z000 N040 V000 o000

0.22 0.31 0.28 0.34 0.26 0.13 0.02 0.11 0.00 0.11 0.16 0.11 0.41 0.48 0.21 0.13 0.52 0.42 0.79 0.82

0.39 0.45 0.45 0.35 0.38 1.00 0.24 0.49 0.52 0.41 0.31 0.26 0.22 0.22 0.40 0.41 0.28 0.40 0.07 0.09

0.60 0.91 0.69 0.73 0.97 1.03 0.73 0.52 0.76 0.56 0.61 0.48 0.59 0.54 0.80 0.55 0.25 0.99 0.56 1.15

0.48 0.34 0.35 0.18 0.24 0.36 0.44 0.29 0.38 0.27 0.36 0.27 0.43 0.41 0.08 0.04 0.33 0.17 0.39 0.33

1.39 1.70 1.52 1.51 1.79 1.90 1.59 1.28 1.55 1.34 1.38 1.27 1.36 1.36 1.54 1.33 0.98 1.80 1.35 2.07

0.21 0.37 0.36 0.45 0.43 0.33 0.29 0.37 0.30 0.37 0.33 0.46 0.27 0.27 0.49 0.75 0.35 0.54 0.32 0.33

Fig. 5. number of new training documents added in each round.

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

3133

sets. Importantly, the size of training sets is greatly reduced by our learning strategies. Hence, ‘‘learning all’’ is not a suitable learning strategy in our experimental environment.

5. Conclusion

Fig. 6. number of total training documents in each round.

without modifying the original training set, and ‘‘learning all’’ is the learning strategy that adds every veriﬁed documents to the training set. LID corrects prediction models by adding indistinguishable documents to the training set. However, classiﬁcation rules are just slightly adjusted since their locations are usually on the negative zone (divided by the separating hyperplane). Thus, prediction models could not be signiﬁcantly improved by LID. The experiment in Table 2 shows that only little improvement on classiﬁcation accuracy is made by LID compared to ‘‘original’’. LMD updates prediction models by adding new samples that are improperly classiﬁed. In Table 2, classiﬁcation accuracy of ‘‘original’’ is improved by LMD from Round 2 since selecting misclassiﬁed documents could immediately correct prediction models. LDMD adds only distinguishable and misclassiﬁed documents into training sets. Consequently, classiﬁcation models that have signiﬁcant errors will be properly corrected. However, some parts of errors still exist in prediction models by applying LDMD since indistinguishable documents are not utilized for adjustment. In Fig. 5 more documents which are misclassiﬁed and indistinguishable, (LMD–LDMD), are identiﬁed from Round 2, and the difference of accuracy between LMD and LDMD becomes large from Round 3. It is observed that LDMD is not a proper learning strategy even though it selects less new training documents than LMD. Experiments also demonstrate that selecting indistinguishable and misclassiﬁed documents is helpful for adjusting prediction models. LMID corrects prediction models by adding both misclassiﬁed documents and indistinguishable documents to the training set. It is worth noting in Table 2 that, the classiﬁcation performance of LMD is not accurate enough. LMID gains equal and higher prediction accuracy than LMD since indistinguishable documents could be classiﬁed more appropriately. Certainly, LMID selects slightly more new training documents than LMD, shown in Fig. 6. This small difference of training sets is necessary for accurate classiﬁcation of indistinguishable documents, but could be ignored in time of the consequent SVM training procedure. At last, our experiments are summarized that adding indistinguishable documents into training sets is helpful for strengthening accuracy of classiﬁcation models. Experimental results in Table 2 also show that ‘‘learning all’’ has the best performance in classiﬁcation accuracy. However, the time needed in training SVM classiﬁers is increased with the size of training sets. Experiments in Fig. 6 show that the curve of ‘‘learning all’’ climbs very fast since it lacks a mechanism of selecting training

Chinese ofﬁcial document classiﬁcation is solved by the proposed multi-class SVM classiﬁcation method. The traditional solution is to add misclassiﬁed documents to the training set in order to adjust classiﬁcation rules. In this paper, selecting indistinguishable documents for training sets is observed to be useful for strengthening classiﬁcation accuracy since current models predict their labels in low conﬁdence. After they are added into training sets, coming documents that are similar with them could be classiﬁed more appropriately. Detailedly, a general method, IPMD, is proposed to utilize decision values in SVM to distinguish veriﬁed documents. Experiments report that indistinguishable documents are misclassiﬁed in a high probability. Furthermore, four learning strategies are proposed to enhance classiﬁcation accuracy. Experimental results show that LMID classiﬁes documents more accurately and reduces the size of training sets dramatically. Hence, our results show that adding both of the misclassiﬁed and the indistinguishable documents to the training set is the best learning strategy in OAA SVM classiﬁcation for large set of Chinese ofﬁcial documents. Our future direction is to analyze interesting properties of indistinguishable documents, especially in multi-class classiﬁcation, so as to identify more of possibly misclassiﬁed documents and improve the classiﬁcation accuracy. Moreover, IPMD could be used to distinguish testing documents before classiﬁcation so that other prediction methods may be adopted to improve the accuracy by processing indistinguishable documents in different ways based on indistinguishable document contents.

References Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: a unifying approach for margin classiﬁers. In International conference on machine learning (pp. 9–16). Alpaydin, E. (2004). Introduction to machine learning. MIT Press, p. 10. Bordes, A., Ertekin, S., Weston, J., & Bottou, L. (2005). Fast kernel classiﬁers with online and active Learning. Journal of Machine Learning Research, 6, 1579–1619. Cauwenberghs, G., & Poggio, T. (2000). Incremental and decremental support vector machine learning. In NIPS, 409–414. Chang, F., Chou, C. H., Lin, C. C., & Chen, C. J. (2004). A prototype classiﬁcation method and its application to handwritten character recognition. In IEEE international conf. on systems, man and cybernetics (pp. 4738–4743). Chin, K. K. (1998). Support vector machines applied to speech pattern classiﬁcation, Masters Thesis, Univ. Cambridge, Cambridge, UK. Combarro, E. F., Montanes, E., e Diaz, I., Ranilla, J., & Mones, R. (2005). Introducing a family of linear measures for feature selection in text categorization. IEEE Transactions on Knowledge and Data Engineering, 17(9), 1223–1232. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20(3), 273–297. Cristianini, N., & Taylor, J. S. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press. Deng, S., & Peng, H. (2006). Document classiﬁcation based on support vector machine using a concept vector model. In IEEE/WIC/ACM international conference on web intelligence (pp. 473–476). Diaz, I., e Ranilla, J., Montanes, E., Fernandez, J., & Combarro, E. F. (2004). Improving performance of text categorization by combining ﬁltering and support vector machines. Journal the American Society for Information Science and Technology, 55(7), 579–592. Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via errorcorrecting output codes. Journal of Artiﬁcial Intelligence Reseach, 2, 263–286. Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In International conference on information and knowledge management (pp. 148–155). Ertekin, S., Huang, J., Bottou, L., & Lee Giles, C. (2007). Learning on the border: active learning in imbalanced data classiﬁcation. In Proceedings ACM 16th conference on information and knowledge management (CIKM 2007) (pp. 127–136). Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425.

3134

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In European conference on machine learning (pp. 137–142). Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings the European conference on machine learning, Springer. Joachims, T. (1999). Transductive inference for text classiﬁcation using support vector machines. International conference on machine learning (ICML) (pp. 200– 209). Kecman, V. (2001). Learning and soft computing, support vector machines. Neural networks and fuzzy logic models. Cambridge: MIT Press. Lau, K. W., & Wu, Q. H. (2003). Online training of support vector classiﬁer. Pattern Recognition, 36, 1913–1920. Lee, J., & Lee, D. (2005). An improved cluster labeling method for support vector clustering. IEEE Transactions on Patter Analysis and Machine Intelligence, 27, 461–464. Lewis, D. D. (1998). Naive (Bayers) at forty: the independence assumption in information retrieval. In European conference on machine learning (pp. 4–15). Lewis, D. D., & Ringuette, M. (1994). Comparison of two learning algorithms for text categorization. In Symposium on document analysis and information retrieval. Liang, J. Z. (2003). SVM based Chinese web pages automatic classiﬁcation. In Proceedings international conference on machine learning and cybernetics (pp. 2265–2268). Liang, J. Z. (2004). SVM multi-classiﬁer and web document classiﬁcation. In Proceedings of international conference on machine learning and cybernetics (Vol. 3, pp. 1347–1351). Lin, X. D., Peng, H., & Liu, B. (2006). Support vector machines for text categorization in Chinese question classiﬁcation. In International IEEE conference on web intelligence (pp. 334–337). Milgram, J., Cheriet, M., & Sabourin, R. (2006). One Against One or One Against All: which one is better for handwriting recognition with SVMs. In International workshop on frontiers in handwriting recognition. Özgür, A., & Güngör. (2006). Classiﬁcation of skewed and homogeneous document corpora with class-based and corpus-based keywords. In German Conference on AI, KI 2006 (pp. 91–101).

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. Ramirez, L., Durdle, N. G., Raso, V. J., & Hill, D. L. (2006). A support vector machines classiﬁer to assess the severity of idiopathic scoliosis from surface topography. IEEE Transactions on Information Technology in Biomedicine, 10, 84–91. Rennie, J. D. M., & Rifkin, R. (2001). Improving multiclass text classiﬁcation with the support vector machine. Massachusetts Institute of Technology, Tech. Rep. 2001-026. Rifkin, R., & Klautau, A. (2004). In defence of one-vs-all classiﬁcation. Journal of Machine Learning Research, 5, 101–141. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523. Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill. SVMlight. . The Chinese Knowledge and Information Processing (CKIP) of Academia Sinica of Taiwan, A Chinese word segmentation system. . Wang, L. P. (Ed.). (2005). Support vector machines: theory and application. Berlin: Springer. Wang, Z. Q., Sun, X., Zhang, D. X., & Li, X. (2006). An optimal SVM-based text classiﬁcation algorithm. In International conference on machine learning and cybernetics (pp. 1378–1381). Wang, L. P., & Fu, X. J. (2005). Data mining with computational intelligence. Berlin: Springer. Wiener, E. (1995). A neural network approach to topic spotting. In Symposium on document analysis and information retrieval (pp. 317–332). Yuan, F., Yang, L., & Yu, G. (2005). Improving the K-NN and applying it to Chinese text classiﬁcation. In International conference on machine learning and cybernetics (Vol. 3, pp. 1547–1553). Zou, J. Q., Chen, G. L., & Guo, W. Z. (2005). Chinese web page classiﬁcation using noise-tolerant support vector machines. In IEEE international conference on natural language processing and knowledge engineering (pp. 785–790).

A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents

A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents

Recommend Documents