A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents

A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents

Expert Systems with Applications 39 (2012) 3127–3134 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal hom...

782KB Sizes 12 Downloads 142 Views

Expert Systems with Applications 39 (2012) 3127–3134

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents q JuiHsi Fu ⇑, SingLing Lee Department of Computer Science and Information Engineering, National Chung Cheng University, 168 University Road, Minhsiung Township, 62162 Chiayi, Taiwan, ROC

a r t i c l e

i n f o

Keywords: Support Vector Machines (SVM) Multi-class classification Chinese official document classification Indistinguishability identification Incremental learning

a b s t r a c t Support Vector Machines (SVM) has been developed for Chinese official document classification in Oneagainst-All (OAA) multi-class scheme. Several data retrieving techniques including sentence segmentation, term weighting, and feature extraction are used in preprocess. We observe that most documents of which contents are indistinguishable make poor classification results. The traditional solution is to add misclassified documents to the training set in order to adjust classification rules. In this paper, indistinguishable documents are observed to be informative for strengthening prediction performance since their labels are predicted by the current model in low confidence. A general approach is proposed to utilize decision values in SVM to identify indistinguishable documents. Based on verified classification results and distinguishability of documents, four learning strategies that select certain documents to training sets are proposed to improve classification performance. Experiments report that indistinguishable documents are able to be identified in a high probability and are informative for learning strategies. Furthermore, LMID that adds both of misclassified documents and indistinguishable documents to training sets is the most effective learning strategy in SVM classification for large set of Chinese official documents in terms of computing efficiency and classification accuracy. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction In government departments and companies, some kinds of articles and documents are still handled by human labors. Among these data, Chinese official documents have been used very often to inform and communicate officially with companies/corporations. Each employee is responsible for dispatching official documents to all related departments. Hence, designing an accurate classification system to handle Chinese official documents will improve government employee’s working efficiency. However, in Chinese documents, it is very often that segmented terms could not represent the original meaning of this content completely, since no delimiter exits in the content. Moreover, no formal stop-word list is defined like English to remove those meaningless words. Due to lack of complete content representation and stop-word lists, distinct features are difficult to be extracted from the document content. Additionally, in Chinese official documents which are well-formed and textual, some special characteristics are listed below:

q This work is supported by NSC, Taiwan, ROC under Grant No. NSC 97-2221-E194-029-MY2. ⇑ Corresponding author. E-mail addresses: [email protected] (J. Fu), [email protected] (S. Lee).

0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.08.176

1. Short and brief content The abstract of an official document usually is used to represent this document for classification. The abstract needs to be short and brief to describe government affairs. 2. Fewer distinct features In official affairs, Chinese official documents belonging to different units (classes) tend to be similar due to most terms in a document are not discriminative. Classifying Chinese official documents is difficult when depending on only few distinct terms. Our objective is to classify Chinese official documents more precisely. One department in the company is corresponding to a class label, so the problem of automatically dispatching an official document is reduced to a multi-class classification problem. However, special characteristics of Chinese official documents usually cause poor classification results. The traditional learning strategy is to add misclassified documents to the training set in order to adjust classification rules. In this paper, some correctly classified documents are observed to be indistinguishable since their labels are predicted in low confidence. It is worth noting that, indistinguishable documents should be informative for strengthening prediction rules since following similar ones could be correctly classified in a higher probability. Hence, a distinguishing method, Identifying Possibly Misclassification Documents (IPMD), is proposed to

3128

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

distinguish whether verified documents tend to be misclassified or not. Based on verified classification results and document distinguishability, four learning strategies that select certain documents to training sets are proposed to enhance classification accuracy and reduce the size of training sets. They are introduced in more details in Section 3. Fig. 1 is an overview of our document classification system. Initially, training documents are processed by modules of Text Preprocessing and Classifier Training to build a prediction model. Feature extraction eliminates terms with weights lower than a predefined threshold, and term weighting methods (Combarro, Montanes, e Diaz, Ranilla, & Mones, 2005; Quinlan, 1986; Salton & Buckley, 1988; Salton & McGill, 1983) are used to represent document vectors. Classifier Training is kernel in classification systems. Some well-known classification methods, K-Nearest Neighbors (KNN) (Yuan, Yang, & Yu, 2005), Support Vector Machines (SVM) (Cortes & Vapnik, 1995; Cristianini & Taylor, 2000; Liang, 2004), Naive Bayes (Lewis, 1998; Lewis & Ringuette, 1994), and neural network (Wiener, 1995), have been well studied recently. Notably, SVM is adopted for solving our document classification problems since it has been proven to perform very effectively in many research results (Deng & Peng, 2006; Diaz, e Ranilla, Montanes, Fernandez, &

Combarro, 2004; Dumais, Platt, Heckerman, & Sahami, 1998; Joachims, 1998; Kecman, 2001; Lee & Lee, 2005; Ramirez, Durdle, Raso, & Hill, 2006; Özgür & Güngör, 2006; Wang, 2005; Wang, Sun, Zhang, & Li, 2006; Wang & Fu, 2005) and is able to deal with high dimensional feature spaces. Geometrically speaking, SVM (Cortes & Vapnik, 1995) generates a hyperplane to separate positive instances from negative ones. The objective function is to maximize the distance from the nearest training instance to the separating hyperplane. When the prediction model is generated, testing documents are also processed by Text Preprocess and their class labels are properly predicted by Classifier Training. Verified module judges the prediction results (supervised learning Alpaydin, 2004). Then, IPMD module utilizes the decision values of verified documents in SVM classification to determine whether they are distinguishable or not. Next, a revised Learning Strategy module that utilizes verified classification results and distinguishability of documents to select new training instances is developed in order to update prediction models and improve classification performance. The objective of semi-supervised learning (SSL) (Joachims, 1999) is to utilize unlabeled samples for decreasing the use of labeled samples. In this paper, the proposed methods identify

Fig. 1. Text Preprocessing, Classifier Training, and classifier testing of our SVM classification system.

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

indistinguishable documents from classified testing ones and put them into training sets in order to build SVM models more precisely. Testing documents are parts of the labeled dataset. Hence, our methods focus on constructing new training sets based on the original classification models, not like SSL that utilizes unlabeled samples for generating prediction models. Moreover, the SVM prediction model of traditional online learning is updated by each new training instance to achieve the specific objective greedily (Bordes, Ertekin, Weston, & Bottou, 2005; Cauwenberghs & Poggio, 2000; Lau & Wu, 2003). Although proposed learning strategies keep adding new training instances to the training set, they are different from traditional online learning. This paper focuses on how to appropriately construct the training set to a certain size such that it will be used later to rebuild SVM classifier to improve classification accuracy. So, when a new training instance comes, it will be first kept in disks without updating our SVM model immediately. Until new training instances are accumulated to a specific amount (two sizes are chosen, 100 and 300, in our experiment), the SVM model is rebuilt by adding a part of new training instances to the current training set. Our prediction models are rebuilt by original classifiers with a new created training set, not like online ones that process each new training instance with no need for storage and reprocessing. Eventually, the proposed methods are designed for selecting informative labeled documents to build an accurate classification system. The rest of this paper is organized as follows: Section 2 introduces multi-class SVM classification methods. Section 3 presents the proposed methods: one distinguishing method and four learning strategies. Section 4 shows the experiment performance of our SVM document classification method. Section 5 is the summary. 2. Related works SVM is originally designed for binary classification, and can be applied to solve multi-class classification problems with multiclass schemes (Chang, Chou, Lin, & Chen, 2004; Hsu & Lin, 2002; Liang, 2003, 2004; Lin, Peng, & Liu, 2006; Rennie & Rifkin, 2001; Zou, Chen, & Guo, 2005). One-against-All (OAA) and One-againstOne (OAO) multi-class schemes are adopted more frequently by current research results (Chang et al., 2004; Chin, 1998; Hsu & Lin, 2002; Rennie & Rifkin, 2001). One widely used approach in OAA scheme is called ‘‘winner-takes-all’’. The label of an instance is predicted according to the maximum output value among all SVM classifiers. Practically, OAA SVM classification scheme trains C binary SVM classifiers for C classes. SVMi classifier is trained by positive instances of which class labels are ci and negative instances of which class labels are not ci. After these C binary SVM classifiers are trained, C corresponding decision functions are generated, respectively. In the classification procedure, the predicted class label of a testing instance, x, is equal to the label of a decision function which has the maximum decision value in the SVM classifier:

arg maxðwi  x þ bi Þ i¼1...C

ð1Þ

(wi  x + bi) is the decision function for class ci. Thus, OAA SVM classification is to solve C quadratic programming problems in which each has l variables (l is the size of whole training set). In OAO scheme, one major method is pairwise with majority voting. The class label is assigned to an instance by winning the most pairwise comparisons, also called ‘‘Max Wins’’. Consequently, multi-class SVM classification problems are solved by training C(C  1)/2 binary SVM classifiers for all pairs of classes and formulating C(C  1)/2 corresponding decision functions. SVMij is trained by positive instances of which class labels are ci and negative instances of which class labels are cj. The predicted class label of

3129

a testing instance is determined by all decision values and a voting strategy. If the sign of (wij  x + bij), the decision function for class ci and cj, shows the predicted class label of instance x is ci, class ci is voted:

arg max v otex ðiÞ; i¼1...C

v otex ðiÞ ¼

C X

ð2Þ

v x ði; jÞ;

j¼1;j!¼i

8 > < 1; if signððwij  x þ bij ÞÞ v x ði; jÞ ¼ > says x is in class ci ; : 0; otherwise: In OAO SVM classification scheme, C(C  1)/2 binary SVM classifiers are constructed, and the average number of training documents in each classifier is 2l/C, l is the size of whole training set. Thus, OAO SVM classification needs to solve C(C  1)/2 quadratic programming problems in which each has 2l/C variables. Error-Correcting Output Codes (ECOC) (Allwein, Schapire, & Singer, 2000; Dietterich & Bakiri, 1995) are methods for combining different binary classifiers. Practically, it is demonstrated that OAA and OAO schemes with well-tuned classifiers are as effective as other methods in ECOC (Rifkin & Klautau, 2004). Furthermore, the predicting time of binary SVM classifiers increases with the number of support vectors. Experiments in (Hsu & Lin, 2002; Milgram, Cheriet, & Sabourin, 2006) demonstrate that there are less support vectors in OAO SVM scheme, and then conclude the predicting time of OAO SVM classification is shorter than that of OAA SVM classification. However, (Rifkin & Klautau, 2004) argues that, when binary SVM classifiers in OAA and OAO schemes are tuned properly, the difference between this two schemes is small. Thus, it is difficult to conclude whether OAO or OAA SVM scheme performs better. In this paper, we focus on OAA scheme and propose revised learning strategies based on decision values of SVM classification. 3. The proposed methods Ertekin, Huang, Bottou, and Lee Giles (2007) argues that adding training documents which are close to the hyperplane in SVM to the training set will improve classification performance. However, indistinguishable documents are not always close to the hyperplane since their labels are predicted in low confidence (low SVM decision values). Thus, the proposed method utilizes decision values in SVM to find indistinguishable verified documents before they are applied to the training set. Section 3 is divided into two parts. Section 3.1 presents the distinguishing method which is the preprocess of the learning strategies. The other subsections present the learning strategies which handle verified and distinguished documents differently. 3.1. Identifying Possibly Misclassified Documents (IPMD) In SVM classification, signs and the shortest distance between the testing document and the hyperplane are used to decide the predicted class label. However, two situations cause poor SVM classification results in OAA scheme. In the first case, all decision values might be negative that implies no SVM classifier can recognize the testing document in proper confidence. Secondly, the maximum and second maximum decision values are very close; the testing document is hard to be differentiated between these two SVM classifiers. Therefore, the class label of a testing document might be predicted in low confidence when only depending on the maximum decision value.

3130

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

Fig. 3. DV distribution of testing documents with RV smaller than 0.4 in the arbitrary class. Fig. 2. RV distribution of testing documents of which the predicted labels are the arbitrary class.

Before introducing the distinguishing method, we first define some notations in the following. When the class label of a testing document is predicted by OAA SVM classifiers, Result-Value (RV) and Difference-Value (DV) are defined by Eqs. (3) and (4). Given a testing document x, RV(x) is the maximum decision value when x is in OAA SVM classification. If RV(x) is generated by the decision function in SVMi, it stands for how possible x’s predicted class label is ci. Assuming the second largest decision value is generated by the decision function in SVMj. DV(x) is the difference between RV(x) and the second largest decision value, which stands for how clear x can be differentiated between SVMi and SVMj.

RVðxÞ ¼ maxðwi  x þ bi Þ;

ð3Þ

DVðxÞ ¼ RVðxÞ  max ðwj  x þ bj Þ;

ð4Þ

i¼1...C

j¼1...C;j!¼i

where x is the document vector and ci is the class label of the decision function which has the maximum decision value. Fig. 2 shows RV distribution of testing documents, of which predicted labels are the arbitrary class, (3000 training documents and 600 testing documents) in SVM classification with our experimental setting. The x-axis is RV, and the y-axis is the number of documents. The white bars are RV distribution of correctly classified documents, and the gray bars are RV distribution of misclassified documents. We observe that when RV(x) is larger than 0.4, the probability of classifying x correctly is over 0.5. When RV(x) is smaller than 0.4, DV(x) is calculated for identifying the other condition. Fig. 3 is DV distribution of testing documents with their RVs smaller than 0.4. The x-axis is DV, and the y-axis is the number of documents. The white bars are DV distribution of correctly classified documents, and the gray bars are DV distribution of misclassified documents. When DV(x) is larger than 0.3, the probability that x is correctly classified is over 0.5. Thus, if RV(x) is smaller than 0.4 and DVx is smaller than 0.3, x is labeled as ‘‘indistinguishable’’. Undoubtedly, predicted labels of the indistinguishable documents are decided by the SVM prediction model in low confidence and are inconsistent with the true labels in a high probability. The main idea for our distinguishing method is to find two thresholds in each class, name RV bound (RVB) and DV bound (DVB), to define indistinguishable documents. Let ci and ni denote

RVB and DVB in class ci. Assuming x’s predicted class is ci, RV(x) and DV(x) are compared with ci and ni to find whether x is ‘‘distinguishable’’ or ‘‘indistinguishable’’. In Figs. 2 and 3, ci and ni in the dominant class are 0.4 and 0.3, respectively. Typically, RV and DV distributions of testing documents could be simulated by those of training documents because decision values of testing documents are based on training documents in SVM classifiers. Consequently, ci and ni are determined by the training set in SVMi. Given a training document t, RV(t) is generated by the decision function in SVMi. Then, t is defined as a true document if t is a positive training instance in SVMi; t is defined as a false document if t is negative one. That implies true documents in the training set are corresponding to correctly classified documents in the testing set and false documents in the training set are corresponding to misclassified documents in the testing set. When all SVM classifiers and decision functions are generated, RV(t) and DV(t) are evaluated by Eqs. (3) and (4) for each training document t. Then, two score distributions of false and true documents are utilized to calculate ci and ni by Eqs. (5) and (6).













c DT  RV c DF c DF þ RV ci ¼ RV i i i





jDFi j ; jDi j

ð5Þ

where Di is the set of documents in class ci, DiF is the set of false documents in class ci, DiT is the set of true documents in class i, and   c DF is the average RV value of documents in DiF. RV i

ci is a threshold over which the probability of training documents being true is higher than the probability of training documents being false. So, ci should fall between the average values of score distributions of false and true documents. In Eq. (5),     c DF and RV c DT are calculated, and they are lower and upper RV i

i

bounds of ci. Then, sizes of instance sets in these two distributions jDF j are used to calculate a weighting percentage, jDii j , that decides jDF j where ci should fall between the lower and upper bounds. If jDi j i   c DF : more testing documents is large, ci should be far from RV i that are similar to misclassified (or false) instances should be identified as ‘‘indistinguishable’’.













c D e F þ DV c D e T  DV c D eF ni ¼ DV i i i



  eF Di     ; Di  f

ð6Þ

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

3131

separating hyperplane. They are informative for correcting the separating hyperplane since they are around the boundary between the positive/negative zones (Ertekin et al., 2007). However, certain correctly classified documents are not selected even they are useful for adjusting prediction models. More, adding all correctly classified documents, positive instances, to the training set would cause the overfitting problem. That is the reason we do not select correctly classified documents in the learning process to improve computing efficiency.

Fig. 4. Classification results of testing documents of which the predicted class labels are i in OAA SVM classification. Triangle instances are misclassified, square instances are correctly classified, and dotted ones are indistinguishable. Among these instances, gray instances are selected as new training instances by (a) LMD, (b) LID, (c) LDMD, and (d) LMID methods respectively.

e i ¼ fxjx 2 Di ; RVðxÞ < c g is training documents in class ci where D i e F is the set of false documents in with RVs smaller than ci ; D   i e T is the set of true documents in D e i , and DV c D e F is the avere i; D D i i F e age DV of false documents in D i . Similar with Eq. (5), in Eq. (6), ni is determined between two     c D e T , according to the percentage c D e F and DV average values, DV i i     e e D Fi  D Fi  of false documents, . If is large, ni should be far from je Di j je Dij   c D e F : more testing documents similar to misclassified inDV i stances can be distinguished as ‘‘indistinguishable’’. IPMD is the preprocess of learning strategies: it distinguishes verified documents before they are added to the training set. Then, four different approaches that handle verified and distinguished documents are designed in order to improve classification performance. For example, verified classification results of testing documents of which the predicted class labels are ci are presented in Fig. 4. In selecting new training instances, gray instances are chosen by the proposed methods, respectively. Furthermore, it is noted that some instances that are on the right of wix + bi = ci are distinguishable since their DV values are assumed to be larger than ni. In the following, four learning approaches that apply different kinds of verified instances are detailedly introduced. 3.2. Learning strategies 3.2.1. Learning from the Misclassified Documents (LMD) Adding misclassified documents to the training set might adjust classification rules, since the current prediction model could learn from mistakes. This strategy is called ‘‘Learning from the Misclassified Documents (LMD)’’. An example is shown in Fig. 4(a) that LMD adds triangle instances to the training set. The distributions of instances in positive and negative zones are both adjusted since triangle instances are usually in the positive zone or around the

3.2.2. Learning from Indistinguishable Documents (LID) After adding indistinguishable verified documents which cause poor SVM classification results to the training set, other similar testing documents would be correctly classified in a higher probability. This strategy is called ‘‘Learning from Indistinguishable Documents (LID)’’. An illustration is presented in Fig. 4(b) that LID adds all dotted instances, which are indistinguishable, to the training set. Most of dotted instances locate in the negative zone since their labels are predicted in low confidence. So, LID pays particular attention on adjusting the distribution of instances in the negative zone. It might lead that the prediction models are just slightly updated since the number of new positive training documents is much less than that of new negative ones. The reason we do not consider to add distinguishable verified documents to the training set is that they cannot improve the classification performance for testing documents which are similar to the indistinguishable verified documents.

3.2.3. Learning from Distinguishably Misclassified Documents (LDMD) From another point of view, if indistinguishable verified documents are noisy for classification, adding them to training set would make no improvement, or even worse. Hence, this learning strategy consider using only distinguishable misclassified documents as new training instances. An example is shown in Fig. 4(c) that LDMD adds triangle and not dotted instances, which are distinguishable and misclassified, to the training set. Most of gray instances locate in the positive zone since their labels are predicted in high confidence. LDMD concentrates on adjusting the distribution of positive instances. However, minor errors might be ignored for correctness and still exist in the prediction models. The reason we do not consider to add distinguishable and correctly classified documents to the training set is that they would cause the overfitting problem.

3.2.4. Learning from the Misclassified and the Indistinguishable Documents (LMID) In this learning strategy, taking advantage of both LMD and LID strategies, misclassified documents and indistinguishable documents are added into classifiers’ training sets. This strategy is called ‘‘Learning from the Misclassified and Indistinguishable Documents (LMID)’’. An example is illustrated in Fig. 4(d) that LMID adds triangle instances and dotted square instances, which are misclassified or indistinguishable, to the training set. Triangle instances are useful for correcting the classification model, since they are in the positive zone or around the separating hyperplane. Additionally, others enhance the accuracy of low-confidence classification because they are indistinguishable training documents. Hence, the current prediction model could be corrected more appropriately than that applying LMD. The reason we do not consider to add indistinguishable documents and correctly classified documents to the training set is that they have no obvious advantage of improving classification results.

3132

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134 Table 2 classification accuracy of learning strategies in OAA SVM classification.

4. Experiment results 4.1. Environment 4200 Chinese official documents in National Chung Cheng University from the year 2002–2005 are used in our experiments. There are 20 units (classes), and 210 documents are in each unit. Chinese official documents have special characteristics mentioned in Section 1. In Sentence Segmentation module illustrated in Fig. 1, we use the Chinese sentence segmentation tool (The Chinese Knowledge & Information Processing), developed by Institute of Information Science in Academia Sinica, to segment Chinese sentences. Document Frequency with filtering level 0.8 is the feature selection method, and TFIDF plus L2 normalization is the term weighting method. About the classification tool, SVMlight (Joachims, 1999) developed by Thorsten Joachims is chosen as SVM classifiers (linear kernel for text classification problems Joachims, 1998). In our classification analysis, HD is the number of correctly classified and distinguishable documents, MD is the number of misclassified and distinguishable documents, HI is the number of correctly classified and indistinguishable documents, and MI is the number of misclassified and indistinguishable documents. The classification accuracy metric which is frequently used is defined as HDþHI . For measuring the distinguishing ability, the accuracy HDþMDþHIþMI HD of distinguishable documents is HDþHI , and the miss ratio of indisMI tinguishable documents is HIþMI.

1 2 3 4 5 6 7 8 9 10 11 12

Original

LID (%)

LDMD (%)

LMD (%)

LMID (%)

Learning all (%)

69.87 72.70 68.90 69.90 69.57 72.07 71.90 70.33 70.57 70.53 70.47 70.73

69.87 72.87 69.20 70.23 69.57 72.17 72.17 70.80 70.67 70.70 71.03 71.27

69.87 73.93 72.73 73.73 75.70 77.47 77.30 77.10 77.33 76.80 79.24 77.93

69.87 74.20 73.33 74.77 76.80 79.07 78.17 78.30 78.57 80.37 81.87 80.13

69.87 74.17 73.93 75.03 77.10 78.93 78.83 78.30 79.63 80.37 82.47 80.73

69.87 75.90 74.73 75.83 77.83 80.53 79.93 78.80 80.57 82.07 83.90 83.07

respectively. That means a large part of correctly classified documents would not be identified as ‘‘indistinguishable’’. Next, bold     c DF < c or DV c D e F < ni . In the cases that both values are RV i i i     F F c c e RV Di and DV D i are bold, misclassified documents of which predicted labels is ci are identified as ‘‘indistinguishable’’ more than those in other cases. In whole, our numerical results of IPMD are calculated accordingly: HD = 876, MD = 121, HI = 81, MI = 122. That is, 91.5% distinguishable documents are correctly classified, and 60.1% indistinguishable documents are misclassified. Hence, it is shown in our experiments that IPMD has an effective ability of identifying indistinguishable/distinguishable documents.

4.2. Performance of identifying indistinguishable documents

4.3. Comparisons of learning strategies

This experiment presents that IPMD could identify indistinguishable documents that tend to be misclassified according to their SVM decision values. The whole document set is randomly divided into 3000 training documents and 1200 testing documents. Ten combination of training and testing sets are generated for averaging decision values of SVM classification. In Table 1, ci and ni are calculated by training documents. Given the predicted label     c D e F are the average c DF and DV ci and the testing set D; RV i i   c DT and RV(t) and DV(t) of which t 2 D is misclassified; RV i   T c e DV D i are those of which t 2 D is correctly classified. It is ob    c D e T are larger than ci and ni, c DT and DV served that both RV i i

Chinese official documents make poor classification results because of indistinguishable contents. In the following experiments, indistinguishable documents are observed to be useful for strengthening classification accuracy. The whole document set is randomly divided into 14 sets. Any two of 14 sets are selected as the training set, and others are testing sets in 12 rounds for simulating the environment of correcting classifiers by new training documents. Ten combinations of training and testing sets are generated for averaging classification performance in each round. Classification accuracy of each method is presented in Table 2. Detailedly, the number of added new training sets and current ones in each round is demonstrated in Figs. 5 and 6. Compared with proposed strategies, ‘‘original’’ is the learning strategy

Table 1 ci, ni, and average decision values of classified instances. Ci

ci

ni

  c DT RV i

  c DF RV i

  c D eT RV i

  c D eF RV i

H100 M062 T000 L011 7256 7104 M070 L031 U000 N020 7206 X000 N050 P000 Z030 M040 Z000 N040 V000 o000

0.22 0.31 0.28 0.34 0.26 0.13 0.02 0.11 0.00 0.11 0.16 0.11 0.41 0.48 0.21 0.13 0.52 0.42 0.79 0.82

0.39 0.45 0.45 0.35 0.38 1.00 0.24 0.49 0.52 0.41 0.31 0.26 0.22 0.22 0.40 0.41 0.28 0.40 0.07 0.09

0.60 0.91 0.69 0.73 0.97 1.03 0.73 0.52 0.76 0.56 0.61 0.48 0.59 0.54 0.80 0.55 0.25 0.99 0.56 1.15

0.48 0.34 0.35 0.18 0.24 0.36 0.44 0.29 0.38 0.27 0.36 0.27 0.43 0.41 0.08 0.04 0.33 0.17 0.39 0.33

1.39 1.70 1.52 1.51 1.79 1.90 1.59 1.28 1.55 1.34 1.38 1.27 1.36 1.36 1.54 1.33 0.98 1.80 1.35 2.07

0.21 0.37 0.36 0.45 0.43 0.33 0.29 0.37 0.30 0.37 0.33 0.46 0.27 0.27 0.49 0.75 0.35 0.54 0.32 0.33

Fig. 5. number of new training documents added in each round.

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

3133

sets. Importantly, the size of training sets is greatly reduced by our learning strategies. Hence, ‘‘learning all’’ is not a suitable learning strategy in our experimental environment.

5. Conclusion

Fig. 6. number of total training documents in each round.

without modifying the original training set, and ‘‘learning all’’ is the learning strategy that adds every verified documents to the training set. LID corrects prediction models by adding indistinguishable documents to the training set. However, classification rules are just slightly adjusted since their locations are usually on the negative zone (divided by the separating hyperplane). Thus, prediction models could not be significantly improved by LID. The experiment in Table 2 shows that only little improvement on classification accuracy is made by LID compared to ‘‘original’’. LMD updates prediction models by adding new samples that are improperly classified. In Table 2, classification accuracy of ‘‘original’’ is improved by LMD from Round 2 since selecting misclassified documents could immediately correct prediction models. LDMD adds only distinguishable and misclassified documents into training sets. Consequently, classification models that have significant errors will be properly corrected. However, some parts of errors still exist in prediction models by applying LDMD since indistinguishable documents are not utilized for adjustment. In Fig. 5 more documents which are misclassified and indistinguishable, (LMD–LDMD), are identified from Round 2, and the difference of accuracy between LMD and LDMD becomes large from Round 3. It is observed that LDMD is not a proper learning strategy even though it selects less new training documents than LMD. Experiments also demonstrate that selecting indistinguishable and misclassified documents is helpful for adjusting prediction models. LMID corrects prediction models by adding both misclassified documents and indistinguishable documents to the training set. It is worth noting in Table 2 that, the classification performance of LMD is not accurate enough. LMID gains equal and higher prediction accuracy than LMD since indistinguishable documents could be classified more appropriately. Certainly, LMID selects slightly more new training documents than LMD, shown in Fig. 6. This small difference of training sets is necessary for accurate classification of indistinguishable documents, but could be ignored in time of the consequent SVM training procedure. At last, our experiments are summarized that adding indistinguishable documents into training sets is helpful for strengthening accuracy of classification models. Experimental results in Table 2 also show that ‘‘learning all’’ has the best performance in classification accuracy. However, the time needed in training SVM classifiers is increased with the size of training sets. Experiments in Fig. 6 show that the curve of ‘‘learning all’’ climbs very fast since it lacks a mechanism of selecting training

Chinese official document classification is solved by the proposed multi-class SVM classification method. The traditional solution is to add misclassified documents to the training set in order to adjust classification rules. In this paper, selecting indistinguishable documents for training sets is observed to be useful for strengthening classification accuracy since current models predict their labels in low confidence. After they are added into training sets, coming documents that are similar with them could be classified more appropriately. Detailedly, a general method, IPMD, is proposed to utilize decision values in SVM to distinguish verified documents. Experiments report that indistinguishable documents are misclassified in a high probability. Furthermore, four learning strategies are proposed to enhance classification accuracy. Experimental results show that LMID classifies documents more accurately and reduces the size of training sets dramatically. Hence, our results show that adding both of the misclassified and the indistinguishable documents to the training set is the best learning strategy in OAA SVM classification for large set of Chinese official documents. Our future direction is to analyze interesting properties of indistinguishable documents, especially in multi-class classification, so as to identify more of possibly misclassified documents and improve the classification accuracy. Moreover, IPMD could be used to distinguish testing documents before classification so that other prediction methods may be adopted to improve the accuracy by processing indistinguishable documents in different ways based on indistinguishable document contents.

References Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: a unifying approach for margin classifiers. In International conference on machine learning (pp. 9–16). Alpaydin, E. (2004). Introduction to machine learning. MIT Press, p. 10. Bordes, A., Ertekin, S., Weston, J., & Bottou, L. (2005). Fast kernel classifiers with online and active Learning. Journal of Machine Learning Research, 6, 1579–1619. Cauwenberghs, G., & Poggio, T. (2000). Incremental and decremental support vector machine learning. In NIPS, 409–414. Chang, F., Chou, C. H., Lin, C. C., & Chen, C. J. (2004). A prototype classification method and its application to handwritten character recognition. In IEEE international conf. on systems, man and cybernetics (pp. 4738–4743). Chin, K. K. (1998). Support vector machines applied to speech pattern classification, Masters Thesis, Univ. Cambridge, Cambridge, UK. Combarro, E. F., Montanes, E., e Diaz, I., Ranilla, J., & Mones, R. (2005). Introducing a family of linear measures for feature selection in text categorization. IEEE Transactions on Knowledge and Data Engineering, 17(9), 1223–1232. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20(3), 273–297. Cristianini, N., & Taylor, J. S. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press. Deng, S., & Peng, H. (2006). Document classification based on support vector machine using a concept vector model. In IEEE/WIC/ACM international conference on web intelligence (pp. 473–476). Diaz, I., e Ranilla, J., Montanes, E., Fernandez, J., & Combarro, E. F. (2004). Improving performance of text categorization by combining filtering and support vector machines. Journal the American Society for Information Science and Technology, 55(7), 579–592. Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Reseach, 2, 263–286. Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In International conference on information and knowledge management (pp. 148–155). Ertekin, S., Huang, J., Bottou, L., & Lee Giles, C. (2007). Learning on the border: active learning in imbalanced data classification. In Proceedings ACM 16th conference on information and knowledge management (CIKM 2007) (pp. 127–136). Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425.

3134

J. Fu, S. Lee / Expert Systems with Applications 39 (2012) 3127–3134

Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In European conference on machine learning (pp. 137–142). Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings the European conference on machine learning, Springer. Joachims, T. (1999). Transductive inference for text classification using support vector machines. International conference on machine learning (ICML) (pp. 200– 209). Kecman, V. (2001). Learning and soft computing, support vector machines. Neural networks and fuzzy logic models. Cambridge: MIT Press. Lau, K. W., & Wu, Q. H. (2003). Online training of support vector classifier. Pattern Recognition, 36, 1913–1920. Lee, J., & Lee, D. (2005). An improved cluster labeling method for support vector clustering. IEEE Transactions on Patter Analysis and Machine Intelligence, 27, 461–464. Lewis, D. D. (1998). Naive (Bayers) at forty: the independence assumption in information retrieval. In European conference on machine learning (pp. 4–15). Lewis, D. D., & Ringuette, M. (1994). Comparison of two learning algorithms for text categorization. In Symposium on document analysis and information retrieval. Liang, J. Z. (2003). SVM based Chinese web pages automatic classification. In Proceedings international conference on machine learning and cybernetics (pp. 2265–2268). Liang, J. Z. (2004). SVM multi-classifier and web document classification. In Proceedings of international conference on machine learning and cybernetics (Vol. 3, pp. 1347–1351). Lin, X. D., Peng, H., & Liu, B. (2006). Support vector machines for text categorization in Chinese question classification. In International IEEE conference on web intelligence (pp. 334–337). Milgram, J., Cheriet, M., & Sabourin, R. (2006). One Against One or One Against All: which one is better for handwriting recognition with SVMs. In International workshop on frontiers in handwriting recognition. Özgür, A., & Güngör. (2006). Classification of skewed and homogeneous document corpora with class-based and corpus-based keywords. In German Conference on AI, KI 2006 (pp. 91–101).

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. Ramirez, L., Durdle, N. G., Raso, V. J., & Hill, D. L. (2006). A support vector machines classifier to assess the severity of idiopathic scoliosis from surface topography. IEEE Transactions on Information Technology in Biomedicine, 10, 84–91. Rennie, J. D. M., & Rifkin, R. (2001). Improving multiclass text classification with the support vector machine. Massachusetts Institute of Technology, Tech. Rep. 2001-026. Rifkin, R., & Klautau, A. (2004). In defence of one-vs-all classification. Journal of Machine Learning Research, 5, 101–141. Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523. Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill. SVMlight. . The Chinese Knowledge and Information Processing (CKIP) of Academia Sinica of Taiwan, A Chinese word segmentation system. . Wang, L. P. (Ed.). (2005). Support vector machines: theory and application. Berlin: Springer. Wang, Z. Q., Sun, X., Zhang, D. X., & Li, X. (2006). An optimal SVM-based text classification algorithm. In International conference on machine learning and cybernetics (pp. 1378–1381). Wang, L. P., & Fu, X. J. (2005). Data mining with computational intelligence. Berlin: Springer. Wiener, E. (1995). A neural network approach to topic spotting. In Symposium on document analysis and information retrieval (pp. 317–332). Yuan, F., Yang, L., & Yu, G. (2005). Improving the K-NN and applying it to Chinese text classification. In International conference on machine learning and cybernetics (Vol. 3, pp. 1547–1553). Zou, J. Q., Chen, G. L., & Guo, W. Z. (2005). Chinese web page classification using noise-tolerant support vector machines. In IEEE international conference on natural language processing and knowledge engineering (pp. 785–790).