Large margin DragPushing strategy for centroid text categorization

Expert Systems with Applications Expert Systems with Applications 33 (2007) 215–220 www.elsevier.com/locate/eswa Large margin DragPushing strategy fo...

Download PDF

220KB Sizes 2 Downloads 110 Views

Report

PDF Reader
Full Text

Expert Systems with Applications Expert Systems with Applications 33 (2007) 215–220 www.elsevier.com/locate/eswa

Large margin DragPushing strategy for centroid text categorization Songbo Tan

*

Software Department, Institute of Computing Technology, Chinese Academy of Sciences, P.O. Box 2704, Beijing, 100080, PR China Graduate School of the Chinese Academy of Sciences, PR China

Abstract Among all conventional methods for text categorization, centroid classiﬁer is a simple and eﬃcient method. However it often suﬀers from inductive bias (or model misﬁt) incurred by its assumption. DragPushing is a very simple and yet eﬃcient method to address this socalled inductive bias problem. However, DragPushing employs only one criterion, i.e., training-set error, as its objective function that cannot guarantee the generalization capability. In this paper, we propose a generalized DragPushing strategy for centroid classiﬁer, which we called as ‘‘Large Margin DragPushing’’ (LMDP). The experiments conducted on three benchmark evaluation collections show that LMDP achieved about one percent improvement over the performance of DragPushing and delivered top performance nearly as well as state-of-the-art SVM without incurring signiﬁcant computational costs. Ó 2006 Published by Elsevier Ltd. Keywords: Text classiﬁcation; Information retrieval; Machine learning

1. Introduction With the exponential growth of text data from internet, text categorization has received more and more attention in information retrieval, machine learning and natural language processing community. Numerous machine learning algorithms have been introduced to deal with text classiﬁcation, such as K-Nearest Neighbor (KNN) (Yang & Lin, 1999), centroid classiﬁer (Han & Karypis, 2000), Naive Bayes (Lewis, 1998), support vector machines (SVM) (Joachims, 1998), Rocchio (Joachims, 1997), decision trees (Apte, Damerau, & Weiss, 1998), Winnow (Zhang, 2001), Perceptron (Zhang, 2001) and voting (Schapire & Singer, 2000). Among all these methods, centroid classiﬁer is a simple and eﬀective method for text categorization. However it often suﬀers from inductive bias or model misﬁt (Liu, Yang, & Carbonell, 2002; Wu, Phang, Liu, & Li, 2002) incurred by its assumption. As we all know, centroid classiﬁer makes a simple assumption that a given document *

Tel.: +86 10 88449181 713; fax: +86 10 88455011. E-mail address: [email protected]

0957-4174/$ - see front matter Ó 2006 Published by Elsevier Ltd. doi:10.1016/j.eswa.2006.04.008

should be assigned a particular class if the similarity of this document to the centroid of the class is the largest. However, this supposition is often violated (misﬁt) when there exists a document from class A sharing more similarity with the centroid of class B than that of class A. Consequently, the more serious the model misﬁt, the poorer the classiﬁcation performance. In order to address this issue eﬀectively, numerous researchers have devoted their energy to reﬁning the classiﬁer model by proposing eﬃcient strategies. One of the popular strategies is voting (Schapire & Singer, 2000). Wu et al. (2002) presented another novel approach to cope with this problem. However, these methods both suﬀer from their own shortcomings respectively. The computational costs of voting is quite signiﬁcant and its application to text classiﬁcation community is often severely limited; Meanwhile Wu’s strategy may be sensitive to the imbalance of text corpora for it needs to split the training set. In order to improve the performance of centroid classiﬁer, an eﬀective and yet eﬃcient strategy, i.e., ‘‘DragPushing’’, was introduced to reﬁne the centroid classiﬁer model (Tan, Cheng, Ghanem, Wang, & Xu, 2005). The main idea behind this strategy is that it takes advantage

216

S. Tan / Expert Systems with Applications 33 (2007) 215–220

of training errors to successively reﬁne classiﬁcation model. For example, if one training example d labeled as class A is misclassiﬁed into class B, DragPushing ‘‘drags’’ the centroid of class A to example d, and ‘‘pushes’’ the centroid of class B against example d. After this operation, document d will be more likely to be correctly classiﬁed by the reﬁned classiﬁer. However, DragPushing employs only one criterion, i.e., training-set error, as its objective function that cannot guarantee the generalization capability. In this paper, we further generalize DragPushing to Large Margin DragPushing (LMDP) which reﬁnes the centroid classiﬁer model not only using training errors but also utilizing training margins. The goal of LMDP is to reduce the training errors as well as to enhance the generalization capability. Extensive experiments conducted on three benchmark document corpora show that LMDP always performs better than DragPushing. Like DragPushing, LMDP is also much faster than many state-of-the-art approaches, e.g., SVM, while delivers performance nearly as well as SVM. The rest of this paper is constructed as follows: Next section describes basic centroid classiﬁer. DragPushing strategy is presented in Section 3. LMDP is introduced in the Section 4. Experimental results are given in Section 5. Finally Section 6 concludes this paper. 2. Centroid classiﬁer In our work, the documents are represented using vector space model (VSM). In this model, each document d is considered to be a vector in the term-space. For the sake of brevity, we denote summed centroid and normalized cen! ! troid of class Ci by C Si and C Ni respectively. First we compute summed centroid by summing document vectors in class Ci: X ! ~ C Si ¼ d ð1Þ d2C i

Next we evaluate normalized centroid by following formula: ! ! CS N C i ¼ !i ð2Þ S Ci 2

where kzk2 denotes the 2-norm of vector z. Afterwards we calculate the similarity between a document d and each centroid Ci using inner-product measure as follows: ! ð3Þ d C Ni simðd; C i Þ ¼ ~ It is worth mentioning that the inner-product measure using formula (3) is equivalent to cosine measure since document vector ~ d is computed using TFIDF formula (19) which is equal to counting TFIDF by following formula (4) and then normalizing it by 2-norm.

*

*

wtfidf ðt; d Þ ¼ tf ðt; d Þ logðD=nt þ 0:01Þ

ð4Þ

Finally, based on these similarities, we assign d the class label corresponding to the most similar centroid: ! C ¼ arg max ~ d C Ni ð5Þ Ci

3. DragPushing strategy Since the text collection may conﬂict the assumption of Centroid Classiﬁer to some degree, inevitably the classiﬁcation rules, i.e., class representatives or class centroids, induced by centroid classiﬁer may contain some kinds of biases that result in degradation of classiﬁcation accuracy on both training and test documents. An intuitive and straightforward solution to this problem is to make use of training errors to adjust class centroids so that the biases can be reduced gradually. Initialization: to start, we need to load the training data and parameters including MaxIteration and ErrorWeight. Then for summed cen! each category Ci we calculate one ! N ;0 troid C S;0 and one normalized centroid C . Note that 0 i i denotes current iteration-step, i.e., the 0th iteration-step. DragPushing: in one iteration, we need to categorize all training documents. If one document d labeled as class ‘‘A’’ is classiﬁed into class ‘‘B’’, DragPushing modiﬁes the summed and normalized centroids of class ‘‘A’’ and class ‘‘B’’ by following formulas: C S;0þ1 ¼ C S;0 A;l A;l þ ErrorWeight d l if d l > 0 C S;0 A;l

if C S;0 C NA;l;0þ1 ¼ A;l > 0 ! S;0 C A 2 h i S;0 C S;0þ1 ¼ C ErrorWeight d l if d l > 0 B;l B;l þ

C S;0 B;l if C S;0 C NB;l;0þ1 ¼ B;l > 0 ! C S;0 B

ð6Þ ð7Þ

ð8Þ ð9Þ

2

where 0 denotes the 0th iteration-step and l stands for feature index of document vectors and centroid vectors. [z]+ denotes the hinge function which equals the argument z if z > 0 and is zero otherwise, i.e., [z]+ = max{z, 0}. The reason for introduction of [z]+ is that we found nonnegative centroids perform better than real centroids in our experience. We call the former formulas (6,7) as ‘‘drag’’ formulas and the latter (8,9) as ‘‘push’’ formulas. Obviously after the executing of ‘‘drag’’ and ‘‘push’’ formulas, the centroid of class ‘‘A’’ will share more similarity with document d than that of class ‘‘B’’. As a result, the similarity between document d and the centroid of class ‘‘A’’ will be enlarged while the similarity between document d and the centroid of class ‘‘B’’ will be reduced. Time Requirements: Assumed that there are D training documents, T test documents, W words in total, K classes

S. Tan / Expert Systems with Applications 33 (2007) 215–220

217

and M iteration steps. The time complexity of calculating one summed centroid and one normalized centroid for each class, i.e., training a Centroid Classiﬁer, is O(DW + KW). Since D > K, the time complexity is O(DW). For one iteration of DragPushing stage, we need to categorize D training documents and for each misclassiﬁed document we need to update four class centroids, therefore the running time is O(D(KW + 4W)), i.e., O(DKW)(for K > 3). Consequently the training of DragPushing can be done in O(DW + MDKW), i.e., O(MDKW). As a result, the training time of DragPushing scales linearly with the training documents (D). Since the reﬁned classiﬁer still consists of K centroids, the prediction time required by reﬁned classiﬁer is the same as the centroid classiﬁer, i.e., O(TKW). Accordingly DragPushing is still a linear classiﬁer.

Hypothesis-margin requires the existence of a distance measure on the hypothesis class. The margin of a hypothesis with respect to instance is the distance between the hypothesis and the closest hypothesis that assigns alternative label to given instance. For example, the 1-NN classiﬁer computes the hypothesis-margin (Crammer, Gilad-Bachrach, Navot, & Tishby, 2002) of instance xp with respect to a set of points P by following deﬁnition,

4. Large margin DragPushing strategy

HM C ðxi Þ ¼ ðSimðxi ; C R Þ Simðxi ; C M ÞÞ

4.1. Motivation

where CR and CM denote the most similar Centroid to xi with the same and diﬀerent label, respectively.

Margin is introduced to measure how far the positive and negative training examples are separated by the decision surface. Based on margin conception, Vapnik (1995) deﬁned the objective function in SVM as both minimization of training-set error and maximization of margin width. Just so, SVM has turned out to be a top-performer in text categorization, often resulting in the best performance in benchmark evaluation (Yang & Lin, 1999). However, DragPushing employs only one criterion, i.e., training-set error, as its objective function that cannot guarantee the generalization capability. Consequently, with the aim of boosting the performance of DragPushing, a natural and straightforward scheme is to introduce margin conception (Fig. 1). Since DragPushing is of on-line update mode, we cannot directly take advantage of the margin deﬁnition in SVM. To overcome this problem, we introduce an approximate margin, i.e., hypothesis-margin, for each training example.

1 HM P ðxp Þ ¼ ðkxp xR k kxp xM kÞ 2

ð10Þ

where xR and xM denote the nearest point to xp in P with the same and diﬀerent label, respectively. Similar to above deﬁnition of hypothesis-margin, we formulate the margin of one instance xi with respect to a centroid classiﬁer C as following, ð11Þ

4.2. The LMDP algorithm Correspondent to the deﬁnition of training error, we introduce a generalized error ‘‘MarginError’’ for each example. If the hypothesis-margin of one example d is bigger than zero but smaller than a small positive constant, such as 0.05, we say centroid classiﬁer make ‘‘MarginError’’ with respective to the example d. For convenience, we call the small positive constant as ‘‘MinMargin’’. Furthermore, we say the example d is MinMargined by the centroid classiﬁer. In order to apply ‘‘drag’’ and ‘‘push’’ formulas to MinMargined example, we introduce parameter MarginWeight which revises formulas (6) and (8) as follows: C S;0þ1 ¼ C S;0 if d l > 0 A;l A;l þ MarginWeight d l h i S;0þ1 S;0 C B;l ¼ C B;l MarginWeight d l if d l > 0

Load training data and parameters; Calculate C Si and CN i for each class Ci ; For iter=1 to MaxIteration Do For each document d in training set Do Classify d labeled “A1 ” into class “A2”; If (A1!=A2) Do Drag normalized centroid of class A1 to d; Push normalized centroid of class A2 against d; End If End For End For Fig. 1. The outline of DragPushing strategy.

þ

ð12Þ ð13Þ

218

S. Tan / Expert Systems with Applications 33 (2007) 215–220

Load training data and parameters; Calculate C Si and CN i for each class Ci ; For iter=1 to Max Iteration Do For each document di in training set Do Select CR and CM for document di; If (HM(d i)<0) Do Apply “drag” formulas (6) and (7); Apply “push” formulas (8) and (9); Else If (HM(di)
In one interation, if example di is misclassiﬁed, LMDP updates the two centroids using formulas (6)–(9); if example di is MinMargined, LMDP reﬁnes the two centroids using formulas (12), (7), (13) and (9). The outline of LMDP is demonstrated in Fig. 2. Since LMDP conducts ‘‘drag’’ and ‘‘push’’ operation not only for misclassiﬁed examples but also for MinMargined examples, the running time of one iteration is O(D(KW + 4W)), i.e., O(DKW)(for K > 3) which equals that of DragPushing. As a result, the training time of LMDP is the same as DragPushing. Consequently LMDP is still a linear classiﬁer. 5. Experiment results 5.1. The datasets In our experiment, we use three corpora: OHSUMED1, Reuter-215782 and Sector-483. 5.1.1. OHSUMED The OHSUMED dataset is a bibliographical document collection: developed by William Hersh and colleagues at the Oregan Health Science University, which is a subset of MEDLINE database. We use a subset (called ohscal4 in Han & Karypis (2000)) from OHSUMED dataset that contains 11,162 documents and in total 10 categories: antibodies, carcinoma, DNA, in vitro, molecular-sequence-

1

Downloadable at ftp://medir.ohsu.edu/pub/OHSUMED/. Downloadable at http://www.research.att.com/~lewis/reuters 21578.html. 3 Downloadable at http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/ theo-20/www/data/. 4 Downloadable at http://www.cs.umn.edu/~han/data/tmdata.tar.gz. 2

data, pregnancy, prognosis, receptors, risk-factors and tomography. 5.1.2. Reuter-21578 The Reuter-21578 text categorization test collection contains documents collected from the Reuters newswire in 1987. It is a standard text categorization benchmark and contains 135 categories. We use its subset: one consisting of 92 categories and in total 10,346 documents. 5.1.3. Sector-48 The industry section dataset is based on the data made available by Market Guide, Inc. (www.marketguide.com). The set consists of company homepages that are categorized in a hierarchy of industry sectors, but we disregarded the hierarchy. There were 9,637 documents in the dataset, which were divided into 105 classes. We use a subset called as Sector-48 consisting of 48 categories and in all 4,581 documents. 5.2. The performance measure To evaluate a text classiﬁcation system, we use the F1 measure introduced by van Rijsbergen (1979). This measure combines recall and precision in the following way: number of correct positive predictions number of positive examples number of correct positive predictions Precision ¼ number of positive predictions 2 Recall Precision F1 ¼ ðRecall þ PrecisionÞ

Recall ¼

ð14Þ ð15Þ ð16Þ

For ease of comparison, we summarize the F1 scores over the diﬀerent categories using the Micro- and Macro-averages of F1 scores (Lewis, Schapire, Callan, & Papka, 1996):

S. Tan / Expert Systems with Applications 33 (2007) 215–220

MicroF1 ¼ F1 over categories and documents MacroF1 ¼ average of within-category F1 values

ð17Þ ð18Þ

219

Table 2 The best MacroF1 of diﬀerent methods on three corpora LMDP

DragPushing

Centroid

SVMTorch

0.7916 0.6113 0.8969

0.7799 0.5934 0.8836

0.7580 0.5617 0.8152

0.7433 0.6013 0.9049

The MicroF1 and MacroF1 emphasize the performance of the system on common and rare categories respectively. Using these averages, we can observe the eﬀect of diﬀerent kinds of data on a text classiﬁcation system.

OHSUMED Reuter-21578 Sector-48

5.3. Experimental design

for LMDP, MaxIteration is set to 8, and MarginWeight is ﬁxed as 0.2. LMDP outperforms all the other three methods on OHSUMED. SVM performs the best on Sector-48. On three corpora, the MicroF1 of LMDP beats DragPushing by approximately 1%. Therefore, we make our conclusion that hypothesis-margin can boost the performance of DragPushing. Tables 3 and 4 report the training time and test time of four methods on three text collections. Note that the running time does not include the seconds for loading data from hard disk and feature number is set to 10,000; for

We evenly split the each dataset into three parts. Then we use two parts for training and the remaining third for test. We perform the training-test procedure three times and use the average of the three performances as ﬁnal result. This is so-called three-fold cross validation. In all experiments we employ Information Gain as feature selection method for it consistently performs well in most cases (Yang & Pedersen. J.O., 1997). Algorithms are coded in C++ and running on a Pentium-4 machine with 3.0 GHz CPUs and 512M memory. We employ TFIDF other than binary word occurrences as input features. The formula for calculating TFIDF can be written as follows: *

wNtfidf ðt;

tf ðt; d Þ logðD=nt þ 0:01Þ d Þ ¼ sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ i2ﬃ * Ph tf ðt; d Þ logðD=nt þ 0:01Þ

*

ð19Þ

Table 3 Training time in seconds

OHSUMED Reuter-21578 Sector-48

LMDP

DragPushing

Centroid

SVMTorch

29.766 22.901 9.854

4.583 16.682 9.942

0.328 0.505 0.683

223.302 124.698 103.219

*

t2d

where D is the total number of training documents, and*nt is the number of documents containing the word * t. tf ðt; d Þ indicates the occurrences of word t in document d . For experiments involving SVM we employed SVMTorch, which uses one-versus-the-rest decomposition and can directly deal with multi-class classiﬁcation problems. (www.idiap.ch/~bengio/projects/SVMTorch.html). We employed a linear kernel that has been found as competitive as other kernels in Reuter-21578 (Yang & Lin, 1999). All parameters were left at default values.

Table 4 Testing time in seconds

OHSUMED Reuter-21578 Sector-48

LMDP

DragPushing

Centroid

SVMTorch

0.068 0.906 0.521

0.062 0.921 0.500

0.068 0.932 0.521

79.245 68.057 65.078

5.4. Comparison and analysis Now we present and discuss the experimental results. Here we compare LMDP against DragPushing, centroid classiﬁer and SVM on three text corpora. Tables 1 and 2 show the best-performance comparison in MicroF1 and MacroF1. Note that for DragPushing, MaxIteration is set to 8, and ErrorWeight is ﬁxed as 1.0;

Table 1 The best MicroF1 of diﬀerent methods on three corpora

OHSUMED Reuter-21578 Sector-48

LMDP

DragPushing

Centroid

SVMTorch

0.8031 0.8691 0.8950

0.7912 0.8619 0.8812

0.7676 0.7818 0.8055

0.7565 0.8706 0.9026

Fig. 3. The error rate vs. MaxIteration on Sector-48.

220

S. Tan / Expert Systems with Applications 33 (2007) 215–220

DragPushing, MaxIteration is set to 8, and ErrorWeight is ﬁxed as 1.0; for LMDP, MaxIteration is set to 8, and MarginWeight is ﬁxed as 0.2. For training phase the CPU time required by SVM is about 10 times larger than that of LMDP on Sector-48. The predicting speed of LMDP is as fast as Centroid and about 100 times faster than SVM. On Sector-48 and Reuter-21578, the training time is nearly the same as DragPushing. Fig. 3 illustrates the error rate vs. MaxIteration on Sector-48. Note that feature number takes 10,000; for DragPushing, ErrorWeight is ﬁxed as 1.0; for LMDP, MarginWeight is ﬁxed as 0.2. DragPushing yields nearly the same training-error curve as LMDP. However, LMDP can decrease the margin-error with the MaxIteration. That is to say, LMDP decreases the training error as well as enlarges the training margin. 6. Conclusion remarks In this paper we propose a generalized DragPushing strategy for Centroid Classiﬁer, i.e., Large Margin DragPushing (LMDP). LMDP reﬁnes the centroid classiﬁer model not only using training errors but also employing training margins. Extensive experiments conducted on three corpora showed that LMDP achieved nearly one percent improvement over the performance of DragPushing and delivered top performance nearly as well as state-ofthe-art SVM without incurring signiﬁcant computational costs.

References Apte, C., Damerau, F., & Weiss, S. (1998). Text mining with decision rules and decision trees. Proceedings of the workshop with conference on automated learning and discovery: Learning from text and the Web. Crammer, K., Gilad-Bachrach, R., Navot, A., & Tishby, N. (2002). Margin analysis of the lvq algorithm. In NIPS. Han, E., & Karypis, G. (2000). Centroid-based document classiﬁcation analysis & experimental result. In PKDD. Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In ICML (pp. 143–151). Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In ECML (pp. 137–142). Lewis, D. D. (1998). Naive (Bayes) at forty: the independence assumption in information retrieval. In ECML (pp. 4–15). Lewis, D. D., Schapire, R. E., Callan, J. P., & Papka, R. (1996). Training algorithms for linear text classiﬁers. In SIGIR (pp. 298–306). Liu, Y., Yang, Y., & Carbonell, J. (2002). Boosting to correct inductive bias in text classiﬁcation. In CIKM (pp. 348–355). Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39, 135–168. Tan, S., Cheng, X., Ghanem, M. M., Wang, B., & Xu, H. (2005). A novel reﬁnement approach for text categorization. In CIKM (pp. 469–476). van Rijsbergen, C. (1979). Information retrieval. London: Butterworths. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Wu, H., Phang, T. H., Liu, B., & Li, X. (2002). A reﬁnement approach to handling model misﬁt in text categorization. In SIGKDD (pp. 207– 216). Yang, Y., & Lin, X. (1999). A re-examination of text categorization methods. In SIGIR (pp. 42–49). Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In ICML (pp. 412–420). Zhang, T. (2001). Regularized winnow methods. Advances in Neural Information Processing Systems, 13, 703–709.

Large margin DragPushing strategy for centroid text categorization

Large margin DragPushing strategy for centroid text categorization

Recommend Documents