Expert Systems with Applications Expert Systems with Applications 33 (2007) 215–220 www.elsevier.com/locate/eswa
Large margin DragPushing strategy for centroid text categorization Songbo Tan
*
Software Department, Institute of Computing Technology, Chinese Academy of Sciences, P.O. Box 2704, Beijing, 100080, PR China Graduate School of the Chinese Academy of Sciences, PR China
Abstract Among all conventional methods for text categorization, centroid classifier is a simple and efficient method. However it often suffers from inductive bias (or model misfit) incurred by its assumption. DragPushing is a very simple and yet efficient method to address this socalled inductive bias problem. However, DragPushing employs only one criterion, i.e., training-set error, as its objective function that cannot guarantee the generalization capability. In this paper, we propose a generalized DragPushing strategy for centroid classifier, which we called as ‘‘Large Margin DragPushing’’ (LMDP). The experiments conducted on three benchmark evaluation collections show that LMDP achieved about one percent improvement over the performance of DragPushing and delivered top performance nearly as well as state-of-the-art SVM without incurring significant computational costs. Ó 2006 Published by Elsevier Ltd. Keywords: Text classification; Information retrieval; Machine learning
1. Introduction With the exponential growth of text data from internet, text categorization has received more and more attention in information retrieval, machine learning and natural language processing community. Numerous machine learning algorithms have been introduced to deal with text classification, such as K-Nearest Neighbor (KNN) (Yang & Lin, 1999), centroid classifier (Han & Karypis, 2000), Naive Bayes (Lewis, 1998), support vector machines (SVM) (Joachims, 1998), Rocchio (Joachims, 1997), decision trees (Apte, Damerau, & Weiss, 1998), Winnow (Zhang, 2001), Perceptron (Zhang, 2001) and voting (Schapire & Singer, 2000). Among all these methods, centroid classifier is a simple and effective method for text categorization. However it often suffers from inductive bias or model misfit (Liu, Yang, & Carbonell, 2002; Wu, Phang, Liu, & Li, 2002) incurred by its assumption. As we all know, centroid classifier makes a simple assumption that a given document *
Tel.: +86 10 88449181 713; fax: +86 10 88455011. E-mail address:
[email protected]
0957-4174/$ - see front matter Ó 2006 Published by Elsevier Ltd. doi:10.1016/j.eswa.2006.04.008
should be assigned a particular class if the similarity of this document to the centroid of the class is the largest. However, this supposition is often violated (misfit) when there exists a document from class A sharing more similarity with the centroid of class B than that of class A. Consequently, the more serious the model misfit, the poorer the classification performance. In order to address this issue effectively, numerous researchers have devoted their energy to refining the classifier model by proposing efficient strategies. One of the popular strategies is voting (Schapire & Singer, 2000). Wu et al. (2002) presented another novel approach to cope with this problem. However, these methods both suffer from their own shortcomings respectively. The computational costs of voting is quite significant and its application to text classification community is often severely limited; Meanwhile Wu’s strategy may be sensitive to the imbalance of text corpora for it needs to split the training set. In order to improve the performance of centroid classifier, an effective and yet efficient strategy, i.e., ‘‘DragPushing’’, was introduced to refine the centroid classifier model (Tan, Cheng, Ghanem, Wang, & Xu, 2005). The main idea behind this strategy is that it takes advantage
216
S. Tan / Expert Systems with Applications 33 (2007) 215–220
of training errors to successively refine classification model. For example, if one training example d labeled as class A is misclassified into class B, DragPushing ‘‘drags’’ the centroid of class A to example d, and ‘‘pushes’’ the centroid of class B against example d. After this operation, document d will be more likely to be correctly classified by the refined classifier. However, DragPushing employs only one criterion, i.e., training-set error, as its objective function that cannot guarantee the generalization capability. In this paper, we further generalize DragPushing to Large Margin DragPushing (LMDP) which refines the centroid classifier model not only using training errors but also utilizing training margins. The goal of LMDP is to reduce the training errors as well as to enhance the generalization capability. Extensive experiments conducted on three benchmark document corpora show that LMDP always performs better than DragPushing. Like DragPushing, LMDP is also much faster than many state-of-the-art approaches, e.g., SVM, while delivers performance nearly as well as SVM. The rest of this paper is constructed as follows: Next section describes basic centroid classifier. DragPushing strategy is presented in Section 3. LMDP is introduced in the Section 4. Experimental results are given in Section 5. Finally Section 6 concludes this paper. 2. Centroid classifier In our work, the documents are represented using vector space model (VSM). In this model, each document d is considered to be a vector in the term-space. For the sake of brevity, we denote summed centroid and normalized cen! ! troid of class Ci by C Si and C Ni respectively. First we compute summed centroid by summing document vectors in class Ci: X ! ~ C Si ¼ d ð1Þ d2C i
Next we evaluate normalized centroid by following formula: ! ! CS N C i ¼ !i ð2Þ S Ci 2
where kzk2 denotes the 2-norm of vector z. Afterwards we calculate the similarity between a document d and each centroid Ci using inner-product measure as follows: ! ð3Þ d C Ni simðd; C i Þ ¼ ~ It is worth mentioning that the inner-product measure using formula (3) is equivalent to cosine measure since document vector ~ d is computed using TFIDF formula (19) which is equal to counting TFIDF by following formula (4) and then normalizing it by 2-norm.
*
*
wtfidf ðt; d Þ ¼ tf ðt; d Þ logðD=nt þ 0:01Þ
ð4Þ
Finally, based on these similarities, we assign d the class label corresponding to the most similar centroid: ! C ¼ arg max ~ d C Ni ð5Þ Ci
3. DragPushing strategy Since the text collection may conflict the assumption of Centroid Classifier to some degree, inevitably the classification rules, i.e., class representatives or class centroids, induced by centroid classifier may contain some kinds of biases that result in degradation of classification accuracy on both training and test documents. An intuitive and straightforward solution to this problem is to make use of training errors to adjust class centroids so that the biases can be reduced gradually. Initialization: to start, we need to load the training data and parameters including MaxIteration and ErrorWeight. Then for summed cen! each category Ci we calculate one ! N ;0 troid C S;0 and one normalized centroid C . Note that 0 i i denotes current iteration-step, i.e., the 0th iteration-step. DragPushing: in one iteration, we need to categorize all training documents. If one document d labeled as class ‘‘A’’ is classified into class ‘‘B’’, DragPushing modifies the summed and normalized centroids of class ‘‘A’’ and class ‘‘B’’ by following formulas: C S;0þ1 ¼ C S;0 A;l A;l þ ErrorWeight d l if d l > 0 C S;0 A;l
if C S;0 C NA;l;0þ1 ¼ A;l > 0 ! S;0 C A 2 h i S;0 C S;0þ1 ¼ C ErrorWeight d l if d l > 0 B;l B;l þ
C S;0 B;l if C S;0 C NB;l;0þ1 ¼ B;l > 0 ! C S;0 B
ð6Þ ð7Þ
ð8Þ ð9Þ
2
where 0 denotes the 0th iteration-step and l stands for feature index of document vectors and centroid vectors. [z]+ denotes the hinge function which equals the argument z if z > 0 and is zero otherwise, i.e., [z]+ = max{z, 0}. The reason for introduction of [z]+ is that we found nonnegative centroids perform better than real centroids in our experience. We call the former formulas (6,7) as ‘‘drag’’ formulas and the latter (8,9) as ‘‘push’’ formulas. Obviously after the executing of ‘‘drag’’ and ‘‘push’’ formulas, the centroid of class ‘‘A’’ will share more similarity with document d than that of class ‘‘B’’. As a result, the similarity between document d and the centroid of class ‘‘A’’ will be enlarged while the similarity between document d and the centroid of class ‘‘B’’ will be reduced. Time Requirements: Assumed that there are D training documents, T test documents, W words in total, K classes
S. Tan / Expert Systems with Applications 33 (2007) 215–220
217
and M iteration steps. The time complexity of calculating one summed centroid and one normalized centroid for each class, i.e., training a Centroid Classifier, is O(DW + KW). Since D > K, the time complexity is O(DW). For one iteration of DragPushing stage, we need to categorize D training documents and for each misclassified document we need to update four class centroids, therefore the running time is O(D(KW + 4W)), i.e., O(DKW)(for K > 3). Consequently the training of DragPushing can be done in O(DW + MDKW), i.e., O(MDKW). As a result, the training time of DragPushing scales linearly with the training documents (D). Since the refined classifier still consists of K centroids, the prediction time required by refined classifier is the same as the centroid classifier, i.e., O(TKW). Accordingly DragPushing is still a linear classifier.
Hypothesis-margin requires the existence of a distance measure on the hypothesis class. The margin of a hypothesis with respect to instance is the distance between the hypothesis and the closest hypothesis that assigns alternative label to given instance. For example, the 1-NN classifier computes the hypothesis-margin (Crammer, Gilad-Bachrach, Navot, & Tishby, 2002) of instance xp with respect to a set of points P by following definition,
4. Large margin DragPushing strategy
HM C ðxi Þ ¼ ðSimðxi ; C R Þ Simðxi ; C M ÞÞ
4.1. Motivation
where CR and CM denote the most similar Centroid to xi with the same and different label, respectively.
Margin is introduced to measure how far the positive and negative training examples are separated by the decision surface. Based on margin conception, Vapnik (1995) defined the objective function in SVM as both minimization of training-set error and maximization of margin width. Just so, SVM has turned out to be a top-performer in text categorization, often resulting in the best performance in benchmark evaluation (Yang & Lin, 1999). However, DragPushing employs only one criterion, i.e., training-set error, as its objective function that cannot guarantee the generalization capability. Consequently, with the aim of boosting the performance of DragPushing, a natural and straightforward scheme is to introduce margin conception (Fig. 1). Since DragPushing is of on-line update mode, we cannot directly take advantage of the margin definition in SVM. To overcome this problem, we introduce an approximate margin, i.e., hypothesis-margin, for each training example.
1 HM P ðxp Þ ¼ ðkxp xR k kxp xM kÞ 2
ð10Þ
where xR and xM denote the nearest point to xp in P with the same and different label, respectively. Similar to above definition of hypothesis-margin, we formulate the margin of one instance xi with respect to a centroid classifier C as following, ð11Þ
4.2. The LMDP algorithm Correspondent to the definition of training error, we introduce a generalized error ‘‘MarginError’’ for each example. If the hypothesis-margin of one example d is bigger than zero but smaller than a small positive constant, such as 0.05, we say centroid classifier make ‘‘MarginError’’ with respective to the example d. For convenience, we call the small positive constant as ‘‘MinMargin’’. Furthermore, we say the example d is MinMargined by the centroid classifier. In order to apply ‘‘drag’’ and ‘‘push’’ formulas to MinMargined example, we introduce parameter MarginWeight which revises formulas (6) and (8) as follows: C S;0þ1 ¼ C S;0 if d l > 0 A;l A;l þ MarginWeight d l h i S;0þ1 S;0 C B;l ¼ C B;l MarginWeight d l if d l > 0
Load training data and parameters; Calculate C Si and CN i for each class Ci ; For iter=1 to MaxIteration Do For each document d in training set Do Classify d labeled “A1 ” into class “A2”; If (A1!=A2) Do Drag normalized centroid of class A1 to d; Push normalized centroid of class A2 against d; End If End For End For Fig. 1. The outline of DragPushing strategy.
þ
ð12Þ ð13Þ
218
S. Tan / Expert Systems with Applications 33 (2007) 215–220
Load training data and parameters; Calculate C Si and CN i for each class Ci ; For iter=1 to Max Iteration Do For each document di in training set Do Select CR and CM for document di; If (HM(d i)<0) Do Apply “drag” formulas (6) and (7); Apply “push” formulas (8) and (9); Else If (HM(di)
In one interation, if example di is misclassified, LMDP updates the two centroids using formulas (6)–(9); if example di is MinMargined, LMDP refines the two centroids using formulas (12), (7), (13) and (9). The outline of LMDP is demonstrated in Fig. 2. Since LMDP conducts ‘‘drag’’ and ‘‘push’’ operation not only for misclassified examples but also for MinMargined examples, the running time of one iteration is O(D(KW + 4W)), i.e., O(DKW)(for K > 3) which equals that of DragPushing. As a result, the training time of LMDP is the same as DragPushing. Consequently LMDP is still a linear classifier. 5. Experiment results 5.1. The datasets In our experiment, we use three corpora: OHSUMED1, Reuter-215782 and Sector-483. 5.1.1. OHSUMED The OHSUMED dataset is a bibliographical document collection: developed by William Hersh and colleagues at the Oregan Health Science University, which is a subset of MEDLINE database. We use a subset (called ohscal4 in Han & Karypis (2000)) from OHSUMED dataset that contains 11,162 documents and in total 10 categories: antibodies, carcinoma, DNA, in vitro, molecular-sequence-
1
Downloadable at ftp://medir.ohsu.edu/pub/OHSUMED/. Downloadable at http://www.research.att.com/~lewis/reuters 21578.html. 3 Downloadable at http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/ theo-20/www/data/. 4 Downloadable at http://www.cs.umn.edu/~han/data/tmdata.tar.gz. 2
data, pregnancy, prognosis, receptors, risk-factors and tomography. 5.1.2. Reuter-21578 The Reuter-21578 text categorization test collection contains documents collected from the Reuters newswire in 1987. It is a standard text categorization benchmark and contains 135 categories. We use its subset: one consisting of 92 categories and in total 10,346 documents. 5.1.3. Sector-48 The industry section dataset is based on the data made available by Market Guide, Inc. (www.marketguide.com). The set consists of company homepages that are categorized in a hierarchy of industry sectors, but we disregarded the hierarchy. There were 9,637 documents in the dataset, which were divided into 105 classes. We use a subset called as Sector-48 consisting of 48 categories and in all 4,581 documents. 5.2. The performance measure To evaluate a text classification system, we use the F1 measure introduced by van Rijsbergen (1979). This measure combines recall and precision in the following way: number of correct positive predictions number of positive examples number of correct positive predictions Precision ¼ number of positive predictions 2 Recall Precision F1 ¼ ðRecall þ PrecisionÞ
Recall ¼
ð14Þ ð15Þ ð16Þ
For ease of comparison, we summarize the F1 scores over the different categories using the Micro- and Macro-averages of F1 scores (Lewis, Schapire, Callan, & Papka, 1996):
S. Tan / Expert Systems with Applications 33 (2007) 215–220
MicroF1 ¼ F1 over categories and documents MacroF1 ¼ average of within-category F1 values
ð17Þ ð18Þ
219
Table 2 The best MacroF1 of different methods on three corpora LMDP
DragPushing
Centroid
SVMTorch
0.7916 0.6113 0.8969
0.7799 0.5934 0.8836
0.7580 0.5617 0.8152
0.7433 0.6013 0.9049
The MicroF1 and MacroF1 emphasize the performance of the system on common and rare categories respectively. Using these averages, we can observe the effect of different kinds of data on a text classification system.
OHSUMED Reuter-21578 Sector-48
5.3. Experimental design
for LMDP, MaxIteration is set to 8, and MarginWeight is fixed as 0.2. LMDP outperforms all the other three methods on OHSUMED. SVM performs the best on Sector-48. On three corpora, the MicroF1 of LMDP beats DragPushing by approximately 1%. Therefore, we make our conclusion that hypothesis-margin can boost the performance of DragPushing. Tables 3 and 4 report the training time and test time of four methods on three text collections. Note that the running time does not include the seconds for loading data from hard disk and feature number is set to 10,000; for
We evenly split the each dataset into three parts. Then we use two parts for training and the remaining third for test. We perform the training-test procedure three times and use the average of the three performances as final result. This is so-called three-fold cross validation. In all experiments we employ Information Gain as feature selection method for it consistently performs well in most cases (Yang & Pedersen. J.O., 1997). Algorithms are coded in C++ and running on a Pentium-4 machine with 3.0 GHz CPUs and 512M memory. We employ TFIDF other than binary word occurrences as input features. The formula for calculating TFIDF can be written as follows: *
wNtfidf ðt;
tf ðt; d Þ logðD=nt þ 0:01Þ d Þ ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i2ffi * Ph tf ðt; d Þ logðD=nt þ 0:01Þ
*
ð19Þ
Table 3 Training time in seconds
OHSUMED Reuter-21578 Sector-48
LMDP
DragPushing
Centroid
SVMTorch
29.766 22.901 9.854
4.583 16.682 9.942
0.328 0.505 0.683
223.302 124.698 103.219
*
t2d
where D is the total number of training documents, and*nt is the number of documents containing the word * t. tf ðt; d Þ indicates the occurrences of word t in document d . For experiments involving SVM we employed SVMTorch, which uses one-versus-the-rest decomposition and can directly deal with multi-class classification problems. (www.idiap.ch/~bengio/projects/SVMTorch.html). We employed a linear kernel that has been found as competitive as other kernels in Reuter-21578 (Yang & Lin, 1999). All parameters were left at default values.
Table 4 Testing time in seconds
OHSUMED Reuter-21578 Sector-48
LMDP
DragPushing
Centroid
SVMTorch
0.068 0.906 0.521
0.062 0.921 0.500
0.068 0.932 0.521
79.245 68.057 65.078
5.4. Comparison and analysis Now we present and discuss the experimental results. Here we compare LMDP against DragPushing, centroid classifier and SVM on three text corpora. Tables 1 and 2 show the best-performance comparison in MicroF1 and MacroF1. Note that for DragPushing, MaxIteration is set to 8, and ErrorWeight is fixed as 1.0;
Table 1 The best MicroF1 of different methods on three corpora
OHSUMED Reuter-21578 Sector-48
LMDP
DragPushing
Centroid
SVMTorch
0.8031 0.8691 0.8950
0.7912 0.8619 0.8812
0.7676 0.7818 0.8055
0.7565 0.8706 0.9026
Fig. 3. The error rate vs. MaxIteration on Sector-48.
220
S. Tan / Expert Systems with Applications 33 (2007) 215–220
DragPushing, MaxIteration is set to 8, and ErrorWeight is fixed as 1.0; for LMDP, MaxIteration is set to 8, and MarginWeight is fixed as 0.2. For training phase the CPU time required by SVM is about 10 times larger than that of LMDP on Sector-48. The predicting speed of LMDP is as fast as Centroid and about 100 times faster than SVM. On Sector-48 and Reuter-21578, the training time is nearly the same as DragPushing. Fig. 3 illustrates the error rate vs. MaxIteration on Sector-48. Note that feature number takes 10,000; for DragPushing, ErrorWeight is fixed as 1.0; for LMDP, MarginWeight is fixed as 0.2. DragPushing yields nearly the same training-error curve as LMDP. However, LMDP can decrease the margin-error with the MaxIteration. That is to say, LMDP decreases the training error as well as enlarges the training margin. 6. Conclusion remarks In this paper we propose a generalized DragPushing strategy for Centroid Classifier, i.e., Large Margin DragPushing (LMDP). LMDP refines the centroid classifier model not only using training errors but also employing training margins. Extensive experiments conducted on three corpora showed that LMDP achieved nearly one percent improvement over the performance of DragPushing and delivered top performance nearly as well as state-ofthe-art SVM without incurring significant computational costs.
References Apte, C., Damerau, F., & Weiss, S. (1998). Text mining with decision rules and decision trees. Proceedings of the workshop with conference on automated learning and discovery: Learning from text and the Web. Crammer, K., Gilad-Bachrach, R., Navot, A., & Tishby, N. (2002). Margin analysis of the lvq algorithm. In NIPS. Han, E., & Karypis, G. (2000). Centroid-based document classification analysis & experimental result. In PKDD. Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In ICML (pp. 143–151). Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In ECML (pp. 137–142). Lewis, D. D. (1998). Naive (Bayes) at forty: the independence assumption in information retrieval. In ECML (pp. 4–15). Lewis, D. D., Schapire, R. E., Callan, J. P., & Papka, R. (1996). Training algorithms for linear text classifiers. In SIGIR (pp. 298–306). Liu, Y., Yang, Y., & Carbonell, J. (2002). Boosting to correct inductive bias in text classification. In CIKM (pp. 348–355). Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39, 135–168. Tan, S., Cheng, X., Ghanem, M. M., Wang, B., & Xu, H. (2005). A novel refinement approach for text categorization. In CIKM (pp. 469–476). van Rijsbergen, C. (1979). Information retrieval. London: Butterworths. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Wu, H., Phang, T. H., Liu, B., & Li, X. (2002). A refinement approach to handling model misfit in text categorization. In SIGKDD (pp. 207– 216). Yang, Y., & Lin, X. (1999). A re-examination of text categorization methods. In SIGIR (pp. 42–49). Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In ICML (pp. 412–420). Zhang, T. (2001). Regularized winnow methods. Advances in Neural Information Processing Systems, 13, 703–709.