Adapting centroid classifier for document categorization

Expert Systems with Applications 38 (2011) 10264–10273 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ...

Download PDF

791KB Sizes 0 Downloads 102 Views

Report

PDF Reader
Full Text

Expert Systems with Applications 38 (2011) 10264–10273

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Adapting centroid classiﬁer for document categorization Songbo Tan a,⇑, Yuefen Wang b, Gaowei Wu a a b

Key Laboratory of Network, Institute of Computing Technology, Chinese Academy of Sciences, China Information Center, Chinese Academy of Geological Sciences, China

a r t i c l e Keywords: Centroid classiﬁer Text categorization Information retrieval Data mining

i n f o

a b s t r a c t In the community of information retrieval, Centroid Classiﬁer has been showed to be a simple and yet effective method for text categorization. However, it is often plagued with model misﬁt (or inductive bias) incurred by its assumption. Various methods have been proposed to address this issue, such as Weight Adjustment, Voting, Reﬁnement and DragPushing. However, existing methods employ only one criterion, i.e., training-set error. Researches in machine learning indicate that training-set error based method cannot guarantee the generalization capability of base classiﬁers for unseen examples. To overcome this problem, we propose a novel Model Adjustment algorithm, which makes use of training-set errors as well as training-set margins. Furthermore, we prove that for a linearly separable problem, proposed method converges to the optimal solution after ﬁnite updates using any learning parameter g(g > 0). The empirical assessment conducted on four benchmark collections indicates that proposed method performs slightly better than SVM classiﬁer in prediction accuracy, as well as beats it in running time. 2011 Elsevier Ltd. All rights reserved.

1. Introduction With the rapid growth of texts in the Internet, text classiﬁcation has been attracting more and more attention in information retrieval and natural language processing community. In most cases, the use of statistical or machine learning techniques has been proven to be successful in this context, since it is typically more feasible to induce categorization rules based on example documents, than to extract such rules from domain experts. Numerous machine learning approaches have been introduced to deal with text classiﬁcation, including Centroid Classiﬁer (Lertnattee & Theeramunkong, 2002; Shankar & Karypis, 2000; Shin, Abraham, & Han, 2006; Tan & Cheng, 2007a, 2007b; Tan, Wu, Tang, & Cheng, 2007), K-Nearest Neighbor (KNN) (Ishii, Murai, Yamada, & Bao, 2006; Li & Hu, 2003; Tan, 2005, 2006; Yuan, Yang, & Yu, 2005), Naive Bayes (Lu, Hu, Wu, Lu, & Zhou, 2002, 2006; Tan, Cheng, Wang, & Xu, 2009; Wang & Zhang, 2005), Winnow or Perceptron (van Mun, 1999), Rocchio (Tsay & Wang, 2004), Voting (Aas & Eikvil, 1999) and Support Vector Machines (SVM) (Wang, Sun, Zhang, & Li, 2006; Zhang, Su, & Xu, 2006). Despite simplicity and straightforwardness, Centroid Classiﬁer has proved to be an efﬁcient and yet robust method for text categorization. Its basic idea is to construct a prototype vector, or centroid, per class using a training set of documents. This method is ⇑ Corresponding author. Address: Key Laboratory of Network, P.O. Box 2704, Beijing 100190, PR China. Tel.: +8610 62600928; fax: +8610 62600905. E-mail address: [email protected] (S. Tan). 0957-4174/$ - see front matter 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.02.114

easy to implement and is of efﬁciency in computation. However, it is often plagued with inductive bias (Liu, Yang, & Carbonell, 2002) or model misﬁt (Wu, Phang, Liu, & Li, 2002). For example, Centroid Classiﬁer makes a simple assumption that a given document should be assigned a particular class if the similarity of this document to the centroid of the class is the largest. However, this supposition is often violated (misﬁt) when there exists a document from class A sharing more similarity with the centroid of class B than that of class A. The more serious the model misﬁt, the poorer the classiﬁcation performance will be. Whereas, for individual domains (e.g. text collections), the choice of the best set of parameters can be found through tedious experimentation, a generic approach for addressing model misﬁts is typically needed. Numerous researchers have thus investigated the development of generic methods that can be used to improve the performance of base text classiﬁers automatically. These methods include Weight Adjustment (Shankar & Karypis, 2000), Voting (Aas & Eikvil, 1999), Reﬁnement (Wu et al., 2002) and DragPushing (Tan, 2008; Tan, Cheng, Ghanem, Wang, & Xu, 2005). However, existing methods employ only one criterion, i.e., training-set error. Researches in machine learning indicate that training-set error based method cannot guarantee the generalization capability of base classiﬁers for unseen examples. In other words, low training-error-rate does not mean low error-rate for unseen examples. This is so-called over-train (or over-ﬁtting) problem. To overcome this problem, we proposed a novel Model Adjustment (MA) algorithm to boost the performance of Centroid Classi-

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

ﬁer, which makes use of training-set error as well as training-set margin. A margin (Crammer, Gilad-Bachrach, Navot, & Tishby, 2002) is a geometric measure for evaluating the conﬁdence of a classiﬁer with respect to its decision. Margins already play a crucial role in current machine learning research. The novelty of this paper is the use of large margin principle for Model Adjustment of Centroid Classiﬁer. MA offers several advantages: ease in implementation, efﬁciency in training, and high accuracy in classiﬁcation. From the perspective of mathematics, we ﬁrst justiﬁed that with respect to a linearly separable problem, proposed method converges to the optimal solution after ﬁnite online updates, if we select an appropriate leaning parameter g. Then we further proved that for a linearly separable problem, proposed method converges after ﬁnite online/batch updates using any learning parameter g(g > 0). To investigate the performance of proposed method, we conduct an extensive experimental comparison against other three methods, i.e., Centroid Classiﬁer, Winnow and SVM, on four benchmark document corpora. The experimental results show that proposed technique is able to enhance classiﬁcation performance of Centroid Classiﬁer dramatically. Furthermore, the resulting classiﬁer performs slightly better than SVM in classifying accuracy, as well as beats it in running time. The rest of this paper is constructed as follows: Next section reviews related work. Section 3 describes Centroid Classiﬁer. Model Adjustment of Centroid Classiﬁer is presented in Section 4. Experimental results are given in Section 5. Finally Section 6 concludes this paper. 2. Related work In this section, we brieﬂy review the related researches and compare them with proposed method. It is well known that SVM is a classical margin-based classiﬁer. The core of SVM is to ﬁnd a decision surface that ‘‘best’’ separates the data points into two classes. Speciﬁcally, the ‘‘best’’ decision surface in a linearly separable space is a hyperplane that maximizes the ‘‘margin’’, that is the distance between two parallel hyperplanes that separate the two classes of data points in the training set. In contrast to SVM classiﬁer, MA has two characteristics. First, MA starts from centroid classiﬁer, while SVM starts from a random-selected point; Second, MA employs hypothesis margin (or approximate margin) (Crammer et al., 2002), while SVM use sample margin (or accurate margin) (Cortes & Vapnik, 1995). Both Winnow and Perceptron (van Mun, 1999) are derived from on-line mistake-driven learning model. A mistake-driven algorithm updates its weight vector only when a mistake is made. Winnow and Perceptron differ by the way they update their weight vectors during the training phase. Different from Winnow or Perceptron, MA starts from the base classiﬁer, while Perception and Winnow begin with randomly-selected weight vectors; the second distinction is that the training and classiﬁcation of Perception and Winnow depend upon given thresholds. Voting (Aas & Eikvil, 1999) is a famous strategy for correction of inductive bias. It works by taking a classiﬁer and training set as input and training the classiﬁer multiple times on different versions of the training set. The generated classiﬁers are then combined to create a ﬁnal classiﬁer that is used to classify the test set. Wu et al. (2002) presented another novel approach to handle the problem of model misﬁts. Based on prediction errors on a training set, their technique retrains a sub-classiﬁer using these misclassiﬁed training examples of each predicted class with the same learning method.

10265

Compared to Voting or Wu’s technique, MA has three particularities. First, MA does not need to retrain the classiﬁer multiple times on the different versions or subsets of the entire training set. Consequently it consumes much less training time than the above two methods. Second, MA produces only one reﬁned classiﬁer. Hence the prediction is much faster. Furthermore, Voting or Wu’s technique utilizes only training-set errors while MA employs training-set errors as well as training-set margins. A ‘‘dragpushing’’ strategy for Centroid and Naı¨ve Bayes Classiﬁer is proposed by Tan et al. (2005). With some similarities to Wu’s method (Wu et al., 2002), this method takes advantage of misclassiﬁed training examples to successively reﬁne classiﬁcation model by online-modiﬁcation. Compared to this method, MA owns three differences. First MA employs training error as well as training margin, which is not made use of by Tan’s method, as learning objective. Second, MA makes use of batch-update while Tan’s method uses online-modiﬁcation which may leads to to-and-fro-movement problem. Furthermore, we presented convergence analysis for MA. A weight adjustment schemes for Centroid Classiﬁer was proposed by Shankar and Karypis (2000). The main idea is to use a measure of the discriminating power of each term to gradually adjust the weights of all features concurrently. This method is based on the assumption that terms with higher discriminating power should play a more important role in classiﬁcation than terms with lower discriminating power. Compared with weight-adjustment scheme, MA has two differences. First, MA employs training error and training margin rather than discriminating power as adjustment goal; second, weightadjustment scheme needs to cut the training set into a new smaller training set and a validation set. 3. Centroid classiﬁer The idea behind the centroid classiﬁcation algorithm (Shankar & Karypis, 2000) is extremely simple and straightforward. First we compute the weighted representation of each training document; second, we calculate the prototype vector or centroid vector Ci for each training class ci; then, we compute the similarity between a testing document d to all centroids; ﬁnally, based on these similarities, we assign d to the class corresponding to the most similar centroid. In the following, we will elaborate these steps in detail. In this work, the documents are represented using vector space model. In this model, each document d is considered to be a vector in the term-space. For term weight we employ TFIDF (Sebastiani, 2002):

tf ðt; dÞ logðN=nt Þ ; wðt; dÞ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P 2 t2d ½tf ðt; dÞ logðN=nt Þ

ð1Þ

where N is the total number of training documents, and nt is the number of documents containing the word t. tf(t, d) indicates the occurrences of word t in document d. After giving the representation of documents, centroid can be computed as following:

Ci ¼

1 X d; jci j d2c

ð2Þ

i

where jzj denotes the cardinality of set z. Then we count the similarity of one document d to each centroid by inner-product measure,

Simðd; C i Þ ¼ d C i ;

ð3Þ

where jjzjj2 denotes the 2-norm of z, and ‘‘.’’ denotes the dot-product of the two vectors.

10266

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

Lastly, based on these similarities, we assign d the class label corresponding to the most similar centroid:

c ¼ arg maxðSimðd; C i ÞÞ:

Class A

Middle Line

ð4Þ

ci

The time complexity of learning a centroid classiﬁer is linear on the number (N) of documents and the number (W) of words in the training set. The computation of the vector-space representation of the documents can be done by scanning the training set at most three times, i.e., O(NW). Similarly, all centroids (K) can be computed in a single scan over the training set, i.e., O(KW). Furthermore, the running time required to classify a new document is at most O(KW). Therefore, centroid classiﬁer is a linear classiﬁer. 4. Proposed technique In this section, we present model adjustment algorithms in detail. First we describe the ways to solve three problems: model bias, over-train, and to-and-fro movement. Then the algorithm outline is discussed in Section 4.4. Last subsection provides the convergence analysis. 4.1. Cope with model bias problem The model bias inherent in Centroid Classiﬁer is captured within the prototype vectors, or class centroids, that are computed by the classiﬁer. The core assumption is that a given document should be assigned a particular class if the similarity of this document to the centroid of its true class is the largest. Nevertheless, this supposition is often violated when there exists a document from class A sharing more similarity with the centroid of class B than that of class A. Let us take a two-class text data as an example. The data distribution is illustrated as Fig. 1. Class ‘‘A’’ spread as grey is elliptically populated; while class ‘‘B’’ packed as white is roundly distributed. CA and CB are the centroids of class A and class B respectively. Middle Line is the perpendicular bisector of the line between CA and CB. From another perspective, Middle Line serves as a decision hyperplane that separate class ‘‘A’’ and class ‘‘B’’. Obviously, the examples of category A on the right of Middle Line share more similarity with centroid CB rather thanCA, so they will be misclassiﬁed into class B. This is the case that the supposition of Centroid Classiﬁer is violated by data distribution. In order to reduce this model bias, we make use of training errors to adjust its prototype vectors. For example, if document d of class A is misclassiﬁed into class B, both centroid CA and CB should be moved right by the following formula (5) and (6) respectively,

C A ¼ C A þ g d;

ð5Þ

C B ¼ C B g d;

ð6Þ

Class A

Middle Line

CA

d

Class B

CB

Fig. 1. The outline of original centroids.

d

C*A

C*B

Fig. 2. The outline of reﬁned centroids.

4.2. Address model bias and over-train problem However, above adjustment approaches employ only one criterion, i.e., training-set error. From the point of view of machine learning, training-set error based method cannot guarantee the generalization capability of base classiﬁers for unseen examples. In other words, low training-error-rate does not mean low errorrate for unseen examples. This is so-called over-train (over-ﬁtting) problem. To fully demonstrate this problem, we resort to aforementioned two-class dataset. Without loss of generality, we can construct the future distribution of class A and class B. Obviously the training examples are only a small portion of unseen examples of class A and class B (as illustrated Fig. 3). The unseen examples of class A are denoted by ‘‘.’’ or grey; The unseen examples of class B are denoted by ‘‘-’’ or white. After the adjusting of classiﬁer model by misclassiﬁed training examples, the Middle Line moves right to the border of class A (see Fig. 4). In this case, all training examples can be correctly classiﬁed, but not all unseen examples can be correctly classiﬁed. For example, we can observe that the unseen examples of class A on the right of Middle Line will be misclassiﬁed into class B. This observation indicates that training-set error based model update cannot guarantee the classiﬁcation performance of base classiﬁers for unseen documents. With the aim to improve the classiﬁcation ability of classiﬁer for unseen examples, the Middle Line should be moved right again. That is to say, centroid CA and CBshould be both moved right. To achieve this goal, some correctly classiﬁed examples near Middle Line in class A should be employed to adjust centroid CA and CB. That is, for each training example in class A, we not only require its Sim(d, CA) is bigger than Sim(d, CB); but also demand Sim(d, CA) exceeds Sim(d, CB) by a wide margin. In other words, for example d of class A what we need to do is to maximize the ‘‘margin’’:

qðd; C A ; C B Þ ¼ ðSimðd; C A Þ Simðd; C B ÞÞ:

ð7Þ

We can generalize this formula to the context of multiple-class dataset:

qðd; C R ; C M Þ ¼ ðSimðd; C R Þ Simðd; C M ÞÞ:

Class A

where g denotes the ‘‘LearnRate’’ that is used to control the strength of update. With this so-called move operation, CA and CB are both moving right gradually. At the end of this kind of move operation (see Fig. 2), no example of class A locates at the right of Middle Line so no example will be misclassiﬁed.

Class B

Middle Line

d

CA

ð8Þ

Class B

CB

Fig. 3. The distribution of unseen examples of Class A and Class B.

Class A

Middle Line

C* A

d

Class B

C*B

Fig. 4. Reﬁning the centroids by training examples.

10267

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

Class A

Middle Line

Class D

Class B

Middle Line DA

Class A

d1

C**A

d

CD

C**B

d3

Middle Line BA

Class B

d2 CA

d4

d5

CB

Fig. 5. Reﬁning the centroids by training examples and unseen examples.

Fig. 6. The original centroids of three categories.

where CR and CM denote the most similar centroid to d with the same and different label, respectively. It is worth noticing that this deﬁnition for margin seems to be nearly the same form as that introduced by Crammer et al. (2002). For the sake of brevity, we use q(d) instead of q(d, CR, CM) in the rest of this paper. To further illustrate this kind of margin, we take document d in Fig. 4 as an example. Although document d can be correctly classiﬁed because the Middle Line has moved to the border of class A, its margin is very close to zero since it lies exactly on the Middle Line. Hence in order to enlarge the margin, both centroid CA and CB should be moved right again by formula (5) and (6) respectively. After a few of this kind of moving operations, the Middle Line moves to the border of unseen examples of class A (as demonstrated in Fig. 5). In this case, all unseen examples can be correctly categorized. This is the mechanism that margin can further boost the classiﬁcation ability of classiﬁer for unseen examples. According to formula (3), Sim(x, y) ranges from 0 to 1. As a result, the margin q(d) ranges from –1 to 1. If example margin exceeds but nears 0, we say the margin is quite small and it needs to be enlarged; On the other hand, if example margin approaches 1, we say the margin is very large and it does not need increase. Accordingly, in order to concentrate our attention on small-margin examples, we introduce a small positive margin threshold, MinMargin (denoted by h). If the margin of example d is smaller than h, it should be employed to adjust the classiﬁer model as small-margin example.

margin example, uses it to adjust the classiﬁer model, and goes on until termination. On-line update is very simple and easy to implement, whereas it often leads to to-and-fro movement for some centroids. To illustrate this situation, we take a three-class text data as an example (see Fig. 6). Class A spread as grey is elliptically populated; while class B and class D packed as white are roundly distributed. CA, CB and CD are the centroids of class A, class B and class D respectively. Obviously, the examples of category A on the right of Middle Line BA or on the left of Middle Line DA will be misclassiﬁed, such as d1, d2, d3, d4, and d5. For the sake of being easy to explain, we assume: d1 = d2; d3 = d4. As a result, the example series, i.e., d1, d2, d3and d4, will move centroidCA to and fro: centroidCA is ﬁrst moved left by d1, next moved right to original place by d2, third moved left again by d3, and then moved right to original place again by d4. In a word, the example series cannot move centroidCA at all. With the aim to overcome this problem, we employ by-batch update to combine training-set error and training-set margin. That is to say, for each update we categorize all training documents and then use these misclassiﬁed and small-margin examples to adjust the corresponding centroids. As a result, we can write down the batch-update formula as following,

8 > > > > > > <

19 > > > > C> B > = C B X X X X C B d dþB d dC : CA ¼ CA þ g > C> B > > > A> @ d R cA d 2 cA d R cA d 2 cA > > > > > > : qðdÞ < 0 ; qðdÞ < 0 0 < qðdÞ < h 0 < qðdÞ < h 0

Obviously, maximization of formula (8) includes two criterions: training errors and training margins. On one hand, if q(d) < 0, that is to say, the instance d will be misclassiﬁed, the instance d serves as one training error; On another hand, if 0 < q(d) < h, that is, it can be correctly classiﬁed but its margin is smaller than h, it serves as one training margin. 4.3. Tackle model bias, over-train and to-and-fro movement problem There are two ways to update the classiﬁer model by misclassiﬁed examples and small-margin examples: on-line and by-batch. On-line update ﬁrst selects one misclassiﬁed example or small-

C A

¼ CA þ g

8 > > > > > > <

According to formula (9), the sum of d1, d2, d3 and d4, is a zero vector, as a result they cannot exert any inﬂuence on the adjustment of centroid CA at all. Hence this so-called to-and-fro movement is overcome. As a result, among ﬁve misclassiﬁed documents in Fig. 6, i.e., d1, d2, d3, d4, and d5, only document d5 exerts inﬂuence on the adjustment of centroid CA. After a few times of this so-called batch-update, as displayed in Fig. 7, both Middle Line DA and BA are moved out of class A. In this case, all examples can be correctly categorized. In order to balance training errors and training margins, we introduce a constant parameter ‘‘Weight’’ (denoted by x). As a re-

0 X

> > > d 2 cA > > > : qðdÞ < 0

d

X d R cA

qðdÞ < 0

B B B dþxB B @

X

d

ð9Þ

X

d 2 cA

d R cA

0 < qðdÞ < h

0 < qðdÞ < h

19 > > > > C> = C> C dC : C> > A> > > > ;

ð10Þ

10268

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

sult, the batch-update formula can be modiﬁed as: In the interest of convenience, we call this batch-update formula as Model-Adjustment formula. 4.4. The model adjustment algorithm

Proof. In the iteration t, assume example d(d 2 SA) is a misclassiﬁed example or small-margin example, that is, C tA d C tB d < dð0 < d < cÞ, where C tB denote the most similar centroid to d with the different label. Then, K X

After explaining the mechanism of Model Adjustment for Centroid Classiﬁer, we present the detailed algorithm in this section. As illustrated in Fig. 8, we ﬁrst need to load training data and parameters (including h, x and g), and then calculate one centroid for each category. In one iteration of the updating phase, we need to categorize all training documents, and then make use of these misclassiﬁed examples and small-margin examples to adjust centroids by formulas (10). For the sake of brevity, we refer to the model-adjustment algorithm as MA. Assume that there are N training documents, T test documents, W words in total, K classes and M iteration steps. The time complexity of step 2 is O(NW + KW). Since K < N, the time complexity is O(NW). Step 3.1 and step 3.2 can be done in O(NKW) and O(KW) respectively. Therefore the running time of step 3 is O(M(NKW + KW)), i.e., O(MNKW). As a result, the training time of MA scales linearly with the training documents (N). Since the improved classiﬁer still consists of K centroids, the prediction time required by MA is the same as Centroid Classiﬁer, i.e., O(TKW). Accordingly MA is still a linear classiﬁer. 4.5. The convergence analysis Given a training set S ¼ [Ki¼1 Si , where K denotes the number of training classes, and Si denotes the training examples of class i. In the following analysis, we suppose the data is 2-norm bounded, that is, "d 2 S, kdk2 6 R(R > 0). Since the size of training set is ﬁnite, this assumption always holds. Deﬁnition 1. We said a ntraining set S isoa linearly separable opt opt so that "i 2 [1, K] problem if there exits C opt 1 ; C2 ; . . . ; CK satisﬁes, opt C opt where d 2 Si ; i d C j d P cðc > 0Þ

2 kC tþ1 C opt i i k ¼

2 t opt 2 kC ti C opt i k þ kC A þ gd C A k

i–A;B

i¼1

2 þ kC tB gd C opt B k ¼

K X

2 kC ti C opt i k

i¼1 t opt þ 2g2 kdk2 þ 2gdðC tA C opt A Þ 2gdðC B C B Þ

¼

K X

t 2 t 2 2 kC ti C opt i k þ 2g R þ 2gd C A C B

i¼1 K X 2 opt 2 2 kC ti C opt 2gd C opt 6 A CB i k þ 2g R i¼1

þ 2gd 2gc ¼

K X

2 kC ti C opt i k

i¼1

þ 2g2 R2 þ 2gðd cÞ: In this time, as long as select g < (c d)/R2, we can guarantee that PK PK 2 tþ1 t opt 2 C opt i¼1 kC i i¼1 kC i C i k . In other words, after each upi k < t date, class centroid C i approaches optimal centroid C opt i . Furthermore, if select an appropriateqso as to 0 < q < ðgðc dÞ g2 R2 Þ, then, K X

K X

2 kC ti C opt i k <

i¼1

2 kC it1 C opt i k 2q ¼

i¼1

K X

2 kC 0i C opt i k 2t q:

i¼1

Obviously, K X

2 kC ti C opt i k P 0;

i¼1

let K X

f¼

j – qi:

X

2 kC 0i C opt i k

i¼1

Theorem 1. With respect to linearly separable problem, if we select an appropriate leaning parameter g, the proposed method converges to the optimal solution fC opt i g after ﬁnite online updates.

then, f 2tq > 0, that is, t < 2fq.

h

P P Lemma 1. ð Ki¼ ai Þ2 6 K Ki¼ a2i when ai P 0. Proof

Class D

Middle Line DA

Class A

Middle Line BA

Class B

K

K X i¼

C*D

CA

a2i

K X

!2 ai

¼ ðK 1Þ

i¼

K X i¼

a2i 2

X i–j

ai aj ¼

X ðai aj Þ2 P 0:

i–j

C*B

Fig. 7. The moved centroids of three categories.

Theorem 2. With respect to a linearly separable problem, the proposed method converges after ﬁnite online updates using any learning parameter g(g > 0). Proof. In the iteration t, assume example d(d 2 SA) is a misclassiﬁed example or small-margin example, that is, C tA d C tB d < dð0 < d < cÞ, where C tB denote the most similar centroid to d with the different label. Then, K X

C tþ1 C opt ¼ i i

t opt C ti C opt þ C tA þ gd Aopt A þ C B gd C B i

i–A;B

i¼1

¼ Fig. 8. The Outline of model adjustment for centroid classiﬁer.

X K X i¼1

K X opt C ti C opt þ gd C opt C ti C opt þ gc; P A CB i i i¼1

10269

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

which indicates, K X

C ti C opt P i

K X

i¼1

then

C 0i C opt þ t gc: i

ð11Þ

i¼1

K X i¼1

In the same way, K X

kC itþ1 k2 ¼

i¼1

kC ti k2 þ kC tA þ gdk2 þ kC tB gdk2

K X

kC tþ1 k2 ¼ i

kC ti k2 þ 2g2 kdk2 þ 2gdðC tA C tB Þ

6

kC ti k2

2 2

þ 2g R þ 2gd; 6

K X

kC ti k2 6

kC 0i k2 þ 2t g2 R2 þ gd ;

i¼1

let

s ¼ maxi kC opt i k, then, C ti C opt 6 i

i¼1

K X

k

X

6

K X

kC ti k kC opt i k 6 s

K X

6 kC ti k:

C ti C opt 6s i

K X

i¼1

kC ti k 6 s K

i¼1

d 2v

X

K X

K X

K X

!1=2 kC ti k2

K X

kC ti k2 6

i¼1

!1=2

kC ti k2

i¼1

0 2

d k þ 2g

d0 2v ti

K X

0 @

i¼1

X

kC ti k2 þ 2nt g2 R2 þ 2g

X

d

d2uti

X

1 0 d AC ti

d0 2v ti

d C ti C tj

d2uti &d2v ti

kC ti k2 þ 2nt g2 R2 þ 2nt gd

K X

kC 0i k2 þ 2tðg2 R2 þ gdÞ

;

K X

!1=2

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ þ s 2Kðg2 R2 þ gdÞt :

kC 0i k2 þ 2tN g2 R2 þ gd :

i¼1

s ¼ maxi kC opt i k, similar to Theorem 2, we obtain,

K X

K pﬃﬃﬃﬃ X C ti C opt 6s K kC 0i k2 i

i¼1 K pﬃﬃﬃﬃ X kC 0i k2 6s K

kC ti k2 þ 2N g2 R2 þ gd ;

Let

i¼1

!1=2

rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ K X þ s 2K g2 R2 þ gd t P C 0i C opt þ tgc: i

i¼1

rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ þ s 2KN g2 R2 þ gd t :

ð14Þ

i¼1

K pﬃﬃﬃﬃ X s K kC 0i k2

!1=2

rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ K X þ s 2KN g2 R2 þ gd t P C 0i C opt þ tgc: i

i¼1

According to (11) and (12), we obtain,

kC 0i k2

!1=2

Combining (13) and (14), we obtain,

ð12Þ

i¼1

i¼1

Obviously, if above inequality holds, t must be ﬁnite. That is to say, the proposed method converges after ﬁnite batch updates. h

i¼1

Obviously, if above inequality holds, t must be ﬁnite. That is to say, the proposed method converges after ﬁnite online updates. h Theorem 3. With respect to a linearly separable problem, the proposed method converges after ﬁnite batch updates using any learning parameter g(g > 0). Proof. In the iteration t, let U t ðjU t j ¼ nt P 1Þ denote the examples whose margin is smaller than d. Let uti denote the examples that belong to class i but whose margin is smaller than d; let v ti denote the examples whose most similar class with the different label is class i but whose margin is smaller than X Xd(0 < d < c). Obviously,

uti \ v ti ¼ /; [i uti ¼ [i v ti ¼ U t ;

jv ti j ¼ nt :

juti j ¼

i

i

t

t

Moreover, if d 2 uti & d 2 v tj ; i – qj, then dC i dC j < d 6 c.

C tþ1 C opt i i

t i

K X

i¼1

i¼1

pﬃﬃﬃﬃ 6s K

K X

0

0

d k2 ¼

where N denotes the size of training examples. As a result,

i¼1

According to Lemma 1,

K pﬃﬃﬃﬃ X

d

d2uti

X

i¼1

i¼1

s K

dg

i¼1

K X

i¼1

K X

X d2uti

i¼1

therefore,

K X

kC ti þ g

þ g2

i¼1

K X

X i–A;B

i¼1

i¼1 K X

ð13Þ

In the same way,

X K X

C 0i C opt þ t gc: i

i¼1

i–A;B

¼

K X

C ti C opt P i

K X

¼

i¼1

0

@C t þ g i X

d2uti

d2uti &d2v tj K X i¼1

1

0 d AC opt i

¼

d0 2v ti

K X

C ti C opt i

i¼1

d0 2v ti

i¼1

K X d C opt C opt C ti C opt þ nt gc P i j i

X

þg

X

0 1 K X 0 X X @ d d AC opt ¼ C ti C opt i i

i

P

dg

d2uti

i¼1

þg

X

i¼1

5. Empirical assessment In this section, we conduct experiments to verify the efﬁciency of proposed method. First, we empirically compare proposed method with other classiﬁcation algorithms; then, we investigate whether training error and training margin can enhance the performance of base classiﬁer effectively and robustly; ﬁnally, we tune the performance of proposed method using its parameters. 5.1. Datasets In our experiment, we use four corpora: Reuter-21578,1 20NewsGroup,2 Industry Sector3 and OHSUMED.4 Reuter-21578 The Reuters-21578 text categorization test collection contains documents collected from the Reuters newswire in 1987. It is a standard text categorization benchmark and contains 135 categories. We used its subset: one consisting of 92 categories and in total 10,346 documents. 20NewsGroup The 20Newsgroup (20NG) dataset contains approximately 20,000 articles evenly divided among 20 Usenet newsgroups. We use a subset consisting of total categories and 19,446 documents. Industry Sector The Industry Section dataset is based on the data made available by Market Guide, Inc. (www.market1 2

C ti C opt i

þ gc

3 4

http://www.daviddlewis.com/resources/testcollections/reuters21578/. http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/wwkb. http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/. ftp://medir.ohsu.edu/pub/OHSUMED/.

10270

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

guide.com). The set consists of company homepages that are categorized in a hierarchy of industry sectors, but we disregard the hierarchy. There were 9,637 documents in the dataset, which were divided into 105 classes. We use a subset called as Sector-48 consisting of 48 categories and in all 4,581 documents. OHSUMED The OHSUMED (Hersh, Buckley, Leone, & Hickam, 1994) dataset is a bibliographical document collection: developed by William Hersh and colleagues at the Oregan Health Science University, Which is a subset of MEDLINE database. We use a subset (called ohscal5 in (Shankar & Karypis, 2000)) from OHSUMED dataset that contains 11,162 documents and in total 10 categories: Antibodies, Carcinoma, DNA, In-Vitro, Molecular-Sequence-Data, Pregnancy, Prognosis, Receptors, Risk-Factors and Tomography. 5.2. Performance measure To evaluate a semantic classiﬁcation system, we use the F1 measure introduced by van Rijsbergen (1979). This measure combines recall and precision in the following way:

number of correct positive predictions : number of positive examples number of correct positive predictions Precision ¼ : number of positive predictions 2 Recall Precision : F1 ¼ ðRecall þ PrecisionÞ

Recall ¼

For ease of comparison, we summarize the F1 scores over the different categories using the Micro- and Macro-averages of F1 scores:

Micro F1 ¼ F1over categories and documents; Macro F1 ¼ average of within-category F1 values: The MicroF1 and MacroF1 emphasize the performance of the system on common and rare categories respectively. Using these averages, we can observe the effect of different kinds of data on a classiﬁcation system. 5.3. Experimental design We evenly split each dataset into three parts. Then we use two parts for training and the remaining third for test. We perform the train-test procedure three times and use the average of the three performances as ﬁnal result. This is so called threefold cross validation. In order to remove redundant features and save running time, we employ Information Gain as feature selection method because it consistently performs well in most cases. Algorithms are coded in C++ and running on a Pentium-4 machine with a single 3.0 GHz CPUs and 512 M memory. For SVM classiﬁer, we employed LibSvm and BSvm which can directly deal with multi-class classiﬁcation problems. (www.csie.ntu.edu.tw/cjlin/). We left all parameters as default. LibSvm is a simple and easy-to-use support vector machines tool for classiﬁcation, regression, and distribution estimation. We use its Current Version-2.84 that is released on April 2007. BSvm borrows the structure of LibSvm. Similar options are also adopted. For the bound-constrained formulation for classiﬁcation and regression, BSvm uses a decomposition method. BSvm uses a simple working set selection which leads to faster convergences for difﬁcult cases than LibSvm. We use its Current Version-2.06 that is released on April 2006. In our experiments, we run Balanced Winnow for it consistently yields better performance than Positive Winnow (van Mun, 1999). 5

http://www.cs.umn.edu/han/data/tmdata.tar.gz.

The Balanced Winnow keeps two weights for each feature l in catþ egory Ci, wþ il and wil . The weight values are initialized as wil ¼ 2:0 and w ¼ 1:0 and the threshold was set to 1.0. The promotion il parameter a and the demotion ß (learning rates) were ﬁxed as 1.2 and 0.8 respectively. 5.4. Comparison and analysis 5.4.1. Comparison with other methods Tables 1 and 2 show performance comparison in MicroF1 and MacroF1. Feature number is set to 10,000; For MA, MaxIteration, Weight, LearnRate, and MinMargin are set to 10, 0.2, 0.5, and 0.1 respectively. According to the two tables, MA improves the performance of Centroid Classiﬁer dramatically, and the improvement is especially signiﬁcant on Sector-48. For example, MA improves Centroid Classiﬁer by about 9% on Sector-48, by about 7% on Reuter, by about 5% on NewsGroup, and by about 4% on OHSUMED. In a word, Model Adjustment is an effective and robust method to boost the performance of Centroid Classiﬁer. MA outperforms all the other methods on OHSUMED Reuter and Sector-48. Especially on Reuter, the MicroF1 of MA is one percent lower than LibSvm (or BSvm) but its MacroF1 is 12 percent higher than LibSvm (or BSvm). In total MA performs a little better than LibSvm (or BSvm). Consequently we can say that MA is an efﬁcient and competitive algorithm for text classiﬁcation. Table 3 reports the training time of ﬁve methods on four text collections. Note that the running time does not include the seconds for loading data from hard disk; Feature number is set to 10,000; For MA, MaxIteration, Weight, LearnRate, and MinMargin are set to 10, 0.2, 0.5, and 0.1 respectively. As we can observe from this table, the CPU time required by LibSvm is about 40 times larger than that of MA on OHSUMED and about 20 times larger on NewsGroup. In contrast to LibSvm, as a result, the time saving of MA is very obvious. To the best of our knowledge, BSvm is indeed one of the fastest SVM classiﬁers in the world. However, under some conditions, its time saving leads to considerable decrease of accuracy. Such as on OHSUMED, it reduce both MicroF1 and MacroF1 by 6 percent. Table 1 The MicroF1 of different methods.

OHSUMED Reuter Sector-48 NewsGroup

MA

Centroid

Winnow

LibSvm

BSvm

0.8049 0.8565 0.8970 0.8892

0.7676 0.7820 0.8055 0.8429

0.7193 0.8263 0.8003 0.8105

0.7906 0.8694 0.8732 0.9040

0.7342 0.8643 0.8755 0.9020

Bold values indicates the best results. Table 2 The MacroF1 of different methods.

OHSUMED Reuter Sector-48 NewsGroup

MA

Centroid

Winnow

LibSvm

BSvm

0.7940 0.6061 0.9000 0.8859

0.7600 0.5617 0.8152 0.8389

0.7110 0.4891 0.8389 0.8161

0.7800 0.4875 0.8780 0.9029

0.7252 0.4880 0.8791 0.9008

Bold values indicates the best results. Table 3 Training time in seconds.

OHSUMED Reuter Sector-48 NewsGroup

MA

Centroid

Winnow

LibSvm

BSvm

1.39 18.41 11.91 7.56

0.40 0.40 0.50 0.48

1.72 7.75 4.92 4.90

62.28 80.77 38.31 160.11

18.32 33.33 24.43 55.21

10271

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

Despite of its high-speed, BSvm still consumes at least two times larger CPU time than MA. In summary, these experiments have shown that MA offers alternative choice for text categorization.

5.4.2. Training error, margin and performance vs. maxiteration Figs. 9–11 show training error, training margin and prediction performance curves of MA vs. MaxIteration on four datasets. Weight, LearnRate, and MinMargin are set to 0.2, 0.5, and 0.1 respectively; feature number is set to 10,000. The ﬁrst observation is that proposed Model Adjustment can decrease training error, enlarge margin and boost prediction performance. The three ﬁgures demonstrate that increasing the MaxIteration decreases training error, increases training margin and prediction performance. However, the decrease or increase in three measures is not directly proportional to increase in MaxIteration. As the MaxIteration is getting larger, the curves of three measures are starting to level off. The second observation is that the ﬁrst updating-operation achieved the biggest performance improvement. MaxIteration equivalent to 0 means that no updating operation is used at all, i.e., Centroid Classiﬁer. From Fig. 11 we can observe that a wide margin improvement is achieved by running only one round of adjustment operation over training set. 0.2 Training-Error-Rate Training-Error-Rate Training-Error-Rate Training-Error-Rate

0.18

on on on on

5.4.4. Performance vs. learnRate Fig. 13 demonstrates the performance comparison of MA with respect to the varying value of LearnRate that controls the step-size of updating operation. Note that feature number is set to 10,000; MaxIteration, Weight, and MinMargin are set to 10, 0.2, and 0.1 respectively. From this ﬁgure, the best value for OHSUMED, Reuter, and NewsGroup is about 0.5 while for Sector-48 is about 1.5. As a result, the empirical value for LearnRate ranges from 0.5 to 1.5. 5.4.5. Performance vs. minMargin Fig. 14 displays the performance comparison of MA vs. MinMargin. Note that Feature number is set to 10,000; MaxIteration, Weight, andLearnRate are set to 10, 0.2, and 0.5 respectively.

OHSUMED Reuter Sector-48 NewsGroup

0.94 0.92

0.14

0.9

0.12

0.88

0.1

0.86

MicroF1

Training Error Rate

0.16

of MA of MA of MA of MA

5.4.3. Performance vs. weight Fig. 12 illustrates the performance comparison of MA with respect to the varying value of Weight that balances training error and training margin. MaxIteration, LearnRate, and MinMargin are set to 10, 0.5, and 0.1 respectively; feature number is set to 10,000. As we can observe from this ﬁgure, all the curves peak at some Weight values larger than zero. For example, the peaks of MA on OHSUMED, on Reuter, on Sector-48 and on NewsGroup are around 0.4, 0.2, 0.2 and 0.8 respectively. Consequently the acceptably performing value for Weight ranges from 0.2 to 0.8.

0.08 0.06

MicroF1 MicroF1 MicroF1 MicroF1

0.84 0.82

of MA of MA of MA of MA

on on on on

OHSUMED Reuter Sector-48 NewsGroup

0.04 0.8 0.02 0.78 0

0

5

10

15 Iteration

20

25

30

0.76

0

5

10

15 Iteration

Fig. 9. Training-error-rate curves of MA vs. Iteration.

20

25

30

Fig. 11. MicroF1 curves of MA vs. iteration.

1300

0.9

1200 0.88

1100

Training-Margin Training-Margin Training-Margin Training-Margin

900 800

of MA of MA of MA of MA

on on on on

0.86

OHSUMED Reuter Sector-48 NewsGroup

MicroF1

Training Margin

1000

0.84

700 0.82

600 500

MicroF1 MicroF1 MicroF1 MicroF1

of MA of MA of MA of MA

0.5 W eight

0.6

on on on on

OHSUMED Reuter Sector-48 NewsGroup

0.8

400 300 0

5

10

15 Iteration

20

Fig. 10. Training-margin curves of MA vs. Iteration.

25

30

0.78

0

0.1

0.2

0.3

0.4

0.7

Fig. 12. MicroF1 curves of MA vs. weight.

0.8

0.9

1

10272

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

0.9

0.88

MicroF1

0.86

0.84

MicroF1 MicroF1 MicroF1 MicroF1

0.82

of MA of MA of MA of MA

on on on on

OHSUMED Reuter Sector-48 NewsGroup

0.8

0.78

0

0.5

1

1.5

LearnRate

Fig. 13. MicroF1 curves of MA vs. learnRate.

Secondly, with the consideration of to-and-fro movement problem resulting from online update, we employ by-batch update. That is to say, for each update, ﬁrst all training documents are categorized and then these misclassiﬁed and small-margin examples are used to adjust the corresponding centroids. Thirdly, from the perspective of mathematics, we proved that with respect to a linearly separable problem, the proposed method converges after ﬁnite online/batch updates using any learning parameter g(g > 0). Lastly, extensive experiments are conducted on four benchmark evaluation collections. The results show that Model Adjustment could make a signiﬁcant difference on the performance of Centroid Classiﬁer. Furthermore, the experimental result indicates margin can further improve the performance of Model Adjustment for Centroid Classiﬁer. We believe that this research only scratches the surface of what can be achieved with Model Adjustment. The future effort is to seek new techniques to enhance the performance of Model Adjustment and to apply it to other classiﬁers. Acknowledgments This work was mainly supported by two funds, i.e., 60933005 and 60803085.

Figure 16: MicroF1 curves of MA vs. MinMargin 0.92

0.9

References

MicroF1

0.88

0.86

0.84 MicroF1 MicroF1 MicroF1 MicroF1

0.82

of MA of MA of MA of MA

on on on on

OHSUMED Reuter Sector-48 NewsGroup

0.8

0.78

0

0.1

0.2

0.3

0.4

0.5 0.6 MinMargin

0.7

0.8

0.9

1

Fig. 14. MicroF1 curves of MA vs. minmargin.

From this ﬁgure, we can obtain one observation: when MinMargin = 0, that means only training-set-error criterion is employed to update the base classiﬁer, MA performs worse than any case of positive MinMargin. This observation indicates: the incorporation of margin can further improve the performance of Model Adjustment for Centroid Classiﬁer. This goes in line with our analysis about margin in Section 4. This result provides evidences for the rationality and feasibility of incorporating margin into Model Adjustment. 6. Conclusion remarks In this work, a novel Model Adjustment (MA) algorithm was proposed to deal with model misﬁt problem of Centroid Classiﬁer. The basic idea is to pick out some training examples to adjust Centroid Classiﬁer model. The main research ﬁndings are: Firstly, in order to avoid over-train problem, we combine two measures for Model Adjustment: training-set errors and trainingset margins. That is to say, misclassiﬁed examples as well as small-margin examples are picked out to update the classiﬁer model.

Aas, K., & Eikvil, L. (1999). Text Categorisation: A Survey. Raport NR 941, Norwegian Computing Center, 15. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 273–297. Crammer, K., Gilad-Bachrach, R., Navot, A., & Tishby, N. (2002). Margin analysis of the lvq algorithm, NIPS. Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. SIGIR, 192–201. Ishii, N., Murai, T., Yamada, T., & Bao, Y. (2006). Text classiﬁcation by combining grouping, LSA and kNN. In The ﬁfth IEEE/ACIS international conference on computer and information science (pp. 148–154). Lertnattee, V., & Theeramunkong, T. (2002). Combining homogeneous classiﬁers for centroid-based text classiﬁcation. ISCC, 1034–1039. Li, R., & Hu, Y. (2003). Noise reduction to text categorization based on density for KNN. ICMLC, 3119–3124. Liu, L., Sun, X., & Song, H. (2006). Combining fuzzy clustering with Naive Bayes augmented learning in text classiﬁcation. In The ﬁrst IEEE international symposium on pervasive computing and applications (pp. 168–171). Liu, Y., Yang, Y., & Carbonell, J. (2002). Boosting to correct inductive bias in text classiﬁcation. CIKM, 348–355. Lu, M., Hu, K., Wu, Y., Lu, Y., & Zhou, L. (2002). SECTCS: towards improving VSM and Naive Bayesian classiﬁer. SMC. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. Shankar, S., Karypis, G. (2000). Weight adjustment schemes for a centroid based classiﬁer, Technical report, Dept. of Computer Science, University of Minnesota. Shin, K., Abraham, A., & Han, S. (2006). Enhanced centroid-based classiﬁcation technique by ﬁltering outliers. TSD, 159–163. Tan, S. (2005). Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Expert Systems With Applications, 28(4), 667–671. Tan, S. (2006). An effective reﬁnement strategy for KNN text classiﬁer. Expert Systems With Applications, 30(2), 290–298. Tan, S. (2008). An improved centroid classiﬁer for text categorization. Expert Systems With Applications, 35(1–2), 279–285. Tan, S., & Cheng, X. (2007a). An effective approach to enhance centroid classiﬁer for text categorization. PKDD, 581–588. Tan, S., & Cheng, X. (2007b). Using hypothesis margin to boost centroid text classiﬁer. SAC, 398–403. Tan, S., Cheng, X., Ghanem, M. M., Wang, B., & Xu, H. (2005). A novel reﬁnement approach for text categorization. CIKM, 469–476. Tan, S., Cheng, X., Wang, Y., & Xu, H. (2009). Adapting Naive Bayes to domain adaptation for sentiment analysis. ECIR. Tan, S., Wu, G., Tang, H., & Cheng, X. (2007). A novel scheme for domain-transfer problem in the context of sentiment analysis. CIKM, 979–982. Tsay, J., & Wang, J. (2004). Improving linear classiﬁer for Chinese text categorization. Information Processing and Management, 223–237. van Mun, P. P. T. M. (1999). Text classiﬁcation in information retrieval using winnow. . van Rijsbergen, C. (1979). Information retrieval. London: Butterworths. Wang, Z., Sun, X., Zhang, D., & Li, X. (2006). An optimal SVM-based text classiﬁcation algorithm. ICMLC, 1378–1381.

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273 Wang, B., & Zhang, S. (2005). A novel text classiﬁcation algorithm based on Naive Bayes and KL-divergence. PDCAT, 913–915. Wu, H., Phang, T. H., Liu, B., & Li, X. (2002). A reﬁnement approach to handling model misﬁt in text categorization. SIGKDD, 207–216.

10273

Yuan, F., Yang, L., & Yu, G. (2005). Improving the k-NN and applying it to Chinese text classiﬁcation. ICMLC, 1547–1553. Zhang, B., Su, J., & Xu, X. (2006). A class-incremental learning method for multi-class support vector machines in text classiﬁcation. ICMLC, 2581–2585.

Adapting centroid classifier for document categorization

Adapting centroid classifier for document categorization

Recommend Documents