Adapting centroid classifier for document categorization

Adapting centroid classifier for document categorization

Expert Systems with Applications 38 (2011) 10264–10273 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ...

791KB Sizes 0 Downloads 102 Views

Expert Systems with Applications 38 (2011) 10264–10273

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Adapting centroid classifier for document categorization Songbo Tan a,⇑, Yuefen Wang b, Gaowei Wu a a b

Key Laboratory of Network, Institute of Computing Technology, Chinese Academy of Sciences, China Information Center, Chinese Academy of Geological Sciences, China

a r t i c l e Keywords: Centroid classifier Text categorization Information retrieval Data mining

i n f o

a b s t r a c t In the community of information retrieval, Centroid Classifier has been showed to be a simple and yet effective method for text categorization. However, it is often plagued with model misfit (or inductive bias) incurred by its assumption. Various methods have been proposed to address this issue, such as Weight Adjustment, Voting, Refinement and DragPushing. However, existing methods employ only one criterion, i.e., training-set error. Researches in machine learning indicate that training-set error based method cannot guarantee the generalization capability of base classifiers for unseen examples. To overcome this problem, we propose a novel Model Adjustment algorithm, which makes use of training-set errors as well as training-set margins. Furthermore, we prove that for a linearly separable problem, proposed method converges to the optimal solution after finite updates using any learning parameter g(g > 0). The empirical assessment conducted on four benchmark collections indicates that proposed method performs slightly better than SVM classifier in prediction accuracy, as well as beats it in running time.  2011 Elsevier Ltd. All rights reserved.

1. Introduction With the rapid growth of texts in the Internet, text classification has been attracting more and more attention in information retrieval and natural language processing community. In most cases, the use of statistical or machine learning techniques has been proven to be successful in this context, since it is typically more feasible to induce categorization rules based on example documents, than to extract such rules from domain experts. Numerous machine learning approaches have been introduced to deal with text classification, including Centroid Classifier (Lertnattee & Theeramunkong, 2002; Shankar & Karypis, 2000; Shin, Abraham, & Han, 2006; Tan & Cheng, 2007a, 2007b; Tan, Wu, Tang, & Cheng, 2007), K-Nearest Neighbor (KNN) (Ishii, Murai, Yamada, & Bao, 2006; Li & Hu, 2003; Tan, 2005, 2006; Yuan, Yang, & Yu, 2005), Naive Bayes (Lu, Hu, Wu, Lu, & Zhou, 2002, 2006; Tan, Cheng, Wang, & Xu, 2009; Wang & Zhang, 2005), Winnow or Perceptron (van Mun, 1999), Rocchio (Tsay & Wang, 2004), Voting (Aas & Eikvil, 1999) and Support Vector Machines (SVM) (Wang, Sun, Zhang, & Li, 2006; Zhang, Su, & Xu, 2006). Despite simplicity and straightforwardness, Centroid Classifier has proved to be an efficient and yet robust method for text categorization. Its basic idea is to construct a prototype vector, or centroid, per class using a training set of documents. This method is ⇑ Corresponding author. Address: Key Laboratory of Network, P.O. Box 2704, Beijing 100190, PR China. Tel.: +8610 62600928; fax: +8610 62600905. E-mail address: [email protected] (S. Tan). 0957-4174/$ - see front matter  2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.02.114

easy to implement and is of efficiency in computation. However, it is often plagued with inductive bias (Liu, Yang, & Carbonell, 2002) or model misfit (Wu, Phang, Liu, & Li, 2002). For example, Centroid Classifier makes a simple assumption that a given document should be assigned a particular class if the similarity of this document to the centroid of the class is the largest. However, this supposition is often violated (misfit) when there exists a document from class A sharing more similarity with the centroid of class B than that of class A. The more serious the model misfit, the poorer the classification performance will be. Whereas, for individual domains (e.g. text collections), the choice of the best set of parameters can be found through tedious experimentation, a generic approach for addressing model misfits is typically needed. Numerous researchers have thus investigated the development of generic methods that can be used to improve the performance of base text classifiers automatically. These methods include Weight Adjustment (Shankar & Karypis, 2000), Voting (Aas & Eikvil, 1999), Refinement (Wu et al., 2002) and DragPushing (Tan, 2008; Tan, Cheng, Ghanem, Wang, & Xu, 2005). However, existing methods employ only one criterion, i.e., training-set error. Researches in machine learning indicate that training-set error based method cannot guarantee the generalization capability of base classifiers for unseen examples. In other words, low training-error-rate does not mean low error-rate for unseen examples. This is so-called over-train (or over-fitting) problem. To overcome this problem, we proposed a novel Model Adjustment (MA) algorithm to boost the performance of Centroid Classi-

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

fier, which makes use of training-set error as well as training-set margin. A margin (Crammer, Gilad-Bachrach, Navot, & Tishby, 2002) is a geometric measure for evaluating the confidence of a classifier with respect to its decision. Margins already play a crucial role in current machine learning research. The novelty of this paper is the use of large margin principle for Model Adjustment of Centroid Classifier. MA offers several advantages: ease in implementation, efficiency in training, and high accuracy in classification. From the perspective of mathematics, we first justified that with respect to a linearly separable problem, proposed method converges to the optimal solution after finite online updates, if we select an appropriate leaning parameter g. Then we further proved that for a linearly separable problem, proposed method converges after finite online/batch updates using any learning parameter g(g > 0). To investigate the performance of proposed method, we conduct an extensive experimental comparison against other three methods, i.e., Centroid Classifier, Winnow and SVM, on four benchmark document corpora. The experimental results show that proposed technique is able to enhance classification performance of Centroid Classifier dramatically. Furthermore, the resulting classifier performs slightly better than SVM in classifying accuracy, as well as beats it in running time. The rest of this paper is constructed as follows: Next section reviews related work. Section 3 describes Centroid Classifier. Model Adjustment of Centroid Classifier is presented in Section 4. Experimental results are given in Section 5. Finally Section 6 concludes this paper. 2. Related work In this section, we briefly review the related researches and compare them with proposed method. It is well known that SVM is a classical margin-based classifier. The core of SVM is to find a decision surface that ‘‘best’’ separates the data points into two classes. Specifically, the ‘‘best’’ decision surface in a linearly separable space is a hyperplane that maximizes the ‘‘margin’’, that is the distance between two parallel hyperplanes that separate the two classes of data points in the training set. In contrast to SVM classifier, MA has two characteristics. First, MA starts from centroid classifier, while SVM starts from a random-selected point; Second, MA employs hypothesis margin (or approximate margin) (Crammer et al., 2002), while SVM use sample margin (or accurate margin) (Cortes & Vapnik, 1995). Both Winnow and Perceptron (van Mun, 1999) are derived from on-line mistake-driven learning model. A mistake-driven algorithm updates its weight vector only when a mistake is made. Winnow and Perceptron differ by the way they update their weight vectors during the training phase. Different from Winnow or Perceptron, MA starts from the base classifier, while Perception and Winnow begin with randomly-selected weight vectors; the second distinction is that the training and classification of Perception and Winnow depend upon given thresholds. Voting (Aas & Eikvil, 1999) is a famous strategy for correction of inductive bias. It works by taking a classifier and training set as input and training the classifier multiple times on different versions of the training set. The generated classifiers are then combined to create a final classifier that is used to classify the test set. Wu et al. (2002) presented another novel approach to handle the problem of model misfits. Based on prediction errors on a training set, their technique retrains a sub-classifier using these misclassified training examples of each predicted class with the same learning method.

10265

Compared to Voting or Wu’s technique, MA has three particularities. First, MA does not need to retrain the classifier multiple times on the different versions or subsets of the entire training set. Consequently it consumes much less training time than the above two methods. Second, MA produces only one refined classifier. Hence the prediction is much faster. Furthermore, Voting or Wu’s technique utilizes only training-set errors while MA employs training-set errors as well as training-set margins. A ‘‘dragpushing’’ strategy for Centroid and Naı¨ve Bayes Classifier is proposed by Tan et al. (2005). With some similarities to Wu’s method (Wu et al., 2002), this method takes advantage of misclassified training examples to successively refine classification model by online-modification. Compared to this method, MA owns three differences. First MA employs training error as well as training margin, which is not made use of by Tan’s method, as learning objective. Second, MA makes use of batch-update while Tan’s method uses online-modification which may leads to to-and-fro-movement problem. Furthermore, we presented convergence analysis for MA. A weight adjustment schemes for Centroid Classifier was proposed by Shankar and Karypis (2000). The main idea is to use a measure of the discriminating power of each term to gradually adjust the weights of all features concurrently. This method is based on the assumption that terms with higher discriminating power should play a more important role in classification than terms with lower discriminating power. Compared with weight-adjustment scheme, MA has two differences. First, MA employs training error and training margin rather than discriminating power as adjustment goal; second, weightadjustment scheme needs to cut the training set into a new smaller training set and a validation set. 3. Centroid classifier The idea behind the centroid classification algorithm (Shankar & Karypis, 2000) is extremely simple and straightforward. First we compute the weighted representation of each training document; second, we calculate the prototype vector or centroid vector Ci for each training class ci; then, we compute the similarity between a testing document d to all centroids; finally, based on these similarities, we assign d to the class corresponding to the most similar centroid. In the following, we will elaborate these steps in detail. In this work, the documents are represented using vector space model. In this model, each document d is considered to be a vector in the term-space. For term weight we employ TFIDF (Sebastiani, 2002):

tf ðt; dÞ  logðN=nt Þ ; wðt; dÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P 2 t2d ½tf ðt; dÞ  logðN=nt Þ

ð1Þ

where N is the total number of training documents, and nt is the number of documents containing the word t. tf(t, d) indicates the occurrences of word t in document d. After giving the representation of documents, centroid can be computed as following:

Ci ¼

1 X d; jci j d2c

ð2Þ

i

where jzj denotes the cardinality of set z. Then we count the similarity of one document d to each centroid by inner-product measure,

Simðd; C i Þ ¼ d  C i ;

ð3Þ

where jjzjj2 denotes the 2-norm of z, and ‘‘.’’ denotes the dot-product of the two vectors.

10266

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

Lastly, based on these similarities, we assign d the class label corresponding to the most similar centroid:

c ¼ arg maxðSimðd; C i ÞÞ:

Class A

Middle Line

ð4Þ

ci

The time complexity of learning a centroid classifier is linear on the number (N) of documents and the number (W) of words in the training set. The computation of the vector-space representation of the documents can be done by scanning the training set at most three times, i.e., O(NW). Similarly, all centroids (K) can be computed in a single scan over the training set, i.e., O(KW). Furthermore, the running time required to classify a new document is at most O(KW). Therefore, centroid classifier is a linear classifier. 4. Proposed technique In this section, we present model adjustment algorithms in detail. First we describe the ways to solve three problems: model bias, over-train, and to-and-fro movement. Then the algorithm outline is discussed in Section 4.4. Last subsection provides the convergence analysis. 4.1. Cope with model bias problem The model bias inherent in Centroid Classifier is captured within the prototype vectors, or class centroids, that are computed by the classifier. The core assumption is that a given document should be assigned a particular class if the similarity of this document to the centroid of its true class is the largest. Nevertheless, this supposition is often violated when there exists a document from class A sharing more similarity with the centroid of class B than that of class A. Let us take a two-class text data as an example. The data distribution is illustrated as Fig. 1. Class ‘‘A’’ spread as grey is elliptically populated; while class ‘‘B’’ packed as white is roundly distributed. CA and CB are the centroids of class A and class B respectively. Middle Line is the perpendicular bisector of the line between CA and CB. From another perspective, Middle Line serves as a decision hyperplane that separate class ‘‘A’’ and class ‘‘B’’. Obviously, the examples of category A on the right of Middle Line share more similarity with centroid CB rather thanCA, so they will be misclassified into class B. This is the case that the supposition of Centroid Classifier is violated by data distribution. In order to reduce this model bias, we make use of training errors to adjust its prototype vectors. For example, if document d of class A is misclassified into class B, both centroid CA and CB should be moved right by the following formula (5) and (6) respectively,

C A ¼ C A þ g  d;

ð5Þ

C B ¼ C B  g  d;

ð6Þ

Class A

Middle Line

CA

d

Class B

CB

Fig. 1. The outline of original centroids.

d

C*A

C*B

Fig. 2. The outline of refined centroids.

4.2. Address model bias and over-train problem However, above adjustment approaches employ only one criterion, i.e., training-set error. From the point of view of machine learning, training-set error based method cannot guarantee the generalization capability of base classifiers for unseen examples. In other words, low training-error-rate does not mean low errorrate for unseen examples. This is so-called over-train (over-fitting) problem. To fully demonstrate this problem, we resort to aforementioned two-class dataset. Without loss of generality, we can construct the future distribution of class A and class B. Obviously the training examples are only a small portion of unseen examples of class A and class B (as illustrated Fig. 3). The unseen examples of class A are denoted by ‘‘.’’ or grey; The unseen examples of class B are denoted by ‘‘-’’ or white. After the adjusting of classifier model by misclassified training examples, the Middle Line moves right to the border of class A (see Fig. 4). In this case, all training examples can be correctly classified, but not all unseen examples can be correctly classified. For example, we can observe that the unseen examples of class A on the right of Middle Line will be misclassified into class B. This observation indicates that training-set error based model update cannot guarantee the classification performance of base classifiers for unseen documents. With the aim to improve the classification ability of classifier for unseen examples, the Middle Line should be moved right again. That is to say, centroid CA and CBshould be both moved right. To achieve this goal, some correctly classified examples near Middle Line in class A should be employed to adjust centroid CA and CB. That is, for each training example in class A, we not only require its Sim(d, CA) is bigger than Sim(d, CB); but also demand Sim(d, CA) exceeds Sim(d, CB) by a wide margin. In other words, for example d of class A what we need to do is to maximize the ‘‘margin’’:

qðd; C A ; C B Þ ¼ ðSimðd; C A Þ  Simðd; C B ÞÞ:

ð7Þ

We can generalize this formula to the context of multiple-class dataset:

qðd; C R ; C M Þ ¼ ðSimðd; C R Þ  Simðd; C M ÞÞ:

Class A

where g denotes the ‘‘LearnRate’’ that is used to control the strength of update. With this so-called move operation, CA and CB are both moving right gradually. At the end of this kind of move operation (see Fig. 2), no example of class A locates at the right of Middle Line so no example will be misclassified.

Class B

Middle Line

d

CA

ð8Þ

Class B

CB

Fig. 3. The distribution of unseen examples of Class A and Class B.

Class A

Middle Line

C* A

d

Class B

C*B

Fig. 4. Refining the centroids by training examples.

10267

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

Class A

Middle Line

Class D

Class B

Middle Line DA

Class A

d1

C**A

d

CD

C**B

d3

Middle Line BA

Class B

d2 CA

d4

d5

CB

Fig. 5. Refining the centroids by training examples and unseen examples.

Fig. 6. The original centroids of three categories.

where CR and CM denote the most similar centroid to d with the same and different label, respectively. It is worth noticing that this definition for margin seems to be nearly the same form as that introduced by Crammer et al. (2002). For the sake of brevity, we use q(d) instead of q(d, CR, CM) in the rest of this paper. To further illustrate this kind of margin, we take document d in Fig. 4 as an example. Although document d can be correctly classified because the Middle Line has moved to the border of class A, its margin is very close to zero since it lies exactly on the Middle Line. Hence in order to enlarge the margin, both centroid CA and CB should be moved right again by formula (5) and (6) respectively. After a few of this kind of moving operations, the Middle Line moves to the border of unseen examples of class A (as demonstrated in Fig. 5). In this case, all unseen examples can be correctly categorized. This is the mechanism that margin can further boost the classification ability of classifier for unseen examples. According to formula (3), Sim(x, y) ranges from 0 to 1. As a result, the margin q(d) ranges from –1 to 1. If example margin exceeds but nears 0, we say the margin is quite small and it needs to be enlarged; On the other hand, if example margin approaches 1, we say the margin is very large and it does not need increase. Accordingly, in order to concentrate our attention on small-margin examples, we introduce a small positive margin threshold, MinMargin (denoted by h). If the margin of example d is smaller than h, it should be employed to adjust the classifier model as small-margin example.

margin example, uses it to adjust the classifier model, and goes on until termination. On-line update is very simple and easy to implement, whereas it often leads to to-and-fro movement for some centroids. To illustrate this situation, we take a three-class text data as an example (see Fig. 6). Class A spread as grey is elliptically populated; while class B and class D packed as white are roundly distributed. CA, CB and CD are the centroids of class A, class B and class D respectively. Obviously, the examples of category A on the right of Middle Line BA or on the left of Middle Line DA will be misclassified, such as d1, d2, d3, d4, and d5. For the sake of being easy to explain, we assume: d1 = d2; d3 = d4. As a result, the example series, i.e., d1, d2, d3and d4, will move centroidCA to and fro: centroidCA is first moved left by d1, next moved right to original place by d2, third moved left again by d3, and then moved right to original place again by d4. In a word, the example series cannot move centroidCA at all. With the aim to overcome this problem, we employ by-batch update to combine training-set error and training-set margin. That is to say, for each update we categorize all training documents and then use these misclassified and small-margin examples to adjust the corresponding centroids. As a result, we can write down the batch-update formula as following,

8 > > > > > > <

19 > > > > C> B > = C B X X X X C B  d dþB d dC : CA ¼ CA þ g  > C> B > > > A> @ d R cA d 2 cA d R cA d 2 cA > > > > > > : qðdÞ < 0 ; qðdÞ < 0 0 < qðdÞ < h 0 < qðdÞ < h 0

Obviously, maximization of formula (8) includes two criterions: training errors and training margins. On one hand, if q(d) < 0, that is to say, the instance d will be misclassified, the instance d serves as one training error; On another hand, if 0 < q(d) < h, that is, it can be correctly classified but its margin is smaller than h, it serves as one training margin. 4.3. Tackle model bias, over-train and to-and-fro movement problem There are two ways to update the classifier model by misclassified examples and small-margin examples: on-line and by-batch. On-line update first selects one misclassified example or small-

C A

¼ CA þ g 

8 > > > > > > <

According to formula (9), the sum of d1, d2, d3 and d4, is a zero vector, as a result they cannot exert any influence on the adjustment of centroid CA at all. Hence this so-called to-and-fro movement is overcome. As a result, among five misclassified documents in Fig. 6, i.e., d1, d2, d3, d4, and d5, only document d5 exerts influence on the adjustment of centroid CA. After a few times of this so-called batch-update, as displayed in Fig. 7, both Middle Line DA and BA are moved out of class A. In this case, all examples can be correctly categorized. In order to balance training errors and training margins, we introduce a constant parameter ‘‘Weight’’ (denoted by x). As a re-

0 X

> > > d 2 cA > > > : qðdÞ < 0

d

X d R cA

qðdÞ < 0

B B B dþxB B @

X

d

ð9Þ

X

d 2 cA

d R cA

0 < qðdÞ < h

0 < qðdÞ < h

19 > > > > C> = C> C dC : C> > A> > > > ;

ð10Þ

10268

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

sult, the batch-update formula can be modified as: In the interest of convenience, we call this batch-update formula as Model-Adjustment formula. 4.4. The model adjustment algorithm

Proof. In the iteration t, assume example d(d 2 SA) is a misclassified example or small-margin example, that is, C tA d  C tB d < dð0 < d < cÞ, where C tB denote the most similar centroid to d with the different label. Then, K X

After explaining the mechanism of Model Adjustment for Centroid Classifier, we present the detailed algorithm in this section. As illustrated in Fig. 8, we first need to load training data and parameters (including h, x and g), and then calculate one centroid for each category. In one iteration of the updating phase, we need to categorize all training documents, and then make use of these misclassified examples and small-margin examples to adjust centroids by formulas (10). For the sake of brevity, we refer to the model-adjustment algorithm as MA. Assume that there are N training documents, T test documents, W words in total, K classes and M iteration steps. The time complexity of step 2 is O(NW + KW). Since K < N, the time complexity is O(NW). Step 3.1 and step 3.2 can be done in O(NKW) and O(KW) respectively. Therefore the running time of step 3 is O(M(NKW + KW)), i.e., O(MNKW). As a result, the training time of MA scales linearly with the training documents (N). Since the improved classifier still consists of K centroids, the prediction time required by MA is the same as Centroid Classifier, i.e., O(TKW). Accordingly MA is still a linear classifier. 4.5. The convergence analysis Given a training set S ¼ [Ki¼1 Si , where K denotes the number of training classes, and Si denotes the training examples of class i. In the following analysis, we suppose the data is 2-norm bounded, that is, "d 2 S, kdk2 6 R(R > 0). Since the size of training set is finite, this assumption always holds. Definition 1. We said a ntraining set S isoa linearly separable opt opt so that "i 2 [1, K] problem if there exits C opt 1 ; C2 ; . . . ; CK satisfies, opt C opt where d 2 Si ; i d  C j d P cðc > 0Þ

2 kC tþ1  C opt i i k ¼

2 t opt 2 kC ti  C opt i k þ kC A þ gd  C A k

i–A;B

i¼1

2 þ kC tB  gd  C opt B k ¼

K X

2 kC ti  C opt i k

i¼1 t opt þ 2g2 kdk2 þ 2gdðC tA  C opt A Þ  2gdðC B  C B Þ

¼

K X

 t 2 t 2 2 kC ti  C opt i k þ 2g R þ 2gd C A  C B

i¼1 K X  2 opt  2 2 kC ti  C opt  2gd C opt 6 A  CB i k þ 2g R i¼1

þ 2gd  2gc ¼

K X

2 kC ti  C opt i k

i¼1

þ 2g2 R2 þ 2gðd  cÞ: In this time, as long as select g < (c  d)/R2, we can guarantee that PK PK 2 tþ1 t opt 2  C opt i¼1 kC i i¼1 kC i  C i k . In other words, after each upi k < t date, class centroid C i approaches optimal centroid C opt i . Furthermore, if select an appropriateqso as to 0 < q < ðgðc  dÞ g2 R2 Þ, then, K X

K X

2 kC ti  C opt i k <

i¼1

2 kC it1  C opt i k  2q ¼

i¼1

K X

2 kC 0i  C opt i k  2t q:

i¼1

Obviously, K X

2 kC ti  C opt i k P 0;

i¼1

let K X



j – qi:

X

2 kC 0i  C opt i k

i¼1

Theorem 1. With respect to linearly separable problem, if we select an appropriate leaning parameter g, the proposed method converges to the optimal solution fC opt i g after finite online updates.

then, f  2tq > 0, that is, t < 2fq.

h

P P Lemma 1. ð Ki¼ ai Þ2 6 K Ki¼ a2i when ai P 0. Proof

Class D

Middle Line DA

Class A

Middle Line BA

Class B

K

K X i¼

C*D

CA

a2i



K X

!2 ai

¼ ðK  1Þ



K X i¼

a2i  2

X i–j

ai aj ¼

X ðai  aj Þ2 P 0:



i–j

C*B

Fig. 7. The moved centroids of three categories.

Theorem 2. With respect to a linearly separable problem, the proposed method converges after finite online updates using any learning parameter g(g > 0). Proof. In the iteration t, assume example d(d 2 SA) is a misclassified example or small-margin example, that is, C tA d  C tB d < dð0 < d < cÞ, where C tB denote the most similar centroid to d with the different label. Then, K X

C tþ1 C opt ¼ i i

   t  opt C ti C opt þ C tA þ gd Aopt A þ C B  gd C B i

i–A;B

i¼1

¼ Fig. 8. The Outline of model adjustment for centroid classifier.

X K X i¼1

K X  opt  C ti C opt þ gd C opt C ti C opt þ gc; P A  CB i i i¼1

10269

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

which indicates, K X

C ti C opt P i

K X

i¼1

then

C 0i C opt þ t gc: i

ð11Þ

i¼1

K X i¼1

In the same way, K X

kC itþ1 k2 ¼

i¼1

kC ti k2 þ kC tA þ gdk2 þ kC tB  gdk2

K X

kC tþ1 k2 ¼ i

kC ti k2 þ 2g2 kdk2 þ 2gdðC tA  C tB Þ

6

kC ti k2

2 2

þ 2g R þ 2gd; 6

K X

kC ti k2 6

 kC 0i k2 þ 2t g2 R2 þ gd ;

i¼1

let

s ¼ maxi kC opt i k, then, C ti C opt 6 i

i¼1

K X

k

X

6

K X

kC ti k  kC opt i k 6 s

K X

6 kC ti k:

C ti C opt 6s i

K X

i¼1

kC ti k 6 s K

i¼1

d 2v

X

K X

K X

K X

!1=2 kC ti k2

K X

kC ti k2 6

i¼1

!1=2

kC ti k2

i¼1

0 2

d k þ 2g

d0 2v ti

K X

0 @

i¼1

X

kC ti k2 þ 2nt g2 R2 þ 2g

X

d

d2uti

X

1 0 d AC ti

d0 2v ti

  d C ti  C tj

d2uti &d2v ti

kC ti k2 þ 2nt g2 R2 þ 2nt gd

K X

kC 0i k2 þ 2tðg2 R2 þ gdÞ

;

K X

!1=2

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi þ s 2Kðg2 R2 þ gdÞt :

  kC 0i k2 þ 2tN g2 R2 þ gd :

i¼1

s ¼ maxi kC opt i k, similar to Theorem 2, we obtain,

K X

K pffiffiffiffi X C ti C opt 6s K kC 0i k2 i

i¼1 K pffiffiffiffi X kC 0i k2 6s K

  kC ti k2 þ 2N g2 R2 þ gd ;

Let

i¼1

!1=2

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi K   X þ s 2K g2 R2 þ gd t P C 0i C opt þ tgc: i

i¼1

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi   þ s 2KN g2 R2 þ gd t :

ð14Þ

i¼1

K pffiffiffiffi X s K kC 0i k2

!1=2

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi K   X þ s 2KN g2 R2 þ gd t P C 0i C opt þ tgc: i

i¼1

According to (11) and (12), we obtain,

kC 0i k2

!1=2

Combining (13) and (14), we obtain,

ð12Þ

i¼1

i¼1

Obviously, if above inequality holds, t must be finite. That is to say, the proposed method converges after finite batch updates. h

i¼1

Obviously, if above inequality holds, t must be finite. That is to say, the proposed method converges after finite online updates. h Theorem 3. With respect to a linearly separable problem, the proposed method converges after finite batch updates using any learning parameter g(g > 0). Proof. In the iteration t, let U t ðjU t j ¼ nt P 1Þ denote the examples whose margin is smaller than d. Let uti denote the examples that belong to class i but whose margin is smaller than d; let v ti denote the examples whose most similar class with the different label is class i but whose margin is smaller than X Xd(0 < d < c). Obviously,

uti \ v ti ¼ /; [i uti ¼ [i v ti ¼ U t ;

jv ti j ¼ nt :

juti j ¼

i

i

t

t

Moreover, if d 2 uti & d 2 v tj ; i – qj, then dC i  dC j < d 6 c.

C tþ1 C opt i i

t i

K X

i¼1

i¼1

pffiffiffiffi 6s K

K X

0

0

d k2 ¼

where N denotes the size of training examples. As a result,

i¼1

According to Lemma 1,

K pffiffiffiffi X

d

d2uti

X

i¼1

i¼1

s K

dg

i¼1



K X

i¼1

K X

X d2uti

i¼1

therefore,

K X

kC ti þ g

þ g2

i¼1

K X

X i–A;B

i¼1

i¼1 K X

ð13Þ

In the same way,

X K X

C 0i C opt þ t gc: i

i¼1

i–A;B

¼

K X

C ti C opt P i

K X

¼

i¼1

0

@C t þ g i X

d2uti

d2uti &d2v tj K X i¼1

1

0 d AC opt i

¼

d0 2v ti

K X

C ti C opt i

i¼1

d0 2v ti

i¼1

K   X d C opt  C opt C ti C opt þ nt gc P i j i

X

þg

X

0 1 K X 0 X X @ d d AC opt ¼ C ti C opt i i

i

P

dg

d2uti

i¼1

þg

X

i¼1

5. Empirical assessment In this section, we conduct experiments to verify the efficiency of proposed method. First, we empirically compare proposed method with other classification algorithms; then, we investigate whether training error and training margin can enhance the performance of base classifier effectively and robustly; finally, we tune the performance of proposed method using its parameters. 5.1. Datasets In our experiment, we use four corpora: Reuter-21578,1 20NewsGroup,2 Industry Sector3 and OHSUMED.4 Reuter-21578 The Reuters-21578 text categorization test collection contains documents collected from the Reuters newswire in 1987. It is a standard text categorization benchmark and contains 135 categories. We used its subset: one consisting of 92 categories and in total 10,346 documents. 20NewsGroup The 20Newsgroup (20NG) dataset contains approximately 20,000 articles evenly divided among 20 Usenet newsgroups. We use a subset consisting of total categories and 19,446 documents. Industry Sector The Industry Section dataset is based on the data made available by Market Guide, Inc. (www.market1 2

C ti C opt i

þ gc

3 4

http://www.daviddlewis.com/resources/testcollections/reuters21578/. http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/wwkb. http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/. ftp://medir.ohsu.edu/pub/OHSUMED/.

10270

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

guide.com). The set consists of company homepages that are categorized in a hierarchy of industry sectors, but we disregard the hierarchy. There were 9,637 documents in the dataset, which were divided into 105 classes. We use a subset called as Sector-48 consisting of 48 categories and in all 4,581 documents. OHSUMED The OHSUMED (Hersh, Buckley, Leone, & Hickam, 1994) dataset is a bibliographical document collection: developed by William Hersh and colleagues at the Oregan Health Science University, Which is a subset of MEDLINE database. We use a subset (called ohscal5 in (Shankar & Karypis, 2000)) from OHSUMED dataset that contains 11,162 documents and in total 10 categories: Antibodies, Carcinoma, DNA, In-Vitro, Molecular-Sequence-Data, Pregnancy, Prognosis, Receptors, Risk-Factors and Tomography. 5.2. Performance measure To evaluate a semantic classification system, we use the F1 measure introduced by van Rijsbergen (1979). This measure combines recall and precision in the following way:

number of correct positive predictions : number of positive examples number of correct positive predictions Precision ¼ : number of positive predictions 2  Recall  Precision : F1 ¼ ðRecall þ PrecisionÞ

Recall ¼

For ease of comparison, we summarize the F1 scores over the different categories using the Micro- and Macro-averages of F1 scores:

Micro  F1 ¼ F1over categories and documents; Macro  F1 ¼ average of within-category F1 values: The MicroF1 and MacroF1 emphasize the performance of the system on common and rare categories respectively. Using these averages, we can observe the effect of different kinds of data on a classification system. 5.3. Experimental design We evenly split each dataset into three parts. Then we use two parts for training and the remaining third for test. We perform the train-test procedure three times and use the average of the three performances as final result. This is so called threefold cross validation. In order to remove redundant features and save running time, we employ Information Gain as feature selection method because it consistently performs well in most cases. Algorithms are coded in C++ and running on a Pentium-4 machine with a single 3.0 GHz CPUs and 512 M memory. For SVM classifier, we employed LibSvm and BSvm which can directly deal with multi-class classification problems. (www.csie.ntu.edu.tw/cjlin/). We left all parameters as default. LibSvm is a simple and easy-to-use support vector machines tool for classification, regression, and distribution estimation. We use its Current Version-2.84 that is released on April 2007. BSvm borrows the structure of LibSvm. Similar options are also adopted. For the bound-constrained formulation for classification and regression, BSvm uses a decomposition method. BSvm uses a simple working set selection which leads to faster convergences for difficult cases than LibSvm. We use its Current Version-2.06 that is released on April 2006. In our experiments, we run Balanced Winnow for it consistently yields better performance than Positive Winnow (van Mun, 1999). 5

http://www.cs.umn.edu/han/data/tmdata.tar.gz.

The Balanced Winnow keeps two weights for each feature l in catþ  egory Ci, wþ il and wil . The weight values are initialized as wil ¼ 2:0 and w ¼ 1:0 and the threshold was set to 1.0. The promotion il parameter a and the demotion ß (learning rates) were fixed as 1.2 and 0.8 respectively. 5.4. Comparison and analysis 5.4.1. Comparison with other methods Tables 1 and 2 show performance comparison in MicroF1 and MacroF1. Feature number is set to 10,000; For MA, MaxIteration, Weight, LearnRate, and MinMargin are set to 10, 0.2, 0.5, and 0.1 respectively. According to the two tables, MA improves the performance of Centroid Classifier dramatically, and the improvement is especially significant on Sector-48. For example, MA improves Centroid Classifier by about 9% on Sector-48, by about 7% on Reuter, by about 5% on NewsGroup, and by about 4% on OHSUMED. In a word, Model Adjustment is an effective and robust method to boost the performance of Centroid Classifier. MA outperforms all the other methods on OHSUMED Reuter and Sector-48. Especially on Reuter, the MicroF1 of MA is one percent lower than LibSvm (or BSvm) but its MacroF1 is 12 percent higher than LibSvm (or BSvm). In total MA performs a little better than LibSvm (or BSvm). Consequently we can say that MA is an efficient and competitive algorithm for text classification. Table 3 reports the training time of five methods on four text collections. Note that the running time does not include the seconds for loading data from hard disk; Feature number is set to 10,000; For MA, MaxIteration, Weight, LearnRate, and MinMargin are set to 10, 0.2, 0.5, and 0.1 respectively. As we can observe from this table, the CPU time required by LibSvm is about 40 times larger than that of MA on OHSUMED and about 20 times larger on NewsGroup. In contrast to LibSvm, as a result, the time saving of MA is very obvious. To the best of our knowledge, BSvm is indeed one of the fastest SVM classifiers in the world. However, under some conditions, its time saving leads to considerable decrease of accuracy. Such as on OHSUMED, it reduce both MicroF1 and MacroF1 by 6 percent. Table 1 The MicroF1 of different methods.

OHSUMED Reuter Sector-48 NewsGroup

MA

Centroid

Winnow

LibSvm

BSvm

0.8049 0.8565 0.8970 0.8892

0.7676 0.7820 0.8055 0.8429

0.7193 0.8263 0.8003 0.8105

0.7906 0.8694 0.8732 0.9040

0.7342 0.8643 0.8755 0.9020

Bold values indicates the best results. Table 2 The MacroF1 of different methods.

OHSUMED Reuter Sector-48 NewsGroup

MA

Centroid

Winnow

LibSvm

BSvm

0.7940 0.6061 0.9000 0.8859

0.7600 0.5617 0.8152 0.8389

0.7110 0.4891 0.8389 0.8161

0.7800 0.4875 0.8780 0.9029

0.7252 0.4880 0.8791 0.9008

Bold values indicates the best results. Table 3 Training time in seconds.

OHSUMED Reuter Sector-48 NewsGroup

MA

Centroid

Winnow

LibSvm

BSvm

1.39 18.41 11.91 7.56

0.40 0.40 0.50 0.48

1.72 7.75 4.92 4.90

62.28 80.77 38.31 160.11

18.32 33.33 24.43 55.21

10271

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

Despite of its high-speed, BSvm still consumes at least two times larger CPU time than MA. In summary, these experiments have shown that MA offers alternative choice for text categorization.

5.4.2. Training error, margin and performance vs. maxiteration Figs. 9–11 show training error, training margin and prediction performance curves of MA vs. MaxIteration on four datasets. Weight, LearnRate, and MinMargin are set to 0.2, 0.5, and 0.1 respectively; feature number is set to 10,000. The first observation is that proposed Model Adjustment can decrease training error, enlarge margin and boost prediction performance. The three figures demonstrate that increasing the MaxIteration decreases training error, increases training margin and prediction performance. However, the decrease or increase in three measures is not directly proportional to increase in MaxIteration. As the MaxIteration is getting larger, the curves of three measures are starting to level off. The second observation is that the first updating-operation achieved the biggest performance improvement. MaxIteration equivalent to 0 means that no updating operation is used at all, i.e., Centroid Classifier. From Fig. 11 we can observe that a wide margin improvement is achieved by running only one round of adjustment operation over training set. 0.2 Training-Error-Rate Training-Error-Rate Training-Error-Rate Training-Error-Rate

0.18

on on on on

5.4.4. Performance vs. learnRate Fig. 13 demonstrates the performance comparison of MA with respect to the varying value of LearnRate that controls the step-size of updating operation. Note that feature number is set to 10,000; MaxIteration, Weight, and MinMargin are set to 10, 0.2, and 0.1 respectively. From this figure, the best value for OHSUMED, Reuter, and NewsGroup is about 0.5 while for Sector-48 is about 1.5. As a result, the empirical value for LearnRate ranges from 0.5 to 1.5. 5.4.5. Performance vs. minMargin Fig. 14 displays the performance comparison of MA vs. MinMargin. Note that Feature number is set to 10,000; MaxIteration, Weight, andLearnRate are set to 10, 0.2, and 0.5 respectively.

OHSUMED Reuter Sector-48 NewsGroup

0.94 0.92

0.14

0.9

0.12

0.88

0.1

0.86

MicroF1

Training Error Rate

0.16

of MA of MA of MA of MA

5.4.3. Performance vs. weight Fig. 12 illustrates the performance comparison of MA with respect to the varying value of Weight that balances training error and training margin. MaxIteration, LearnRate, and MinMargin are set to 10, 0.5, and 0.1 respectively; feature number is set to 10,000. As we can observe from this figure, all the curves peak at some Weight values larger than zero. For example, the peaks of MA on OHSUMED, on Reuter, on Sector-48 and on NewsGroup are around 0.4, 0.2, 0.2 and 0.8 respectively. Consequently the acceptably performing value for Weight ranges from 0.2 to 0.8.

0.08 0.06

MicroF1 MicroF1 MicroF1 MicroF1

0.84 0.82

of MA of MA of MA of MA

on on on on

OHSUMED Reuter Sector-48 NewsGroup

0.04 0.8 0.02 0.78 0

0

5

10

15 Iteration

20

25

30

0.76

0

5

10

15 Iteration

Fig. 9. Training-error-rate curves of MA vs. Iteration.

20

25

30

Fig. 11. MicroF1 curves of MA vs. iteration.

1300

0.9

1200 0.88

1100

Training-Margin Training-Margin Training-Margin Training-Margin

900 800

of MA of MA of MA of MA

on on on on

0.86

OHSUMED Reuter Sector-48 NewsGroup

MicroF1

Training Margin

1000

0.84

700 0.82

600 500

MicroF1 MicroF1 MicroF1 MicroF1

of MA of MA of MA of MA

0.5 W eight

0.6

on on on on

OHSUMED Reuter Sector-48 NewsGroup

0.8

400 300 0

5

10

15 Iteration

20

Fig. 10. Training-margin curves of MA vs. Iteration.

25

30

0.78

0

0.1

0.2

0.3

0.4

0.7

Fig. 12. MicroF1 curves of MA vs. weight.

0.8

0.9

1

10272

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273

0.9

0.88

MicroF1

0.86

0.84

MicroF1 MicroF1 MicroF1 MicroF1

0.82

of MA of MA of MA of MA

on on on on

OHSUMED Reuter Sector-48 NewsGroup

0.8

0.78

0

0.5

1

1.5

LearnRate

Fig. 13. MicroF1 curves of MA vs. learnRate.

Secondly, with the consideration of to-and-fro movement problem resulting from online update, we employ by-batch update. That is to say, for each update, first all training documents are categorized and then these misclassified and small-margin examples are used to adjust the corresponding centroids. Thirdly, from the perspective of mathematics, we proved that with respect to a linearly separable problem, the proposed method converges after finite online/batch updates using any learning parameter g(g > 0). Lastly, extensive experiments are conducted on four benchmark evaluation collections. The results show that Model Adjustment could make a significant difference on the performance of Centroid Classifier. Furthermore, the experimental result indicates margin can further improve the performance of Model Adjustment for Centroid Classifier. We believe that this research only scratches the surface of what can be achieved with Model Adjustment. The future effort is to seek new techniques to enhance the performance of Model Adjustment and to apply it to other classifiers. Acknowledgments This work was mainly supported by two funds, i.e., 60933005 and 60803085.

Figure 16: MicroF1 curves of MA vs. MinMargin 0.92

0.9

References

MicroF1

0.88

0.86

0.84 MicroF1 MicroF1 MicroF1 MicroF1

0.82

of MA of MA of MA of MA

on on on on

OHSUMED Reuter Sector-48 NewsGroup

0.8

0.78

0

0.1

0.2

0.3

0.4

0.5 0.6 MinMargin

0.7

0.8

0.9

1

Fig. 14. MicroF1 curves of MA vs. minmargin.

From this figure, we can obtain one observation: when MinMargin = 0, that means only training-set-error criterion is employed to update the base classifier, MA performs worse than any case of positive MinMargin. This observation indicates: the incorporation of margin can further improve the performance of Model Adjustment for Centroid Classifier. This goes in line with our analysis about margin in Section 4. This result provides evidences for the rationality and feasibility of incorporating margin into Model Adjustment. 6. Conclusion remarks In this work, a novel Model Adjustment (MA) algorithm was proposed to deal with model misfit problem of Centroid Classifier. The basic idea is to pick out some training examples to adjust Centroid Classifier model. The main research findings are: Firstly, in order to avoid over-train problem, we combine two measures for Model Adjustment: training-set errors and trainingset margins. That is to say, misclassified examples as well as small-margin examples are picked out to update the classifier model.

Aas, K., & Eikvil, L. (1999). Text Categorisation: A Survey. Raport NR 941, Norwegian Computing Center, 15. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 273–297. Crammer, K., Gilad-Bachrach, R., Navot, A., & Tishby, N. (2002). Margin analysis of the lvq algorithm, NIPS. Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. SIGIR, 192–201. Ishii, N., Murai, T., Yamada, T., & Bao, Y. (2006). Text classification by combining grouping, LSA and kNN. In The fifth IEEE/ACIS international conference on computer and information science (pp. 148–154). Lertnattee, V., & Theeramunkong, T. (2002). Combining homogeneous classifiers for centroid-based text classification. ISCC, 1034–1039. Li, R., & Hu, Y. (2003). Noise reduction to text categorization based on density for KNN. ICMLC, 3119–3124. Liu, L., Sun, X., & Song, H. (2006). Combining fuzzy clustering with Naive Bayes augmented learning in text classification. In The first IEEE international symposium on pervasive computing and applications (pp. 168–171). Liu, Y., Yang, Y., & Carbonell, J. (2002). Boosting to correct inductive bias in text classification. CIKM, 348–355. Lu, M., Hu, K., Wu, Y., Lu, Y., & Zhou, L. (2002). SECTCS: towards improving VSM and Naive Bayesian classifier. SMC. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. Shankar, S., Karypis, G. (2000). Weight adjustment schemes for a centroid based classifier, Technical report, Dept. of Computer Science, University of Minnesota. Shin, K., Abraham, A., & Han, S. (2006). Enhanced centroid-based classification technique by filtering outliers. TSD, 159–163. Tan, S. (2005). Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Expert Systems With Applications, 28(4), 667–671. Tan, S. (2006). An effective refinement strategy for KNN text classifier. Expert Systems With Applications, 30(2), 290–298. Tan, S. (2008). An improved centroid classifier for text categorization. Expert Systems With Applications, 35(1–2), 279–285. Tan, S., & Cheng, X. (2007a). An effective approach to enhance centroid classifier for text categorization. PKDD, 581–588. Tan, S., & Cheng, X. (2007b). Using hypothesis margin to boost centroid text classifier. SAC, 398–403. Tan, S., Cheng, X., Ghanem, M. M., Wang, B., & Xu, H. (2005). A novel refinement approach for text categorization. CIKM, 469–476. Tan, S., Cheng, X., Wang, Y., & Xu, H. (2009). Adapting Naive Bayes to domain adaptation for sentiment analysis. ECIR. Tan, S., Wu, G., Tang, H., & Cheng, X. (2007). A novel scheme for domain-transfer problem in the context of sentiment analysis. CIKM, 979–982. Tsay, J., & Wang, J. (2004). Improving linear classifier for Chinese text categorization. Information Processing and Management, 223–237. van Mun, P. P. T. M. (1999). Text classification in information retrieval using winnow. . van Rijsbergen, C. (1979). Information retrieval. London: Butterworths. Wang, Z., Sun, X., Zhang, D., & Li, X. (2006). An optimal SVM-based text classification algorithm. ICMLC, 1378–1381.

S. Tan et al. / Expert Systems with Applications 38 (2011) 10264–10273 Wang, B., & Zhang, S. (2005). A novel text classification algorithm based on Naive Bayes and KL-divergence. PDCAT, 913–915. Wu, H., Phang, T. H., Liu, B., & Li, X. (2002). A refinement approach to handling model misfit in text categorization. SIGKDD, 207–216.

10273

Yuan, F., Yang, L., & Yu, G. (2005). Improving the k-NN and applying it to Chinese text classification. ICMLC, 1547–1553. Zhang, B., Su, J., & Xu, X. (2006). A class-incremental learning method for multi-class support vector machines in text classification. ICMLC, 2581–2585.