Information Sciences 236 (2013) 109–125
Contents lists available at SciVerse ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Class-indexing-based term weighting for automatic text classification Fuji Ren ⇑,1, Mohammad Golam Sohrab 1 Faculty of Engineering, University of Tokushima, 2-1 Minami Josanjima, Tokushima 770-8506, Japan
a r t i c l e
i n f o
Article history: Received 21 February 2012 Received in revised form 27 January 2013 Accepted 17 February 2013 Available online 27 February 2013 Keywords: Text classification Indexing Term weighting Machine learning Feature selection Classifier
a b s t r a c t Most of the previous studies related on different term weighting emphasize on the document-indexing-based and four fundamental information elements-based approaches to address automatic text classification (ATC). In this study, we introduce class-indexingbased term-weighting approaches and judge their effects in high-dimensional and comparatively low-dimensional vector space over the TF.IDF and five other different term weighting approaches that are considered as the baseline approaches. First, we implement a class-indexing-based TF.IDF.ICF observational term weighting approach in which the inverse class frequency (ICF) is incorporated. In the experiment, we investigate the effects of TF.IDF.ICF over the Reuters-21578, 20 Newsgroups, and RCV1-v2 datasets as benchmark collections, which provide positive discrimination on rare terms in the vector space and biased against frequent terms in the text classification (TC) task. Therefore, we revised the ICF function and implemented a new inverse class space density frequency (ICSdF), and generated the TF.IDF.ICSdF method that provides a positive discrimination on infrequent and frequent terms. We present detailed evaluation of each category for the three datasets with term weighting approaches. The experimental results show that the proposed class-indexing-based TF.IDF.ICSdF term weighting approach is promising over the compared well-known baseline term weighting approaches. Ó 2013 Elsevier Inc. All rights reserved.
1. Introduction With large amount of textual information available digitally, effective retrieval is difficult to accomplish without good text to vector representation in order to enhance automatic text classification (ATC) [3,8,23,33]. In the vector space model (VSM) [32], the content of a text is represented as a vector in the term space. The term weight is the degree of importance of term ti in document dj. The term weighting schemes play a significant role in effective text classification (TC). Therefore, an effective indexing-based term weighting approach can generate more information-rich terms and assign appropriate weighting values to the terms. In general, text to vector representation can be classified into two tasks: indexing and term weighting [14,19]. Indexing based on the documents provides a meta-language for describing the document, where the information about a set of documents in a certain class ck is missing. In the TC task, the analysis of document contents by traditional term indexing that is based on documents is not enough to enhance the performance of classification task. Accurate retrieval depends on the exactness of the document retrieval process and class description for a certain term ti = {t1, t2, . . . , tn} in the TC task. Most research does not show the diversity of category information for a certain class in the classification task. ⇑ Corresponding author. 1
E-mail addresses:
[email protected] (F. Ren),
[email protected] (M.G. Sohrab). Both authors contributed equally to this work.
0020-0255/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2013.02.029
110
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
The motivation for exploiting the class-indexing-based term weighting method for TC can be attributed to two main properties. First, a more information-rich term weighting method with effective indexing can generate a more effective classifier. Second, there is a demand for dimensionality reduction through inverse class space density frequency (ICSdF). A major characteristic or difficulty of TC is the high dimensionality of the feature space [10]. Since the conventional TF.IDF term weighting scheme favors rare terms and the term space consists of hundreds of thousands of rare terms, sparseness frequently occurs in document vectors. Therefore, a novel term weighting method is needed to overcome the problem of high dimensionality and for the effective indexing based on class to enhance the classification task. Recently, many experiments have been conducted using a document-indexing-based term weighting approach to address the classification task as a statistical method [6,17,24,29–31,37,38]. TF.IDF is considered to be the most popular term weighting method in successfully performing the ATC task and document-indexing [30]. Salton and Buckley [30] discussed many term weighting approaches in the information retrieval (IR) field, and found that normalized TF.IDF is the best document weighting function. Therefore, this term weighting scheme is used as a standard in this study and as a default term weighting function for TC. However, there are some drawbacks of the classic TF.IDF term weighting scheme for the TC task. In the training vector space, to compute the weight of a certain term, the category information is constantly omitted by the document-indexing-based TF.IDF term weighting method. In contrast, the inverse document frequency (IDF) function provides the lowest score of those terms that which appear in multiple documents; because of this, the TF.IDF score gives positive discrimination to rare terms and is biased against frequent terms. At present, because TF.IDF uses a default term weighting parameter in the classification task, a variety of feature selection techniques, such as information gain [18], chi-square test, and document frequency [49], have been used to reduce the dimension of the vectors. Flora and Agus [4] performed some experiments to recommend the best term weighting function for both document and sentence-level novelty mining. The results show that the TFIDF weighting function outperforms in TREC2003 and TREC2004 datasets. In the indexing process [29,38], two properties are of main concern: semantic term weighting [5,13,25,39] and statistical term weighting [12], where term weighting is related to a term’s meaning and to the discriminative supremacy of a term that appears in a document or a group of documents, respectively. Leo et al. [25] proposed a semantic term weighting by exploiting the semantics of categories and indexing term using WordNet. In this approach, they replaced the IDF function with a semantic weight (SW) and the experiment’s results, which were based on overall system performance, showed that the proposed TF.SW scheme that outperformed TF.IDF. However, the amount of training data was small and they were not able to show the improvement in the performance of a certain category in the dataset. Nevertheless, it is possible to address a limited number of terms in a term index by semantic term weighting, but with large number of terms in the term index, it is difficult to provide the appropriate semantic knowledge of a term based on categories. Therefore, statistical term weighting approaches are more promising to compute the score of a term. Lartnatte and Theeramunkong [42] investigated the various combinations of inter-class standard deviation, class standard deviation and standard deviation with TF.IDF, and the average results showed that the TF.IDF was superior in nine out of ten term weighting methods. Only one method performed better than TF.IDF by 0.38% on average. Recently, Youngjoong and Jungyun [15] proposed a new classification algorithm with feature projection techniques based on unsupervised learning. The TF.IDF is used as a feature weighting criterion. The classification approach is based on voting score incorporated with three different modules: voting ratio, modified weight, and adopted v2 feature selection method. In this approach, they project the elements using binary approach; if the category for an element is equal to a certain category, the output value is 1 otherwise 0. In the second part, where category frequency is the number of categories between any of the two features which is co-occurred in a certain category is computed. Therefore, they adopted v2 feature selection method to get the final voting score. The experiment results show that this approach is promising when the numbers of labeled training documents are up to 3000 approximately. The system failed to improve the performances in comparatively large number of labeled training documents in three different datasets. In the last few years, researchers have attempted to improve the performance of TC by exploiting statistical classification approaches [46,47] and machine learning techniques, including probabilistic Bayesian models [20,28,43], support vector machines (SVMs) [7,11,26,44], decision trees [20], Rocchio classifiers [21], and multivariate regression models [35,48]. However, a novel term weighting scheme with a good indexing technique is needed in addition to these statistical methods to truly enhance the classification task. Therefore, in this study, we have developed a novel class-indexing-based term weighting scheme that enhances the indexing process to generate more informative terms. This study makes the following contributions: 1. We propose an automatic indexing method using the combination of document-based and class (category)-based approaches. 2. The proposed class-indexing-based term weighting approach has two aspects: a. The TF.IDF.ICF term weighting that favors the rare terms and is biased against frequent terms. b. The TF.IDF.ICSdF term weighting that gives a positive discrimination both to rare and frequent terms. 3. The proposed class-indexing-based TF.IDF.ICSdF term weighting approach is outperformed with existing term weighting approaches. 4. The proposed TF.IDF.ICSdF approach is effective in high-dimensional and comparatively low-dimensional vector spaces. 5. The proposed class-indexing method expands the existing document-indexing method in term indexing and generates more informative terms based on a certain category through use of inverse class frequency (ICF) and ICSdF functions.
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
111
The remainder of this study is organized as follows. Section 2 presents the proposed class-indexing- based term weighting methods. In Section 3, we elaborate on the centroid, Naı¨ve Bayes and SVM classifier. Section 4 presents the performance evaluation results. Finally, we conclude the study in Section 5. 2. Indexing: combining document and class oriented indexing In information retrieval and automatic text processing, the construction of effective indexing vocabularies has always been considered the single most important step [34]. Therefore, an information-rich term weighting method with effective indexing can generate a more effective classifier. 2.1. Conventional approach: document-indexing-based term weighting scheme The most widely used traditional document-indexing-based TF.IDF term weighting approach for ATC has two subtasks. First is the term frequency based analysis of the text contents, and second is the determination of terms that are in the document space dj. In addition, measuring term specificity was first proposed by Sparck Jones [37] and it later became known as the IDF. The reasoning is that a term with a high document frequency (DF) count may not be a good discriminator, and it should be given less weight. Therefore, IDF means that a term that occurs in many documents is not a good document discriminator. The IDF function assigns the lowest score to those terms that appear in multiple documents in a document space D = d1, d2, . . . , dn. To quantitatively judge the similarity between a pair of documents, a weighting method is needed to determine the significance of each term in differentiating one document from another. The common term weighting method is TF.IDF [30], as defined in Eq. (1):
D : W TF:IDF ðti ; dj Þ ¼ tfðti ;dj Þ 1 þ log dðt i Þ
ð1Þ
Its cosine normalization is denoted as
tfðti ;dj Þ 1 þ log dðtDi Þ W norm TF:IDF ðt i ; dj Þ ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i2ffi ; Xh tfðti ;dj Þ 1 þ log dðtDi Þ
ð2Þ
t i 2dj
where D denotes the total number of documents in the collection, tfðti ;dj Þ is the number of occurrences of term ti in document dj,d(ti) is the number of documents in the collection in which term ti occurs at least once, dðtDi Þ is referred to as the DF, and dðtDi Þ is the IDF of term ti. 2.2. Proposed approach: class-indexing-based term weighting scheme Of greater interest is that term and document frequencies that depend only on the occurrence characteristics of terms in documents are not enough to enhance the classification task. In terms of class-oriented indexing, a subset of documents from document space D = d1, d2, . . . , dn is allocated to a certain class in order to create a VSM in the training procedure. In the traditional TF.IDF method, a list of identical documents is linked with a single term and the IDF function provides the lowest score for that term. However, it is not obvious that in the classification task, the IDF function provides the lowest score for those terms that are identical. These identical documents linked with a certain term ti might be a sub-part of a certain category ck. Therefore, it is important to explore the occurrence characteristics of terms in both the document space D = {d1, d2, . . . , dn} and the class space. In class-oriented indexing, a subset of documents from the global document space is allocated to a certain class ck (k = 1, 2, . . . , m) according to their topics in order to create a boundary line vector space in the training procedure. Therefore, the class space is defined as C = {(d11, d12, . . . , d1n) 2 C1, (d21, d22, . . . , d2n) 2 C2, . . . , (dm1, dm2, . . . , dmn) 2 Cm} where a set of documents with same topics is assigned to a certain class ckin the training procedure. In the class space, where the category itself has its own domain and domain size is dissimilar to other categories, which is based on the number of documents that the domain contains. For this reason, the knowledge of a set of relevant documents containing a certain term ti in a certain class ckis considered to determine the category mapping in the training procedure. A simple overview of the proposed class-indexing is shown in Fig. 1 where the proposed class-based indexing is incorporated with term, document and class index. The nodes of term, document and class index contain two fields: data and link. While provide the dataset as an input to assign scores for lexical terms, the data field of term index contains a term and the corresponding three outbound link fields where the first link points to the node containing the next term. The second and third link points to the class and document index to find out the relevant class and the relevant documents respectively that the term falls into. The data field of document index contains a document and the corresponding link field points to the node containing the next document. Therefore the combination of term and document index or the combination of term, document and class index, we can generate TF.IDF or TF.IDF.ICF term weighting scheme respectively. In contrast, the data field of class index contains class information and the corresponding two link fields where the first link points to the node containing the next class. The second outbound link field points to the document index to give the
112
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
Categorical Document Index d11 Term Index
Document Index
Class Index d12
t1
d1
C1
t2
d2
C2
t3
d3
d1n d21 d22 d2n
tk
dn
Null
Null
Null
Cm
Null
TF.IDF
Null
dm1 dm2
TF.IDF.ICF
dmn
Null
TF.IDF.ICSdF Fig. 1. Class-oriented-indexing: the formulation of TF.IDF, TF.IDF.ICF, and TF.IDF.ICSdF.
identity of a set of documents that a certain term falls into a certain class. Which is later we call the class space density frequency. The effectiveness of this indexing is that, the lexical terms obtain the prior knowledge from class index in the training mode to create vector space model. Using class-indexing based term weighting method, more informative term can be generate to examine the existence of a term not only to compute the occurrence of a term in a certain class but also compute the occurrence of a set of relevant documents in a certain class. From Fig. 1 and the above discussions it is noticeable that, the proposed class-oriented-indexing requires more space to store additional data as well as computational cost is higher than document-indexing for the training formulation. In class-indexing-based term weighting method, terms are examined for frequency and to compute their occurrence in a certain class. They are also used to compute the occurrence of a set of relevant documents that are clustered [2] for a certain term ti in a certain class ck. Two factors are considered for category mapping using class-indexing-based term weighting scheme: (1) class-frequency (CF)-based category mapping of each term and (2) class-space-density-based category mapping of each term. 2.2.1. Class-frequency-based category mapping of each term: TF.IDF.ICF term weighting scheme In the classic document-indexing-based VSM, the criterion of numeric representation of text to document vectors are products of local (term frequency) and global (IDF) parameters, that is, TF.IDF. In the class-frequency-based category mapping term weighting scheme, the ICF is multiplied by TF.IDF, generating TF.IDF.ICF. In this method, the existence of a category is indicated by assigning a numerical value 1 when a term is relevant for a certain category ck, and 0 when a term is not relevant. The TF.IDF.ICF term weighting scheme introduces the concept of category mapping using the ICF function. In ICF, a term that occurs in many categories is not a good discriminator. Consequently, the ICF function gives the lowest score to those terms that appear in multiple classes in class space C = c1, c2, . . . , cm. Therefore, the numeric representation of a term is the product of term frequency (local parameter), IDF (global parameter), and ICF (categorical global parameter), represented as
D C 1 þ log : W TF:IDF:ICF ðti ; dj ; ck Þ ¼ tfðti ;dj Þ 1 þ log dðti Þ cðti Þ
ð3Þ
Its cosine normalization is denoted as
tfðti ;dj Þ 1 þ log dðtDi Þ 1 þ log cðtCi Þ W norm TF:IDF:ICF ðt i ; dj ; ck Þ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h i2ffi ; P D C t i 2dj ;t i 2ck tfðti ;dj Þ 1 þ log dðt i Þ 1 þ log cðt i Þ
ð4Þ
where C denotes the total number of predefined categories in the collection, c(ti) is the number of categories in the collection in which term ti occurs at least once, cðtCi Þ is referred to as the CF, and cðtCi Þ is the ICF of term ti. The TF.IDF.ICF approach is our primary motivation to revise the document-indexing-based term weighting approach and lead to a new class-indexing-based term weighting approach. For a valid term, the evidence of clustered relevant documents, where documents are grouped into a certain category ck for a certain term ti, is missing in this method. Such a method gives positive discrimination to rare terms and is biased against frequent terms. Because, the IDF and ICF functions are incorporated with TF and both assigns the lowest score to those terms that appear in multiple documents in document space and
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
113
multiple categories in class space, respectively. Therefore, we redesigned the ICF and implemented a new inverse class space density frequency (ICSdF) method that provides positive discrimination for both rare and frequent terms. 2.2.2. Class-space-density-based category mapping of each term: TF.IDF.ICSdF term weighting scheme Both TF.IDF and TF.IDF.ICF term weighting schemes emphasize on rare terms, which favor terms appearing in only a few documents multiplied by the IDF function and a few classes multiplied by the ICF function, respectively. In the class-spacedensity-based category mapping of each term, the inverse class-space-density frequency (ICSdF) is multiplied by the TF.IDF to generate the TF.IDF.ICSdF. Because the ICF function gives the lowest score to those terms that appear in multiple classes without any prior knowledge of the class space, a new class-space-density-frequency-based category mapping is proposed. It should be possible to improve the classification performance by addressing the prior knowledge of a certain term in a certain category in the training procedure, in particular, by determining the number of outbound document links for a certain term ti through a certain category ck. More specifically, the weight of each term depends not only on its occurrence characteristics in documents and classes but also on its class density (Cd) where a set of relevant documents are linked with a certain term ti. 2.2.2.1. Determining the category mapping by class density of each term. In a VSM, documents are usually represented as vectors in n-dimensional space by describing each document dj by a numerical value of a feature vector. For a certain term ti, first the class weight is assigned a numerical value between 1 and 0 for overlapping and non-overlapping terms in a certain class ck, respectively.
W ck ðti Þ ¼
1; if term appears in a certain class that ti 2 jck j 0;
otherwise
ð5Þ
:
If a certain term ti falls into certain class ck, in the next computational process, the number of total outbound document links is calculated through a certain class ck that the term ti falls into. The outbound document links represent a collection of tightly clustered documents for a certain term ti in a certain class. Thus, the class density Cd is defined as a rate of documents that include the term ti in the category ck. In mathematical representation,
C d ðti Þ ¼
nck ðt i Þ ; N ck
ð6Þ
where nck ðt i Þ denotes the number of documents that include the term ti and are a member of the category ck, and N ck denotes the total number of documents in a certain category ck. 2.2.2.2. Determining the class space density of each term. In the computational process, the class space density (CSd) is the sum of the outbound class links for a certain term ti and each class link is measured by class density. In mathematical representation,
CSd ðt i Þ ¼
X C d ðt i Þ:
ð7Þ
ck
Therefore, the inverse class space density frequency is denoted as
ICSd Fðt i Þ ¼ log
C : CSd ðt i Þ
ð8Þ
The numeric representation of a term is the product of term frequency (local parameter), IDF (global parameter), and inverse class space density frequency (ICSdF), which represents the combination of categorical local and global parameters. Therefore, the proposed term weighting scheme TF.IDF.ICSdF for a certain term ti in document dj with respect to category ck is defined as
D C W TF:IDF:ICSd F ðt i ; dj ; ck Þ ¼ tfðti ;dj Þ 1 þ log 1 þ log dðt i Þ CSd ðt i Þ
ð9Þ
Its cosine normalization is denoted as
W norm TF:IDF:ICSd F ðt i ; dj ; c k Þ
tfðti ;dj Þ 1 þ log dðtD Þ 1 þ log CSdCðt Þ i i ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i2ffi ; X h tfðti ;dj Þ 1 þ log dðtDi Þ 1 þ log CSdCðti Þ
ð10Þ
t i 2dj ;t i 2ck CSd ðt i Þ C
where is referred to as the class space density frequency (CSdF) and (ICSdF) of term ti.
C CSd ðt i Þ
is the inverse class space density frequency
114
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
2.3. An example: term representation in class-space-density-frequency indexing We use an example to judge the effectiveness of CSdF in the vector space. We assume that a dataset contains of two categories C1 and C2. A term t1 appears in two classes C1 and C2. C1 and C2, each contain four documents as a balance corpus distribution in Fig. 2a and five and three documents as an unbalance corpus distribution in Fig. 2b, respectively. If term t1 appears in any category, the weight of t1 is assigned by binary value, 0 or 1 from Level1 to Level2.
TagLevel1
to Level2
¼
1; if term appears in a certain class that ti 2 jck j 0; otherwise
In the next computational step from Level2 to Level3, the weight of t1is 1/n when it appears in relevant documents in a certain class domain otherwise zero. Where n is the total number of documents in a certain class.
( TagLevel2
to Level3
¼
1 ; n
if term appears in a certain class that ti 2 jck j and t i 2 jdj j:
0; otherwise
:
The dotted rectangular area in Fig. 2 indicates that the documents are relevant for a term t1. In consider of Fig. 2a, the class density of term t1 in category C1is C d ðt 1 Þ ¼ nc1 ðt 1 Þ=N c1 ¼ 3=4 ¼ 0:75; C 2 is C d ðt1 Þ ¼ nc2 ðt 1 Þ=N c2 ¼ 2=4 ¼ 0:50 and CSd ðt 1 Þ ¼ P of term t1 in cateck C d ðt 1 Þ ¼ 0:75 þ 0:50 ¼ 1:25. Similarly in Fig. 2b for unbalanced corpus distribution, the class density P gory C1 is C d ðt 1 Þ ¼ nc1 ðt1 Þ=N c1 ¼ 3=5 ¼ 0:60; C 2 is C d ðt1 Þ ¼ nc2 ðt 1 Þ=N c2 ¼ 2=3 ¼ 0:67 and CSd ðt1 Þ ¼ ck C d ðt 1 Þ ¼ 0:60 þ 0:67 ¼ 1:2. From the above computation, it is noticeable that the score of the class density of category C1 is higher than that of C2, and C2is higher than C1 in terms of balance and unbalance corpus distribution. Therefore, in the training vector space, while we compute the average score of a certain term ti in a certain category ck, the class density will cause the greatest possible separation between the categories because it can make use the categories to explore the outbound document links through a certain category for a certain term. In contrast, since the term t1 appears in two categories in Fig. 2, therefore, without explore the categories of C1 and C2, the ICF score will be log (2/2) = 0; because the ICF only conveys the information whether a term belongs to a category or not. In other words, we assume that t1 appears in almost all documents in category 1 and in only one document in category 2. From the initial information, it is comprehensible that the term t1 belongs to category 1, but according to the ICF function, because the term t1 appears in two categories, the ICF score will be zero and fails to provide the greatest possible separation between the categories. Thus, the ICF function provides positive discrimination for rare terms which may not include in a certain category domain. It is noticeable that, the ICSdF function provides a judgment score that is not biased against frequent or infrequent terms.
t1
Null
1 C1
1/n d1
1/n d2
1/n d3
t1
Level1
Null
1
1 C2
1/n
1/n
d4 Null
d1
1/n d2
1
1/n d3
C1
Level2
1/n d4 Null
Level3
C2
1/n
1/n
1/n
1/n
d1
d2
d3
d4
(a)
1/n d5 Null
1/n
1/n
d1
d2
1/n d3 Null
(b)
Fig. 2. Term representation in class-space-density-frequency indexing: (a) balance corpus distribution and (b) unbalance corpus distribution.
Table 1 Contingency table of prediction for a category Ck.
ti ti
ck
ck
A C
B D
Note: A denotes the # of documents belonging to category ck where the term ti occurs at least once; B denotes the # of documents not belonging to category ck where the term ti occurs at least once; C denotes the # of documents belonging to category ck where the term ti does not occur; and D denotes the number of documents not belonging to category ck where the term ti does not occur.
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
115
Table 2 Term weighting functions to be compared. Weighting method
Name
TF. Correlation coefficient
CC
Expressions pffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi tfðti ;dj Þ ð N ðAD BCÞ= ðA þ CÞðB þ DÞðÞA þ BÞðC þ DÞÞ
TF. Mutual information
MI
tfðti ;dj Þ logðAN=ðA þ BÞðA þ CÞÞ
TF. Odds ratio
OR
TF. Probability based
ProbBa.
tfðti ;dj Þ logðAD=BCÞ
tfðti ;dj Þ log 1 þ AB AC
A tfðti ;dj Þ log 2 þ C
TF. Relevance frequency
RF
TF.IDF
TF.IDF
tfðti ;dj Þ ð1 þ log dðtDi Þ
Recently, many term weighting methods such as, relevance frequency [16,24], probability based approach [24], mutual information [24,33], odds ratio [24,33], correlation coefficient [24] have been reported that these term weighting methods can significantly improve the performance of ATC. Therefore, to compare with other weighting approaches, we implement these term weighting methods based on four fundamental information elements which is mentioned in Table 1. Table 2 shows the mathematical expression of these weighting schemes. 3. Classifier In the machine learning workbench, some classifiers [46,50] like centroid-based, support vector machine (SVM), and naı¨ve bayes (NB) have achieved great success in TC. In order to evaluate the effects of the proposed term weighting approaches over existing weighting methods, these three classification schemes are used in three datasets. 3.1. Centroid classifier In this study, for simplicity and linearity, we implement a centroid classifier [9,40,41], a commonly used method, to judge the effectiveness of the proposed class-indexing-based term weighting approaches and to compare it with the conventional TF.IDF term weighting approach. In the VSM, each document dj is considered to be a vector in the term space. To calculate the centroid of a certain class ck, add the training set of document vectors dj (j = 1, 2 . . . , n) in the class ck (k = 1, 2 . . . , m):
Sum centroid; C sum ¼ k
X
dj :
ð11Þ
d2ck
The normalized version of C sum is as follows: k
C sum k Normalized centroid; C norm ¼ k C sum ; k 2
ð12Þ
simðdj ; ck Þ ¼ dj C norm : k
ð13Þ
is defined as 2-norm vector C sum . Next, calculate the similarity between a query document and each normalwhere C sum k k 2 ized centroid class vector by inner-product measure, as follows:
Consequently, the test vector (or query vector) dj is assigned to the class level ck whose class prototype vector is very similar to the query vector in performing the classification task.
LðC k Þ ¼ arg max dj C norm : k ck 2C
ð14Þ
¨ve Bayes classifier 3.2. Naı Naı¨ve Bayes (NB) [50] classifier is one of the oldest formal classification algorithms, and widely used classification algorithm in the field of ATC. In Bayesian model, the assumption is based on a prior and posterior probability. Finding the probability of a certain document type djC can only be based on the observation ti. The conditional probability P(Cjti) can be written according to Bayes’ rule:
PðCjt i Þ ¼
Pðt i jCÞPðCÞ Pðt i Þ
ð15Þ
Since, the denominator does not depend on the category, we can therefore omit the probability P(ti). The probability P(tijC) can be estimated as:
Pðti jC k Þ ¼
m Y Pðt i jC k Þ i¼1
ð16Þ
116
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
By assuming that each term follows a probability distribution function for a normal distribution with mean l and standard deviation r in each category c, therefore the Eq. (16) can be written as:
Pðt i jCÞ ¼
m Y
Pðt i ; li;c; ri;c Þ ¼
i¼1
m Y
ðt i li;c Þ 1 qffiffiffiffiffiffiffiffiffiffiffiffiffi e 2r2i;c 2 pr 2 i¼1 i;c
2
ð17Þ
Given this probability for a single term, the logarithm of the probability for all m terms in the data is
ln PðT i jC k Þ ¼ ln
¼
m Y
2
m X ðt i li;c Þ ðt i li;c Þ 1 1 qffiffiffiffiffiffiffiffiffiffiffiffiffi e ¼ ln qffiffiffiffiffiffiffiffiffiffiffiffiffi e 2 2 2 r 2 2r2i;c i;c 2 2 pr pr i¼1 i i;c i;c
m ðt i li;c Þ m m mX lnð2pÞ ln r2i;c 2 2 2 i r2i;c
2
2
ð18Þ
The prior probability P(C) of a certain class is estimated using a Laplacean prior as:
PðCÞ ¼
1 þ Nt;c ; D þ Nc
ð19Þ
Nt,c is the number of documents in a certain class ck, D is the total number of documents and Nc is the total number of classes. 3.3. Support vector machines In the machine learning approaches, support vector machines (SVMs) are considered one of the most robust and accurate methods among all well-known algorithms [45]. Therefore, as a third learning classifier, SVM based classification toolbox SVM_Light2 (Joachims, 1998, 1999, 2002) is used in this experiment. All parameters were left at default values. The parameter c was set to 1.0, which is considered as a default setting in this toolbox. 4. Evaluations In this section, we provide empirical evidence for the effectiveness of the proposed term weighting approaches over six different term weighting approaches. The experiment results show the consistency of the proposed term weighting approaches outperforms the conventional approaches in high-dimensional vector space consisting of rare terms and in the comparatively low-dimensional vector space where reduce the rare terms using threshold setting through a certain category from the training data set. 4.1. Experimental datasets To evaluate the performance of proposed term weighting approaches with existing baseline approaches, we conducted our experiments using Reuters-21578,3 20 Newsgroups,4 and RCV1-v2/LYRL20045 which is widely used benchmark collections in the classification task. 4.1.1. Reuters-21578 dataset In Reuters-21578, the ten top-sized categories of Apte’ split are adopted, which splits data into a test set and a training set. Because the system is evaluated with 10-fold cross validation, we merged the training and testing documents together. The Reuters-21578 corpus contains 9976 documents. 4.1.2. 20 Newsgroups dataset The second dataset that we used in this experiment is the 20 Newsgroups, which is a popular dataset to use against machine learning techniques such as TC and text clustering. It contains approximately 18,828 news articles across 20 different newsgroups. For convenience, we call the 20 categories Atheism, CompGraphics, CompOsMsWindowsMisc, CompSysIbmPcHardware, CompSysMacHardware, CompWindowsx, MiscForsale, RecAutos, RecMotorcycles, RecSportBaseBall, RecSportHockey, SciCrypt, SciElectronics, SciMed, SciSpace, SocReligionChristian, TalkPoliticsGuns, TalkPoliticsMideast, TalkPoliticsMisc, and TalkReligionMisc as ‘‘Ath,’’ ‘‘CGra,’’ ‘‘CMWM,’’ ‘‘CSIPH,’’ ‘‘CSMH,’’ ‘‘CWin,’’ ‘‘MFor,’’ ‘‘RAuto,’’ ‘‘RMot,’’ ‘‘RSB,’’ ‘‘RSH,’’ ‘‘SCry,’’ ‘‘SEle,’’ ‘‘SMed,’’ ‘‘SSpa,’’ ‘‘SRChr,’’ ‘‘TPG,’’ ‘‘TPMid,’’ ‘‘TPMisc,’’ and ‘‘TRMi,’’ respectively. We employ the commonly used 10-fold cross validation technique in which the Reuters-21578 and 20 Newsgroups datasets are randomly divided into 10-fold. Each turn on one data fold is used for testing and the remaining folds are used for 2 3 4 5
Available Available Available Available
at at at at
http://svmlight.joachims.org/svm_multiclass.html. http://disi.unitn.it/moschitti/corpora.htm. http://people.csail.mit.edu/jrennie/20Newsgroups/. http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm.
117
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125 Table 3 One sample dataset fold from 10-fold cross validation. Dataset
Reuters-21578
Category
# Training # Testing
20 Newsgroups # Training # Testing # Training # Testing
Acq
Corn
Crude
Earn
Grain
Interest
Money
Ship
Trade
Wheat
2129 236
214 23
521 57
3568 396
524 58
431 47
646 71
258 28
438 48
255 28
Ath 719 80 RSH 900 99
CGra 876 97 SCry 892 99
CWin 882 98 SEle 883 98
CSIPH 883 99 SMed 891 99
CSMH 865 96 SSpa 888 99
CMWM 888 98 SRChr 898 99
MFor 875 97 TPG 819 91
RAuto 891 99 TPMid 846 94
RMot 895 99 TPMi 698 77
RSB 895 99 TrMi 565 63
training. Table 3 shows the description of only one of the possible sample datasets for the Reuters-21578 and 20 Newsgroups. 4.1.3. RCV1-v2 dataset/LYRL2004 The RCV1 [22] dataset, RCV1-v2/LYRL2004 is adopted, which contains a total of 804,414 documents with 103 categories from four parent topics. As single-label in concern in this study, therefore we extract all the documents which are labeled with at least once. We found that only approximate 23,000 documents out of 804,414 are labeled with at least once. Beside this, we therefore, considered a document which is labeled with two categories a parent with child category. Then we removed the parent category and child category is assigned in order to produce single-label classification for each document. From RCV1-v2/LYRL2004, a single topic is assigned a total of 229,221documents which falls into 54 different categories. We keep the same split, the first 23,149 documents as for training and the remainder 206,072 documents is for testing. Table 4 shows the description of training and testing split over RCV1-v2 dataset. 4.2. Feature selection by threshold setting This study emphasizes on a novel class-indexing-based term weighting scheme to enhance the classification task. Therefore, we do not adopt feature selection techniques such as information gain, mutual information, chi-square test, and document frequency as feature selection criterion [1,2,26,27,36]. Nevertheless, we first normalize all the documents in the document space using several preprocessing steps, including converting uppercase letters in a token to lowercase, removing punctuation, eliminating word on the smart stop word list that contains 571 words, and reducing inflected words to their root form using stemming. After normalizing the documents, we perform a local feature (term) selection by term frequency in a certain category (TFC). In this approach, we rank and assess all features based on their appearance in a certain category ck, and arrange them in descending order. In this way, we select only some of the features and remove others by setting common thresholds for each category in the two datasets. The idea behind using a threshold is to create a high-dimensional vector space (including numerous rare terms) and a comparatively low-dimensional vector space (removing rare terms) to judge the performance of the proposed TF.IDF.ICF and TF.IDF.ICSdF approaches compared with the traditional TF.IDF approach in both high-dimensional and low-dimensional vector space. Setting the thresholds, we conducted experiments in two different sessions: one in which the high-dimensional vector space is considered and other in which the comparatively low-dimensional vector space is considered. The high-dimensional and low-dimensional vector spaces are generated by setting the threshold q = 3 and q = 10, respectively, in each category of the Reuters-21578 and 20 Newsgroups datasets. In the high-dimensional vector space, the threshold q = 3 eliminates all the terms that appear for not more than two times in a certain class ck of the two datasets. Therefore, we investigate the effects of the proposed and traditional term weighting schemes in this vector space with the centroid classifier. In contrast, the threshold q = 10 is assigned to the comparatively low-dimensional vector space and it eliminates all the terms that appear for not more than nine times in a certain category of the datasets. This action judges the effects of the proposed and traditional term weighting schemes in this vector space with the centroid classifier. With the thresholds of q = 3 and q = 10 set in the Reuters-21578 and 20 Newsgroups datasets, the total number of unique terms in term space Ti = {t1, t2, . . . , tn} decreased from 8974 to 8286 and 19,634 to 12,377, respectively. Thus, we manage the vector space to eliminate a certain level of infrequent terms by setting the thresholds, in order to justify the effectiveness of proposed and traditional term weighting approaches in high-dimensional and comparatively low-dimensional vector space. 4.3. Performance measurement The standard methods used to judge the performance of a classification task are precision, recall, and the F1 measure [10,47]. These measures are defined based on a contingency table of predictions for a target category ck. The precision P(Ck), recall R(Ck), and the F1 measure F1(Ck) are defined as follows:
118
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
TPðC k Þ TPðC k Þ þ FPðC k Þ TPðC k Þ RðC k Þ ¼ TPðC k Þ þ FNðC k Þ 2 PðC k Þ RðC k Þ 2TPðC k Þ ¼ ; F 1 ðC k Þ ¼ PðC k Þ þ RðC k Þ 2TPðC k Þ þ FPðC k Þ þ FNðC k Þ
PðC k Þ ¼
ð20Þ ð21Þ ð22Þ
TP(Ck) is the set of test documents correctly classified to the category Ck, FP(Ck) is the set of test documents incorrectly classified to the category, FN(Ck) is the set of test documents wrongly rejected, and TN(Ck) is the set of test documents correctly rejected. To compute the average performance, we used macro-average, micro-average, and overall accuracy. The of the class space are computed as macro-average of precision (PM), recall (RM), and the F1 measure F M 1
PM ¼
m 1X PðC k Þ m k¼1
ð23Þ
RM ¼
m 1X RðC k Þ m k¼1
ð24Þ
FM 1 ¼
m 1X F 1 ðC k Þ m k¼1
ð25Þ
l
Therefore, The micro-average of precision (Pl), recall (Rl), and the F1 measure F 1 of the class space are computed as
Pm
k¼1 TPðC k Þ
P l ¼ Pm Rl ¼ l
F1 ¼
ð26Þ
k¼1 ðTPðC k Þ þ FPðC k ÞÞ Pm k¼1 TPðC k Þ Pm ðTPðC k Þ þ FFðC k ÞÞ k¼1
l
ð27Þ
l
2P R Pl þ Rl
ð28Þ
4.4. Experimental results In this study, we designed two experiment environments with high-dimensional and comparatively low-dimensional vector spaces to evaluate the performance of the proposed TF.IDF.ICSdF and TF.IDF.ICF approaches and compared them with six different weighting approaches. The reason of chosen two environments is to examine the performance of the proposed term weighting schemes. Tables 5–10 present the results achieved, where the number in bold indicates the best result for each category and as well as the macro-F1 over the total categories. Table 4 RCV1-v2 dataset training and testing split. Category C11 # Of training data # Of testing data
729 5839 C23
# Of training data # Of testing data
# Of training data # Of testing data
# Of training data # Of testing data
# Of training data # Of testing data
11 68
C13
C14
317 2263 C31
891 6711
1112 10,842
E11
E12
E13
453 3438
313 2409
ECAT
G15
GCAT
GCRIM
173 1113
2449 21,474
836 5697
GHEA
GJOB
GODD
3 56
C32 45 662 E14 3 18
3 71
116 957
4 55
GTOUR
GVIO
GWEA
8 61
1287 9308
103 910
C15
163 1557
C24
42 451
GFAS # Of training data # Of testing data
C12
9 89
120 979 GWELF 10 73
C16 3 31
C33
C17
28 333 C34
772 5808 E21
C41 7 66
E31 5 60
GDEF 63 488 GPOL 1182 10,172
49 467
38 274 GDIP 968 8141 GPRO 132 550
M11
M12
3657 36,521
1382 15,258
9 175 E51
C21 538 4702 C42 5 25 E61
C22 422 2180 CCAT 28 414 E71
30 212
23 162
559 4505
GDIS
GENT
210 1881
155 1024
GREL
GSCI
GSPO
75 320
70 1003
3030 31,732
M13
M14
MCAT
13 140
434 3440
GENV 60 482
12 157
119
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125 Table 5 Performance on F1 measure in the Reuters-21578 dataset with threshold q = 3 and q = 10 over the Centroid classifier. Category
acq corn crude earn grain interest money-fx ship trade wheat macro-F1
Term weighting scheme CC
MI
OR
ProbBa.
RF
TF.IDF
TF.IDF.ICF
TF.IDF.ICSdF
0.707(0.717) 0.076(0.083) 0.389(0.403) 0.799(0.853) 0.033(0.017) 0.439(0.363) 0.445(0.463) 0.305(0.276) 0.289(0.273) 0.193(0.189) 0.367(0.364)
0.825(0.834) 0.453(0.444) 0.793(0.793) 0.891(0.898) 0.307(0.360) 0.659(0.641) 0.701(0.656) 0.675(0.639) 0.716(0.747) 0.495(0.447) 0.651(0.646)
0.799(0.810) 0.245(0.281) 0.670(0.699) 0.852(0.861) 0.263(0.283) 0.527(0.527) 0.509(0.645) 0.625(0.624) 0.507(0.589) 0.293(0.336) 0.532(0.565)
0.784(0.800) 0.205(0.252) 0.639(0.748) 0.765(0.871) 0.163(0.479) 0.546(0.547) 0.428(0.576) 0.587(0.751) 0.462(0.865) 0.189(0.187) 0.477(0.608)
0.814(0.817) 0.455(0.445) 0.763(0.772) 0.872(0.873) 0.320(0.318) 0.617(0.633) 0.633(0.685) 0.667(0.668) 0.758(0.758) 0.525(0.522) 0.642(0.649)
0.897(0.906) 0.432(0.508) 0.820(0.836) 0.936(0.938) 0.375(0.372) 0.702(0.788) 0.695(0.829) 0.732(0.763) 0.819(0.833) 0.520(0.544) 0.693(0.732)
0.917(0.925) 0.468(0.539) 0.829(0.849) 0.946(0.950) 0.408(0.424) 0.712(0.856) 0.704(0.898) 0.758(0.783) 0.836(0.887) 0.540(0.559) 0.707(0.767)
0.930(0.934) 0.391(0.477) 0.826(0.838) 0.958(0.960) 0.377(0.435) 0.713(0.892) 0.682(0.891) 0.760(0.785) 0.828(0.853) 0.511(0.529) 0.697(0.759)
Note: The number in parenthesis denotes the performance with threshold q = 10.
Table 6 Performance on F1 measure in the Reuters-21578 dataset with threshold q = 3 and q = 10 over the NB classifier. Category
acq corn crude earn grain interest money-fx ship trade wheat macro-F1
Term weighting scheme CC
MI
OR
ProbBa.
RF
TF.IDF
TF.IDF.ICF
TF.IDF.ICSdF
0.799(0.858) 0.237(0.496) 0.620(0.736) 0.901(0.932) 0.546(0.673) 0.516(0.664) 0.572(0.818) 0.552(0.794) 0.579(0.758) 0.416(0.620) 0.574(0.735)
0.786(0.859) 0.249(0.448) 0.599(0.730) 0.907(0.937) 0.520(0.668) 0.529(0.729) 0.580(0.792) 0.577(0.790) 0.596(0.728) 0.434(0.599) 0.578(0.728)
0.795(0.880) 0.243(0.392) 0.635(0.781) 0.914(0.946) 0.547(0.660) 0.531(0.679) 0.591(0.778) 0.572(0.794) 0.582(0.715) 0.390(0.558) 0.580(0.718)
0.759(0.863) 0.104(0.117) 0.615(0.712) 0.827(0.892) 0.465(0.577) 0.500(0.605) 0.482(0.638) 0.521(0.637) 0.466(0.689) 0.161(0.211) 0.490(0.594)
0.783(0.867) 0.254(0.491) 0.631(0.739) 0.897(0.931) 0.535(0.655) 0.501(0.714) 0.573(0.782) 0.608(0.783) 0.595(0.712) 0.424(0.579) 0.580(0.725)
0.750(0.797) 0.242(0.420) 0.575(0.578) 0.886(0.876) 0.517(0.618) 0.486(0.630) 0.572(0.711) 0.558(0.720) 0.577(0.563) 0.435(0.573) 0.560(0.649)
0.753(0.867) 0.206(0.448) 0.620(0.738) 0.886(0.932) 0.537(0.679) 0.501(0.668) 0.583(0.749) 0.484(0.727) 0.593(0.684) 0.437(0.587) 0.560(0.708)
0.751(0.858) 0.215(0.454) 0.603(0.748) 0.879(0.931) 0.515(0.671) 0.487(0.708) 0.574(0.751) 0.481(0.735) 0.566(0.692) 0.525(0.580) 0.560(0.713)
Note: The number in parenthesis denotes the performance with threshold q = 10.
Table 7 Performance on F1 measure in the Reuters-21578 dataset with threshold q = 3 and q = 10 over the SVM classifier. Category
acq corn crude earn grain interest money-fx ship trade wheat macro-F1
Term weighting scheme CC
MI
OR
ProbBa.
RF
TF.IDF
TF.IDF.ICF
TF.IDF.ICSdF
0.812(0.827) 0.139(0.219) 0.654(0.705) 0.886(0.909) 0.394(0.404) 0.495(0.643) 0.598(0.737) 0.557(0.638) 0.539(0.638) 0.056(0.151) 0.513(0.587)
0.880(0.867) 0.308(0.324) 0.768(0.755) 0.929(0.920) 0.443(0.353) 0.657(0.712) 0.738(0.793) 0.456(0.407) 0.743(0.722) 0.265(0.229) 0.619(0.608)
0.839(0.853) 0.163(0.182) 0.709(0.783) 0.913(0.923) 0.368(0.405) 0.515(0.537) 0.605(0.741) 0.611(0.710) 0.572(0.681) 0.082(0.229) 0.538(0.605)
0.714(0.791) 0.253(0.416) 0.778(0.865) 0.879(0.919) 0.403(0.554) 0.569(0.640) 0.449(0.690) 0.568(0.740) 0.589(0.889) 0.021(0.068) 0.522(0.657)
0.758(0.758) 0.207(0.126) 0.584(0.592) 0.806(0.807) 0.386(0.420) 0.593(0.636) 0.631(0.671) 0.322(0.311) 0.652(0.677) 0.216(0.319) 0.516(0.532)
0.907(0.915) 0.235(0.305) 0.803(0.824) 0.922(0.928) 0.408(0.465) 0.648(0.757) 0.749(0.874) 0.578(0.617) 0.744(0.780) 0.278(0.294) 0.627(0.676)
0.942(0.960) 0.243(0.359) 0.849(0.876) 0.960(0.972) 0.472(0.573) 0.644(0.863) 0.763(0.911) 0.686(0.755) 0.783(0.864) 0.302(0.374) 0.664(0.751)
0.947(0.961) 0.240(0.419) 0.851(0.879) 0.975(0.981) 0.494(0.629) 0.685(0.907) 0.771(0.923) 0.714(0.788) 0.821(0.882) 0.269(0.403) 0.677(0.777)
Note: The number in parenthesis denotes the performance with threshold q = 10.
4.4.1. Results with high-dimensional vector space The high-dimensional vector space is generated by setting the threshold to q = 3 in the Reuters-21578 and 20 Newsgroups datasets. Tables 5–7 show F1 measure F1(Ck) of a certain category in the Reuters-21578 dataset over centroid, NB, and SVM classifier respectively. In these tables, in comparing with other weighting methods, TF.IDF.ICF and TF.IDF.ICSdF is outperformed over other methods in centroid and SVM classifier respectively. The RF is showing its own superiority in NB classifier. In contrast, Tables 8–10 show F1 measure F1(Ck) of a certain category in the 20 Newsgroups dataset over centroid, NB, and SVM classifier respectively. The result shows that TF.IDF.ICSdF shows its own superiority in Centroid and SVB methods. It achieves higher accuracy 19 of 20 in centroid and 20 of 20 categories in SVM over other methods. The TF.IDF is showing its own superiority in NB classifier.
120
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
Table 8 Performance on F1 measure in the 20 Newsgroups dataset with threshold q = 3 and q = 10 over the Centroid classifier. Category
Ath CGra CMWM CSIPH CSMH CWin MFor RAuto RMot RSB RSH SCry SEle SMed SSpa SRChr TPG TPMid TPMi TrMi macro-F1
Term weighting scheme CC
MI
OR
RF
TF.IDF
TF.IDF.ICF
TF.IDF.ICSdF
0.4441(0.6092) 0.3529(0.4385) 0.4085(0.1577) 0.3663(0.4322) 0.3321(0.3989) 0.4323(0.5083) 0.3296(0.6485) 0.3717(0.4511) 0.3148(0.6878) 0.4468(0.6390) 0.5310(0.2452) 0.6687(0.7277) 0.2545(0.5070) 0.2643(0.6794) 0.5303(0.4198) 0.2429(0.7006) 0.4691(0.6471) 0.6186(0.7163) 0.2506(0.5234) 0.1741(0.3690) 0.3902(0.5253)
0.4134(0.6499) 0.5161(0.5950) 0.4025(0.1092) 0.4928(0.5378) 0.6237(0.6634) 0.5364(0.5489) 0.6374(0.6329) 0.7149(0.7768) 0.7552(0.6504) 0.6209(0.6208) 0.3604(0.2378) 0.7259(0.8138) 0.5116(0.4890) 0.7232(0.7981) 0.7279(0.7949) 0.6888(0.7118) 0.6733(0.7558) 0.7732(0.8198) 0.5314(0.5665) 0.3905(0.4712) 0.5910(0.6122)
0.7270(0.6556) 0.5317(0.5305) 0.1104(0.0967) 0.4506(0.5093) 0.5904(0.6093) 0.6211(0.5457) 0.3679(0.5993) 0.7257(0.7730) 0.5865(0.6503) 0.7282(0.6293) 0.4829(0.3458) 0.8007(0.8125) 0.5761(0.4646) 0.7444(0.7859) 0.7736(0.7889) 0.7284(0.7129) 0.7284(0.7443) 0.6760(0.8070) 0.5901(0.5669) 0.5022(0.4657) 0.6021(0.6047)
0.6328(0.6682) 0.5766(0.6290) 0.0902(0.1818) 0.5229(0.5593) 0.6356(0.6720) 0.5491(0.5844) 0.6368(0.6815) 0.7493(0.7961) 0.5288(0.6626) 0.6402(0.6568) 0.3036(0.3429) 0.7860(0.8229) 0.4140(0.5481) 0.7421(0.8270) 0.7855(0.8252) 0.6937(0.7560) 0.7226(0.7863) 0.8055(0.8421) 0.5484(0.6257) 0.3755(0.4781) 0.5870(0.6473)
0.7514(0.7612) 0.7496(0.7445) 0.1390(0.1313) 0.6799(0.6645) 0.7864(0.7781) 0.6763(0.6617) 0.7758(0.7794) 0.8843(0.8889) 0.9292(0.9235) 0.8481(0.8452) 0.7965(0.7894) 0.8775(0.8828) 0.7670(0.7621) 0.9067(0.9134) 0.9047(0.9100) 0.8409(0.8465) 0.8389(0.8366) 0.9009(0.9054) 0.7251(0.7290) 0.5994(0.6199) 0.7689(0.7687)
0.8216(0.6347) 0.7988(0.6063) 0.1430(0.1280) 0.7180(0.5397) 0.8297(0.6743) 0.7253(0.5681) 0.8251(0.5150) 0.9308(0.7315) 0.9623(0.6344) 0.9516(0.6327) 0.9324(0.3706) 0.9227(0.8125) 0.8316(0.4551) 0.9464(0.7991) 0.9296(0.7710) 0.8718(0.6984) 0.8691(0.7437) 0.9382(0.8389) 0.8039(0.5685) 0.6450(0.4504) 0.8199(0.6086)
0.8218(0.8289) 0.8055(0.8108) 0.1747(0.1474) 0.7348(0.7180) 0.8342(0.8286) 0.7558(0.7259) 0.8283(0.8385) 0.9312(0.9387) 0.9627(0.9621) 0.9518(0.9565) 0.9338(0.9359) 0.9254(0.9277) 0.8346(0.8277) 0.9468(0.9536) 0.9297(0.9346) 0.8719(0.8764) 0.8696(0.8689) 0.9382(0.9477) 0.8039(0.8133) 0.6451(0.6615) 0.8250(0.8251)
Note: The number in parenthesis denotes the performance with threshold q = 10.
Table 9 Performance on F1 measure in the 20 Newsgroups dataset with threshold q = 3 and q = 10 over the NB classifier. Category
Ath CGra CMWM CSIPH CSMH CWin MFor RAuto RMot RSB RSH SCry SEle SMed SSpa SRChr TPG TPMid TPMi TrMi macro-F1
Term weighting scheme CC
MI
OR
RF
TF.IDF
TF.IDF.ICF
TF.IDF.ICSdF
0.8151(0.8351) 0.8021(0.8210) 0.7992(0.8192) 0.8133(0.8353) 0.8148(0.8480) 0.8223(0.8438) 0.8244(0.8404) 0.8236(0.8364) 0.8674(0.8847) 0.8558(0.8882) 0.8903(0.9034) 0.8324(0.8518) 0.8023(0.8203) 0.8186(0.8456) 0.7877(0.7975) 0.8337(0.8474) 0.7893(0.7931) 0.7555(0.7751) 0.6918(0.7018) 0.7601(0.7715) 0.8100(0.8280)
0.7906(0.8155) 0.7798(0.8167) 0.8078(0.8310) 0.8194(0.8398) 0.8391(0.8482) 0.8488(0.8454) 0.7998(0.8417) 0.8119(0.8359) 0.8491(0.8714) 0.8633(0.8889) 0.8984(0.8918) 0.8371(0.8512) 0.8148(0.8272) 0.8068(0.8355) 0.7630(0.7811) 0.8475(0.8410) 0.7624(0.7814) 0.7764(0.7568) 0.6524(0.6901) 0.7304(0.7544) 0.8049(0.8223)
0.8027(0.8275) 0.8183(0.8302) 0.7867(0.8167) 0.8132(0.8328) 0.8515(0.8615) 0.8243(0.8437) 0.8389(0.8501) 0.8401(0.8510) 0.8675(0.8835) 0.8890(0.9029) 0.8904(0.9073) 0.8337(0.8471) 0.8026(0.8265) 0.8363(0.8636) 0.7849(0.8046) 0.8086(0.8368) 0.7886(0.7968) 0.7803(0.7929) 0.7046(0.7234) 0.7611(0.7745) 0.8162(0.8337)
0.8095(0.8172) 0.7599(0.8113) 0.7614(0.8216) 0.8119(0.8385) 0.8359(0.8494) 0.8118(0.8524) 0.8024(0.8416) 0.8312(0.8475) 0.8706(0.8799) 0.8825(0.8974) 0.9024(0.9051) 0.8423(0.8443) 0.8041(0.8310) 0.8074(0.8588) 0.7694(0.7941) 0.8334(0.8308) 0.7764(0.7914) 0.7754(0.7596) 0.6763(0.7001) 0.7308(0.7577) 0.8048(0.8265)
0.8034(0.7612) 0.8216(0.7445) 0.8221(0.1313) 0.8388(0.6645) 0.8538(0.7781) 0.8643(0.6617) 0.8380(0.7794) 0.8537(0.8889) 0.8754(0.9235) 0.8905(0.8452) 0.9095(0.7894) 0.8617(0.8828) 0.8315(0.7621) 0.8410(0.9134) 0.8023(0.9100) 0.8301(0.8465) 0.7924(0.8366) 0.7667(0.9054) 0.6920(0.7290) 0.7415(0.6199) 0.8265(0.7687)
0.7865(0.8202) 0.7907(0.8151) 0.7949(0.1474) 0.8086(0.8398) 0.8380(0.8610) 0.8350(0.8631) 0.7867(0.8327) 0.8326(0.8444) 0.8701(0.8774) 0.8862(0.8999) 0.8901(0.9106) 0.8536(0.8420) 0.8130(0.8225) 0.8185(0.8610) 0.7932(0.7942) 0.8097(0.8294) 0.7938(0.7945) 0.7827(0.7610) 0.6990(0.7144) 0.7133(0.7453) 0.8098(0.7938)
0.7900(0.8289) 0.7675(0.8108) 0.7880(0.8312) 0.8046(0.7180) 0.8428(0.8286) 0.8230(0.7259) 0.7891(0.8385) 0.8318(0.9387) 0.8631(0.9621) 0.8706(0.9565) 0.8888(0.9359) 0.8473(0.9277) 0.8043(0.8277) 0.8000(0.9536) 0.7694(0.9346) 0.7972(0.8764) 0.7886(0.8689) 0.7553(0.9477) 0.6831(0.8133) 0.7226(0.6615) 0.8014(0.8593)
Note: The number in parenthesis denotes the performance with threshold q = 10.
The above results show that the class-indexing-based TF.IDF.ICSdF and TF.IDF.ICF, both term weighting approaches outperformed the conventional approaches. This is especially the case for the TF.IDF.ICF approach in which both IDF and the ICF functions give positive discrimination on rare terms and the high-dimensional vector space is considered with numerous rare terms in the term space. Therefore, by setting the threshold at q = 3, the TF.IDF.ICF and TF.IDF.ICSdF term weighting approaches are showing its own superiority over the centroid and SVM classifier respectively. 4.4.2. Results with comparatively low-dimensional vector space The low-dimensional vector space is generated by setting the threshold to q = 10 in the two datasets, Reuters-21578 and 20 Newsgroups. Tables 5–7 where the result with parenthesis shows F1 measure F1(Ck) of a certain category in the Reuters-21578 dataset over centroid, NB, and SVM classifier respectively. In these tables, in comparing with other weighting methods, TF.IDF.ICF and TF.IDF.ICSdF is outperformed over other methods in centroid and SVM classifier respectively. CC is showing its superiority in NB classifier.
121
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125 Table 10 Performance on F1 measure in the 20 Newsgroups dataset with threshold q = 3 and q = 10 over the SVM classifier. Category
Ath CGra CMWM CSIPH CSMH CWin MFor RAuto RMot RSB RSH SCry SEle SMed SSpa SRChr TPG TPMid TPMi TrMi macro-F1
Term weighting scheme CC
MI
OR
RF
TF.IDF
TF.IDF.ICF
TF.IDF.ICSdF
0.7857(0.7853) 0.7085(0.7356) 0.6490(0.7278) 0.6512(0.6707) 0.7133(0.7622) 0.7131(0.7169) 0.6346(0.7772) 0.7696(0.8190) 0.8021(0.8741) 0.8332(0.8551) 0.8354(0.8935) 0.8846(0.8171) 0.6400(0.6918) 0.8220(0.8662) 0.8247(0.8600) 0.7609(0.7771) 0.7641(0.8129) 0.8994(0.8643) 0.6930(0.7362) 0.5342(0.4682) 0.7459(0.7756)
0.6321(0.7123) 0.6058(0.7132) 0.6667(0.7221) 0.6058(0.6461) 0.7128(0.7536) 0.6706(0.7218) 0.7237(0.7422) 0.7657(0.8160) 0.8321(0.8531) 0.7369(0.8209) 0.7987(0.8484) 0.7015(0.7959) 0.5362(0.5912) 0.8080(0.8491) 0.8044(0.8402) 0.7245(0.7529) 0.7554(0.8016) 0.7979(0.8032) 0.6316(0.6696) 0.2234(0.2679) 0.6867(0.7361)
0.7664(0.7355) 0.7080(0.7712) 0.6143(0.7892) 0.6233(0.6218) 0.6983(0.7510) 0.7276(0.7902) 0.5930(0.7543) 0.8055(0.8093) 0.8373(0.8926) 0.8444(0.8213) 0.8300(0.8607) 0.8841(0.8238) 0.6706(0.6083) 0.8396(0.8797) 0.8438(0.8270) 0.7736(0.7668) 0.8125(0.8480) 0.8848(0.8573) 0.7401(0.7135) 0.5320(0.4075) 0.7515(0.7664)
0.4774(0.4283) 0.5845(0.6306) 0.1950(0.5856) 0.4552(0.5373) 0.5988(0.5838) 0.4598(0.6277) 0.5869(0.6305) 0.6082(0.6527) 0.6370(0.5812) 0.6168(0.5106) 0.6800(0.7192) 0.7629(0.8302) 0.2982(0.3435) 0.3965(0.6063) 0.7057(0.6828) 0.6729(0.6133) 0.4650(0.6673) 0.6816(0.8227) 0.3965(0.2780) 0.1681(0.2125) 0.5223(0.5772)
0.7941(0.8036) 0.8266(0.8337) 0.7958(0.8089) 0.7658(0.7561) 0.8411(0.8385) 0.8344(0.8509) 0.8279(0.8341) 0.8860(0.8934) 0.9433(0.9476) 0.9348(0.9464) 0.9198(0.9231) 0.8708(0.8827) 0.7968(0.8134) 0.9255(0.9341) 0.8997(0.9080) 0.8311(0.8316) 0.8596(0.8654) 0.8968(0.9100) 0.8117(0.8184) 0.4925(0.5029) 0.8377(0.8451)
0.8890(0.8347) 0.8914(0.7963) 0.8727(0.7800) 0.8393(0.7677) 0.8988(0.8250) 0.9116(0.8273) 0.8666(0.8151) 0.9464(0.8753) 0.9692(0.9248) 0.9705(0.9447) 0.9682(0.9577) 0.9468(0.9242) 0.8863(0.7949) 0.9595(0.9083) 0.9510(0.9021) 0.9032(0.8655) 0.9235(0.8696) 0.9691(0.9394) 0.9010(0.8321) 0.7179(0.6833) 0.9091(0.8534)
0.9046(0.9105) 0.9003(0.8996) 0.8731(0.8671) 0.8555(0.8493) 0.9154(0.9046) 0.9127(0.9073) 0.8968(0.9046) 0.9495(0.9559) 0.9775(0.9715) 0.9770(0.9814) 0.9786(0.9781) 0.9626(0.9625) 0.9076(0.9059) 0.9621(0.9662) 0.9584(0.9584) 0.9348(0.9435) 0.9464(0.9451) 0.9714(0.9783) 0.9243(0.9340) 0.8088(0.8322) 0.9259(0.9278)
Note: The number in parenthesis denotes the performance with threshold q = 10.
In contrast, Tables 8–10 where the result with parenthesis shows F1 measure F1(Ck) of a certain category in the 20 Newsgroups dataset over centroid, NB, and SVM classifier respectively. The result shows that TF.IDF.ICSdF shows its own superiority in among three classifiers. It achieves higher accuracy 20 of 20 categories in centroid, 11 of 20 in NB, and 20 of 20 categories in SVM over other term weighting approaches. The above results show that the class-indexing-based TF.IDF.ICSdF and TF.IDF.ICF approaches outperformed over the different term weighting approaches. This is especially the case for the TF.IDF.ICSdF approach, which shows its superiority almost all the categories in the dataset. 4.4.3. Results with RCV1-v2 dataset In this dataset, we do not introduce any thresholds because of unbalanced document distribution in the training process. Table 4 shows that majority of the categories have a small domain, some of them are only contains three to thirteen documents in a certain category. Therefore, the total number of unique terms in term space Ti = {t1, t2, . . . , tn} is 58,885, which includes many rare terms in the vector space. Fig. 7 shows the performance based on micro-F1 over the centroid classifier. In this figure, the TF.IDF.ICF showing its own superiority over other weighing methods, which achieved the highest micro-F1 score (79.06). The TFIDF and TF.IDF.ICSdF are second and third rank with micro-F1 score is 78.87 and 78.63 respectively. 4.5. Overall performances and discussions Figs. 3–6 show the performance comparison with micro-F1on eight different term weighting methods over the Reuters21578 and 20 Newsgroups datasets with setting threshold q = 3 and q = 10 using three different learning classifiers. In Figs. 3,
1.0000 0.9000 0.8000
Micro-F1
0.7000 0.6000 0.5000 0.4000 0.3000
NaiveBayes
0.2000
Centroid
0.1000
SVM
0.0000 CC
MI
OR
PB
RF
TFIDF
TFIDFICF TFIDFICSdF
Fig. 3. Performance comparison with micro-F1in the Reuters-21578 dataset with setting threshold using q = 3 over the Centroid, NB, and SVM classifier.
122
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125 1.0000 0.9000 0.8000
Micro-F1
0.7000 0.6000 0.5000 0.4000 0.3000
NaiveBayes
0.2000
Centroid
0.1000
SVM
0.0000
CC
MI
OR
PB
RF
TFIDF
TFIDFICF TFIDFICSdF
Fig. 4. Performance comparison with micro-F1in the Reuters-21578 dataset with setting threshold using q = 10 over the Centroid, NB, and SVM classifier.
1 0.9 0.8
Micro-F1
0.7 0.6 0.5 0.4 0.3
NaiveBayes
0.2
Centroid
0.1
SVM
0
CC
MI
OR
RF
TFIDF
TFIDFICF TFIDFICSdF
Fig. 5. Performance comparison with micro-F1in the 20 Newsgroups dataset with setting threshold using q = 3 over the Centroid, NB,and SVM classifier.
1.0000 0.9000 0.8000
Micro-F1
0.7000 0.6000 0.5000 0.4000 0.3000
NaiveBayes
0.2000
Centroid
0.1000
SVM
0.0000 CC
MI
OR
RF
TFIDF
TFIDFICF TFIDFICSdF
Fig. 6. Performance comparison with micro-F1in the Reuters-21578 dataset with setting threshold using q = 10 over the Centroid, NB, and SVM classifier.
4 for Reuters-21578 and Figs. 5, 6 for 20 Newsgroups, the proposed TF.IDF.ICSdF outperforms in SVM and centroid classifier to compare with other term weighting approaches. In Fig. 7 for RCV1-v2, the TF.IDF.ICF outperforms in centroid classifier. The vector space of RCV1-v2 which includes many rare terms and therefore, it gives positive discrimination on this dataset. The results of the above experiments show that the proposed class-indexing-based TF.IDF.ICSdF term weighting method consistently outperforms in high-dimensional and in comparatively low-dimensional vector spaces over the other methods. Moreover, the TF.IDF.ICSdF approach shows its superiority not only in overall, micro-F1, and macro-F1 but also in all the categories of the 20 Newsgroups and a majority of the Reuters-21578 datasets using SVM and centroid classifier. Another proposed class-indexing-based TF.IDF.ICF term weighting approach outperformed on RCV1-v2, which is very high-dimensional vector space with numerous rare terms are included in the VSM.
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
123
0.9000 0.8000 0.7000
Micro-F1
0.6000 0.5000 0.4000 0.3000
Centroid
0.2000 0.1000 0.0000
CC
MI
OR
PB
RF
TFIDF
TFIDFICF TFIDFICSdF
Fig. 7. Performance comparison with micro-F1in the RCV1-v2 dataset over the Centroid classifier.
However, it is important to note that from the experiments result the combination of TF.IDF.ICSdF and centroid or TF.IDF.ICSdF and SVM, which can significantly improve the system performance. It is also noticeable that, the TF.IDF.ICSdF approach is very effective when we reduce the vector space. Therefore, it is important to generate more informative terms in the class-indexing-based term weighting method in order to enhance the automatic indexing task. Our study shows that the TF.IDF.ICSdF approach is a novel term weighting method that demonstrates a consistently higher performance over other term weighting approaches. It is worth to mention that, in Reuter-21578, the results of some small categories like wheat and corn are comparatively poor along with other categories in this dataset. We, therefore, examined the Reuters-21578 dataset to explore the possible reason for providing poor accuracy. In high dimensional vector space, the category wheat and corn contains 1169(13.02%) and 1153(12.8%) terms respectively, to represent their own domain which we found very small in compare with other categories. And in comparatively low dimensional vector space, the category wheat and corn contains 515(6.2%) and 471(5.7%) terms respectively. These categories are semantically much correlated and most of the terms are overlapped with each other.
5. Conclusion In this study, we investigated the effectiveness of proposed class-indexing-based TF.IDF.ICSdF and TF.IDF.ICF approaches with other different term weighing approaches using a centroid, Naı¨ve Bayes, and SVM classifier to address the ATC task. The proposed term weighting approaches are effective in enhancing the classification task. The experiments were conducted using the Reuters-21578, 20 Newsgroups and RCV1-v2 datasets as benchmark collections. First, we proposed a new classbased indexing method to replace the traditional document indexing task. Then, we implemented a class-indexing-based TF.IDF.ICF term weighting approach that led to a new class-indexing-based TF.IDF.ICSdF term weighing approach that emphasizes on addressing class space density rather than class space. The proposed class-indexing-based term weighting approaches outperformed the six different term weighting approaches. In particular, the TF.IDF.ICSdF approach consistently outperformed in SVM and centroid classifier over other term weighting approaches. Since SVM is considered one of the most robust and achieved great success in TC, therefore, the combination of TF.IDF.ICSdF and SVM can significantly improve the performance of TC and in other applications. From the experiment results, it is clear that the TF.IDF.ICSdF approach is very promising in domain based applications like text classification, information retrieval, information extraction, emotion recognition, topic identification, and many other applications in machine learning workbench. The TF.IDF.ICSdF approach may further use to compute the weight of basic eight emotions6 classification for intelligent robot. From the experiment results, it is noticeable that the TF.IDF.ICSdF approach is very effective when we reduce the vector space. Therefore, the proposed TF.IDF.ICSdF approach can overcome the problem of high dimensionality. It gives positive discrimination both in rare or in frequent terms. Although, the proposed TF.IDF.ICSdF consistently performed well in single-label classification task, however, in its current form needs to apply on multi-label classification task; to judge whether the proposed method can significantly perform well or not to handle multi-label classification. Future extension of this proposed methodology aim to improve the performance of text classification to introduce hybridindexing by exploiting semantic indexing technique. The classification performance can be further improved by combining class-indexing and semantic indexing. In future work, beside single-label classification, we will conduct our experiments with multi-label classification. 6
Joy, Love, Expectation, Surprise, Anxiety, Sorrow, Anger, and Hate.
124
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
Acknowledgement This research has been partially supported by the Ministry of Education, Science, Sports and Culture of Japan under Grantin-Aid for Scientific Research (A) No. 22240021.
References [1] J. Chen, H. Huang, S. Tian, Y. Qua, Feature selection for text classification with Naı¨ve Bayes, Expert Systems with Applications 36 (2009) 5432–5435. [2] T.F. Covoes, E.R. Hruschka, Towards improving cluster-based feature selection with a simplified silhouette filter, Information Sciences 181 (2011) 3766–3782. [3] F. Debole, F. Sebastiani, Supervised term weighting for automated text categorization, in: Proceedings of the 2003 ACM Symposium on Applied Computing, Melbourne, FL, USA, 2003, pp. 784–788. [4] S. Flora, T. Agus, Experiments in term weighting for novelty mining, Expert Systems with Applications 38 (2011) 14094–14101. [5] F. Figueiredo, L. Rocha, T. Couto, T. Salles, M.A. Goncalves, Word co-occurrence features for text classification, Information Systems 36 (2011) 843–858. [6] N. Fuhr, C. Buckley, A probabilistic learning approach for document indexing, ACM Transactions on Information Systems 9 (1991) 223–248. [7] S. Godbole, S. Sarawagi, S. Chakrabarti, Scaling multi-class support vector machine using inter-class confusion, in: Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining, ACM Press, New Orleans, 2002, pp. 513–518. [8] Y. Guo, Z. Shao, N. Hua, Automatic text categorization based on content analysis with cognitive situation models, Information Sciences 180 (2010) 613– 630. [9] E.H. Han, G. Karypis, Centroid-based document classification: analysis and experimental results, Principles of Data Mining and Knowledge Discovery (2000) 424–431. [10] T. Joachims, Text categorization with support vector machines: learning with many relevant features, in: Proceedings of 10th European Conference on Machine Learning, Springer Verlag, Heidelberg, Germany, 1998, pp. 137–142. [11] T. Joachims, A statistical learning model of text classification for support vector machines, in: Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval, 2001. [12] M.G.H. Jose, Text representation for automatic text categorization, 2003
. [13] B. Kang, S. Lee, Document indexing: a concept-based approach to term weight estimation, Information Processing and Management 41 (5) (2005) 1065–1080. [14] S. Kansheng, H.Jie.L. Hai-tao, Z. Nai-tong, S. Wen-tao, Efficient text classification method based on improved term reduction and term weighting, The Journal of China Universities of Posts and Telecommunications 18 (2011) 131–135. [15] Y. Ko, J. Seo, Text classification from unlabeled documents with bootstrapping and feature projection techniques, Information Processing and management 45 (2009) 70–83. [16] M. Lan, C.L. Su, Y. Lu, Supervised and traditional term weighting methods for automatic text categorization, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2009) 721–735. [17] J.H. Lee, Combining multiple evidence from different properties of weighting schemes, in: Proceedings of the 18th International Conference on Research and Development in Information Retrieval 1995, pp. 180–188. [18] C. Lee, G.G. Lee, Information gain and divergence-based feature selection for machine learning- based text categorization, Information Processing & Management 42 (1) (2006) 155–165. [19] D.D. Lewis, Text representation for intelligent text retrieval: a classification-oriented view, in: S.J. Paul (Ed.), Text-Based Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval, Lawrence Erlbaum Associates, Inc., Publishers, Hillsdale, New Jersey, USA, 1992, pp. 179–197. [20] D.D. Lewis, M. Ringuette, A Comparison of two learning algorithms for text categorization, in: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, USA, 1994, pp. 81–93. [21] D.D. Lewis, R.E. Schapire, J.P. Callan, R. Papka, Training algorithms for linear text classifiers, in: Proceedings of 19th ACM International Conference on Research and Development in Information Retrieval, 1996, pp. 298–30. [22] D.D. Lewis, Y. Yang, T. Rose, F. Li, RCV1: a new benchmark collection for text categorization research, Journal of Machine Learning research 5 (2004) 361–397. [23] W. Li, D. Miao, W. Wang, Two-level hierarchical combination method for text classification, Expert Systems with Applications 38 (2011) 2030–2039. [24] Y. Liu, H. Loh, A. Sun, Imbalanced text classification: a term weighting approach, Expert Systems with Applications 36 (2009) 690–701. [25] Q. Luo, E. Chen, H. Xiong, A semantic term weighting scheme for text classification, Expert Systems with Applications 38 (2011) 12708–12716. [26] S. Maldonado, R. e Weber, J. Basak, Simultaneous feature selection and classification using kernel-penalized support vector machines, Information Sciences 181 (2011) 115–128. [27] H. Ogura, H. Amano, M. Kondo, Comparison of metrics for feature selection in imbalanced text classification, Expert Systems with Applications 38 (2011) 4978–4989. [28] F. Peng, A. McCallum, Information extraction from research papers using conditional random fields, Information Processing and Management 42 (2006) 963–979. [29] G. Salton, A Theory of Indexing, Bristol, UK, 1975. [30] G. Salton, A. Wong, C.S. Yang, A vector space model for automatic indexing, Association of Computing Machines 18 (1975) 613–620. [31] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management 24 (1988) 513–523. [32] G. Salton, M.J. McGill, Introduction to Modern Information Retrieval, New York, 1983. [33] F. Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys 34 (2002) 1–47. [34] G. Salton, C.S. Yang, C.T. Yu, Contribution to the theory of indexing, in: Proceedings of International Federation for Information Processing Congress 74, Stockholm, American Elsevier, New York, 1973. [35] H. Schutze, D. Hull, J.O. Pedersen, A comparison of classifiers and document representations for the routing problem, in: Proceedings of the 18th ACM International Conference on Research and Development in Information Retrieval, 1995, pp. 229–237. [36] W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, Z. Wang, A novel feature selection algorithm for text categorization, Expert Systems with Applications 33 (2007) 1–5. [37] K. Sparck Jones, A statistical interpretation of term specificity and its application in retrieval, Journal of Documentation 28 (1972) 11–21. [38] K. Sparck Jones, Index term weighting, Information Storage and Retrieval 9 (1973) 619–633. [39] A. Tagarelli, Exploring dictionary-based semantic relatedness in labeled tree data, Information Sciences 220 (2012) 244–268. [40] S. Tan, An improved centroid classifier for text categorization, Expert Systems with Applications 35 (2008) 279–285. [41] T. Theeramnukkong, V. Lertnattee, Improving centroid-based text classification using term distribution-based weighting system and clustering, in: International Symposium on Communications and Information Technologies, 2001, pp. 33–36. [42] T. Theeramnukkong, V. Lertnattee, Effect of term distributions on centroid-based text categorization, Information Sciences 158 (2004) 89–115. [43] K. Tzeras, S. Hartmann, Automatic indexing based on Bayesian inference networks, In Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval, Pittsburgh, PA, 1993, pp. 22–34.
F. Ren, M.G. Sohrab / Information Sciences 236 (2013) 109–125
125
[44] C.H. Wan, L.H. Lee, R. Rajkumar, D. Isa, A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine, Expert Systems with Applications 39 (2012) 11880–11888. [45] X. Wu, V. Kumar, et al, Top 10 algorithms in data mining, Knowledge and Information Systems 14 (2008) 1–37. [46] R. Xia, C. Zong, S. Li, Ensemble of feature sets and classification algorithms for sentiment classification, Information Sciences 181 (2011) 1138–1152. [47] Y. Yang, An evaluation of statistical approaches to text categorization, Journal of Information Retrieval 1 (1999) 67–88. [48] Y. Yang, C.G. Chute, An example-based mapping method for text categorization and retrieval, ACM Transactions on Information Systems 12 (3) (1994) 252–277. [49] Y. Yang, J.O. Pedersen, A Comparative study on feature selection in text categorization, In Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, TN, 1997, pp. 412–420. [50] W. Zhang, F. Gao, An Improvement to Naı¨ve Bayes for text classification, Procedia Engineering 15 (2011) 2160–2164.