Expert Systems with Applications 39 (2012) 4760–4768
Contents lists available at SciVerse ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Comparison of term frequency and document frequency based feature selection metrics in text categorization Nouman Azam ⇑, JingTao Yao Department of Computer Science, University of Regina, Regina, SK, Canada S4S 0A2
a r t i c l e
i n f o
Keywords: Text categorization Feature selection metrics Term frequency Document frequency
a b s t r a c t Text categorization plays an important role in applications where information is filtered, monitored, personalized, categorized, organized or searched. Feature selection remains as an effective and efficient technique in text categorization. Feature selection metrics are commonly based on term frequency or document frequency of a word. We focus on relative importance of these frequencies for feature selection metrics. The document frequency based metrics of discriminative power measure and GINI index were examined with term frequency for this purpose. The metrics were compared and analyzed on Reuters 21,578 dataset. Experimental results revealed that the term frequency based metrics may be useful especially for smaller feature sets. Two characteristics of term frequency based metrics were observed by analyzing the scatter of features among classes and the rate at which information in data was covered. These characteristics may contribute toward their superior performance for smaller feature sets. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction The volume of digital documents available online is growing exponentially as a result of increased usage of the internet. Finding relevant and in time information from these documents are important for many applications. Automated text categorization is the key technology for this task (Shang et al., 2007). It has been utilized in many application areas such as, customer relationship management (Coussement & Van den Poel, 2008), spam email filtering (Sakkis et al., 2003; Zhou, Yao, & Luo, 2010), web page classification (Qi & Davison, 2009), text sentinel classification (Wang, Li, Song, Wei, & Li, 2011) and astronomy (Kou, Napoli, & Toussaint, 2005). A moderate size text collection usually contains tens of thousands of features (Genkin, David, & Madigan, 2007; Yang & Pedersen, 1997). The commonly used ‘bag-of-words’ representation for text documents where each word is treated as a feature, results in high dimensionality (Sebastiani, 2002). Feature selection is among the possible solutions in such situations for making the learning task efficient. Feature selection is an active research area in many fields such as, data mining, machine learning and rough sets (Liang, Wang, & Yao, 2007; Piramuthu, 2004; Yao & Zhao, 2008; Yao, Zhao, & Wang, 2008). Feature selection may be defined as the process of selecting most important features (Azam & Yao, 2011; Yao & Zhang, 2005). The process typically involves certain metrics that is used for find⇑ Corresponding author. E-mail addresses:
[email protected] (N. Azam),
[email protected] (J.T. Yao). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.09.160
ing utilities or importance level of features. Feature selection can help in efficient utilization of resources for large scale problems (Forman, 2003). Existing feature selection methods in text categorization are either based on term frequency (López, Jiménez-Salazar, & Pinto, 2007; Moyotl-Hernández & Jiménez-Salazar, 2005; Tang, Shepherd, Milios, & Heywood, 2005) or document frequency (Forman, 2003; Lee & Lee, 2006; Mladenic & Grobelnik, 1999; Ogura, Amano, & Kondo, 2009; Yang & Pedersen, 1997). Term frequency is the number of times a particular word appeared in a document while document frequency is the count of documents containing that word. Term frequency may be considered as relatively more important, since document frequency is based on binary value of a term presence or absence in a document and it ignores the actual contribution of a word within a document. For instance, two words having term frequencies of 10 and 100, respectively, in a document will have same document frequency of 1. This means that we are unable to judge their relative importance for a document. Term frequency on the other hand considers such information which may be useful in selection of important features. The above argument leads us to an interesting issue. If term frequency is relatively more important compared to document frequency, the same may be true for feature selection metrics defined with them. In other words, feature selection metrics defined with term frequency may perform better than those defined with document frequency. We will examine this issue in this article. In particular, we will compare term frequency and document frequency in a given feature selection metric for finding the more useful one. Two approaches may be utilized for this task.
4761
N. Azam, J.T. Yao / Expert Systems with Applications 39 (2012) 4760–4768
Revising term frequency based feature selection metrics with document frequency, or, revising document frequency based feature selection metrics with term frequency.
We adopt the second approach in this research. Furthermore, the research is limited to recently proposed document frequency based metrics of discriminative power measure (Chen, Lee, & Chang, 2009) and GINI index (Shang et al., 2007). The metrics were revised with term frequency for investigating the effectiveness of the two types of frequencies. 2. Feature selection Feature selection is a process which selects a subset of features that is considered as important. Such selection can help in building faster, cost effective and accurate models for data processing (Saeys, Inza, & Larranaga, 2007). A typical feature selection process consists of four basic steps, namely, subset generation, subset evaluation, stopping criterion, and results validation (Liu & Yu, 2005). Subset generation is a searching procedure that generates a candidate feature subset for evaluation. The search for a subset may start at a full, empty or random feature set. The generated subset is then evaluated with an evaluation criterion which determines the goodness of the subset. The subset is compared with the best subset previously generated. The process of subset generation and evaluation are repeated until given stopping criteria is reached. Finally, the selected feature subset is validated with different tests using artificial or real world data. Feature selection algorithms designed with different evaluation criteria fall broadly into three categories, namely, filter, wrapper and hybrid models. A filter model uses general characteristics of data for evaluating and selecting features and is independent of a mining algorithm. Evaluating features with a wrapper model requires a predetermined mining algorithm with its performance used as an evaluation criterion. A hybrid model combines the two models in a unified framework. Feature selection methods based on a wrapper model are mostly not suitable in large scale problems like text categorization (Forman, 2003). Majority of feature selection methods are based on a filter model which evaluates each feature independently (Forman, 2003). Feature selection algorithms use certain metrics to assign scores to features in such cases. Feature selection methods in text categorization are also sometimes refereed to as feature selection metrics (Forman, 2003). Feature selection metrics can be defined mathematically by considering data points of the form ((x1, y1), (x2, y2), (x3, y3), . . . , (xm, ym)), where yis are the class labels associated with each instance xi. Each xi is represented by a vector in D dimensions ðxi1 ; xi2 ; xi3 ; . . . ; xiD Þ. The features used to define data points may be presented as a set F = (f1, f2, f3, . . . , fD). Feature selection metrics generate a score against each feature in F. A set F0 # F with jF0 j 6 jFj is generated using these scores. The set F0 (i.e. the selected features) is commonly based on a threshold value or some predefined number of top scoring features. 3. Term frequency in feature selection metrics We now elaborate the importance of term frequency for feature selection metrics. The metrics of discriminative power measure (DPM) and GINI index (GINI) are considered for this purpose. The DPM metric was proposed by Chen et al. (2009). It was very useful in reducing the feature set, for example, from thousands to hundreds of features with less than 5 percent of decreasing test accuracy. DPM was reported to have interesting properties on
emphasizing classification in parallel and selection of both positive and negative features. Interested reader may refer to Chen et al. (2009) for more details on these properties. The GINI feature selection metric was proposed by Shang et al. (2007). It was based on the theory of GINI index which was previously used in decision trees for splitting attributes (Breiman, Friedman, Stone, & Olshen, 1984). The comparisons of GINI with several other metrics suggest that it was a useful metric, involving simpler computations (Shang et al., 2007). The two metrics may be defined mathematically using the following notations. Let w be any word, we define its presence and absence in category i as follows. Ai Bi Ci Di
number of documents with word w and belong to category i number of documents with word w and do not belong to category i number of documents without word w and belong to category i number of documents without word w and do not belong to category i
The above notations may be used to define the total number of documents as N = Ai + Bi + Ci + Di and the total number of documents in category i as Mi = Ai + Ci. The DPM and GINI metrics for a word, w, are defined as follows.
m X Ai Bi M N M i i i¼1 2 2 m X Ai Ai GINIðwÞ ¼ Mi Ai þ Bi i¼1
DPMðwÞ ¼
ð1Þ ð2Þ
The documents belonging to a particular category i are referred to as positive documents while those not belonging to category i as negative documents. The fraction Ai/Mi in above equations may be understood as a word’s probability given positive documents, i.e. its occurrences in documents of ith category, divided by total documents in ith category. In the same way, Bi/(N Mi) may be understood as a word’s probability given negative documents. We may interpret the DPM and GINI according to these definitions. The DPM for a word is the absolute difference of a word’s probability given positive and word’s probability given negative documents. The category values are summed up to get the final DPM score. The GINI may be considered as the square of word’s probability given positive documents weighted by the square of the word’s probability given its entire occurrences (i.e. Ai/(Ai + Bi)). The category values are summed up to get GINI score for a word. The DPM and GINI are dependent on Ai and Bi values as N and Mi are independent of a word frequencies. Since the values of Ai and Bi are document frequencies, we suspect their suitability in this case. We will consider some demonstrative examples for illustrating shortcomings that may result from using document frequency in these metrics. We make a couple of cases for this purpose. A sample database of documents of Table 1 will be considered. Case 1. Word occurrences limited to single category: the document and term frequencies for three words are shown under respective categories in Table 2. We first consider the words Hello and Price. These words have same doc-
Table 1 Dataset of documents for demonstrative examples.
Number of documents
Cat. A
Cat. B
Cat. C
40
30
20
4762
N. Azam, J.T. Yao / Expert Systems with Applications 39 (2012) 4760–4768
A0i ¼
Table 2 Terms occurrences limited to one class. Document frequency
Hello Price Cheap
Normalized term frequency
Cat. A
Cat. B
Cat. C
Cat. A
Cat. B
Cat. C
20 20 22
0 0 0
0 0 0
2.0 4.0 2.0
0 0 0
0 0 0
We redefined the values of Ai, Bi, Ci, Di with normalized term frequencies for overcoming these limitations. The revised definitions are as follows. total normalized term frequency of w in documents from category i total normalized term frequency of w in documents not from category i total number of documents in category i minus Ai total number of documents in category i minus Bi
B0i C 0i D0i
The above definitions may be formally expressed as the following equations.
Table 3 Term occurrences in many classes.
ð3Þ
B0i ¼
X
ntf ðw; dÞ
ð4Þ
dRcati
C 0i ¼ M i A0i D0i
¼ N Mi
ð5Þ B0i
ð6Þ
where ntf(w, d) is the normalized term frequency for a word w in document d. The revised A0i ; B0i ; C 0i ; D0i were used in DPM and GINI presented in Eqs. (1) and (2). The metrics obtained with revised definitions are referred to as DPMNTF and GININTF. The normalized values of term frequency are used in these metrics so that term frequencies are not influenced by varying lengths of documents. The term frequency based metrics overcame the problems highlighted in Cases 1 and 2. Tables 4 and 5 present the scores of words for the two cases. In Table 4 we notice that the important word in Case 1, i.e. Price receives a higher score with revised metrics. In Table 5 we observe that the relatively important word of Home also gets higher score with revised metrics. It is worth to mention the work by Yang, Wu, Deng, Zhang, and Yang (2002). Their study also revised document frequency based metrics with term frequency. In particular, they replaced document frequency by thresholded value of term frequency. The new values were used in metrics of information gain, mutual information and document frequency. They found the revised versions as superior. Our work is different as we incorporated the actual statistics of term frequency in considered metrics without using any thresholds. 4. Results with Reuters 21,578 We conducted experiments on Reuters 21,578 (Lewis, 1999) for analyzing the effectiveness of feature selection metrics. Reuters collections are among the widely used datasets in text categorization community (Debole & Sebastiani, 2005). Many splits of Reuters 21,578 for testing and training have been proposed. We selected the Modapte split which contains 3,299 testing and 9,603 training examples. Two types of preprocessing were used: document preprocessing and word preprocessing. We removed all documents that, 1. were empty, 2. had no labels, or 3. did not contained instances in both testing and training sets. These removals resulted in 2,756 testing and 7,105 training documents with 89 distinct classes. We also removed all words that, 1. were alphanumeric, 2. had a length of 2 or lesser characters, or 3. were stop words (Stopwords, 2010). Porter’s stemming algorithm (Porter, 1980) was also applied. The total number of unique words after preprocessing were 15,012. We need to represent documents in numeric form for automatic processing on computer machines. The representation scheme of term frequency inverse of document frequency (Sebas-
Table 4 Scores of terms considered in Case 1.
Document frequency
Win Home
ntf ðw; dÞ
d2cati
ument frequencies but different normalized term frequencies in category A (Cat.A). The higher normalized term frequency of Price suggests its high presence in Cat.A. This means that comparatively it is more important for Cat.A as compared to word Hello. We expect a higher score for Price compared to Hello. In other words, we are expecting DPM(Hello) < DPM(Price) and GINI(Hello) < GINI(Price). This is unfortunately not the case when we calculate the scores for the two words. According to calculated scores, DPM(Hello) = DPM(Price) = 1.119 and GINI(Hello) = GINI(Price) = 0.25. This shortcoming of the metrics may be attributed to document frequency. We now highlight a more interesting situation by considering the word Cheap in Table 2. It has lower normalized term frequency than Price but higher document frequency. This suggests the overall importance of Cheap for Cat.A to be less than that of Price. A lower score of Cheap is expected in this case. When scores are calculated we have 1.119 = DPM(Price) < DPM(Cheap) = 1.231 and 0.25 = GINI(Price) < GINI(Cheap) = 0.3025. This situation is worse than the previous one as the more important word is assigned the lesser score (which was equal in the previous case). Case 2. Word occurrences in multiple categories: we now consider Table 3 where the words frequencies are approximately scattered across the categories. However, the word Home has high frequency in Cat.B that makes it comparatively more important than the word Win. We are expecting a high score for Home in this situation. When scores were calculated with respective methods, we have 0.658 = DPM(Home) < DPM(Win) = 0.707 and 0.144 = GINI(Home) < GINI(Win) = 0.164.
A0i
X
Normalized term frequency
Cat. A
Cat. B
Cat. C
Cat. A
Cat. B
Cat. C
20 20
22 20
5 4
2.0 2.0
2.0 8.0
0.5 0.1
Hello Price Cheap
DPM
GINI
DPMNTF
GININTF
1.119 1.119 1.231
0.25 0.25 0.3025
0.1119 0.2238 0.1119
0.0025 0.01 0.0025
N. Azam, J.T. Yao / Expert Systems with Applications 39 (2012) 4760–4768 Table 5 Scores of terms considered in Case 2.
CHIðwÞ ¼
m X
4763
Pðcati Þ CHIðw; cat i Þ
ð8Þ
i¼1
Win Home
DPM
GINI
DPMNTF
GININTF
0.707 0.658
0.164 0.144
0.057 0.482
0.001 0.045
CHIðw; cati Þ ¼
tiani, 2002) with normalization was adopted for this purpose. The value of a word wi in document d with this representation is given as follows.
n o N tf ðwi ; dÞ log df ðw iÞ Rðwi ; dÞ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n oo2ffi PD n N k¼1 tf ðwk ; dÞ log df ðw Þ
ð7Þ
k
where tf(wi, d) is the term frequency of wi in dth document, df(wi) the document frequency of wi, D the total number of features and N the total number of documents. The representation scheme was used to represent each document as a row vector of words with their respective values. The k-nearest neighbor (k-NN) was selected as classifier in experiments. The choice was suitable in this case as k-NN was reported to be relatively more sensitive towards the effectiveness of feature selection metrics (Ogura et al., 2009). For classifying a document, k-NN searches k (a positive integer) documents in training set that are closest to a given document. The majority class among the k nearest documents is the predicted class. Reuters 21,578 is highly imbalanced data set with large variance among categories instances. The number of instances in some categories are less than ten while in others it’s above thousands. In such high imbalance situation, higher values for parameter k in k-NN will have biases for higher probable categories (Li, Lu, & Yu, 2004). We used smaller values of k 6 10 for this purpose. The micro and macro values of F1 measure were used for evaluating the classification effectiveness of feature selection metrics. The measure F1 is based on precision and recall values. For a particular category i, the precision, recall and F1 measure for a classifier are defined as follows.
Pi ¼
Ai ; Ci
Ri ¼
Ai ; Mi
F1i ¼
2Pi Ri P i þ Ri
where Ai is the number of documents correctly assigned to category i, Ci is the number of documents assigned to category i and Mi is the number of documents in category i. The above category specific values can be averaged in two different ways. 1. Micro averaging: the values are computed globally over all categories.
Pm Pm Ai i Ai Pmicro ¼ Pm ; Rmicro ¼ Pmi C M i i i i 2Pmicro Rmicro F1micro ¼ ; Pmicro þ Rmicro m is the total number of classes. 2. Macro averaging: the values are computed for each category and then averaged over all categories.
Pm
Pi ; Rmacro ¼ m Pm F1i ¼ i m
Pmacro ¼ F1macro
i
NðAi Di Bi C i Þ2 ðAi þ C i ÞðBi þ Di ÞðAi þ Bi ÞðC i þ Di Þ
ð9Þ
where P(cati) is the prior probability of category i. The remaining definitions are same as discussed in Section 3. It was argued by Forman (2003) that if features are selected efficiently, then most of the information are contained in initial features. We kept feature sets less than 1000 in all experiments for this reason. The smaller feature sets are helpful in analyzing performances under limited resources. 4.1. Results of F1-micro Fig. 1(a)–(c) show the performances of metrics for F1-micro against different values of k. The metrics of GINI and GININTF always perform better as compared to other metrics. The best results for each value of k were recorded with GINI. The results are summarized below. 0.80493 with k = 1 for 700 features. 0.82822 with k = 5 for 700 features. 0.81757 with k = 10 for 500 features. It may be noted that GININTF performs comparatively well in all cases especially for smaller feature sets of 200 or lesser features. Though GINI performs poorly for feature set of 100 features but it improves substantially when we increased the feature set. The DPMNTF remains superior to DPM for smaller feature set but tends to perform similarly for higher feature set. The CHI metric seems to be a better choice compared to DPM and DPMNTF for higher values of parameter k. Tables 6–8 present detailed results. The best result for each feature set is shown in bold. We observe that the best results are always against GINI and GININTF. 4.2. Results of F1-macro Fig. 2(a)–(c) present the results of F1-macro. The best performance for k = 1 was recorded with GINI as 0.50251, while for k = 5 and k = 10 the best performances were obtained with GININTF as 0.42295 and 0.33318, respectively. The metrics of GININTF and GINI maintained their superiority. The performance of GININTF was specially notable for smaller feature set. Though CHI performed poorly for smaller value of k (Fig. 2(a)) but significantly improves and matches DPM for higher values (Fig. 2(b) and (c)). DPMNTF performs well than CHI between feature sets of 200 and 700 for all values of k. Once more DPMNTF remained superior to DPM though their performances were comparable for higher feature sets. Tables 9–11 present detailed results of F1-macro. The best results were again obtained with GINI and GININTF. The above experimental results demonstrate the possible improvements that may be achieved with term frequency based metrics for smaller feature sets. This is desired in cases when computational resources are limited. We try to examine some properties of the considered metrics in the next section.
Pm
i Ri m
Chi square (CHI) (Yang & Pedersen, 1997) metric was also introduced as a baseline metric for comparisons. The CHI score for a word w can be calculated as follows.
5. Characteristics of term frequency based feature selection metrics 5.1. Features class scatter An effective feature selection metric is expected to select features that represent most of the classes. We will examine this characteristic of considered metrics in this section. A visual analysis
4764
N. Azam, J.T. Yao / Expert Systems with Applications 39 (2012) 4760–4768
Fig. 1. Results of F1-micro.
Table 6 Results of F1-micro with k = 1. Feature selection metrics
CHI DPM DPMNTF GINI GININTF
Table 8 Results of F1-micro with k = 10. Number of features
Feature selection metrics
100
300
500
700
900
0.7316 0.6999 0.7212 0.4832 0.7394
0.7624 0.7343 0.7687 0.7874 0.7994
0.7636 0.7575 0.7755 0.8025 0.7973
0.7647 0.7679 0.7749 0.8049 0.8045
0.7617 0.7725 0.7741 0.8047 0.7902
0
Table 7 Results of F1-micro with k = 5. Feature selection metrics
CHI DPM DPMNTF GINI GININTF
CHI DPM DPMNTF GINI GININTF
Number of features 100
300
500
700
900
0.7740 0.7495 0.7582 0.5823 0.7892
0.8038 0.7817 0.7969 0.8171 0.8251
0.8097 0.7940 0.8037 0.8271 0.8276
0.8082 0.7966 0.8026 0.8282 0.82675
0.8101 0.7980 0.8038 0.8272 0.8242
method called features class scatter is introduced for this purpose. The method gets inspiration from earlier visual analysis techniques introduced in Forman (2003) and Ogura et al. (2010). The method is based on a four step procedure for each feature selection metric. Step 1. Creating feature-class matrix: A feature-class matrix of size M D was created whose columns represent features while rows represent the classes as follows.
Number of features 100
300
500
700
900
0.7604 0.7356 0.7467 0.5680 0.7759
0.7956 0.7639 0.7833 0.8029 0.8167
0.8014 0.7821 0.7882 0.8175 0.8069
0.7912 0.7833 0.7890 0.8135 0.8144
0.7902 0.7831 0.7828 0.8138 0.8095
a11
a12
a13
...
a1D
1
Ba B 21 A¼B @ ...
a22 ...
a23 ...
... ...
a2D ...
C C C A
aM1
aM2
aM3
. . . : aMD
A value aij of the matrix represents the score of jth feature in ith class corresponding to a particular feature selection metric. Step 2. Sorting features and classes: The features in columns were sorted from right to left in descending order based on their scores as returned by a particular feature selection metric. Classes were also sorted from top to bottom in descending order based on their prior probabilities. This means that an entry a11 of matrix A represents the score of the least significant feature in the most probable class while aMD represents the score of most significant feature in the least probable class. In all experiments D was set to 100. This means that we only considered the top most 100 features. Step 3. Binarizing matrix values: The maximum value for each feature across all the classes was replaced by 1 and the rest were set to 0.
N. Azam, J.T. Yao / Expert Systems with Applications 39 (2012) 4760–4768
4765
Fig. 2. Results of F1-macro.
Table 9 Results of F1-macro with k = 1. Feature selection metrics
CHI DPM DPMNTF GINI GININTF
Number of features 100
300
500
700
900
0.2882 0.2645 0.2719 0.3047 0.4093
0.3590 0.3370 0.3938 0.4632 0.4808
0.3717 0.3893 0.4290 0.4716 0.4780
0.3986 0.4254 0.4352 0.5025 0.4962
0.3956 0.4467 0.4449 0.4784 0.4649
Table 10 Results of F1-macro with k = 5. Feature selection metrics
CHI DPM DPMNTF GINI GININTF
Number of features 100
300
500
700
900
0.2359 0.1917 0.2151 0.2555 0.3850
0.2950 0.2795 0.3159 0.4119 0.4229
0.3180 0.3012 0.3437 0.4013 0.3949
0.3332 0.3320 0.3419 0.3894 0.3805
0.3429 0.3292 0.3393 0.3871 0.3660
Table 11 Results of F1-macro with k = 10. Feature selection metrics
CHI DPM DPMNTF GINI GININTF
Number of features 100
300
500
700
900
0.1685 0.1421 0.1581 0.2168 0.3036
0.2327 0.2120 0.2495 0.3254 0.3331
0.2511 0.2487 0.2756 0.3185 0.2973
0.2633 0.2462 0.2671 0.3042 0.3001
0.2680 0.2378 0.2441 0.3043 0.2840
Step 4. Visualizing thematrix: the matrix obtained with above operations was converted to a rectangular array of pixels for obtaining a binary image. A value of 1 was represented by black and 0 by white. Figs. 3 and 4 show the resulting images obtained with above procedure. Below are some observations from Fig. 3. We may notice high density of black pixels at the bottom of DPM image. This suggests that DPM selects more features from low probable categories. On the other hand, DPMNTF has comparatively lesser concentration of black pixels at the bottom of the image. The black pixels in DPMNTF are comparatively more scattered across the image. This suggests that DPMNTF is relatively better in selecting features across wider range of categories. The following are observations from Fig. 4. Small length horizontal lines may be noticed at the bottom of the GINI image. This means that GINI selects features within the same low probable classes. Hence majority of high probable classes are not represented well. This may be attributed for its effected performance for smaller feature set. This shortcoming of GINI is absent in the image of GININTF. A larger scatter in the image of GININTF may be observed compared to GINI. This suggests that GININTF is comparatively better in selecting features from majority of the classes. The above discussion highlights the ability of term frequency based metrics for achieving high scatter of features among the
4766
N. Azam, J.T. Yao / Expert Systems with Applications 39 (2012) 4760–4768
Fig. 3. Visualization of features class scatter for DPM and DPMNTF.
classes. Document frequency based metrics were observed to have comparatively low scatter of features among the classes. 5.2. Cumulative information rate We now examine another interesting property of feature selection metrics by considering the information contents in top scoring features. It is suggested that the most important features returned by a particular feature selection metric is expected to contain most of the information in data (Forman, 2003). We introduce a five step process for investigating this property. Step 1. Sorting and selecting features: The features were sorted based on their scores with respective feature selection metrics. The top 1000 features were selected. Step 2. Adding information of features: The term frequencies of each feature were added across all documents. The document frequency of each feature was also noted. Step 3. Representing features as vectors: Two types of row vectors were used to represent term frequencies and document frequencies as follows.
Fig. 4. Visualization of features class scatter for GINI and GININTF.
TF ¼ ðtf1 ; tf2 ; tf3 ; . . . ; tf1000 Þ DF ¼ ðdf1 ; df2 ; df3 ; . . . ; df1000 Þ A particular entry tfk (k = 1, 2, . . ., 1000) of vector TF equals PN i¼1 ntf ðw; di Þ, where ntf(w, di) is the normalized term frequency of word w in ith document and N is the total number of documents. Similarly, an entry dfk of DF equals the document frequency of a particular word w across all the documents. The top 1000 features returned by DPMNTF and GININTF were represented using vector of type TF while that of DPM and GINI with type DF. Step 4. Normalizing the frequencies: The values in respective vectors were normalized in this step. Each value was divided by the sum of all entries in that vector. The vectors obtained with normalization were denoted as TF0 and 0 0 DF0 , respectively. The entries tf k and df k may be represented as follows.
tfk 0 tf k ¼ P1000 i¼1
tfi
dfk 0 df k ¼ P1000 i¼1
dfi
ð10Þ ð11Þ
N. Azam, J.T. Yao / Expert Systems with Applications 39 (2012) 4760–4768
4767
Fig. 5. Cumulative information rate of considered metrics.
Each value in TF or DF now represents the fraction of normalized term frequency or document frequency information for a particular feature. These entries may be interpreted as probabilities. Step 5. Representing as cumulative density function: Finally, each entry in respective vectors were added to all entries whose indices were less than or equal to that entry. This means that each entry represents the fraction of total term frequency or document frequency that has been covered uptill that feature. This may be interpreted as cumulative density function. The vectors obtained after this step were 00 00 represented as TF00 and DF00 . The entries tf k and df k may be represented as follows.
00
tf k ¼
k X
0
tf i
ð12Þ
i¼1 00
df k ¼
k X
0
df i
ð13Þ
i¼1
Fig. 5 summarizes the results. We noticed that term frequency based metrics cover the information contents at a faster rate compared to document frequency based metrics. This is evident for smaller number of features. The GININTF captures 80% of information for approximately 500 features while GINI for about 800. Similarly, DPMNTF achieves its 80% for nearly 400 features while DPM for about 550. This suggests that the most important features returned by DPMNTF and GININTF tend to cover most of the available information at a faster rate, i.e. against lesser number of features. The DPM and GINI are comparatively slower in covering document frequency information. The characteristics elaborated above may be attributed to superior-
Table 12 Summary of results. Evaluation method
F1-micro (6200 features) F1-macro (6200 features) F1-micro (200 < features < 900) F1-macro (200 < features < 900) Better feature class scatter Cumulative information rate
Performance of metrics Document frequency based metrics
Term frequency based metrics
Low
High
Low
High
GINI (high) DPM (low) GINI (high) DPM (low) No Slow
GININTF (low) DPMNTF (high) GININTF (low) DPMNTF (high) Yes Fast
ity of term frequency based metrics for smaller feature sets. Table 12 summarizes the results discussed in the article. 6. Conclusion This article compares term frequency and document frequency for feature selection metrics in text categorization. The document frequency based metrics of discriminative power measure and GINI index were defined with term frequency for this purpose. The metrics were compared on Reuters 21,578 dataset under various feature sets. Experimental results obtained in this study suggest that the term frequency based metrics were superior for smaller feature sets. Further analysis of term frequency based metrics revealed their important characteristics. They were observed to have relatively larger scatter of features among the classes and accumulate information in data at a faster rate. These characteristics may contribute toward their superior performance for smaller feature sets. Acknowledgements This work was partially supported by a Discovery Grant from NSERC Canada and the University of Regina FGSR Dean’s Scholarship Program. References Azam, N., & Yao, J. T. (2011). Incorporating game theory in feature selection for text categorization. In Proceedings of 13th international conference on rough sets, fuzzy sets, data mining and granular computing (RSFDGrC’11), Lecture notes in computer science (Vol. 6743, pp. 215–222). Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. (1984). Classification and regression trees. Montery, CA: Wadsworth International Group. Chen, C.-M., Lee, H.-M., & Chang, Y.-J. (2009). Two novel feature selection approaches for web page classification. Expert Systems with Applications, 36(1), 260–272. Coussement, K., & Van den Poel, D. (2008). Integrating the voice of customers through call center emails into a decision support system for churn prediction. Information Management, 45(3), 164–174. Debole, F., & Sebastiani, F. (2005). An analysis of the relative hardness of reuters21,578 subsets. Journal of the American Society for Information Science and Technology, 56(6), 584–596. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305. Genkin, A., David, D. L., & Madigan, D. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3), 291–304. Kou, H., Napoli, A., & Toussaint, Y. (2005). Application of text categorization to astronomy field. In Proceedings of 10th international conference on applications of natural language to information systems (NLDB’05), Lecture notes in computer science (Vol. 3513, pp. 32–43). Lee, C., & Lee, G. G. (2006). Information gain and divergence-based feature selection for machine learning-based text categorization. Information Processing and Management, 42(1), 155–165. Lewis, D. D. (1999). Reuters-21,578 text categorization collection.
Retrieved 06.10.
4768
N. Azam, J.T. Yao / Expert Systems with Applications 39 (2012) 4760–4768
Liang, H., Wang, J., & Yao, Y. Y. (2007). User-oriented feature selection for machine learning. The Computer Journal, 50, 421–434. Li, B., Lu, Q., & Yu, S. (2004). An adaptive k-nearest neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing, 3(4), 215–226. Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491–502. López, F. R., Jiménez-Salazar, H., & Pinto, D. (2007). A competitive term selection method for information retrieval. In Proceedings of 8th international conference on computational linguistics and intelligent text processing, (CICLing’07), Lecture notes in computer science (Vol. 4394, pp. 468–475). Mladenic, D., & Grobelnik, M. (1999). Feature selection for unbalanced class distribution and Naive Bayes. In Proceedings of the 16th international conference on machine learning (ICML’99) (pp. 258–267). Moyotl-Hernández, E., & Jiménez-Salazar, H. (2005). Enhancement of dtp feature selection method for text categorization. In Proceedings of 6th international conference on computational linguistics and intelligent text processing (CICLing’05), Lecture notes in computer science (Vol. 3406, pp. 719–722). Ogura, H., Amano, H., & Kondo, M. (2009). Feature selection with a measure of deviations from poisson in text categorization. Expert Systems with Applications, 36(3), 6826–6832. Ogura, H., Amano, H., & Kondo, M. (2010). Distinctive characteristics of a metric using deviations from poisson for feature selection. Expert Systems with Applications, 37(3), 2273–2281. Piramuthu, S. (2004). Evaluating feature selection methods for learning in data mining applications. European Journal of Operational Research, 156(2), 483–494. Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137. Qi, X., & Davison, B. D. (2009). Web page classification: Features and algorithms. ACM Computing Surveys, 41(2). Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507–2517. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. D., & Stamatopoulos, P. (2003). A memory-based approach to anti-spam filtering for mailing lists. Information Retrieval, 6(1), 49–73.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., & Wang, Z. (2007). A novel feature selection algorithm for text categorization. Expert Systems with Applications, 33(1), 1–5. Stopwords (2010). Retrieved 06.10. Tang, B., Shepherd, M., Milios, E., & Heywood, M. I. (2005). Comparing and combining dimension reduction techniques for efficient text clustering. In Proceeding of 2005 international workshop on feature selection for data mining (ICDM’05) (pp. 17–26). Wang, S., Li, D., Song, X., Wei, Y., & Li, H. (2011). A feature selection method based on improved fisher’s discriminant ratio for text sentiment classification. Expert Systems with Applications, 38(7), 8696–8702. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th international conference on machine learning (ICML’97) (pp. 412–420). Yang, S. M., Wu, X.-B., Deng, Z.-H., Zhang, M., & Yang, D.-Q. (2002). Relative term frequency based feature selection for text categorization. In Proceedings of the 1st international conference on machine learning and cybernatics (ICMLC’02) (pp. 1432–1436). Yao, J. T., & Zhang, M. (2005). Feature selection with adjustable criteria. In Proceedings of 10th international conference on rough sets, fuzzy sets, data mining, and granular computing (RSFDGrC’05), Lecture notes in computer science (Vol. 3641, pp. 204–213). Yao, Y. Y., & Zhao, Y. (2008). Attribute reduction in decision-theoretic rough set models. Information Science, 178(17), 3356–3373. Yao, Y. Y., Zhao, Y., & Wang, J. (2008). On reduct construction algorithms. Transactions on Computational Science, 2, 100–117. Zhou, B., Yao, Y. Y, & Luo, J. (2010). A three-way decision approach to email spam filtering. In Proceedings of 23rd canadian conference on artificial intelligence (Canadain AI’10), Lecture notes in computer science (Vol. 6085, pp. 28–39).