Selection of the most relevant terms based on a max-min ratio metric for text classification

Selection of the most relevant terms based on a max-min ratio metric for text classification

Accepted Manuscript Selection of the Most Relevant Terms Based on a Max-Min Ratio metric for Text Classification Abdur Rehman, Kashif Javed, Haroon A...

7MB Sizes 1 Downloads 15 Views

Accepted Manuscript

Selection of the Most Relevant Terms Based on a Max-Min Ratio metric for Text Classification Abdur Rehman, Kashif Javed, Haroon A. Babri, Nabeel Asim PII: DOI: Reference:

S0957-4174(18)30445-7 10.1016/j.eswa.2018.07.028 ESWA 12081

To appear in:

Expert Systems With Applications

Received date: Revised date: Accepted date:

8 August 2017 11 July 2018 12 July 2018

Please cite this article as: Abdur Rehman, Kashif Javed, Haroon A. Babri, Nabeel Asim, Selection of the Most Relevant Terms Based on a Max-Min Ratio metric for Text Classification, Expert Systems With Applications (2018), doi: 10.1016/j.eswa.2018.07.028

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • We Illustrated weaknesses of balanced accuracy and normalized difference measures.

CR IP T

• We proposed a new feature ranking metric called max-min ratio (MMR). • MMR better estimates the true worth of a term in high class skews.

• We tested MMR against 8 well-known metrics on 6 datasets with 2 classifiers.

AC

CE

PT

ED

M

AN US

• MMR statistically outperforms metrics in 76% macro F1 cases and 74% micro F1 cases.

1

ACCEPTED MANUSCRIPT

Selection of the Most Relevant Terms Based on a Max-Min Ratio metric for Text Classification

CR IP T

Abdur Rehmana,∗, Kashif Javedb , Haroon A. Babrib , Nabeel Asimc a

AN US

Department of Computer Science, University of Gujrat, Gujrat, Pakistan b Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan c Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore, Pakistan

Abstract

AC

CE

PT

ED

M

Text classification automatically assigns text documents to one or more predefined categories based on their content. In text classification, data are characterized by a large number of highly sparse terms and highly skewed categories. Working with all the terms in the data has an adverse impact on the accuracy and efficiency of text classification tasks. A feature selection algorithm helps in selecting the most relevant terms. In this paper, we propose a new feature ranking metric called max-min ratio (MMR). It is a product of max-min ratio of the true positives and false positives and their difference, which allows MMR to select smaller subsets of more relevant terms even in the presence of highly skewed classes. This results in performing text classification with higher accuracy and more efficiency. To investigate the effectiveness of our newly proposed metric, we compare its performance against eight metrics (balanced accuracy measure, information gain, chi-squared, Poisson ratio, Gini index, odds ratio, distinguishing feature selector, and normalized difference measure) on six data sets namely WebACE (WAP, K1a, K1b), Reuters (RE0, RE1), and 20 Newsgroups using the multinomial naive Bayes (MNB) and support vector machines (SVM) classifiers. The statistical significance of MMR has been estimated on 5 different splits of training and ∗

Corresponding author Email addresses: [email protected] (Abdur Rehman), [email protected] (Kashif Javed), [email protected] (Haroon A. Babri), [email protected] (Nabeel Asim) Preprint submitted to Expert Systems with Applications

July 18, 2018

ACCEPTED MANUSCRIPT

CR IP T

test data sets using the one-way analysis of variance (ANOVA) method and a multiple comparisons test based on Tukey-Kramer method. We found that performance of MMR is statistically significant than that of the other 8 metrics in 76.2% cases in terms of macro F1 measure and in 74.4% cases in terms of micro F1 measure. Keywords: Text classification, Feature selection, Feature ranking metrics. 1. Introduction

AC

CE

PT

ED

M

AN US

In today’s life with modern technology, humans are generating large amounts of digital data in textual, audio, and visual form at a fast rate. For example, as of 2014, users posted almost 300,000 tweets on Twitter, generated more than four million queries on the Google search engine, and sent 2,40,000,000 email messages in one minute (James, 2014). Arranging text documents into different categories makes text related tasks such as a search for a certain information more efficient (Chen et al., 1996). Text classification also known as text categorization is a task to assign each document present in a corpus (a collection of documents under consideration) to one or more than one given categories (Sebastiani, 2002). These days, text classification is performed automatically and efficiently using machine learning algorithms and is found in a number of applications in a number of domains, such as information retrieval and text mining (Aggarwal & Zhai, 2012). Some wellknown examples of text classification include placing documents in relevant folders, separating spam emails from ham emails, and finding user interests based on their comments in social media (Forman, 2008). Text classification is also playing a vital role in promoting businesses. Financial news can contribute in changing stock price returns. Prediction of stock price by analyzing financial news is yet another application of text classification (Li et al., 2014). Furthermore, advances in social media has greatly increased customer to business interactions. Marketing departments are now using targeted marketing strategies to outreach their customers. Classifying customers into different categories can make the job of a marketer much easier. User comments on social media about a product contains valuable information about user interests. Text classification is being used to extract customer preferences from volumes of data to enhance sales and developing marking strategies (Rao et al., 2016). The machine learning approach to text classification has three main phases: 3

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

feature extraction, dimensionality reduction and classification (Marin et al., 2014). In the feature extraction or alternatively data representation stage, we represent the raw text documents with a numerical form. The most commonly used representation for text documents is “Bag of Words” (BoW) model, which has been adopted from information retrieval (Lan et al., 2007). BoW ignores the structure and ordering of the text and assumes a document to be a sequence of independent words or terms1 . To make the BoW representation more effective, stemming, removal of stop words and pruning are employed. Stemming converts the inflected (or derived) words to their word stem, base or root form. Stop words such as “is”, “the”, “was” etc. are used for grammatical structure and do not convey any meanings (Joshi et al., 2012). These words are removed with the help of a stop words list. In text documents, there are few very frequently occurring terms, and a number of rarely occurring terms (Grimmer & Stewart, 2013). Pruning allows the removal of topic specific too frequent terms and rarely occurring terms (Aggarwal & Zhai, 2012). It is implemented by removing terms occurring above a certain upper threshold or below a certain lower threshold. Pruning acts as a preprocessing step to feature selection (Srividhya & Anitha, 2011). The outcome of this stage is a document represented in the form of a vector d = (tw1 , tw2 , ..., twi , ..., twv ), where twi is the weight of ith term (denoted by ti ) in a vocabulary containing v number of terms (Aggarwal & Zhai, 2012; Lan et al., 2009). A number of weighting schemes have been proposed in the literature. Most of them are based on the occurrences of a term (Wallach, 2006) called term count or term frequency, which is the term count normalized by the length of a document. The most popular weighting scheme is term frequency-inverse document frequency (TF-IDF) (Zhang et al., 2011). The vector d is inherently high dimensional. Even for a moderate sized data set, it can easily contain tens of thousands of unique words (Joachims, 2002; Wang et al., 2014). High dimensional data degrades the training time and classification accuracy of a classifier (Wang et al., 2016; Wu & Zhang, 2004). Furthermore, the data is highly sparse where most of the entries are zero and contain no information (Su et al., 2011). In the dimensionality reduction stage, the most useful information present in the data related to the category2 is captured and irrelevant dimensions are removed. The 1 2

Features, words and terms are used interchangeably throughout the paper. In text classification, category is used for the class.

4

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

text classification researchers have been focusing on feature ranking metrics to perform feature selection or in other words to reduce the dimensionality. Feature selection can also be performed using feature subset selection algorithms (e.g. filters, wrappers and embedded methods) (Guyon & Elisseeff, 2003). Algorithms of this category evaluate discrimination power of a feature in the presence of other features. On the other hand, a feature ranking algorithm estimates the strength of an individual feature to discriminate instances among different classes and assigns a score to it accordingly (Chen et al., 2009; Van Hulse et al., 2011). The features are then sorted in the decreasing order of importance and the top k ranked features are chosen (Guyon & Elisseeff, 2003). Here, the time complexity grows linearly in terms of the length of vector d or in other words v number of terms present in the vocabulary. For feature subset algorithms, the time complexity is typically O(v 2 ). Since the first feature ranking metric was proposed, researchers have been on the quest for designing better metrics for text classification (Forman, 2003; Uysal & Gunal, 2012; Rehman et al., 2017). Mostly, the metrics are based on the document frequency of a term in the positive class (i.e., the number of true positives), and the document frequency in the negative class (i.e., the number of false positives). Accuracy measure (ACC) is a simple metric that evaluates the worth of a term using the difference between true positives and false positives. An improvement in it resulted in the metric called balanced accuracy measure (ACC2) (Forman, 2003). It ranks features by taking absolute difference of true positive rate (tpr) and false positive rate (f pr) (Forman, 2003). The novelty of this work is to propose a new feature ranking metric called Max-min ratio (MMR). It addresses the issues of ACC2 and a recently proposed metric normalized difference measure (NDM) (Rehman et al., 2017). ACC2 treats two terms having the same difference between tpr and f pr equally. Thus, it can assign the same score to relevant and irrelevant terms. NDM can assign higher ranks to highly irrelevant sparse terms in large and highly skewed data. MMR solves these two issues. The objective of MMR is to select the most relevant terms present in text classification, where the data is highly sparse and its classes are highly skewed. To look into the usefulness of MMR, we compare it against eight well known feature ranking metrics including balanced accuracy (ACC2), information gain (IG), odds ratio (OR), chi-squared (CHI), Poisson ratio (POIS), gini index (GI), distinguishing feature selector (DFS) and NDM on six widely used text data sets 5

ACCEPTED MANUSCRIPT

CR IP T

using the multinomial naive Bayes (MNB) (Stigler, 1983) and support vector machines (SVM) (Cortes & Vapnik, 1995) classifiers. The remainder of this paper is organized in five sections. We present the feature ranking metrics already proposed in the literature in Section 2. Section 3 illustrates problems of ACC2 and NDM. It also explains the working of the newly proposed MMR feature ranking metric. Section 4 discusses the details related to our experiments. We present the results and discuss them in Section 5 and draw conclusions of the work done in this paper in Section 6. 2. Related works

AN US

This section is divided into two subsections. The first section contains the recent works of feature selection algorithms for text classification while the second discusses the feature ranking algorithms.

AC

CE

PT

ED

M

2.1. Overview of feature selection methods for text classification The goal of feature selection is to select a compact feature subset with maximal discriminative capability. This optimal subset should consist of features that together have the highest relevance for the class variable and lowest redundancy among them. One criterion with which feature selection algorithms proposed in the literature can be categorized is whether discrimination power of features is evaluated individually or in a subset (Bolon-Canedo et al., 2015; Guyon et al., 2006). In individual evaluation, an algorithm assesses usefulness of features for the class variable individually and independent of other features. Such algorithms are known as a feature ranking or univariate feature selection algorithms. The output is a list of features that are sorted in decreasing order of relevance. A subset of top ranked features generated from this list contains features highly relevant for the class and can possibly be redundant. Moreover, two or more features which are individually less relevant but together can provide good class discrimination can be missed (Guyon et al., 2006). The solution lies in designing a subset evaluation or multivariate algorithm that assesses candidate subsets of features that together can provide better discrimination. Unlike univariate algorithms, multivariate algorithms can handle feature redundancy with feature relevance while searching for the optimal subset. Another criterion with which feature selection algorithms can be categorized is their relationship with the classification algorithms (Li et al., 2017; Javed et al., 2012). Broadly, there can be three categories. Filter methods depend on the general 6

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

characteristics of the data and carry out the feature selection process independent of the classification algorithm (Lal et al., 2006; Dash & Liu, 1997). Wrappers involve a classification algorithm to assess the relative usefulness of feature subsets (Kohavi & John, 1997). In other words, the classification algorithm guides a wrapper to search for the optimal subset of features. Embedded methods perform feature selection in the process of training the classifier (Guyon et al., 2002). In other words, the search for the optimal subset of features is embedded into the classifier’s training process. Based on the second criterion, feature ranking algorithms and filters belong to the same group and are thus used interchangeably. However, filters can perform feature selection in a multivariate fashion unlike feature ranking algorithms (Peng et al., 2005; Yu & Liu, 2003). Feature ranking algorithms are computationally the least expensive and thus, preferred for performing feature selection of high-dimensional data such as text data (Sebastiani, 2002; Joachims, 2002). Mostly, text classification researchers have been focusing on designing univariate feature ranking algorithms. Some well-known examples are discussed in subsection 2.2. In recent years, we can find that new feature selection frameworks are being designed to better address issues of text classification. One of the challenges is how to design feature selection algorithms that take term dependencies into account while not allowing feature selection process to become computationally intractable. To this end, recently twostage algorithms have been proposed. The first stage selects a subset of highly relevant terms with the help of a feature ranking algorithm while the second stage eliminates redundant features from the reduced space of features using a multivariate method. In (Uguz, 2011), IG is used in the first stage to reduce the original space of features and then, principal component analysis or a genetic algorithm is used in the second stage. The two stage algorithm performed better than IG. Similarly, DFS and latent semantic indexing were combined in two stages to enhance the performance of text classification in (Uysal & Gunal, 2014). In (Javed et al., 2015), IG was used in the first stage with a Markov blanket algorithm in the second stage, which has attained better performance as compared to IG. More recently, a multivariate algorithm was proposed in (Labani et al., 2018), which selects the maximum relevant features in the first step, while the correlation between features is considered in the second step. All these studies have shown that removal of redundant terms from the optimal subset can result in higher classification accuracy as compared to feature ranking algorithms. 7

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Another challenge faced by the feature selection process is due to the high imbalance or skew among the categories of text data. Feature ranking algorithms become biased towards larger classes and select a surplus of strongly predictive features for them while ignoring features for the discrimination of smaller classes (Forman, 2004). This thus results in poor performance by text classifiers. In (Ogura et al., 2011), the abilities and characteristics of various metrics for performing feature selection with the class imbalance problem were investigated. Feature ranking metrics can be one-sided or twosided. The former metrics preserve the sign of the weight assigned to a term and can therefore generate both positive and negative values. Two-sided metrics do not preserve the sign and thus generate only positive weights to be assigned to terms. Due to this inherent property, one-sided metrics select those features (i.e., positive features) that are good discriminators of the target class while the latter can implicitly combine features most indicative of membership (positive features) and nonmembership (negative features). The class imbalance problem encountered due to the one-versus-rest setting causes negative features to be rarely added to the optimal subset. To look into this issue, we can divide feature ranking algorithms into global and local algorithms. A global algorithm assigns a unique score to each feature, which can be directly used for ranking the features. On the other hand, a local algorithm calculates multiple class-based scores for each feature and requires a globalization policy to convert these multiple local scores into a unique global score. In (Uysal, 2016), an improved global feature selection scheme was proposed that creates an optimal feature subset representing all classes almost equally. However, this scheme suffers from a pitfall (Agnihotri et al., 2017). In imbalanced data sets with a large number of classes, the distribution of features in the optimal subset should not be equal. If equal number of features are selected from each class, it can exclude important features from the class containing a higher number of features. To solve this problem, a scheme to select variable number of features from each class based on the distribution of terms in the classes was proposed in (Agnihotri et al., 2017).

2.2. Ranking methods for feature selection In this subsection, we discuss some of the well-known feature ranking metrics used for text classification. Most of the feature ranking algorithms are based on document frequency (Forman, 2003). For binary text classification, we can define four document

8

ACCEPTED MANUSCRIPT

AN US

CR IP T

frequencies for a term. True positives (tp) are the number of documents belonging to the positive class and containing the term. True negatives (tn) are the number of documents not belonging to the positive class and not containing the term. False positives (f p) are the number of documents belonging to the negative class and containing the term. False negatives (f n) are the number of documents not belonging to the negative class and not containing the term. ACC is the simplest metric to estimate the discrimination power of a term. The score that is assigned to a term is the difference between true positives and false positives. Equation 1 presents the mathematical expression of ACC (Forman, 2003). ACC = tp − f p

(1)

ED

M

ACC ranks frequent terms in the positive class higher than frequent terms in negative class. It will therefore be unable to select a good mix of positive and negative terms for unbalanced data sets. The improvement suggested in ACC is the ACC2 measure. It ranks terms by taking the absolute difference between true positive rate (tpr) and false positive rate (f pr), where tpr and f pr are true positives and false positives normalized by the category size. For tp tn and f pr = tn+f (Flach, 2012). ACC2 uses Equation 2 a term, tpr = tp+f n p to estimate a term’s importance. ACC2 = |tpr − f pr|

(2)

AC

CE

PT

ACC2 solves the problem of class imbalance faced by ACC. It also provides a solution to the problem of ranking positive features higher than negative features by taking the absolute value of the difference. However, it can assign to the relevant and irrelevant terms the same rank when two terms have the same difference between tpr and f pr. Even a highly frequent term can have ACC2 value equal to a less frequent term. Such terms should not be ranked equally. The problem with ACC2 is that it does not take the relative document frequencies of a term in both classes into account. We illustrate these issues in Section 3. IG is a widely used metric to estimate the usefulness of a feature for a classification task in machine learning (Xu et al., 2007). It calculates the amount of information gained by the inclusion or removal of a term for the prediction of a class (Yang & Pedersen, 1997). Equation 3 shows the mathematical definition of IG (Yang & Pedersen, 1997). 9

ACCEPTED MANUSCRIPT

P (ci ) log P (ci ) + P (t)

i=1

m X i=1

P (ci |t) log P (ci |t)+ P (t¯)

m X

CR IP T

IG = −

m X

i=1

P (ci |t¯) log P (ci |t¯) (3)

AN US

where ci is the ith category, m is the total number of categories, P (ci ) is the proportion of the ith class among all the classes and P (ci |t) is the probability of class ci when term t is present and P (ci |t¯) is the probability of the ith class when term t is absent. P (t) is the probability of the term t while P (t¯) is the probability of terms other than t. CHI is a well-known statistical test. It estimates the worth of a term by measuring the divergence from the distribution expected if we assume the occurrence of the term is independent of the class label. In Equation 4, we provide the mathematical expression of CHI (Forman, 2003).

ED

M

CHI = t(tp, (tp + f p)Ppos ) + t(f n, (f n + tn)Ppos )+ t(f p, (tp + f p)Pneg ) + t(tn, (f n + tn)Pneg ) (4) where

t(count, expect) = (count − expect)2 /expect

AC

CE

PT

where Ppos is the probability of the positive class and Pneg is the probability of the negative class. CHI does not perform well for documents having small term counts as shown by (Forman, 2003). OR is the ratio between the probability of occurrence of an event of interest and the probability that the event does not occur (Bland & Altman, 2000). For text classification, the event of interest is the occurrence of a term. OR is given in Equation 5. OR =

tp × tn tpr(1 − f pr) = (1 − tpr)f pr fp × fn

(5)

OR gives highest preference to the rare features present in only the positive class (Brank et al., 2002). GI is used to estimate the distribution of an attribute over different 10

ACCEPTED MANUSCRIPT

CR IP T

classes (Singh et al., 2010). It estimates the worth of a term with the expression given in Equation 6.  2  2 tp fp 2 2 GI = tpr × + f pr × (6) tp + f p tp + f p

DFS is a recently proposed metric (Uysal & Gunal, 2012). According to the criteria described in (Uysal & Gunal, 2012), a term that is present in only one class is of highest importance in discriminating among the classes. DFS estimates the worth of a term using Eqaution 7. m X i=1

P (ci |t) P (t¯|ci ) + P (t|¯ ci ) + 1

(7)

AN US

DF S =

ED

M

where P (t¯|ci ) is the probability of absence of the term t in class ci and P (t|c¯i ) is the probability of term t when other classes are present. A more recently proposed feature ranking metric is NDM (Rehman et al., 2017). It is an improvement over ACC2, which treats two relevant and irrelevant terms equally if they have the same |tpr − f pr| values. NDM takes care of this by assigning a higher rank to the term which has a lower min(tpr, f pr) value where min is the function to find minimum of the two values. Mathematical definition of NDM is given in Equation 8. N DM =

|tpr − f pr| min(tpr, f pr)

(8)

CE

PT

In case min(tpr, f pr) is equal to 0, the denominator is replaced by a small value  to avoid infinite values. However, NDM has some drawbacks in ranking terms that we will discuss in Section 3. 3. Max-Min Ratio: Our Proposed Feature Ranking Measure

AC

In this section, we highlight the issues associated with ACC2 and NDM metrics and describe how our newly proposed MMR feature ranking metric is an improvement over these two metrics. 3.1. Balanced Accuracy Measure A relatively more frequent term in one class than the other class has higher discrimination power than a term having almost equal document frequencies in both classes. For such terms, ACC2 performs well as it considers the 11

(a) Isocline Contours of ACC2

AN US

CR IP T

ACCEPTED MANUSCRIPT

(b) The Seven Terms on ACC2 Contour Lines

Figure 1: ACC2 contour lines.

AC

CE

PT

ED

M

difference between the document frequencies of both the classes. However, the ACC2 measure has some shortcomings in capturing the true relevance of a term as discussed next. Figure 1(a) shows the contour lines for the ACC2 measure. All the terms located on a given contour line (where |tpr−f pr| = constant) will be assigned the same ACC2 score and will have the same rank. However, not all terms on a given contour are of the same relevance (Rehman et al., 2017). The issues with this ranking strategy are highlighted with the help of an example given in Table 1. We have shown a data set, which consists of twenty documents with seven words divided into two categories namely home and wild. The data set is balanced where each class has ten documents. The document frequencies of the terms in each class are also shown in the table. The scores of the terms estimated by ACC2 metric according to Equation 2 are also provided. Before we discuss ranking of ACC2, let us describe each term and its association with the categories. Phone has become a necessity in daily life and is a strong indicator of the home related documents. In our example, it is present in documents of the class home only. We can see that it is a highly relevant term for discriminating between the classes. Lion is a wild animal and it is strongly 12

ACCEPTED MANUSCRIPT

Table 1: An Example Data set.

Wild 0 8 6 0 1 6 8

tpr 0.8 0.0 0.0 0.2 0.4 0.9 0.6

f pr 0.0 0.8 0.6 0.0 0.1 0.6 0.8

ACC2 0.8 0.8 0.6 0.2 0.3 0.3 0.2

NDM 80 80 60 20 3 0.5 0.33

MMR 64 64 36 4 1.2 0.45 0.27

CR IP T

Home 8 0 0 2 4 9 6

AN US

Terms Phone Lion Deer USB Cow Water Cat

AC

CE

PT

ED

M

associated with the class wild. Its document frequencies indicate that it is also a highly important term. Deer is also a wild animal. It appears in 60% documents of the class wild in our given example. The obvious class of a document containing the word deer is wild. USB stands for Universal Serial Bus. It is a commonly used port to connect devices for data transfer. Although present in only two documents of the class home, it is a strong indicator of the class home as it is absent in documents of the wild class. Cow is a pet and is kept in farms for milk and meat. It can be also found in the wild. In the example data set, it is present in four documents of the class home and in only one document of the class wild. It has a relatively stronger association to the class home than the class wild in the example data set. Water is one of the most important components for life and is abundantly found on earth. It is present in documents of both wild and home categories. From the document frequencies, we can establish that its presence is a weak indicator of a document’s class. Cat is commonly found in both homes and wild and is also a weak discriminator of the two classes. From Table 1, we can see that ranked list generated by ACC2 is {Phone, Lion, Deer, Water, Cow, Cat, USB}. ACC2 has been successful in capturing true relevance of the first three terms and thus have correctly positioned them in its ranking. The scores of the next four terms highlight the problems of ACC2 measure. ACC2 assigns terms cow and water the same score (i.e., 0.3) although the two terms have different discrimination power. Cow is a stronger indicator of the class home while water occurs almost equally in documents of both the classes and is a weak discriminator. Similarly, we find that ACC2

13

ACCEPTED MANUSCRIPT

Table 2: Example to show issues with the NDM metric.

+ive class 2 10 10 2 80

-ive class 2,800 1 0 1 2

tpr 0.02 0.10 0.10 0.02 0.8

f pr 0.93 0.0003 0 0.0003 0.00067

NDM 45.5 332.4 100 65 1,193

MMR 42.3 33.3 10 1.31 954.4

CR IP T

Terms t1 t2 t3 t4 t5

M

AN US

assigns both the terms USB and cat the same rank. Looking at the document frequencies of USB, we can say that its presence in any document is a strong indicator that the document belongs to the class home. It is more relevant than the term water but is assigned a lower rank by ACC2. Furthermore, the word cat cannot be used to determine the class of a document as it occurs very frequently in both classes. But ACC2 considers the terms USB and cat to have equal discrimination strength, which is against the normal observation. Figure 1(b) shows the seven terms plotted on the contour lines of ACC2 and visually highlights these issues.

AC

CE

PT

ED

3.2. Normalized difference measure In our earlier work (Rehman et al., 2017), we proposed a normalized difference measure, which is given in Equation 8 to address the above-mentioned issues of ACC2. When NDM is applied on the data set of Table 1, the ranked list generated by NDM is {Phone, Lion, Deer, USB, Cow, Water, Cat} if  = 0.01. The terms cow and USB which have higher discriminating capability than water and cat are now assigned higher scores and ranks. Thus, NDM improves ACC2 measure as the former outperforms the later as shown in (Rehman et al., 2017). NDM however has some issues, which we highlight next. In large and highly skewed data, terms are highly sparse3 in one or both the classes. We observe that NDM assigns high scores to sparse terms. Such terms have f pr ≈ 0, tpr ≈ 0 or both f pr, tpr ≈ 0 and the denominator (i.e., min(f pr, tpr)) in Equation 8 overshadows the numerator (i.e., |tpr − f pr|). 3

A term which is absent in most of the documents

14

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

This becomes a problem for the terms highly sparse in both the classes (i.e., both f pr, tpr ≈ 0) because their discrimination power is really poor but NDM will assign higher positions to them in its ranked list of terms. To show this, let us consider another example shown in Table 2. The data set has five terms and two classes. The positive class contains 100 documents while negative class contains 3,000 documents. There are 3,100 documents in total. Term t1 is present in 2,800 documents of negative class, while it occurs in 2 documents of the positive class. Term t2 is present in 10 documents of the positive class, while it occurs in 1 document of the negative class. Term t3 is present in only the positive class and does not occur in the negative class. Term t4 is present in 2 documents of the positive class and in 1 document of the negative class. For t5 , we see that it occurs in 80 documents of the positive class while it occurs twice in the negative class. Table 2 shows document frequencies, and NDM scores for these terms. Intuitively, t5 discriminates between the classes the best as it is present in 80% documents of the positive class and in only 0.067% documents of the negative class, i.e., almost absent. The next best is t1 . These terms should be ranked towards the top. Term t4 has the highest sparsity and is almost absent in both the classes. Being the poorest discriminator, it should be ranked at the bottom. The ranked list {t5 , t1 , t3 , t2 , t4 } indicates the true relevance of the terms based on their f pr and tpr values. The ranked list generated by NDM (with  = 0.001) is {t5 , t2 , t3 , t4 , t1 }. We can see that NDM has assigned a higher rank to t4 as compared to t1 and term t1 is considered to be of the poorest relevance. Thus, a very small value in denominator can result in a very high NDM score failing to capture the true relevance of terms. Furthermore, the NDM values for terms t2 and t3 should be very close. But the large size of the negative class results into a very small f pr value for the term t2 . As a result, the NDM score for t2 becomes 1.5 times larger than the NDM value for t3 . Hence, in the presence of terms highly sparse in both the classes, NDM can fail to correctly locate terms according to their true relevance in its ranking. On one hand, the denominator in Equation 8 of NDM helps resolving the issues of ACC2. For example, ACC2 treats terms USB (|tpr = 0.2 − f pr = 0| =0.2) and Cat (|tpr = 0.6 − f pr = 0.8| =0.2) of Table 1 the same although both have different discrimination power. On the other hand, it overestimates the true relevance of terms highly sparse in both classes (i.e., poor discriminators) and assigns higher ranks to them.

15

ACCEPTED MANUSCRIPT

M M R = max(tpr, f pr) ×

CR IP T

3.3. Max-min ratio To address the above-mentioned problems NDM faces with highly sparse terms, we propose a new metric named max-min ratio . It is given in Equation 9. |tpr − f pr| min(tpr, f pr)

(9)

CE

PT

ED

M

AN US

As both tpr and f pr are fractions, the factor max(tpr, f pr) will refrain the NDM scores from getting too large. This will be helpful particularly for terms with f pr, tpr ≈ 0 or terms with high sparsity in both the classes. The MMR metric will assign the same score as NDM for those terms that have max(tpr, f pr) = 1. Similarly, terms for which max(tpr, f pr) = min(tpr, f pr) or in other words f pr = tpr , MMR behaves same as ACC2. Therefore, MMR is an improvement in the NDM metric and better captures the true relevance of the terms especially for data sets with high class skews, where most of the terms are sparse. Now, let us compare NDM and MMR scores assigned to the terms in Table 2. The ranked list generated by MMR is {t5 , t1 , t2 , t3 , t4 }. We find that term t4 being the poorest discriminator because it is highly sparse in both the classes is assigned the lowest score by MMR and is located at the bottom of its list. Similarly, t1 which is the second best term in Table 2 and was located at the bottom by NDM is now positioned as the second best term by MMR. Therefore, MMR ranks terms better in the presence of sparse terms as compared to NDM. For Table 1, MMR generates the ranked list {Phone, Lion, Deer, USB, Cow, Water, Cat}, which is the same as generated by NDM. Hence, MMR which is derived from ACC2 and NDM addresses the issues these metrics face and we can expect that it selects more relevant terms as compared to these metrics.

AC

3.4. Strengths and Weaknesses MMR metric is derived from ACC2 and NDM. The strength of ACC2 is that it takes the absolute value of the difference of true positive and false positive rates, which allows it to select relevant terms. However, the weakness of ACC2 is that it assigns the same rank to relevant and irrelevant terms when the two terms have the same difference between true positive and false positive rates. NDM addresses this issue. However, the weakness of NDM is that it can assign higher ranks to highly irrelevant sparse terms in large and highly skewed data. 16

ACCEPTED MANUSCRIPT

CR IP T

MMR is an improvement over ACC2 and NDM. It has the strength of better estimating the true relevance of the terms especially for data sets with high class skews, where most of the terms are highly sparse. Its main weakness is how to determine the value of its denominator (i.e., min(tpr, f pr)) when a term is absent in one of the classes. Currently, we have chosen it to be 1/N, where N is total number of documents in a corpus. This has shown to work well but it needs to be further improved especially when N is too large. 4. Empirical evaluation

AN US

In this section, we describe the experimental settings used for the empirical evaluation of the newly proposed MMR metric.

PT

ED

M

4.1. Description of the Data sets We have used six single labelled data sets that are widely used by the researchers. Summary of these data sets is given in Table 3. For each data set, total number of documents, number of terms, number of classes, sizes of the smallest and largest classes have been shown. These data sets are of different sizes and class skews. Five of the data sets WAP, K1a, K1b, RE0, and RE1 have been taken from Karypis Lab, University of Minnesota 4 . The first three were made available as part of the WebACE project (Han et al., 1998) while RE0 and RE1 are subsets of the Reuters data set. The sixth data set we have used is the 20 Newsgroups (20NG) and is downloaded from a well-known web site 5 . The detailed description of it is given in (CardosoCachopo, 2007).

AC

CE

4.2. Data sets pre-processing The data sets are already stemmed and stop words have also been removed. The data sets are further processed to remove too frequent and rare terms (Forman, 2003). For this purpose, pruning was performed by following the guidelines provided by Forman (Forman, 2003). The lower threshold is an absolute value and was set equal to three. Thus, a word, which is present in three or less than three documents will be removed. For the upper threshold, Forman suggests to employ a percentage value. A word present in 4 5

http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download http://ana.cachopo.org/datasets-for-single-label-text-categorization

17

ACCEPTED MANUSCRIPT

Table 3: Summary of the Data sets

K1b

RE0

AC

CE

PT

RE1

20NG

CR IP T

AN US

K1a

Total Terms

M

WAP

Smallest Largest Category Class Class names size size Culture, Media, Multimedia, Business, Politics, Cable, Online, Review, Health, Sports, Art, Variety, Television, Music, 1,560 6,852 5 341 Entertainment, Stage, Film, People, Industry, Technology (20 classes) E, Ec, B, Ea, H, Ev, Ecu, Er, T, Et, Es, P, Em, S, Ep, Emu, Eo, 2,340 8,589 9 449 Ei, Ef, Emm - (20 classes) Politics, Sports, Health, Tech, Business, Entertainment - (6 2,340 8,589 60 1,389 classes) lei, housing, bop, wpi, retail, ipi, 1,504 2,886 11 608 jobs, reserves, cpi, gnp, interest, trade, money-fx - (13 classes) cotton, zinc, copper, ship, carcass, alum, tin, iron, oilseed, gold, meal, wheat, orange, rub1,657 3,037 10 371 ber, cofee, livestock, gas, veg, dlr, cocoa, pet, grain, crude, nat, sugar - (25 classes) talk.religion.misc,talk.politics.misc, alt.atheism,talk.politics.guns, talk.politics.mideast, comp.os.mswindows.misc, comp.sys.mac.harddware, comp.graphics,misc.forsale, 18,828 17,425 628 998 comp.sys.ibm.pc.hardware, sci.electronics,comp.windows.x sci.space,rec.autos,sci.med,sci.crypt, rec.sport.baseball,rec.motorcycles, soc.religion.christian, rec.sport.hockey - (20 classes) 18 Total Docs

ED

Data set

ACCEPTED MANUSCRIPT

CR IP T

25% or more documents should be removed. To represent terms numerically, we have employed the term frequency-inverse document frequency (TF-IDF) weight schema. All the data sets we have used in this work consist of multiple classes. In order to handle multi-class classification problems, there are two common approaches, namely one-versus-rest and one-versus-one (Flach, 2012). For text classification tasks, the former is the most commonly used strategy (Forman, 2003) and thus employed by us in this work.

AN US

4.3. Feature Ranking Metrics We compare the performance of eight feature ranking metrics namely ACC2, IG, POIS, CHI, DFS, GI, ODDS, and NDM against our newly proposed MMR metric. These metrics were implemented using JAVA programming language.

ED

M

4.4. Classification Algorithms To evaluate the quality of terms selected by the feature ranking metrics, we have used two classification algorithms namely naive Bayes and support vector machines classifiers. For the former, the multinomial implementation of WEKA (Hall et al., 2009) was employed while for SVM, we used the LibSVM library (Chang & Lin, 2011). The TF-IDF transformation applied on a document vector before using MNB classifier has resulted in improved text classification performance (Kibriya et al., 2005). We have adopted the same in our experiments.

AC

CE

PT

4.5. Evaluation Metrics To measure the performance of a classifier over a collection of 2-class classification problems, there are two methods for averaging the F1 measure namely macro averaged F1 and micro averaged F1 (Forman, 2003). F1 is a measure of the accuracy of a binary classifier (Aggarwal & Zhai, 2012) and is based on precision (let’s denote it by p) and recall (let’s denote it by r). Precision is estimated as the number of correct results out of the results tp ) (Forman, 2003). Here, tp marked correct by the classifier (i.e., p = tp+f p denotes the true positives and f p is the false positives. Recall is defined as the number of correct results out of actual number of correct results (i.e., r tp = tp+f ) (Forman, 2003). Here, f n denotes the false negatives. F1 measure n is a harmonic mean of precision and recall (Aggarwal & Zhai, 2012). Higher 19

ACCEPTED MANUSCRIPT

k=1

CR IP T

the value, better the classifier’s performance. Its best value is 1 and worst value is zero. If pk denotes the precision and rk is the recall for k th class, then Equation 10 defines the macro averaged F1 (Uysal & Gunal, 2012). PC 2×pk ×rk pk +rk

(10) C where C is the total number of classes in a data set. Micro averaged F1 is computed globally. Equation 11 gives its expression. M acro Averaged F1 =

2 × precisionµ × recallµ precisionµ + recallµ

(11)

AN US

M icro Averaged F1 =

M

where µ indicates micro averaging. Equations 12 and 13 give the expressions for precision and recall for micro averaging (Sebastiani, 2002). PC tpi µ (12) precision = PC i=1 (tp + f p ) i i i=1 PC tpi recallµ = PC i=1 (13) i=1 (tpi + f ni )

AC

CE

PT

ED

4.6. Evaluation procedure After a data set was preprocessed, we randomly split it into two sets. The training set consists of 70% of the documents while the remaining 30% documents will be used as the test set. We then applied a feature ranking metric on the training set of a data set and thus, generated a list of words in the decreasing order of importance. Eight nested subsets were generated by selecting the top 10, 20, 50, 100, 200, 500, 1,000, and 1,500 ranked words. To investigate the quality of each subset, we trained two classifiers (SVM and MNB) and measured the macro F1 and micro F1 values on the test set. These steps were repeated for all the metrics. Over all, we will test MMR’s performance on 8 subsets of top ranked words against a competitor on a data set using two classifiers. Since there are 6 data sets, two classifiers and eight nested subsets of terms, the total number of comparisons between MMR and a metric is 96. As there are 8 metrics being compared against MMR, we will examine 96 × 8 = 768 subsets to evaluate the performance of MMR metric against its competing metrics.

20

ACCEPTED MANUSCRIPT

CR IP T

4.7. Statistical analysis procedure To statistically estimate the performance of MMR against its competing metrics, we repeated our above-mentioned evaluation procedure 5 times. For each iteration, the instances of the training and test sets were randomly selected while maintaining the ratio of training and test sets instances. For each nested subset with a given classifier,

CE

PT

ED

M

AN US

(a) this results in 5 different macro F1 values for each of the 9 feature ranking metrics. (b) we then apply the one-way analysis of variance (ANOVA) (Witte & Witte, 2010) method for comparing the means of these 9 samples. This tests the null hypothesis (H0 ) wherein all population means are the same, or alternative hypothesis (H1 ) wherein at least one population mean is different (Field, 2013). (c) as we are interested in determining the difference between macro F1 values of MMR and those of all the other feature ranking metrics, a multiple comparisons test based on Tukey-Kramer method (Navidi, 2015) is then applied. (d) a 95% confidence interval (CI) for each difference between pairs of means is calculated. (i) if the confidence interval includes zero, the difference in the mean macro F1 value of MMR and that of a metric is not significant, and we cannot reject the hypothesis. We consider it to be a tie. (ii) In case the confidence interval does not include zero, the difference is significant, and we compare the mean value of MMR and that of a competing metric. If performance of MMR is statistically better than that of its competitor, we call it a win for MMR. In case performance of a metric is statistically significant than that of MMR, we consider it to be a loss for MMR.

AC

In the tables of Section 5, the average of macro F1 values over the 5 splits are presented. We use a • to indicate a win for MMR while a loss for MMR is indicated by a ◦. The absence of a symbol means a tie. Additionally, we provide the lower and upper bounds of the 95% CI associated with each test carried out on a nested subset. The same statistical analysis procedure was repeated on the micro F1 values of the 9 feature ranking metrics.

21

ACCEPTED MANUSCRIPT

5. Results and discussion

CR IP T

In this section, we present and compare the results of our proposed MMR feature ranking metric with eight other well known feature ranking metrics using macro average and micro average F1 measures. In the tables to follow, we use G to denote a subset that contains the top ranked terms of a feature ranking metric and |G| represents its cardinality.

CE

PT

ED

M

AN US

5.1. The WAP Data set The ratio of number of terms to number of documents is greater than 1 and is the highest for WAP among all data sets that we have considered in our work. Additionally, it is a highly skewed data set. Tables 4 and 5 show macro average F1 values attained by MMR and the other eight feature ranking metrics on the WAP data set for both SVM and MNB classifiers. We can observe that the highest macro F1 value with SVM is 0.7 and is obtained with the smallest subset of top 200 terms ranked by MMR. In case of MNB, the highest macro F1 value is 0.78 and is attained by top 500 ranked terms of MMR. We can find that macro F1 values of MMR are statistically significant as compared to the macro F1 values of the other 8 metrics in 114 out of 128 subsets we have evaluated. Similarly, the micro average F1 performances of the feature ranking metrics using SVM and MNB are presented in Tables 6 and 7. The highest micro F1 value with SVM is 0.84, which is obtained by the smallest subset of top 500 ranked terms of MMR. In the MNB case, the highest value of micro F1 is also 0.84 and is achieved by MMR with top 500 ranked terms. It can be observed that MMR outperforms the other metrics in 115 out of 128 subsets. Hence, we can say that the macro F1 and micro F1 performances of MMR averaged on 5 different splits of training and test sets are significantly better than those of the other 8 metrics.

AC

5.2. The K1a Data set The ratio of number of terms to number of documents is significantly high for K1a and K1b. One main difference between them is that the classes of K1a are more skewed as compared to those of K1b. Tables 8 and 9 tabulate the macro F1 performances. The SVM classifier attains the highest macro F1 value of 0.71 with the top 500 MMR ranked terms. For MNB classifier, the highest macro F1 value is 0.76, which is achieved by the smallest subset of

22

CE

23

0.61

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.17 • [0.2 0.29] 0.26• [0.16 0.25] 0.35• [0.18 0.27] 0.41• [0.18 0.3] 0.54• [0.12 0.22] 0.61• [0.05 0.14] 0.67 [-0.07 0.03] 0.64 [-0.07 0.02]

ED

CHI 0.39 [-0.03 0.06] 0.42• [4E-4 0.1] 0.46• [0.06 0.15] 0.46• [0.13 0.25] 0.45 • [0.2 0.3] 0.51• [0.15 0.24] 0.56• [0.05 0.15] 0.57• [0.01 0.09]

M

NDM 0.4 [-0.03 0.06] 0.42• [1E-4 0.1] 0.41• [0.11 0.2] 0.4• [0.19 0.31] 0.4 • [0.25 0.35] 0.46• [0.2 0.29] 0.49• [0.11 0.21] 0.53• [0.04 0.13]

POIS 0.39 [-0.03 0.06] 0.42 [-1E-3 0.09] 0.46• [0.06 0.15] 0.47 • [0.12 0.25] 0.47• [0.19 0.28] 0.52 • [0.14 0.23] 0.55• [0.05 0.15] 0.56• [0.01 0.1]

CR IP T

ACC2 0.31• [0.06 0.14] 0.37• [0.05 0.14] 0.47• [0.05 0.15] 0.55• [0.05 0.17] 0.59• [0.06 0.16] 0.59• [0.07 0.16] 0.58• [0.03 0.12] 0.57• [.0005 0.09]

AN US

ODDS 0.43 [-0.07 0.02] 0.44 [-0.02 0.07] 0.43• [0.09 0.18] 0.41• [0.18 0.31] 0.39• [0.26 0.36] 0.41 • [0.25 0.34] 0.45• [0.15 0.25] 0.47• [0.1 0.18]

GI 0.32• [0.04 0.13] 0.37• [0.05 0.15] 0.44• [0.08 0.17] 0.54• [0.06 0.18] 0.57• [0.09 0.19] 0.61• [0.05 0.14] 0.63 [-0.03 0.07] 0.61 [-0.03 0.05]

of top ranked terms with highest macro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.65

1,000

0.65

100

0.7

0.57

50

500

0.47

20

0.7

0.41

10

200

MMR

|G|

Table 4: Macro F1 Performance of MMR against eight well-known metrics on the WAP Data set averaged over 5 trials with SVM classifier

AC

IG 0.37 [-0.01 0.08] 0.42 [-4E-3 0.09] 0.47• [0.05 0.14] 0.51• [0.08 0.2] 0.53• [0.12 0.22] 0.57• [0.09 0.18] 0.59• [0.02 0.11] 0.58 [-0.01 0.07]

ACCEPTED MANUSCRIPT

CE

24

0.7

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

ED

CHI 0.12 • [0.25 0.32] 0.12 • [0.32 0.41] 0.16 • [0.41 0.5] 0.23 • [0.42 0.51] 0.31 • [0.41 0.49] 0.5 • [0.21 0.34] 0.63 • [0.08 0.16] 0.64 • [0.02 0.1]

PT

DFS 0.16 • [0.21 0.28] 0.2 • [0.24 0.34] 0.3 • [0.28 0.37] 0.35 • [0.31 0.4] 0.4 • [0.32 0.4] 0.42 • [0.29 0.42] 0.42 • [0.29 0.37] 0.4 • [0.26 0.34]

M

ACC2 0.02 • [0.34 0.42] 0.13 • [0.31 0.41] 0.33 • [0.24 0.33] 0.46 • [0.2 0.29] 0.58 • [0.14 0.22] 0.67 • [0.05 0.18] 0.68 • [0.02 0.1] 0.65 • [0.01 0.09]

POIS 0.12 • [0.25 0.32] 0.13 • [0.32 0.41] 0.17 • [0.41 0.5] 0.25 • [0.41 0.5] 0.35 • [0.37 0.45] 0.54 • [0.18 0.31] 0.64 • [0.07 0.15] 0.64 • [0.02 0.1]

CR IP T

NDM 0.05 • [0.32 0.39] 0.12 • [0.32 0.42] 0.18 • [0.4 0.49] 0.18 • [0.47 0.56] 0.23 • [0.49 0.57] 0.38 • [0.33 0.46] 0.57 • [0.14 0.22] 0.58 • [0.08 0.16]

AN US

ODDS 0.07 • [0.3 0.37] 0.1 • [0.35 0.44] 0.13 • [0.44 0.53] 0.14 • [0.52 0.61] 0.14 • [0.57 0.65] 0.21 • [0.51 0.64] 0.33 • [0.38 0.46] 0.3 • [0.36 0.44]

GI 0.12 • [0.25 0.32] 0.29 • [0.16 0.25] 0.41 • [0.17 0.26] 0.45 • [0.2 0.29] 0.49 • [0.23 0.31] 0.5 • [0.21 0.34] 0.53 • [0.18 0.26] 0.53 • [0.13 0.21]

of top ranked terms with highest macro F1 values achieved by MNB classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.75

1,000

0.7

100

0.78

0.62

50

500

0.49

20

0.76

0.4

10

200

MMR

|G|

Table 5: Macro F1 Performance of MMR against eight well-known metrics on the WAP Data set averaged over 5 trials with MNB classifier

AC IG 0.05 • [0.31 0.39] 0.12 • [0.32 0.42] 0.26 • [0.31 0.4] 0.42 • [0.23 0.32] 0.54 • [0.18 0.26] 0.67 • [0.05 0.18] 0.68 • [0.02 0.1] 0.67 [-0.01 0.07]

ACCEPTED MANUSCRIPT

CE

25

0.83

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.41 • [0.16 0.21] 0.5 • [0.13 0.18] 0.57 • [0.14 0.19] 0.64 • [0.11 0.15] 0.73 • [0.07 0.11] 0.8 • [0.02 0.06] 0.83 [-.003 0.03] 0.83 [-0.01 0.02]

ED

CHI 0.57 [-.003 0.04] 0.62 • [0.01 0.06] 0.65 • [0.06 0.11] 0.67 • [0.08 0.12] 0.7 • [0.1 0.13] 0.76 • [0.06 0.1] 0.79 • [0.03 0.07] 0.8 • [0.02 0.05]

M

NDM 0.61 [-0.04 0.01] 0.63 [-.003 0.05] 0.62 • [0.09 0.14] 0.63 • [0.12 0.17] 0.66 • [0.14 0.18] 0.71 • [0.11 0.15] 0.74 • [0.08 0.11] 0.76 • [0.05 0.08]

POIS 0.57 [-.003 0.04] 0.62 • [0.005 0.05] 0.65 • [0.06 0.11] 0.67 • [0.08 0.12] 0.71 • [0.09 0.13] 0.77 • [0.05 0.09] 0.79 • [0.03 0.06] 0.79 • [0.02 0.05]

CR IP T

ACC2 0.53 • [0.03 0.08] 0.57 • [0.05 0.1] 0.68 • [0.02 0.07] 0.75 [-.002 0.04] 0.79 • [0.01 0.05] 0.81 • [0.01 0.05] 0.81 • [0.01 0.04] 0.81 • [0.01 0.04]

AN US

ODDS 0.62 ◦ [-0.05 -0.01] 0.63 [-0.01 0.04] 0.6 • [0.11 0.16] 0.61 • [0.14 0.18] 0.61 • [0.2 0.23] 0.62 • [0.2 0.24] 0.65 • [0.17 0.2] 0.69 • [0.13 0.16]

GI 0.51 • [0.05 0.1] 0.58 • [0.04 0.09] 0.64 • [0.06 0.11] 0.74 • [0.01 0.05] 0.77 • [0.03 0.07] 0.81 • [0.01 0.05] 0.82 • [0.01 0.04] 0.81 • [0.002 0.03]

of top ranked terms with highest micro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.84

1,000

0.77

100

0.84

0.73

50

500

0.65

20

0.82

0.59

10

200

MMR

|G|

Table 6: Micro F1 Performance of MMR against eight well-known metrics on the WAP Data set averaged over 5 trials with SVM classifier

AC

IG 0.59 [-0.02 0.03] 0.62 • [7E-4 0.05] 0.67 • [0.04 0.09] 0.72 • [0.03 0.07] 0.77 • [0.04 0.07] 0.8 • [0.02 0.06] 0.81 • [0.01 0.04] 0.81 • [8E-4 0.03]

ACCEPTED MANUSCRIPT

CE

26

0.84

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

ED

CHI 0.12 • [0.37 0.5] 0.25 • [0.36 0.44] 0.36 • [0.35 0.42] 0.45 • [0.28 0.35] 0.57 • [0.22 0.28] 0.73 • [0.08 0.14] 0.79 • [0.03 0.07] 0.81 • [0.01 0.06]

PT

DFS 0.38 • [0.11 0.24] 0.46 • [0.14 0.23] 0.52 • [0.19 0.26] 0.57 • [0.16 0.23] 0.62 • [0.17 0.22] 0.67 • [0.14 0.2] 0.68 • [0.14 0.18] 0.67 • [0.15 0.19]

M

ACC2 0.1 • [0.39 0.53] 0.32 • [0.29 0.37] 0.56 • [0.15 0.22] 0.68 • [0.05 0.12] 0.76 • [0.03 0.09] 0.81 • [0.01 0.06] 0.82 [-0.002 0.04] 0.82 • [2E-05 0.05]

NDM 0.08 • [0.41 0.54] 0.39 • [0.21 0.29] 0.45 • [0.26 0.33] 0.47 • [0.26 0.33] 0.51 • [0.28 0.34] 0.63 • [0.18 0.24] 0.74 • [0.08 0.12] 0.76 • [0.06 0.1 ]

POIS 0.12 • [0.37 0.5] 0.25 • [0.36 0.44] 0.37 • [0.34 0.41] 0.48 • [0.25 0.32] 0.61 • [0.18 0.24] 0.74 • [0.07 0.13] 0.8 • [0.02 0.06] 0.81 • [0.01 0.05]

CR IP T

AN US

ODDS 0.09 • [0.41 0.54] 0.13 • [0.47 0.55] 0.19 • [0.52 0.58] 0.23 • [0.5 0.57] 0.27 • [0.52 0.58] 0.34 • [0.47 0.52] 0.49 • [0.33 0.37] 0.55 • [0.27 0.31]

GI 0.1 • [0.4 0.53] 0.44 • [0.17 0.25] 0.6 • [0.11 0.17] 0.66 • [0.07 0.14] 0.72 • [0.07 0.13] 0.75 • [0.06 0.12] 0.78 • [0.04 0.08] 0.77 • [0.04 0.09]

of top ranked terms with highest micro F1 values achieved by MNB classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.84

1,000

0.77

100

0.84

0.74

50

500

0.65

20

0.82

0.56

10

200

MMR

|G|

Table 7: Micro F1 Performance of MMR against eight well-known metrics on the WAP Data set averaged over 5 trials with MNB classifier

AC IG 0.13 • [0.37 0.5] 0.23 • [0.37 0.45] 0.5 • [0.21 0.28] 0.65 • [0.08 0.15] 0.74 • [0.05 0.11] 0.8 • [0.01 0.07] 0.83 [-0.003 0.04] 0.83 [-0.02 0.03]

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

CR IP T

top 500 MMR ranked terms. Analyzing the over all macro F1 performance, we find that MMR outperforms the 8 metrics in 108 out of 128 subsets. As far as the micro F1 performance is concerned, Tables 10 and 11 show the micro F1 results of the metrics. We can observe that the highest value with SVM is 0.86. This has been achieved by the subset containing top 1,000 terms of MMR. For MNB classifier, the highest micro F1 is 0.85 which is generated by top 1,500 terms of MMR. It can also be seen that out of 128 subsets, MMR’s performance is statistically significant than the other 8 metrics in 100 subsets.

PT

ED

M

AN US

5.3. The K1b Data set Now, we look at the results obtained on the K1b data set. Tables 12 and 13 show the working of MMR along with the other 8 metrics in terms of macro F1 . For SVM, the highest macro F1 value is 0.96 which has been obtained by the smallest subset of top 100 ranked MMR terms. In the case of MNB, the smallest subset of top 50 ranked MMR terms have resulted in the highest value, which is 0.94. The overall performance on 128 subsets indicates that MMR is statistically better than the other 8 feature ranking metrics in 103 subsets. From Tables 14 and 15, we can observe that the highest micro F1 value is 0.98 using SVM which is obtained by the smallest subset of MMR with the top 200 terms while DFS has been able to generate the smallest subset of top ranked terms with the highest micro F1 value in the MNB case. Its top 100 ranked terms have resulted in a micro F1 of 0.96. In 91 out of 128 subsets, MMR has attained higher micro F1 values as compared to its counterparts.

AC

CE

5.4. The RE0 Data set The RE0 data set has the smallest number of documents among the data sets we have considered in this work. Its classes are also highly skewed. From Tables 16 and 17, we can find that the highest macro F1 value using SVM is 0.78 which is attained by top 100 ranked MMR terms. For the MNB case, 0.64 is the highest macro F1 value and is achieved by the top 100 words ranked by MMR. Also, it can be seen that MMR shows an improvement over the macro F1 values of the other 8 metrics in 91 out of 128 subsets. Next, we look at the micro F1 performances presented in Tables 18 and 19. For SVM’s case, 0.84 is the highest micro F1 value, which is attained by the smallest subset of top 500 ranked MMR and DFS terms. With MNB, the top 500 ranked MMR terms result in the highest micro F1 equal to 0.76. 27

CE

28

0.67

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.14 • [0.19 0.27] 0.24 • [0.2 0.29] 0.32 • [0.23 0.3] 0.51 • [0.09 0.16] 0.61 • [0.01 0.1] 0.68 [-0.01 0.06] 0.7 [-0.04 0.01] 0.7 ◦ [-0.06 -0.002]

ED

NDM 0.38 [-0.05 0.03] 0.42 • [0.02 0.1] 0.43 • [0.12 0.2] 0.42 • [0.18 0.25] 0.43 • [0.19 0.28] 0.49 • [0.18 0.25] 0.58 • [0.09 0.14] 0.6 • [0.05 0.1]

POIS 0• [0.33 0.41] 0• [0.44 0.53] 0• [0.55 0.62] 0• [0.59 0.66] 0.01 • [0.61 0.71] 0.02 • [0.65 0.72] 0.02 • [0.64 0.7] 0.02 • [0.62 0.68]

CR IP T

ACC2 0.34 [-0.01 0.07] 0.47 [-0.03 0.06] 0.57 [-0.02 0.06] 0.63 [-0.03 0.04] 0.65 [-0.03 0.06] 0.66 • [0.01 0.08] 0.67 [-0.01 0.05] 0.67 [-0.02 0.03]

AN US

ODDS 0.38 [-0.05 0.03] 0.43 • [0.01 0.1] 0.43 • [0.12 0.2] 0.43 • [0.17 0.24] 0.42 • [0.2 0.29] 0.46 • [0.21 0.28] 0.49 • [0.17 0.23] 0.54 • [0.11 0.16]

M

CHI 0• [0.33 0.41] 0• [0.44 0.53] 0• [0.55 0.62] 0• [0.59 0.66] 0.01 • [0.61 0.71] 0.02 • [0.65 0.72] 0.02 • [0.64 0.7] 0.01 • [0.63 0.69]

GI 0.38 [-0.05 0.03] 0.42 • [0.02 0.1] 0.43 • [0.12 0.2] 0.42 • [0.18 0.25] 0.43 • [0.19 0.28] 0.49 • [0.18 0.25] 0.58 • [0.09 0.14] 0.6 • [0.05 0.1]

of top ranked terms with highest macro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.69

1,000

0.63

100

0.71

0.59

50

500

0.48

20

0.67

0.37

10

200

MMR

|G|

Table 8: Macro F1 Performance of MMR against eight well-known metrics on the K1a Data set averaged over 5 trials with SVM classifier

AC

IG 0.47 ◦ [-0.13 -0.05] 0.51 [-0.07 0.01] 0.55 [-8E-4 0.07] 0.57 • [0.02 0.1] 0.6 • [0.02 0.11] 0.63 • [0.04 0.11] 0.65 • [0.01 0.07] 0.65 [-0.01 0.05]

ACCEPTED MANUSCRIPT

CE

29

0.74

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

ED

CHI 0• [0.3 0.36] 0• [0.44 0.51] 0• [0.54 0.62] 0.01 • [0.64 0.71] 0.01 • [0.67 0.75] 0.02 • [0.7 0.79] 0.01 • [0.71 0.81] 0• [0.69 0.78]

PT

DFS 0.13 • [0.17 0.24] 0.19 • [0.25 0.32] 0.25 • [0.29 0.36] 0.35 • [0.3 0.37] 0.4 • [0.28 0.36] 0.45 • [0.27 0.36] 0.45 • [0.27 0.37] 0.44 • [0.26 0.34]

M

ACC2 0.08 • [0.22 0.28] 0.19 • [0.25 0.32] 0.39 • [0.15 0.23] 0.52 • [0.13 0.2] 0.6 • [0.08 0.15] 0.66 • [0.06 0.15] 0.72 [-0.01 0.09] 0.71 [-0.01 0.08]

NDM 0.08 • [0.22 0.28] 0.12 • [0.32 0.39] 0.2 • [0.34 0.42] 0.18 • [0.47 0.54] 0.23 • [0.45 0.53] 0.38 • [0.34 0.43] 0.55 • [0.17 0.26] 0.61 • [0.09 0.18]

POIS 0• [0.3 0.36] 0• [0.44 0.51] 0• [0.54 0.62] 0.01 • [0.64 0.71] 0.01 • [0.67 0.75] 0.03 • [0.7 0.78] 0.03 • [0.69 0.78] 0.03 • [0.66 0.75]

CR IP T

AN US

ODDS 0.05 • [0.26 0.32] 0.09 • [0.35 0.42] 0.13 • [0.41 0.49] 0.13 • [0.52 0.59] 0.16 • [0.52 0.6] 0.17 • [0.55 0.64] 0.31 • [0.4 0.5] 0.35 • [0.35 0.43]

GI 0• [0.3 0.36] 0• [0.44 0.51] 0• [0.54 0.62] 0.01 • [0.64 0.71] 0.01 • [0.67 0.75] 0.03 • [0.7 0.78] 0.03 • [0.69 0.78] 0.03 • [0.66 0.75]

of top ranked terms with highest macro F1 values achieved by MNB classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.76

1,000

0.68

100

0.76

0.58

50

500

0.48

20

0.72

0.33

10

200

MMR

|G|

Table 9: Macro F1 Performance of MMR against eight well-known metrics on the K1a Data set averaged over 5 trials with MNB classifier

AC IG 0.14 • [0.16 0.22] 0.21 • [0.23 0.3] 0.38 • [0.17 0.24] 0.45 • [0.2 0.27] 0.51 • [0.17 0.24] 0.66 • [0.06 0.15] 0.7 • [0.01 0.11] 0.7 [-0.01 0.08]

ACCEPTED MANUSCRIPT

CE

30

0.85

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.41 • [0.1 0.16] 0.52 • [0.11 0.17] 0.58 • [0.16 0.21] 0.72 • [0.04 0.09] 0.79 • [3E-3 0.07] 0.83 [-4E-3 0.05] 0.85 [-0.01 0.04] 0.85 [-0.02 0.03]

ED

CHI 0• [0.52 0.58] 0• [0.63 0.69] 0• [0.74 0.79] 0.01 • [0.76 0.81] 0.01 • [0.78 0.85] 0.03 • [0.8 0.85] 0.02 • [0.81 0.86] 0.02 • [0.81 0.86]

M

NDM 0.56 [-0.05 0.01] 0.63 • [8E-4 0.06] 0.62 • [0.11 0.17] 0.64 • [0.13 0.18] 0.65 • [0.14 0.21] 0.7 • [0.13 0.18] 0.75 • [0.08 0.13] 0.77 • [0.05 0.11]

POIS 0• [0.52 0.58] 0• [0.63 0.69] 0• [0.74 0.79] 0.01 • [0.76 0.81] 0.01 • [0.78 0.85] 0.03 • [0.8 0.85] 0.03 • [0.8 0.85] 0.04 • [0.79 0.84]

CR IP T

ACC2 0.58 ◦ [-0.07 -0.004] 0.67 [-0.04 0.02] 0.75 [-0.01 0.04] 0.81 [-0.04 0.01] 0.82 [-0.03 0.04] 0.84 [-0.02 0.04] 0.85 [-0.01 0.03] 0.85 [-0.02 0.03]

AN US

ODDS 0.56 [-0.04 0.02] 0.63 [-0.002 0.06] 0.62 • [0.12 0.17] 0.62 • [0.14 0.19] 0.61 • [0.18 0.25] 0.63 • [0.19 0.25] 0.65 • [0.18 0.23] 0.7 • [0.13 0.18]

GI 0.56 [-0.05 0.01] 0.63 • [8E-4 0.06] 0.62 • [0.11 0.17] 0.64 • [0.13 0.18] 0.65 • [0.14 0.21] 0.7 • [0.13 0.18] 0.75 • [0.08 0.13] 0.77 • [0.05 0.11]

of top ranked terms with highest micro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.86

1,000

0.79

100

0.85

0.76

50

500

0.66

20

0.82

0.55

10

200

MMR

|G|

Table 10: Micro F1 Performance of MMR against eight well-known metrics on the K1a Data set averaged over 5 trials with SVM classifier

AC

IG 0.65 ◦ [-0.13 -0.07] 0.7 ◦ [-0.08 -0.02] 0.74 [-0.004 0.05] 0.77 [-0.002 0.05] 0.8 [-0.01 0.06] 0.83 [-0.01 0.05] 0.84 [-4E-3 0.04] 0.84 [-0.01 0.04]

ACCEPTED MANUSCRIPT

CE

31

0.85

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

ED

CHI 0• [0.48 0.53] 0• [0.62 0.68] 0• [0.71 0.76] 0.01 • [0.74 0.79] 0.01 • [0.76 0.82] 0.02 • [0.78 0.83] 0.01 • [0.81 0.86] 0.01 • [0.81 0.86]

PT

DFS 0.39 • [0.08 0.14] 0.47 • [0.15 0.21] 0.52 • [0.19 0.24] 0.59 • [0.15 0.21] 0.63 • [0.14 0.2] 0.68 • [0.13 0.18] 0.7 • [0.12 0.17] 0.69 • [0.13 0.18]

M

ACC2 0.25 • [0.22 0.28] 0.38 • [0.24 0.3] 0.59 • [0.12 0.17] 0.71 • [0.04 0.1 ] 0.77 • [0.003 0.06] 0.81 • [0.001 0.05] 0.82 [-0.004 0.04] 0.83 [-0.003 0.04]

NDM 0.28 • [0.19 0.25] 0.36 • [0.26 0.32] 0.46 • [0.25 0.31] 0.45 • [0.29 0.35] 0.48 • [0.3 0.36] 0.59 • [0.22 0.27] 0.69 • [0.13 0.17] 0.74 • [0.08 0.13]

POIS 0• [0.48 0.53] 0• [0.62 0.68] 0• [0.71 0.76] 0.01 • [0.74 0.79] 0.01 • [0.76 0.82] 0.04 • [0.77 0.82] 0.04 • [0.78 0.83] 0.05 • [0.77 0.82]

CR IP T

AN US

ODDS 0.06 • [0.41 0.47] 0.11 • [0.51 0.57] 0.19 • [0.52 0.58] 0.21 • [0.54 0.59] 0.25 • [0.53 0.59] 0.28 • [0.53 0.58] 0.38 • [0.44 0.49] 0.49 • [0.33 0.38]

GI 0• [0.48 0.53] 0• [0.62 0.68] 0• [0.71 0.76] 0.01 • [0.74 0.79] 0.01 • [0.76 0.82] 0.04 • [0.77 0.82] 0.04 • [0.78 0.83] 0.05 • [0.77 0.82]

of top ranked terms with highest micro F1 values achieved by MNB classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.84

1,000

0.77

100

0.83

0.74

50

500

0.65

20

0.8

0.51

10

200

MMR

|G|

Table 11: Micro F1 Performance of MMR against eight well-known metrics on the K1a Data set averaged over 5 trials with MNB classifier

AC IG 0.31 • [0.16 0.22] 0.44 • [0.18 0.24] 0.6 • [0.11 0.17] 0.69 • [0.06 0.11] 0.75 • [0.02 0.08] 0.81 [-0.003 0.05] 0.83 [-0.01 0.04] 0.83 [-0.01 0.04]

ACCEPTED MANUSCRIPT

CE

32

0.94

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.74 • [0.1 0.15] 0.9 [-0.02 0.04] 0.94 [-0.02 0.03] 0.95 [-0.02 0.04] 0.95 [-0.01 0.04] 0.95 [-0.02 0.04] 0.94 [-0.02 0.04] 0.92 [-0.01 0.05]

ED

CHI 0.12 • [0.72 0.77] 0.12 • [0.75 0.81] 0.12 • [0.8 0.84] 0.13 • [0.8 0.86] 0.13 • [0.81 0.86] 0.15 • [0.78 0.84] 0.16 • [0.76 0.82] 0.15 • [0.75 0.81]

M

ACC2 0.89 [-0.05 0.003] 0.91 [-0.03 0.03] 0.92 • [.002 0.05] 0.94 [-.005 0.05] 0.94 • [5E-4 0.05] 0.93 [-0.01 0.06] 0.91 • [0.01 0.07] 0.9 • [0.01 0.07]

NDM 0.8 • [0.04 0.09] 0.82 • [0.06 0.11] 0.82 • [0.1 0.15] 0.86 • [0.07 0.13] 0.83 • [0.1 0.16] 0.84 • [0.09 0.15] 0.87 • [0.04 0.1] 0.87 • [0.03 0.09]

POIS 0.12 • [0.72 0.77] 0.12 • [0.75 0.81] 0.12 • [0.8 0.84] 0.13 • [0.8 0.86] 0.13 • [0.81 0.86] 0.15 • [0.77 0.84] 0.16 • [0.76 0.82] 0.17 • [0.74 0.8]

CR IP T

AN US

ODDS 0.81 • [0.03 0.08] 0.84 • [0.04 0.1] 0.83 • [0.1 0.14] 0.85 • [0.08 0.14] 0.85 • [0.09 0.15] 0.82 • [0.11 0.17] 0.84 • [0.08 0.14] 0.84 • [0.07 0.13]

GI 0.12 • [0.72 0.77] 0.12 • [0.75 0.81] 0.12 • [0.8 0.84] 0.13 • [0.8 0.86] 0.13 • [0.81 0.86] 0.15 • [0.77 0.84] 0.16 • [0.76 0.82] 0.17 • [0.74 0.8]

of top ranked terms with highest macro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.95

1,000

0.96

100

0.96

0.95

50

500

0.9

20

0.96

0.87

10

200

MMR

|G|

Table 12: Macro F1 Performance of MMR against eight well-known metrics on the K1b Data set averaged over 5 trials with SVM classifier

AC IG 0.9 ◦ [-0.05 -0.01] 0.9 [-0.03 0.03] 0.91 • [0.01 0.06] 0.92 • [0.01 0.07] 0.93 • [0.01 0.06] 0.92 • [.002 0.07] 0.92 [-.0007 0.06] 0.91 [-0.002 0.06]

ACCEPTED MANUSCRIPT

CE

33

0.92

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.72 • [0.06 0.17] 0.76 • [0.06 0.18] 0.9 • [0.01 0.07] 0.92 [-0.02 0.05] 0.92 [-0.02 0.05] 0.91 [-0.01 0.05] 0.88 • [0.01 0.06] 0.87 • [0.02 0.08]

ED

CHI 0.12 • [0.65 0.77] 0.12 • [0.7 0.81] 0.12 • [0.78 0.84] 0.13 • [0.77 0.84] 0.13 • [0.77 0.84] 0.15 • [0.74 0.8] 0.14 • [0.75 0.79] 0.14 • [0.76 0.81]

M

ACC2 0.4 • [0.38 0.49] 0.63 • [0.19 0.31] 0.82 • [0.08 0.15] 0.86 • [0.04 0.11] 0.88 • [0.03 0.09] 0.9 [-0.01 0.05] 0.9 [-0.01 0.03] 0.92 [-0.03 0.02]

NDM 0.19 • [0.59 0.7] 0.37 • [0.45 0.56] 0.48 • [0.43 0.49] 0.51 • [0.39 0.47] 0.62 • [0.29 0.35] 0.83 • [0.07 0.12] 0.88 • [0.01 0.05] 0.9 [-0.01 0.05]

POIS 0.12 • [0.65 0.77] 0.12 • [0.7 0.81] 0.12 • [0.78 0.84] 0.13 • [0.77 0.84] 0.13 • [0.77 0.84] 0.15 • [0.74 0.8] 0.17 • [0.73 0.77] 0.18 • [0.71 0.77]

CR IP T

AN US

ODDS 0.15 • [0.63 0.74] 0.29 • [0.53 0.65] 0.37 • [0.54 0.6] 0.39 • [0.51 0.59] 0.38 • [0.52 0.59] 0.46 • [0.44 0.49] 0.64 • [0.25 0.3] 0.75 • [0.14 0.19]

GI 0.12 • [0.65 0.77] 0.12 • [0.7 0.81] 0.12 • [0.78 0.84] 0.13 • [0.77 0.84] 0.13 • [0.77 0.84] 0.15 • [0.74 0.8] 0.17 • [0.73 0.77] 0.18 • [0.71 0.77]

of top ranked terms with highest macro F1 values achieved by MNB classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.92

1,000

0.94

100

0.92

0.94

50

500

0.88

20

0.94

0.83

10

200

MMR

|G|

Table 13: Macro F1 Performance of MMR against eight well-known metrics on the K1b Data set averaged over 5 trials with MNB classifier

AC IG 0.44 • [0.34 0.45] 0.66 • [0.16 0.28] 0.79 • [0.11 0.18] 0.88 • [0.02 0.09] 0.89 • [0.01 0.07] 0.91 [-0.01 0.04] 0.92 [-0.02 0.02] 0.92 [-0.03 0.02]

ACCEPTED MANUSCRIPT

CE

34

0.98

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.86 • [0.01 0.03] 0.93 [-0.02 1E4] 0.96 ◦ [-0.02 -0.002] 0.97 [-0.01 0.01] 0.97 [-0.005 0.01] 0.98 [-0.01 0.01] 0.98 [-0.01 0.01] 0.97 [-0.003 0.01]

ED

NDM 0.83 • [0.04 0.06] 0.86 • [0.05 0.07] 0.9 • [0.04 0.06] 0.93 • [0.03 0.05] 0.94 • [0.03 0.04] 0.95 • [0.02 0.04] 0.96 • [0.01 0.03] 0.96 • [0.01 0.03]

POIS 0.59 • [0.27 0.29] 0.59 • [0.32 0.34] 0.59 • [0.35 0.37] 0.6 • [0.36 0.39] 0.6 • [0.37 0.39] 0.6 • [0.37 0.39] 0.61 • [0.37 0.38] 0.61 • [0.36 0.38]

CR IP T

ACC2 0.91 ◦ [-0.05 -0.02] 0.94 ◦ [-0.02 -0.01] 0.95 [-0.01 0.01] 0.97 [-0.01 0.02] 0.97 • [0.001 0.02] 0.98 [-0.003 0.01] 0.97 [-.0009 0.02] 0.97 [-0.001 0.02]

AN US

ODDS 0.84 • [0.03 0.05] 0.88 • [0.03 0.05] 0.9 • [0.05 0.06] 0.91 • [0.05 0.07] 0.92 • [0.05 0.07] 0.91 • [0.06 0.07] 0.92 • [0.05 0.07] 0.92 • [0.05 0.07]

M

CHI 0.59 • [0.27 0.29] 0.59 • [0.32 0.34] 0.59 • [0.35 0.37] 0.6 • [0.36 0.39] 0.6 • [0.37 0.39] 0.6 • [0.37 0.39] 0.6 • [0.37 0.38] 0.6 • [0.37 0.39]

GI 0.59 • [0.27 0.29] 0.59 • [0.32 0.34] 0.59 • [0.35 0.37] 0.6 • [0.36 0.39] 0.6 • [0.37 0.39] 0.6 • [0.37 0.39] 0.61 • [0.37 0.38] 0.61 • [0.36 0.38]

of top ranked terms with highest micro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.98

1,000

0.97

100

0.98

0.95

50

500

0.92

20

0.98

0.88

10

200

MMR

|G|

Table 14: Micro F1 Performance of MMR against eight well-known metrics on the K1b Data set averaged over 5 trials with SVM classifier

AC IG 0.93 ◦ [-0.06 -0.04] 0.94 ◦ [-0.03 -0.01] 0.95 [-0.01 0.01] 0.96 [-0.01 0.02] 0.97 [-0.002 0.02] 0.97 [-.0007 0.01] 0.97 [-0.003 0.01] 0.97 [-0.003 0.01]

ACCEPTED MANUSCRIPT

CE

35

0.97

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.82 [-0.02 0.01] 0.88 ◦ [-0.03 -0.01] 0.93 [-0.01 0.004] 0.96 ◦ [-0.03 -0.01] 0.96 [-0.02 0.01] 0.96 [-0.004 0.02] 0.96 • [0.01 0.02] 0.95 • [0.01 0.03]

ED

NDM 0.63 • [0.17 0.2] 0.76 • [0.09 0.12] 0.83 • [0.09 0.1] 0.86 • [0.06 0.08] 0.89 • [0.05 0.08] 0.94 • [0.02 0.04] 0.96 • [0.004 0.02] 0.96 • [0.002 0.02]

POIS 0.59 • [0.21 0.24] 0.59 • [0.25 0.28] 0.59 • [0.33 0.34] 0.6 • [0.33 0.35] 0.6 • [0.35 0.37] 0.6 • [0.36 0.38] 0.6 • [0.36 0.37] 0.61 • [0.35 0.37]

CR IP T

ACC2 0.75 • [0.05 0.08] 0.81 • [0.04 0.07] 0.9 • [0.02 0.04] 0.93 [-0.01 0.01] 0.94 • [0.003 0.03] 0.96 [-0.0006 0.02] 0.96 [-0.002 0.01] 0.97 [-0.01 0.01]

AN US

ODDS 0.56 • [0.24 0.27] 0.58 • [0.27 0.3] 0.59 • [0.33 0.34] 0.61 • [0.31 0.33] 0.64 • [0.31 0.33] 0.69 • [0.27 0.29] 0.77 • [0.19 0.21] 0.82 • [0.14 0.16]

M

CHI 0.59 • [0.21 0.24] 0.59 • [0.25 0.28] 0.59 • [0.33 0.34] 0.6 • [0.33 0.35] 0.6 • [0.35 0.37] 0.6 • [0.36 0.38] 0.6 • [0.36 0.37] 0.6 • [0.36 0.38]

GI 0.59 • [0.21 0.24] 0.59 • [0.25 0.28] 0.59 • [0.33 0.34] 0.6 • [0.33 0.35] 0.6 • [0.35 0.37] 0.6 • [0.36 0.38] 0.6 • [0.36 0.37] 0.61 • [0.35 0.37]

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset of top ranked terms with highest micro F1 values achieved by MNB classifier are underlined.

0.97

1,000

0.93

100

0.97

0.93

50

500

0.86

20

0.96

0.82

10

200

MMR

|G|

Table 15: Micro F1 Performance of MMR against eight well-known metrics on the K1b Data set averaged over 5 trials with MNB classifier

AC

IG 0.77 • [0.03 0.06] 0.82 • [0.03 0.06] 0.89 • [0.03 0.05] 0.94 [-0.02 0.002] 0.95 [-0.01 0.02] 0.96 [-0.01 0.01] 0.97 [-0.01 0.01] 0.97 [-0.01 0.01]

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

When we consider the overall performance, we see that MMR statistically outperforms the other 8 metrics in 77 out of 128 cases.

M

AN US

CR IP T

5.5. The RE1 Data set The RE1 data set is highly skewed and its classes are more skewed than those of RE0. Tables 20 and 21 show the macro F1 performances of the feature ranking metrics using SVM and MNB. We can find that the highest macro F1 value obtained using SVM is 0.73 and is achieved by the subset containing top 200 ranked terms of MMR. In MNB case, we can see that the subset with top 500 terms of MMR has resulted in the highest macro F1 value, i.e., 0.72. While analyzing the overall performance, we can observe that MMR has performed better in 85 of 128 subsets. Similarly, the micro F1 values of the metrics obtained on RE1 data set are shown in Tables 22 and 23. Using SVM classifier for evaluation, the highest micro F1 with the smallest subset is 0.94 generated by the subset of DFS consisting of top 500 terms. In case of MNB, subsets of MMR and IG containing the top 500 ranked terms have resulted in the highest micro F1 value equal to 0.92. The overall performance indicates that MMR has generated statistically significant better micro F1 values in 89 of 128 cases.

AC

CE

PT

ED

5.6. The 20 Newsgroups Data set Finally, we consider the 20 Newsgroups data set that has a ratio of number of terms to number of documents less than 1. In Tables 24 and 25, macro F1 values of the 9 metrics are tabulated. We can find that the highest macro F1 value in the case of SVM’s evaluation is 0.84, which has been achieved by subsets of top 1,500 ranked terms of MMR, ACC2 and IG. On the other hand, evaluation done by MNB classifier indicates that the subsets containing the top 1,500 ranked terms of CHI, POIS, and IG have resulted in the highest macro F1 value, which is equal to 0.8. When we analyze the overall performance, macro F1 values of MMR statistically significant than the other 8 metrics in 84 out of 128 subsets we have evaluated. Tables 26 and 27 list the micro F1 values obtained on this data set. The highest micro F1 value obtained by SVM is 0.85, which was attained by the subsets of the top 1,500 terms of MMR, ACC2 and IG. With MNB, the subsets consisting of top 1,500 ranked terms of CHI, POIS and IG generate a micro F1 value of 0.8, which is the highest in this case. We can also observe that MMR statistically outperforms other metrics in 99 out of 128 subsets.

36

CE

37

0.73

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.36 • [0.06 0.18] 0.54 • [0.01 0.12] 0.6 • [0.05 0.2] 0.68 • [0.03 0.17] 0.7 [-0.05 0.11] 0.73 [-0.06 0.06] 0.73 [-0.06 0.05] 0.72 [-0.04 0.07]

ED

CHI 0.01 • [0.41 0.53] 0.02 • [0.54 0.65] 0.37 • [0.28 0.43] 0.46 • [0.25 0.39] 0.56 • [0.09 0.25] 0.69 [-0.01 0.1] 0.72 [-0.05 0.06] 0.73 [-0.06 0.05]

M

NDM 0.37 • [0.05 0.17] 0.35 • [0.2 0.31] 0.38 • [0.26 0.41] 0.46 • [0.26 0.4] 0.57 • [0.09 0.24] 0.65 • [0.02 0.14] 0.7 [-0.03 0.07] 0.71 [-0.04 0.07]

POIS 0.01 • [0.41 0.53] 0.02 • [0.54 0.65] 0.43 • [0.21 0.37] 0.51 • [0.2 0.34] 0.62 • [0.04 0.19] 0.68 [-0.01 0.11] 0.71 [-0.04 0.07] 0.72 [-0.05 0.06]

CR IP T

ACC2 0.27 • [0.16 0.28] 0.4 • [0.15 0.26] 0.55 • [0.09 0.25] 0.64 • [0.07 0.21] 0.69 [-0.03 0.12] 0.7 [-0.03 0.09] 0.71 [-0.04 0.07] 0.72 [-0.05 0.06]

AN US

ODDS 0.37 • [0.05 0.17] 0.36 • [0.19 0.3] 0.35 • [0.29 0.44] 0.37 • [0.34 0.48] 0.43 • [0.23 0.38] 0.55 • [0.12 0.24] 0.64 • [0.04 0.14] 0.67 [-0.002 0.11]

GI 0.01 • [0.41 0.53] 0.02 • [0.54 0.65] 0.5 • [0.14 0.29] 0.67 • [0.04 0.18] 0.67 [-0.02 0.14] 0.69 [-0.02 0.1] 0.71 [-0.04 0.06] 0.71 [-0.04 0.07]

of top ranked terms with highest macro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.73

1,000

0.78

100

0.73

0.72

50

500

0.61

20

0.73

0.48

10

200

MMR

|G|

Table 16: Macro F1 Performance of MMR against eight well-known metrics on the RE0 Data set averaged over 5 trials with SVM classifier

AC

IG 0.32 • [0.11 0.23] 0.41 • [0.15 0.26] 0.51 • [0.14 0.29] 0.59 • [0.13 0.27] 0.66 • [8E-4 0.15] 0.71 [-0.04 0.08] 0.72 [-0.05 0.06] 0.73 [-0.05 0.06]

ACCEPTED MANUSCRIPT

CE

38

0.2

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.14 • [0.22 0.3] 0.36 • [0.2 0.28] 0.28 • [0.3 0.39] 0.28 • [0.3 0.42] 0.29 • [0.22 0.4] 0.24 • [0.05 0.19] 0.21 • [0.0005 0.08] 0.16 • [0.01 0.06]

ED

NDM 0.08 • [0.29 0.36] 0.15 • [0.41 0.49] 0.23 • [0.36 0.45] 0.25 • [0.33 0.45] 0.34 • [0.17 0.35] 0.34 [-0.05 0.09] 0.24 [-0.03 0.05] 0.17 • [0.01 0.05]

POIS 0.01 • [0.35 0.43] 0.02 • [0.54 0.62] 0.18 • [0.41 0.5] 0.22 • [0.36 0.48] 0.29 • [0.22 0.39] 0.31 [-0.02 0.11] 0.25 [-0.04 0.04] 0.18 • [4E-4 0.05]

CR IP T

ACC2 0.11 • [0.25 0.32] 0.17 • [0.39 0.47] 0.3 • [0.28 0.38] 0.35 • [0.23 0.35] 0.44 • [0.07 0.24] 0.36 [-0.07 0.07] 0.26 [-0.05 0.02] 0.18 • [4E-4 0.05]

AN US

ODDS 0.07 • [0.3 0.37] 0.09 • [0.47 0.55] 0.14 • [0.44 0.54] 0.13 • [0.45 0.57] 0.18 • [0.33 0.51] 0.16 • [0.13 0.27] 0.12 • [0.09 0.16] 0.13 • [0.04 0.09]

M

CHI 0.01 • [0.35 0.43] 0.02 • [0.54 0.62] 0.17 • [0.42 0.51] 0.2 • [0.39 0.5] 0.29 • [0.23 0.4] 0.33 [-0.05 0.09] 0.26 [-0.05 0.03] 0.2 [-0.02 0.02]

GI 0.01 • [0.35 0.43] 0.02 • [0.54 0.62] 0.19 • [0.39 0.48] 0.25 • [0.34 0.45] 0.27 • [0.24 0.42] 0.27 • [0.02 0.16] 0.23 [-0.03 0.05] 0.17 • [0.002 0.05]

of top ranked terms with highest macro F1 values achieved by MNB classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.24

1,000

0.64

100

0.36

0.63

50

500

0.6

20

0.6

0.4

10

200

MMR

|G|

Table 17: Macro F1 Performance of MMR against eight well-known metrics on the RE0 Data set averaged over 5 trials with MNB classifier

AC

IG 0.12 • [0.25 0.32] 0.19 • [0.37 0.44] 0.27 • [0.31 0.41] 0.36 • [0.23 0.34] 0.44 • [0.07 0.25] 0.37 [-0.08 0.06] 0.26 [-0.05 0.03] 0.2 [-0.02 0.02]

ACCEPTED MANUSCRIPT

CE

39

0.84

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.6 ◦ [-0.09 -0.04] 0.64 [-5E-3 0.04] 0.73 • [3E-3 0.06] 0.8 [-0.02 0.03] 0.82 [-0.01 0.02] 0.84 [-0.02 0.01] 0.84 [-0.01 0.01] 0.84 [-0.01 0.01]

ED

NDM 0.43 • [0.08 0.14] 0.47 • [0.17 0.21] 0.57 • [0.17 0.23] 0.64 • [0.14 0.2] 0.7 • [0.11 0.15] 0.8 • [0.02 0.05] 0.83 • [4E-3 0.02] 0.83 [-2E-4 0.02]

POIS 0.02 • [0.5 0.55] 0.02 • [0.61 0.66] 0.71 • [0.02 0.08] 0.78 • [0.01 0.06] 0.81 • [0.002 0.04] 0.83 [-1E-4 0.02] 0.84 [-0.01 0.01] 0.84 [-0.02 0.005]

CR IP T

ACC2 0.62 ◦ [-0.11 -0.05] 0.68 ◦ [-0.04 -1E-4] 0.79 [-0.06 3E-4] 0.81 [-0.03 0.02] 0.82 [-0.01 0.02] 0.83 [-0.01 0.02] 0.84 [-3E-3 0.02] 0.84 [-0.02 0.01]

AN US

ODDS 0.45 • [0.06 0.11] 0.54 • [0.1 0.14] 0.54 • [0.19 0.25] 0.58 • [0.2 0.25] 0.59 • [0.22 0.25] 0.67 • [0.16 0.18] 0.74 • [0.09 0.11] 0.79 • [0.03 0.05]

M

CHI 0.02 • [0.5 0.55] 0.02 • [0.61 0.66] 0.7 • [0.03 0.09] 0.76 • [0.02 0.08] 0.8 • [0.01 0.04] 0.82 • [3E-3 0.03] 0.84 [-5E-3 0.02] 0.84 [-0.02 0.01]

GI 0.02 • [0.5 0.55] 0.02 • [0.61 0.66] 0.76 [-0.03 0.04] 0.81 [-0.02 0.03] 0.81 [-5E-3 0.03] 0.83 [-0.01 0.02] 0.84 [-0.01 0.02] 0.84 [-0.01 0.01]

of top ranked terms with highest micro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.84

1,000

0.81

100

0.84

0.76

50

500

0.66

20

0.83

0.54

10

200

MMR

|G|

Table 18: Micro F1 Performance of MMR against eight well-known metrics on the RE0 Data set averaged over 5 trials with SVM classifier

AC

IG 0.64 ◦ [-0.12 -0.07] 0.67 [-0.03 0.01] 0.76 [-0.03 0.03] 0.8 [-0.02 0.03] 0.82 [-0.01 0.03] 0.83 [-0.01 0.02] 0.84 [-0.01 0.01] 0.84 [-0.01 0.01]

ACCEPTED MANUSCRIPT

CE

40

0.66

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.07 • [0.41 0.48] 0.4 • [0.13 0.2] 0.59 • [0.09 0.15] 0.62 • [0.09 0.13] 0.67 • [0.08 0.11] 0.66 • [0.05 0.08] 0.65 • [0.01 0.05] 0.64 [-0.01 0.04]

ED

CHI 0.01 • [0.47 0.54] 0.02 • [0.5 0.58] 0.57 • [0.1 0.16] 0.64 • [0.08 0.12] 0.68 • [0.06 0.09] 0.72 • [6E-4 0.03] 0.69 [-0.02 0.02] 0.66 [-0.03 0.02]

M

ACC2 0.43 • [0.06 0.13] 0.52 • [0.01 0.08] 0.66 • [0.01 0.07] 0.69 • [0.02 0.06] 0.71 • [0.03 0.06] 0.71 • [0.004 0.03] 0.69 [-0.02 0.02] 0.65 [-0.02 0.03]

NDM 0.24 • [0.24 0.31] 0.39 • [0.14 0.21] 0.52 • [0.15 0.21] 0.59 • [0.12 0.16] 0.66 • [0.08 0.11] 0.71 • [0.004 0.03] 0.67 [-0.004 0.03] 0.64 [-0.01 0.04]

POIS 0.01 • [0.47 0.54] 0.02 • [0.5 0.58] 0.59 • [0.09 0.15] 0.66 • [0.06 0.1] 0.68 • [0.06 0.09] 0.71 • [0.01 0.04] 0.68 [-0.02 0.02] 0.66 [-0.02 0.02]

CR IP T

AN US

ODDS 0.19 • [0.29 0.36] 0.25 • [0.28 0.35] 0.34 • [0.34 0.4] 0.35 • [0.36 0.41] 0.41 • [0.33 0.36] 0.45 • [0.26 0.29] 0.5 • [0.16 0.2] 0.54 • [0.09 0.14]

GI 0.01 • [0.47 0.54] 0.02 • [0.5 0.58] 0.54 • [0.14 0.2] 0.62 • [0.09 0.13] 0.66 • [0.08 0.11] 0.67 • [0.05 0.08] 0.68 [-0.01 0.03] 0.65 [-0.01 0.03]

of top ranked terms with highest micro F1 values achieved by MNB classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.69

1,000

0.73

100

0.73

0.7

50

500

0.56

20

0.76

0.52

10

200

MMR

|G|

Table 19: Micro F1 Performance of MMR against eight well-known metrics on the RE0 Data set averaged over 5 trials with MNB classifier

AC

IG 0.43 • [0.05 0.12] 0.55 [-0.02 0.05] 0.66 • [0.02 0.08] 0.71 [-7E-4 0.04] 0.73 • [0.01 0.05] 0.73 [-0.01 0.02] 0.7 [-0.03 0.01] 0.66 [-0.03 0.02]

ACCEPTED MANUSCRIPT

CE

41

0.69

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.39 • [0.25 0.32] 0.53 • [0.12 0.19] 0.67 • [0.01 0.08] 0.69 [-0.01 0.07] 0.71 [-0.02 0.06] 0.71 [-0.03 0.03] 0.7 [-0.03 0.04] 0.69 [-0.04 0.04]

ED

NDM 0.61 • [0.03 0.1] 0.61 • [0.05 0.12] 0.61 • [0.06 0.13] 0.63 • [0.05 0.13] 0.63 • [0.06 0.15] 0.66 • [0.01 0.07] 0.67 [-0.01 0.07] 0.67 [-0.03 0.06]

POIS 0.64 • [0.01 0.07] 0.64 • [0.01 0.08] 0.64 • [0.03 0.1] 0.65 • [0.03 0.11] 0.66 • [0.02 0.11] 0.67 • [2E-3 0.06] 0.68 [-0.02 0.05] 0.68 [-0.03 0.05]

CR IP T

ACC2 0.64 • [0.01 0.07] 0.66 [-4E-3 0.06] 0.67 • [0.01 0.08] 0.68 • [0.01 0.09] 0.67 • [0.01 0.1] 0.68 [-2E-3 0.06] 0.68 [-0.02 0.05] 0.68 [-0.03 0.05]

AN US

ODDS 0.63• [0.02 0.08] 0.63 • [0.03 0.09] 0.63 • [0.04 0.11] 0.64 • [0.04 0.12] 0.63 • [0.05 0.14] 0.65 • [0.03 0.09] 0.66 • [2E-4 0.07] 0.67 [-0.02 0.06]

M

CHI 0.64• [0.01 0.07] 0.64 • [0.01 0.08] 0.64 • [0.03 0.1] 0.65 • [0.04 0.12] 0.65 • [0.03 0.12] 0.68 [-2E-3 0.06] 0.69 [-0.03 0.05] 0.68 [-0.04 0.05]

GI 0.66 [-0.01 0.05] 0.66 • [4E-5 0.06] 0.67 [-0.001 0.07] 0.68 • [3E-3 0.08] 0.68 • [2E-3 0.09] 0.68 [-5E-3 0.06] 0.68 [-0.02 0.05] 0.68 [-0.03 0.05]

of top ranked terms with highest macro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.7

1,000

0.72

100

0.71

0.71

50

500

0.69

20

0.73

0.68

10

200

MMR

|G|

Table 20: Macro F1 Performance of MMR against eight well-known metrics on the RE1 Data set averaged over 5 trials with SVM classifier

AC

IG 0.64• [0.01 0.07] 0.65 • [0.01 0.07] 0.65 • [0.02 0.09] 0.66 • [0.02 0.1] 0.68 • [0.01 0.09] 0.69 [-0.01 0.05] 0.69 [-0.03 0.04] 0.69 [-0.04 0.04]

ACCEPTED MANUSCRIPT

CE

42

0.6

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

ED

CHI 0.28 • [0.28 0.37] 0.44 • [0.18 0.29] 0.49 • [0.17 0.24] 0.55 • [0.11 0.19] 0.64 • [0.02 0.11] 0.7 [-0.03 0.07] 0.63 [-0.02 0.09] 0.58 [-0.04 0.07]

PT

DFS 0.35 • [0.21 0.3] 0.53 • [0.09 0.2] 0.64 • [0.02 0.1] 0.63 • [0.03 0.11] 0.63 • [0.03 0.13] 0.61 • [0.06 0.16] 0.57 • [0.05 0.15] 0.52 • [0.02 0.13]

M

ACC2 0.48 • [0.08 0.17] 0.59 • [0.02 0.14] 0.65 • [0.01 0.08] 0.66 [-0.01 0.08] 0.68 [-0.02 0.08] 0.69 [-0.02 0.08] 0.62 [-0.01 0.1] 0.57 [-0.03 0.08]

NDM 0.09 • [0.48 0.57] 0.2 • [0.42 0.53] 0.39 • [0.27 0.35] 0.46 • [0.19 0.28] 0.53 • [0.13 0.23] 0.65 • [0.02 0.12] 0.6 • [0.01 0.12] 0.57 [-0.02 0.09]

POIS 0.29 • [0.28 0.37] 0.44 • [0.18 0.29] 0.51 • [0.15 0.23] 0.58 • [0.07 0.16] 0.65 • [0.01 0.11] 0.69 [-0.02 0.08] 0.63 [-0.01 0.09] 0.57 [-0.03 0.08]

CR IP T

AN US

ODDS 0.15 • [0.42 0.51] 0.24 • [0.37 0.48] 0.38 • [0.28 0.35] 0.44 • [0.21 0.3] 0.49 • [0.18 0.27] 0.52 • [0.15 0.25] 0.5 • [0.12 0.22] 0.48 • [0.06 0.17]

GI 0.57 • [0.002 0.09] 0.63 [-0.01 0.1] 0.65 • [0.01 0.09] 0.63 • [0.02 0.11] 0.65 • [0.02 0.11] 0.62 • [0.05 0.15] 0.58 • [0.03 0.14] 0.56 [-0.02 0.09]

of top ranked terms with highest macro F1 values achieved by MNB classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.67

1,000

0.7

100

0.72

0.7

50

500

0.67

20

0.71

0.61

10

200

MMR

|G|

Table 21: Macro F1 Performance of MMR against eight well-known metrics on the RE1 Data set averaged over 5 trials with MNB classifier

AC IG 0.37 • [0.2 0.29] 0.53 • [0.09 0.2] 0.61 • [0.05 0.13] 0.64 • [0.02 0.11] 0.68 [-0.02 0.08] 0.7 [-0.03 0.07] 0.64 [-0.03 0.08] 0.6 [-0.05 0.06]

ACCEPTED MANUSCRIPT

CE

43

0.94

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

PT

DFS 0.5 • [0.36 0.4] 0.74 • [0.13 0.17] 0.89 • [0.01 0.04] 0.91 • [6E-4 0.03] 0.93 [-0.02 0.02] 0.94 [-0.02 0.01] 0.93 [-0.02 0.03] 0.94 [-0.02 0.02]

ED

CHI 0.86 • [0.01 0.04] 0.86 • [0.01 0.05] 0.87 • [0.03 0.06] 0.88 • [0.03 0.07] 0.9 • [0.01 0.05] 0.92 [-0.01 0.03] 0.94 [-0.02 0.03] 0.93 [-0.02 0.03]

M

ACC2 0.86 • [0.01 0.04] 0.87 • [9E-4 0.05] 0.9 [-0.01 0.03] 0.9 • [0.004 0.04] 0.91 [-7E-5 0.04] 0.93 [-0.01 0.02] 0.93 [-0.01 0.03] 0.93 [-0.01 0.03]

NDM 0.85 • [0.02 0.05] 0.85 • [0.02 0.07] 0.85 • [0.04 0.08] 0.87 • [0.04 0.07] 0.88 • [0.03 0.07] 0.91 • [0.01 0.04] 0.92 • [0.005 0.05] 0.92 [-0.01 0.04]

POIS 0.86 • [0.01 0.04] 0.86 • [0.01 0.05] 0.87 • [0.03 0.06] 0.88 • [0.03 0.06] 0.9 • [0.01 0.05] 0.92 [-0.01 0.03] 0.94 [-0.02 0.03] 0.93 [-0.02 0.03]

CR IP T

AN US

ODDS 0.85 • [0.01 0.05] 0.86 • [0.01 0.06] 0.85 • [0.04 0.08] 0.86 • [0.05 0.08] 0.86 • [0.06 0.1] 0.87 • [0.04 0.08] 0.9 • [0.02 0.06] 0.91 • [0.01 0.05]

GI 0.86 • [0.005 0.04] 0.87 [-0.004 0.04] 0.88 • [0.01 0.05] 0.9 • [0.01 0.04] 0.91 [-6E-4 0.04] 0.92 [-0.01 0.03] 0.93 [-0.01 0.03] 0.93 [-0.01 0.03]

of top ranked terms with highest micro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.94

1,000

0.93

100

0.93

0.91

50

500

0.89

20

0.93

0.88

10

200

MMR

|G|

Table 22: Micro F1 Performance of MMR against eight well-known metrics on the RE1 Data set averaged over 5 trials with SVM classifier

AC

IG 0.86 • [0.01 0.04] 0.87 [-3E-4 0.04] 0.88 • [0.01 0.05] 0.9 • [0.01 0.05] 0.92 [-0.005 0.04] 0.93 [-0.01 0.02] 0.94 [-0.02 0.03] 0.94 [-0.02 0.02]

ACCEPTED MANUSCRIPT

CE

0.87

0.9

0.9

20

50

100

44

0.89

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

ED

CHI 0.5 • [0.26 0.39] 0.62 • [0.19 0.3] 0.71 • [0.16 0.21] 0.79 • [0.09 0.14] 0.86 • [0.02 0.08] 0.91 [-0.01 0.03] 0.91 [-0.01 0.03] 0.89 [-0.02 0.02]

PT

DFS 0.46 • [0.3 0.43] 0.72 • [0.1 0.21] 0.86 • [0.01 0.06] 0.85 • [0.03 0.08] 0.87 • [0.01 0.07] 0.87 • [0.03 0.07] 0.86 • [0.04 0.07] 0.84 • [0.03 0.07]

M

ACC2 0.63 • [0.12 0.25] 0.79 • [0.03 0.13] 0.86 • [0.01 0.06] 0.88 • [0.004 0.05] 0.89 [-0.01 0.05] 0.91 [-0.01 0.03] 0.9 • [0.002 0.04] 0.88 [-0.01 0.03]

NDM 0.17 • [0.59 0.72] 0.39 • [0.42 0.53] 0.66 • [0.21 0.26] 0.74 • [0.14 0.19] 0.79 • [0.09 0.14] 0.87 • [0.03 0.08] 0.86 • [0.04 0.07] 0.86 • [0.01 0.05]

POIS 0.5 • [0.26 0.38] 0.62 • [0.19 0.3] 0.73 • [0.15 0.2] 0.8 • [0.08 0.12] 0.87 • [0.02 0.07] 0.91 [-0.01 0.04] 0.9 [-0.01 0.03] 0.88 [-0.01 0.03]

CR IP T

AN US

ODDS 0.2 • [0.56 0.68] 0.3 • [0.52 0.63] 0.53 • [0.35 0.4] 0.59 • [0.29 0.34] 0.65 • [0.24 0.29] 0.73 • [0.17 0.21] 0.75 • [0.14 0.18] 0.79 • [0.09 0.13]

GI 0.74 • [0.01 0.14] 0.81 • [0.003 0.11] 0.84 • [0.03 0.08] 0.85 • [0.03 0.07] 0.88 • [0.01 0.06] 0.88 • [0.02 0.06] 0.88 • [0.02 0.06] 0.87 • [0.01 0.05]

of top ranked terms with highest micro F1 values achieved by MNB classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.92

1,000

500

0.92

0.91

0.82

10

200

MMR

|G|

Table 23: Micro F1 Performance of MMR against eight well-known metrics on the RE1 Data set averaged over 5 trials with MNB classifier

AC IG 0.56 • [0.19 0.32] 0.75 • [0.07 0.18] 0.84 • [0.03 0.09] 0.86 • [0.02 0.06] 0.9 [-0.01 0.04] 0.92 [-0.02 0.02] 0.91 [-0.01 0.02] 0.9 [-0.02 0.02]

ACCEPTED MANUSCRIPT

CE

45

0.84

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

ED

NDM 0.47 • [6E-4 0.13] 0.56 [-4E-3 0.09] 0.64 [-0.01 0.07] 0.67 • [2E-4 0.06] 0.67 • [0.05 0.11] 0.73 • [0.04 0.07] 0.77 • [0.04 0.06] 0.79 • [0.04 0.05]

POIS 0.53 [-0.06 0.07] 0.57 [-0.01 0.09] 0.64 [-0.01 0.07] 0.68 [-0.01 0.05] 0.72 [-0.01 0.05] 0.78 [-0.001 0.03] 0.82 [-0.003 0.01] 0.83 [-0.004 0.01]

CR IP T

ACC2 0.43 • [0.04 0.16] 0.5 • [0.06 0.15] 0.58 • [0.04 0.12] 0.65 • [0.02 0.08] 0.71 • [0.01 0.06] 0.78 [-0.01 0.02] 0.82 [-0.01 0.01] 0.84 [-0.01 0.01]

AN US

ODDS 0.46 • [0.01 0.13] 0.56 [-3E-3 0.1] 0.63 [-0.01 0.07] 0.66 • [0.01 0.07] 0.65 • [0.06 0.12] 0.69 • [0.09 0.11] 0.71 • [0.1 0.12] 0.73 • [0.1 0.12]

M

CHI 0.56 [-0.09 0.04] 0.59 [-0.03 0.07] 0.65 [-0.03 0.05] 0.69 [-0.02 0.04] 0.73 [-0.02 0.04] 0.78 [-0.002 0.03] 0.82 [-0.003 0.01] 0.83 [-0.004 0.01]

PT

DFS 0.22 • [0.25 0.37] 0.29 • [0.26 0.36] 0.44 • [0.19 0.27] 0.55 • [0.12 0.18] 0.65 • [0.06 0.12] 0.73 • [0.05 0.07] 0.79 • [0.03 0.04] 0.81 • [0.02 0.04]

GI 0.44 • [0.02 0.15] 0.5 • [0.05 0.15] 0.58 • [0.05 0.13] 0.63 • [0.03 0.1] 0.68 • [0.04 0.1] 0.76 • [0.02 0.05] 0.81 • [0.01 0.02] 0.83 [-0.002 0.01]

of top ranked terms with highest macro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.82

1,000

0.7

100

0.79

0.66

50

500

0.61

20

0.74

0.53

10

200

MMR

|G|

Table 24: Macro F1 Performance of MMR against eight well-known metrics on the 20NG Data set averaged over 5 trials with SVM classifier

AC

IG 0.54 [-0.07 0.05] 0.57 [-0.01 0.09] 0.64 [-0.01 0.07] 0.69 [-0.02 0.04] 0.74 [-0.03 0.03] 0.8 [-0.02 0.01] 0.83 [-0.02 3E-4] 0.84 [-0.01 4E-3]

ACCEPTED MANUSCRIPT

CE

46

0.79

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

ED

NDM 0.05 • [0.2 0.25] 0.1 • [0.34 0.38] 0.22 • [0.42 0.45] 0.38 • [0.32 0.34] 0.55 • [0.18 0.2] 0.66 • [0.09 0.1] 0.7 • [0.07 0.09] 0.72 • [0.06 0.07]

POIS 0.24 • [0.01 0.06] 0.36 • [0.08 0.12] 0.49 • [0.15 0.18] 0.61 • [0.09 0.11] 0.67 • [0.06 0.08] 0.75 • [0.002 0.02] 0.79 ◦ [-0.02 -0.01] 0.8 ◦ [-0.03 -0.01]

CR IP T

ACC2 0.25 • [0.004 0.05] 0.44 [-0.01 0.03] 0.59 • [0.05 0.08] 0.65 • [0.05 0.07] 0.7 • [0.04 0.05] 0.74 • [0.01 0.02] 0.77 [-0.01 0.01] 0.79 [-0.01 0.002]

AN US

ODDS 0.05 • [0.2 0.25] 0.1 • [0.34 0.38] 0.22 • [0.42 0.45] 0.31 • [0.39 0.41] 0.39 • [0.34 0.35] 0.48 • [0.26 0.28] 0.54 • [0.23 0.24] 0.57 • [0.21 0.22]

M

CHI 0.24 • [0.01 0.06] 0.35 • [0.08 0.13] 0.49 • [0.15 0.18] 0.59 • [0.11 0.13] 0.66 • [0.07 0.09] 0.73 • [0.02 0.03] 0.78 [-0.02 0.0008] 0.8 ◦ [-0.02 -0.005]

PT

DFS 0.19 • [0.07 0.11] 0.3 • [0.14 0.18] 0.48 • [0.16 0.19] 0.57 • [0.12 0.14] 0.66 • [0.07 0.09] 0.71 • [0.04 0.06] 0.72 • [0.04 0.06] 0.73 • [0.04 0.06]

GI 0.37 ◦ [-0.11 -0.07] 0.41 • [0.03 0.07] 0.52 • [0.12 0.15] 0.58 • [0.12 0.14] 0.62 • [0.11 0.12] 0.7 • [0.05 0.06] 0.76 • [0.01 0.02] 0.78 [-0.002 0.01]

of top ranked terms with highest macro F1 values achieved by MNB classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average macro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.77

1,000

0.71

100

0.76

0.66

50

500

0.46

20

0.74

0.28

10

200

MMR

|G|

Table 25: Macro F1 Performance of MMR against eight well-known metrics on the 20NG Data set averaged over 5 trials with MNB classifier

AC

IG 0.26 [-0.004 0.04] 0.38 • [0.06 0.1] 0.57 • [0.07 0.1] 0.67 • [0.03 0.05] 0.73 • [0.004 0.02] 0.77 ◦ [-0.02 -0.005] 0.79 ◦ [-0.02 -0.01] 0.8 ◦ [-0.02 -0.01]

ACCEPTED MANUSCRIPT

CE

47

0.85

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

ED

NDM 0.49 • [0.07 0.1] 0.58 • [0.05 0.07] 0.65 • [0.03 0.06] 0.68 • [0.04 0.06] 0.68 • [0.07 0.09] 0.74 • [0.05 0.07] 0.78 • [0.04 0.05] 0.8 • [0.04 0.05]

POIS 0.59 [-0.02 0.005] 0.62 • [0.01 0.04] 0.67 • [0.01 0.04] 0.71 • [0.002 0.02] 0.75 • [0.01 0.02] 0.79 • [0.002 0.02] 0.83 [-0.003 0.01] 0.84 [-0.004 0.01]

CR IP T

ACC2 0.49 • [0.08 0.1] 0.55 • [0.09 0.11] 0.62 • [0.06 0.09] 0.68 • [0.03 0.05] 0.73 • [0.02 0.04] 0.8 [-0.002 0.01] 0.83 [-0.01 0.01] 0.85 [-0.01 0.01]

AN US

ODDS 0.49 • [0.08 0.1] 0.58 • [0.05 0.07] 0.65 • [0.03 0.06] 0.67 • [0.04 0.06] 0.67 • [0.09 0.1] 0.7 • [0.09 0.11] 0.72 • [0.1 0.11] 0.74 • [0.1 0.12]

M

CHI 0.59 [-0.02 0.005] 0.62 • [0.02 0.04] 0.67 • [0.01 0.04] 0.71 • [0.004 0.03] 0.74 • [0.01 0.02] 0.79 • [0.01 0.02] 0.82 [-5E-4 0.01] 0.84 [-0.003 0.01]

PT

DFS 0.31 • [0.25 0.28] 0.38 • [0.25 0.28] 0.52 • [0.17 0.2] 0.6 • [0.11 0.13] 0.69 • [0.06 0.08] 0.75 • [0.04 0.06] 0.8 • [0.02 0.04] 0.82 • [0.02 0.03]

GI 0.49 • [0.08 0.11] 0.54 • [0.09 0.12] 0.61 • [0.07 0.1] 0.66 • [0.05 0.08] 0.7 • [0.05 0.07] 0.77 • [0.02 0.04] 0.82 • [0.01 0.02] 0.84 [-5E-4 0.02]

of top ranked terms with highest micro F1 values achieved by SVM classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.83

1,000

0.72

100

0.8

0.7

50

500

0.64

20

0.76

0.58

10

200

MMR

|G|

Table 26: Micro F1 Performance of MMR against eight well-known metrics on the 20NG Data set averaged over 5 trials with SVM classifier

AC

IG 0.57 [-0.01 0.02] 0.6 • [0.03 0.06] 0.66 • [0.02 0.05] 0.71 • [0.004 0.03] 0.76 [-0.01 0.01] 0.81 [-0.02 6E-4] 0.84 [-0.01 2E-4] 0.85 [-0.01 0.01]

ACCEPTED MANUSCRIPT

CE

48

0.78

1,500

F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI F1 CI

ED

NDM 0.05 • [0.23 0.27] 0.1 • [0.38 0.42] 0.22 • [0.44 0.47] 0.4 • [0.31 0.34] 0.58 • [0.16 0.18] 0.67 • [0.08 0.1] 0.7 • [0.07 0.08] 0.72 • [0.05 0.07]

POIS 0.26 • [0.02 0.07] 0.37 • [0.11 0.15] 0.51 • [0.16 0.19] 0.62 • [0.1 0.12] 0.68 • [0.06 0.08] 0.75 [-0.003 0.02] 0.79 ◦ [-0.03 -0.01] 0.8 ◦ [-0.03 -0.01]

CR IP T

ACC2 0.28 [-0.002 0.04] 0.48 • [5E-4 0.04] 0.61 • [0.06 0.09] 0.67 • [0.04 0.07] 0.71 • [0.03 0.05] 0.75 • [0.004 0.02] 0.77 [-0.01 0.01] 0.79 [-0.02 0.004]

AN US

ODDS 0.05 • [0.23 0.27] 0.1 • [0.38 0.42] 0.22 • [0.44 0.47] 0.32 • [0.4 0.42] 0.41 • [0.33 0.35] 0.5 • [0.25 0.27] 0.55 • [0.21 0.23] 0.58 • [0.19 0.21]

M

CHI 0.26 • [0.02 0.07] 0.37 • [0.11 0.15] 0.5 • [0.16 0.2] 0.6 • [0.11 0.13] 0.67 • [0.07 0.09] 0.74 • [0.01 0.03] 0.78 ◦ [-0.02 -3E-4] 0.8 ◦ [-0.02 -0.005]

PT

DFS 0.27 • [0.01 0.06] 0.38 • [0.1 0.14] 0.54 • [0.12 0.15] 0.61 • [0.1 0.12] 0.68 • [0.06 0.08] 0.71 • [0.04 0.06] 0.73 • [0.04 0.06] 0.74 • [0.04 0.06]

GI 0.42 ◦ [-0.14 -0.1] 0.48 • [0.002 0.05] 0.57 • [0.1 0.13] 0.62 • [0.1 0.12] 0.66 • [0.08 0.1] 0.72 • [0.03 0.05] 0.76 [-6E-4 0.02] 0.78 [-0.01 0.01]

of top ranked terms with highest micro F1 values achieved by MNB classifier are underlined.

win for MMR while ◦ denotes a win for the competing metric. Otherwise, there is no significant change and is taken to be a tie. The smallest subset

method to determine which pairs of means are significantly different, and which are not. The 95% CI for the difference is shown. A • represents a

and test sets. F1 shows the average micro F1 values. We have used the ANOVA method with multiple comparisons test based on Tukey-Kramer

G contains the top ranked terms and |G| represents its cardinality. For statistical significance, results have been taken on 5 different splits of training

0.77

1,000

0.73

100

0.76

0.68

50

500

0.5

20

0.75

0.3

10

200

MMR

|G|

Table 27: Micro F1 Performance of MMR against eight well-known metrics on the 20NG Data set averaged over 5 trials with MNB classifier

AC

IG 0.27 • [0.01 0.05] 0.39 • [0.09 0.13] 0.58 • [0.08 0.12] 0.68 • [0.04 0.06] 0.74 • [9E-4 0.02] 0.77 ◦ [-0.02 -0.002] 0.79 ◦ [-0.02 -0.01] 0.8 ◦ [-0.02 -0.005]

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

5.7. Discussion of Results In this work, we have used data sets with different characteristics such as class skew, different ratio of number of terms to number of documents to investigate the performance of our newly proposed metric MMR. Statistical analysis on the macro and micro F1 values of the 9 feature ranking metrics that were obtained on 5 different splits of training and test sets using SVM and MNB classifiers was conducted with the help of ANOVA and a multiple comparisons test based on Tukey-Kramer method. Table 28 provides a summary of percentage wins, losses and ties of MMR on the 6 data sets. We can see that the wins of MMR are significantly higher than the losses. The MMR metric exhibits up to 98.5% wins on the WAP data set while the highest loss was observed on the 20NG data set. Table 29 summarizes the percentage wins, losses and ties of MMR against each metric. We can find that IG is the closest competitor of MMR as it has highest percentage losses among the 8 metrics. It was also the second best feature ranking metric after MMR in most of the cases. We can also observe that MMR has shown the highest percentage wins against odds ratio. Next, results of the highest macro and micro F1 values that have been attained by metrics with the smallest subsets of top ranked terms on the data sets are summarized in Table 30. It can be seen that a subset of top ranked terms as small as 50 was selected by MMR that has attained the highest macro F1 value on the K1b data set. Similarly, MMR has shown the highest micro F1 value on a subset of top 100 ranked terms on the K1b data set. On the 20NG data set, MMR could not obtain the highest values. Attaining high classification performance on text data is a big challenge. High class skew and sparsity are considered to be the reasons behind this difficulty. The concept behind the design of MMR is to enhance the rank of terms, which are almost absent in one class, and frequent in the other class. As text data is multi-class and skewed by nature, negative class becomes the larger class due to the one-versus-rest setting. Therefore, frequent terms in the positive class may get higher tpr values as compared to f pr values for frequent terms in the negative class. As MMR does not depend only on the absolute difference of tpr and f pr, it is biased towards frequent terms in positive class. As a result, frequent terms in positive class will be ranked higher than the frequent terms in the negative class. In this way, MMR will be able to select best features in the beginning of its ranked list. However, selecting the denominator value in the MMR metric for the terms, which are completely absent in one class is a challenge. A wrong 49

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

choice can negatively affect the feature ranking. For our experiments, we have currently selected the value of denominator to be  = 1/N when min(tpr, f pr) → 0, where N is the total number of documents in a corpus. For highly skewed and very large data sets, like the 20 Newsgroups data, f pr can attain values as low as 1/18,828 = 0.00005 (where 18,828 is the total number of documents in the 20NG data set) , which can drastically increase the rank of relatively weak terms. Although the numerator was added in MMR as compared to NDM to handle such cases, it may still skip some terms. The situation becomes worse for the infrequent terms present in only one class. Such terms may unjustifiably be assigned higher ranks due to small value of the numerator. It can be observed in large data sets, which contain large number of terms and high sparsity. We expect such data sets to contain more terms which are present in only one class and are infrequent in the other class. The 20NG is an example of such a data set where MMR performs comparable to other metrics. For each feature ranking metric, eight nested subsets S1 ⊆ S2 ⊆ . . . ⊆ S8 from its ranked list were generated by progressively adding terms of decreasing importance to find an optimal smaller subset of terms. The size of these subsets was 10, 20, 50, 100, 200, 500, 1,000 and 1,500 terms. The quality of the first subset that we evaluate with a classifier consists of top 10 ranked terms. As the size of the nested subset increases or in other words, as we add more and more terms to the subset, a point comes where the feature ranking metric exhibits the best performance. Beyond that point, more and more less relevant terms are added and thus cause the accuracy to decrease. The same is exhibited by MMR. As the top ranked terms selected by MMR are highly discriminating, which are almost absent in one class, as compared to the other metrics, we observe a better performance for MMR for nested subsets with 500 and less number of terms. This means MMR finds relatively more relevant terms and locates them in the beginning of its ranked list as compared to its competitors. Those terms that come after the top 500 ranked terms are less relevant and cause the accuracy to decrease. In other words, as the number of terms increases, terms present in both classes start to appear in the selected set of terms. At this stage, set of selected terms by other feature ranking metrics begin to overlap with set of terms selected by MMR. Therefore, MMR performance tends to decrease and resembles performance of other good feature ranking metrics such as IG and DFS.

50

CR IP T

ACCEPTED MANUSCRIPT

Table 28: Percentage Wins, Losses and Ties of MMR on Data sets

SVM Losses 0 3.2% 1.6% 0 0 0

Ties 20.3% 23.4% 21.8% 37.5% 39.1% 50%

SVM Losses 1.6% 4.7% 7.8% 6.3% 0 0

Ties 14% 31.2% 26.6% 48.4% 40.6% 23.4%

AN US

M

Wins 95.3% 92.2% 76.6% 75% 79.7% 78.2%

PT

Data set WAP K1a K1b RE0 RE1 20NG

Wins 98.5% 95.3% 84.4% 79.7% 71.8% 81.3%

ED

Data set WAP K1a K1b RE0 RE1 20NG

In terms of macro F1 MNB Losses Ties Wins 0 1.5% 79.7% 0 4.7% 73.4% 0 15.6% 76.6% 0 20.3% 62.5% 0 28.2% 60.9% 9.4% 9.4% 50% In terms of micro F1 MNB Losses Ties Wins 0 4.7% 84.4% 0 7.8% 64.1% 3.2% 20.2% 65.6% 0 25% 45.3% 0 20.3% 59.4% 12.5% 9.3% 76.6%

CE

For each data set, we evaluate 8 subsets of top ranked features of a metric against MMR using a classifier. There are 8 metrics. Therefore, performance of MMR is evaluated on 64 subsets for a data set using a classifier. If results of MMR are statistically significant than those of its competitor, it is a win for it. In case performance of MMR is statistically poorer than that of its competitor, it is a loss for the former.

AC

Otherwise, it is taken to be a tie.

51

ACCEPTED MANUSCRIPT

CE

SVM Losses 2% 0 0 0 0 0 0 4.2%

Ties 41.7% 31.2% 14.6% 45.8% 16.7% 31.2% 27.1% 47.9%

SVM Losses 4.2% 0 2.1% 10.4% 0 0 0 10.4%

Ties 43.8% 18.7% 6.2% 56.2% 6.2% 20.8% 27.1% 62.5%

AN US

M

Wins 85.4% 85.4% 100% 68.8% 95.8% 83.4% 89.6% 54.2%

PT

Metric DFS CHI ODDS ACC2 NDM POIS GI IG

Wins 93.8% 83.4% 100% 68.8% 91.7% 85.4% 89.6% 66.7%

ED

Metric DFS CHI ODDS ACC2 NDM POIS GI IG

In terms of macro F1 MNB Losses Ties Wins 0 6.2% 56.3% 2% 14.6% 68.8% 0 0 85.4% 0 31.2% 54.2% 0 8.3% 83.3% 4.2% 10.4% 68.8% 2% 8.4% 72.9% 6.2% 27.1% 47.9% In terms of micro F1 MNB Losses Ties Wins 4.2% 10.4% 52.1% 4.2% 10.4% 81.3% 0 0 91.7% 0 31.2% 33.4% 0 4.2% 93.8% 4.2% 12.4% 79.2% 2.1% 8.3% 72.9% 6.3% 39.5% 27.1%

CR IP T

Table 29: Percentage Wins, Losses and Ties of MMR against metrics

For each metric, we evaluate 8 subsets of top ranked features of a metric against MMR on a data set using a classifier. There are 6 data sets. Therefore, we evaluate and compare performances of MMR

AC

and a metric on 48 subsets using a classifier. If results of MMR are statistically significant than those of its competitor, it is a win for it. In case performance of MMR is statistically poorer than that of its competitor, it is a loss for the former. Otherwise, it is taken to be a tie.

52

CR IP T

ACCEPTED MANUSCRIPT

Table 30: The highest F1 values obtained by the smallest subset of top ranked terms

In terms of macro F1

subset size 500 1,500 100 200 500 1,500

AN US

M

AC

CE

PT

Data set WAP K1a K1b RE0 RE1 20NG

subset size 500 500 50 100 500 1,500

ED

Data set WAP K1a K1b RE0 RE1 20NG

MNB value metric subset size 0.78 MMR 200 0.76 MMR 500 0.94 MMR 100 0.64 MMR 100 0.72 MMR 200 0.8 CHI, POIS, IG 1,500 In terms of micro F1 MNB value metric subset size 0.84 MMR 500 0.85 MMR 1,000 0.96 DFS 200 0.76 MMR 500 0.92 MMR, IG 500 0.8 CHI, POIS, IG 1,500

53

SVM value metric 0.7 MMR 0.71 MMR 0.96 MMR 0.78 MMR 0.73 MMR 0.84 MMR, ACC2, IG SVM value metric 0.84 MMR 0.86 MMR 0.98 MMR 0.84 MMR, DFS 0.94 DFS 0.85 MMR, ACC2, IG

ACCEPTED MANUSCRIPT

6. Conclusions and Future Works

AC

CE

PT

ED

M

AN US

CR IP T

This paper proposes a new feature ranking metric called max-min ratio for the selection of the most relevant terms in text data. MMR takes a product of ratio between the maximum and minimum of false positive rate and true positive rate and their absolute difference. This allows it to assign the more crucial terms with higher scores and filter out sparse and less discriminative terms. We have tested the performance of MMR against eight other metrics namely balanced accuracy measure, information gain, chi-squared, Poisson ratio, Gini index, odds ratio, distinguishing feature selector, and normalized difference measure on six data sets namely WebACE (WAP, K1a, K1b), Reuters (RE0, RE1), and 20 Newsgroups. The quality of the subsets with top ranked terms is evaluated using the multinomial naive Bayes and support vector machines classifiers. To determine whether results of MMR are statistically significant or not, we have repeated our experiments 5 times with different splits of training and test sets and have used the one way analysis of variance method and a multiple comparisons test based on Tukey-Kramer method. We have found that MMR outperforms the other 8 metrics in 76.6% cases in terms of macro F1 measure and in 74.4% cases in terms of micro F1 measure. Furthermore, MMR has obtained the highest macro F1 and micro F1 values with the smallest subset of terms on most of the data sets. This feature can thus enhance the performance of text classification tasks both in terms of efficiency and effectiveness. Many applications of expert and intelligent information systems require information management tasks such as text filtering, indexing of text, and search for relevant text are a few to name. Such tasks can be solved efficiently and effectively via text classification. Performance of text classification still to date encounters several challenges due to unique characteristics of its data such as high class skewness and high sparsity of terms. For example, what is the most suitable model for its data representation, what is the best weighting scheme for its terms representation and how to select the most relevant terms. In this work, we have addressed the issue of selecting most relevant terms by proposing a new feature ranking metric. However, such metrics do not capture the interactions among terms. Selecting terms in the context of other terms can make text classification more accurate. There is a need to design efficient feature selection algorithms that find an optimal subset of most relevant and least redundant terms. This can lead to not only enhancing the effectiveness of text classification but also the efficiency. Another issue 54

ACCEPTED MANUSCRIPT

AN US

CR IP T

while designing feature selection algorithms for text data is that all terms of a document can get eliminated. A similar issue is the removal of all terms belonging to small categories. There is a need to develop effective local or class-wise feature selection algorithms to overcome these problems. As a part of our future work, we are working toward designing algorithms that can address the local and global issues of text documents in a corpus. Text classification’s performance is also affected by the weighting scheme used for terms. Term frequency-inverse document frequency (TF-IDF) is one well-known scheme. In an attempt to reduce bias towards longer documents, TF-IDF negatively impacts a term’s true importance. For example, term frequency of a term in a related document may become equal or even smaller than its term frequency in the unrelated document. These days, we are also working on developing a new term weighting scheme, which does not require normalization by document length. References

M

Aggarwal, C. C., & Zhai, C. (2012). A Survey of Text Classification Algorithms. In Mining Text Data (pp. 163–222). Springer.

ED

Agnihotri, D., Verma, K., & Tripathi, P. (2017). Variable global feature selection scheme for automatic classification of text documents. Expert Systems with Applications, 81 , 268 – 281.

PT

Bland, J. M., & Altman, D. G. (2000). The Odds Ratio. British Medical Journal (BMJ), 320 , 1468.

CE

Bolon-Canedo, V., Sanchez-Marono, N., & Alonso-Betanzos, A. (2015). Feature Selection for High-Dimensional Data. Springer International Publishing Switzerland.

AC

Brank, J., Grobelnik, M., Milic-Frayling, N., & Mladenic, D. (2002). Interaction of Feature Selection Methods and Linear Classification Models. In Proceedings of the ICML-02 Workshop on Text Learning.

Cardoso-Cachopo, A. (2007). Improving Methods for Single-label Text Categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa.

55

ACCEPTED MANUSCRIPT

Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2 , 27.

CR IP T

Chen, H., Schuffels, C., & Orwig, R. (1996). Internet categorization and search: A self-organizing approach. Journal of Visual Communication and Image Representation, 7 , 88 – 102. Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature Selection for Text Classification with Naive Bayes. Expert Systems with Applications, 36 , 5432–5435.

AN US

Cortes, C., & Vapnik, V. (1995). Support Vector Networks. Machine Learning, 20 , 273–297. Dash, M., & Liu, H. (1997). Feature Selection for Classification. Intelligent Data Analysis, 1 , 131–156. Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics. (5th ed.). SAGE Publications.

M

Flach, P. (2012). Machine Learning The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press.

ED

Forman, G. (2003). An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research, 3 , 1289–1305.

CE

PT

Forman, G. (2004). A pitfall and solution in multi-class feature selection for text classification. In Proceedings of the 21st International Conference on Machine Learning (pp. 38–45). ACM.

AC

Forman, G. (2008). Feature Selection for Text Classification. In Computational Methods of Feature Selection (pp. 257–276). Chapman and Hall/CRC. Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21 , 267–297. Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3 , 1157–1182. 56

ACCEPTED MANUSCRIPT

Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. (2006). Feature Extraction: Foundations and Applications. Springer.

CR IP T

Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46 , 389–422.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations Newsletter , 11 , 10–18.

AN US

Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). WebACE: A Web Agent for Document Categorization and Exploration. In Proceedings of the Second International Conference on Autonomous Agents (pp. 408–415). James, J. (2014). Data Never Sleeps 2.0. http://www.domo.com/blog/2014/04/data-never-sleeps-2-0. Accessed: January 02, 2015.

ED

M

Javed, K., Babri, H., & Saeed, M. (2012). Feature Selection Based on ClassDependent Densities for High-Dimensional Binary Data. IEEE Transactions on Knowledge and Data Engineering, 24 , 465–477.

PT

Javed, K., Maruf, S., & Babri, H. A. (2015). A two-stage markov blanket based feature selection algorithm for text classification. Neurocomputing, 157 , 91 – 104.

CE

Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers.

AC

Joshi, H., Pareek, J., Patel, R., & Chauhan, K. (2012). To Stop or Not to Stop Experiments on Stopword Elimination for Information Retrieval of Gujarati Text Documents. In Nirma University International Conference on Engineering (NUiCONE) (pp. 1–4). Kibriya, A. M., Frank, E., Pfahringer, B., & Holmes, G. (2005). Multinomial naive bayes for text categorization revisited. In AI 2004: Advances in Artificial Intelligence (pp. 488–499). Springer Berlin Heidelberg.

57

ACCEPTED MANUSCRIPT

Kohavi, R., & John, G. H. (1997). Wrappers for Feature Subset Selection. Artificial intelligence, 97 , 273–324.

CR IP T

Labani, M., Moradi, P., Ahmadizar, F., & Jalili, M. (2018). A novel multivariate filter method for feature selection in text classification problems. Engineering Applications of Artificial Intelligence, 70 , 25 – 37. Lal, T. N., Chapelle, O., Weston, J., & Elisseeff, A. (2006). Embedded Methods. In Feature extraction (pp. 137–165). Springer.

AN US

Lan, M., Tan, C. L., Su, J., & Low, H. B. (2007). Text Representations for Text Categorization: A Case Study in Biomedical Domain. In International Joint Conference on Neural Networks (IJCNN) (pp. 2557–2562).

Lan, M., Tan, C. L., Su, J., & Lu, Y. (2009). Supervised and Traditional Term Weighting Methods for Automatic Text Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31 , 721–735.

M

Li, X., Xie, H., Chen, L., Wang, J., & Deng, X. (2014). News Impact on Stock Price Return via Sentiment Analysis. Knowledge-Based Systems, 69 , 14–23.

ED

Li, Y., Li, T., & Liu, H. (2017). Recent advances in feature selection and its applications. Knowledge and Information Systems, 53 , 551–577.

PT

Marin, A., Holenstein, R., Sarikaya, R., & Ostendorf, M. (2014). Learning Phrase Patterns for Text Classification using a Knowledge Graph and Unlabeled Data. In Fifteenth Annual Conference of the International Speech Communication Association (pp. 253–257).

CE

Navidi, W. (2015). Statistics for Engineers and Scientists. McGraw-Hill Education.

(4th ed.).

AC

Ogura, H., Amano, H., & Kondo, M. (2011). Comparison of metrics for feature selection in imbalanced text classification. Expert Systems with Applications, 38 , 4978–4989.

Peng, H., Long, F., & Ding, C. (2005). Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and MinRedundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27 , 1226–1238. 58

ACCEPTED MANUSCRIPT

Rao, Y., Xie, H., Li, J., Jin, F., Wang, F. L., & Li, Q. (2016). Social Emotion Classification of Short Text via Topic-level Maximum Entropy Model. Information and Management, 53 , 978–986.

CR IP T

Rehman, A., Javed, K., & Babri, H. A. (2017). Feature Selection based on a Normalized Difference Measure for Text Classification. Information Processing and Management, 53 , 473–489. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34 , 1–47.

AN US

Singh, S. R., Murthy, H. A., & Gonsalves, T. A. (2010). Feature Selection for Text Classification Based on Gini Coefficient of Inequality. In FSDM (pp. 76–85). volume 10 of JMLR Proceedings. Srividhya, V., & Anitha, R. (2011). Evaluating Preprocessing Techniques in Text Categorization. International Journal of Computer Science and Application, 47 .

The American

M

Stigler, S. M. (1983). Who discovered Bayes’s Theorem? Statistician, 37 , 290–296.

ED

Su, J., Shirab, J. S., & Matwin, S. (2011). Large Scale Text Classification using Semi-supervised Multinomial Naive Bayes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 97–104).

PT

Uguz, H. (2011). A Two-Stage Feature Selection Method for Text Categorization by using Information Gain, Principal Component Analysis and Genetic Algorithm. Knowledge-Based Systems, 24 , 1024–1032.

CE

Uysal, A. K. (2016). An improved global feature selection scheme for text classification. Expert Systems with Applications, 43 , 82 – 92.

AC

Uysal, A. K., & Gunal, S. (2012). A Novel Probabilistic Feature Selection Method for Text Classification. Knowledge-Based Systems, 36 , 226–235.

Uysal, A. K., & Gunal, S. (2014). Text classification using genetic algorithm oriented latent semantic features. Expert Systems with Applications, 41 , 5938–5947.

59

ACCEPTED MANUSCRIPT

CR IP T

Van Hulse, J., Khoshgoftaar, T., & Napolitano, A. (2011). A Comparative Evaluation of Feature Ranking Methods for High Dimensional Bioinformatics Data. In IEEE International Conference on Information Reuse and Integration (IRI) (pp. 315–320). Wallach, H. M. (2006). Topic Modeling: Beyond Bag-of-words. In Proceedings of the 23rd International Conference on Machine Learning (ICML) (pp. 977–984).

AN US

Wang, D., Zhang, H., Liu, R., Liu, X., & Wang, J. (2016). Unsupervised Feature Selection Through Gram-Schmidt orthogonalization-A Word Cooccurrence Perspective. Neurocomputing, 173 , 845–854.

Wang, D., Zhang, H., Liu, R., Lv, W., & Wang, D. (2014). t-test Feature Selection Approach Based on Term Frequency for Text Categorization. Pattern Recognition Letters, 45 , 1–10. Witte, R. S., & Witte, J. S. (2010). Statistics. (9th ed.). John Wiley & Sons.

ED

M

Wu, Y., & Zhang, A. (2004). Feature Selection for Classifying HighDimensional Numerical Data. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). volume 2.

PT

Xu, Y., Jones, G. J., Li, J., Wang, B., & Sun, C. (2007). A Study on Mutual Information-based Feature Selection for Text Categorization. Journal of Computational Information Systems, 3 , 1007–1012.

CE

Yang, Y., & Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML) (pp. 412–420).

AC

Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (pp. 856–863). AAAI Press. Zhang, W., Yoshida, T., & Tang, X. (2011). A Comparative Study of TF*IDF, LSI and Multi-words for Text Classification. Expert Systems with Applications, 38 , 2758–2765. 60