A three-stage unsupervised dimension reduction method for text clustering

A three-stage unsupervised dimension reduction method for text clustering

Journal of Computational Science 5 (2014) 156–169 Contents lists available at ScienceDirect Journal of Computational Science journal homepage: www.e...

1MB Sizes 0 Downloads 62 Views

Journal of Computational Science 5 (2014) 156–169

Contents lists available at ScienceDirect

Journal of Computational Science journal homepage: www.elsevier.com/locate/jocs

A three-stage unsupervised dimension reduction method for text clustering Kusum Kumari Bharti ∗ , P.K. Singh Computational Intelligence and Data Mining Research Lab, ABV-Indian Institute of Information Technology and Management Gwalior, Morena Link Road, Gwalior, MP, India

a r t i c l e

i n f o

Article history: Received 28 February 2013 Received in revised form 25 October 2013 Accepted 25 November 2013 Available online 4 December 2013 Keywords: Feature selection Feature extraction Dimension reduction Sparsity Three-stage model Text clustering

a b s t r a c t Dimension reduction is a well-known pre-processing step in the text clustering to remove irrelevant, redundant and noisy features without sacrificing performance of the underlying algorithm. Dimension reduction methods are primarily classified as feature selection (FS) methods and feature extraction (FE) methods. Though FS methods are robust against irrelevant features, they occasionally fail to retain important information present in the original feature space. On the other hand, though FE methods reduce dimensions in the feature space without losing much information, they are significantly affected by the irrelevant features. The one-stage models, FS/FE methods, and the two-stage models, a combination of FS and FE methods proposed in the literature are not sufficient to fulfil all the above mentioned requirements of the dimension reduction. Therefore, we propose three-stage dimension reduction models to remove irrelevant, redundant and noisy features in the original feature space without loss of much valuable information. These models incorporates advantages of the FS and the FE methods to create a low dimension feature subspace. The experiments over three well-known benchmark text datasets of different characteristics show that the proposed three-stage models significantly improve performance of the clustering algorithm as measured by micro F-score, macro F-score, and total execution time. © 2013 Elsevier B.V. All rights reserved.

1. Introduction A continuous phenomenal growth of Internet technology is resulting in an exponential increase in the amount of digital information. This exponential growth of the digital information makes it difficult to process the documents manually and extract relevant information from the vast corpus in time. It necessitates a vital need of such methods, which organize the documents automatically to facilitate retrieving the relevant information in time. Automatic text clustering is a key step to resolve this issue. Text clustering is one of the efficient ways to organize digital documents in a well structured format to facilitate quick and efficient retrieval of relevant information in time. It is also known as the unsupervised method as it creates clusters based on the intrinsic characteristics of the documents without using any prior class label information. Various machine learning approaches, e.g., k-means [21], fuzzy c-mean [3], expectation-maximization clustering [7], quality threshold clustering [12], kernel k-mean

∗ Corresponding author. Tel.: +91 9669934184. E-mail addresses: [email protected], kkusum [email protected] (K.K. Bharti), [email protected] (P.K. Singh). 1877-7503/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jocs.2013.11.007

clustering [29], density based clustering [17], minimum spanning tree based clustering [41], have been proposed in the past years for clustering. For text clustering each document is represented as a vector using Feature Vector Model (FVM) also known as Vector Space Model (VSM) [28]. A widely used approach for document representation is bag of words where each term present in the document is considered as a separate feature/dimension. The features present in the documents, and hence in the FVM/VSM, are classified as relevant: provide useful information to discriminate different documents, irrelevant: provide no useful information in any context, redundant: information is implicit in remainder, and noisy: misleads the algorithm. The presence of irrelevant, redundant and noisy features not only increases computational complexity of the underlying algorithm but also decreases accuracy and sometimes mislead it. In such cases, dimension reduction plays an important role; it not only speeds up the computation but also improves the performance. The primary aim of dimension reduction methods is to reduce dimensions in the feature space without scarifying performance of the underlying algorithm. Traditionally, dimension reduction methods are classified as feature selection methods [30,4,39,25] and feature extraction methods [24,6,18].

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169

Feature selection methods are broadly classified as filter, wrapper, and embedded methods. Filter methods assess relevance of feature subsets based on the intrinsic characteristics of the documents [10]. Wrapper and embedded methods evaluate quality of feature subsets using specific learning algorithm [15,27]. Therefore, they are computationally expensive and are specific to the utilized learning model. On the other hand, filter methods have better generalization characteristics as they ignore interaction with the learning algorithm. Therefore, filter feature selection methods are widely used for dimension reduction specially in the text clustering because of their simplicity, scalability, less computational complexity, and considerable accuracy [8,11,23]. Information gain (IG) [25], mutual information (MI) [43], chi-square (CHI) [19], gini index (GINI) [30], odds ratio (OR) [23], reliefF [16], relief [14], document frequency (DF) [20], term variance (TV) [20], term strength (TS) [39], and laplacian score (LS) [11], mean-median (MM), modified arithmetic mean geometric mean (AMGM), absolute cosine (AC) method [8] etc. are few examples of filter feature selection. The feature extraction methods create a distinct low dimensional space by a combination of the original high dimensional space. They reduce dimensions in the feature space without loss of much information. Principal component analysis (PCA) [24], latent semantic indexing (LSI) [6], multidimensional scaling (MDS) [18] etc. are few examples of the feature extraction methods. However, PCA has received much attention [36,35]. It is a standard statistical method to reduce dimensions in the feature space with an aim to minimize the loss of information. As it is also able to reduce the noise efficiently, it increases performance of the underlying algorithm [32]. A good dimension reduction method should: (1) identify relevant features and remove irrelevant features, (2) remove redundant features, (3) remove noisy features, (4) preserve the valuable information present in original feature space, and (5) significantly reduce dimension of the feature space without compromising performance of the algorithm. In order to achieve all these essential characteristics, in this work, we propose a novel three-stage model. This model systematically and significantly reduce dimensions in the original feature space by removing irrelevant, redundant, and noisy features without losing much valuable information. In the first stage, a FS method is applied to obtain a candidate feature set, which removes irrelevant features from features in the original feature space. In the second stage, again a FS method is applied to remove the redundant features from the relevant feature space, i.e., the obtained candidate set at stage one. Lastly, in the third stage, we apply a FE method to further reduce dimensions without losing much information in the relevant and non-redundant feature space obtained at stage two. This step also removes noisy features, thereby, positively contribute in improving performance of the underlying algorithm. We perform comprehensive experiments to compare performance of the proposed three-stage dimension reduction models to their variants of one-stage and two-stage models using k-means on three well-known benchmark text datasets 20 Newsgroups (N), Reuters-21,578 (R), and WebKB (W) of varying characteristics. The results are very promising and show that the proposed three-stage models improve performance of the clustering algorithm significantly. The rest of the paper is organized as follows. Section 2 presents a brief review and the related literature of dimension reduction. A brief introduction to mean absolute difference, meanmedian, absolute cosine measure, principal component analysis, and term weighting scheme are presented in Section 3. Section 4 explains our proposed methodology. The results of the proposed methods are analyzed and discussed in Section 5. Finally, conclusions and recommendations for future work are summarized in Section 6.

157

2. Review of existing techniques The irrelevant, redundant and noisy features of a document are unwanted as they do not help in identifying the basic context of the document. Rather, they unnecessary increase computation complexity of the underlying algorithm. Moreover, they significantly affect performance of the algorithm negatively. Identification and removal of these unwanted features is a challenge in the area of text clustering. Various studies have been conducted in literature [30,39,11,23,19,20,33,36,35,22,13,1,37,2,42] to remove such features without compromising performance of the underlying algorithm. In this section, we present a brief overview of recent development in the area of dimension reduction. DF [20], TV [20], TS [39], OR [23], CHI [19], GINI [30], and LS [11] are examples of one-stage dimension reduction methods. As none of the one-stage dimension reduction method is able to deal with all types of unwanted features altogether, a significant amount of empirical research has been conducted by various researchers [33,36,35,22,13,1,37,2,42] in hybrid models to deal with different types of unwanted features altogether. Some of the researchers name it as two-stage model [22,1,2] whereas some of others call it as the hybrid model [35,13] for dimension reduction. Various combinations of dimension reduction methods have been proposed in literature like FS-FE methods [36,22], filter and wrapper methods [35,1], filter-filter methods [42] to create reduced feature subspace. Information gain-genetic algorithm (IG-GA) [36], feature contribution degree-latent semantic indexing (FCD-LSI) [22], maximum relevance minimum redundancy-genetic algorithm (MRMR-GA) [1] are few examples of two-stage dimension reduction methods. Information gain-principal component analysis (IG-PCA) [35], maximum relevance minimum redundancy particle swarm optimization (mr2PSO) [37] are few examples of hybrid models. Uguz [36] uses hybrid methods to create informative reduced dimensional feature subspace. He proposes a filter-wrapper (IGGA) and a FS-FE method (IG-PCA) to convert high dimensional feature space into low dimensional subspace. First, each feature in the document is ranked based on its discriminative power for classification using the IG. In the second stage, a FS method (GA) and a FE method (PCA) is applied individually to reduce dimensions in the feature space. To evaluate the effectiveness of his proposed models, he uses k-nearest neighbour (KNN) and C4.5 decision tree algorithm on Reuters-21,578 and Classic3 datasets. The experimental results show that the proposed models are effective in terms of precision, recall and F-measure. Among the two proposed models the filter-wrapper (IG-GA) performs comparatively better. Though this method yields good performance in terms of precision, recall, and F-measure, it generates classifier specific feature subsets, hence leads to overfitting problem. Further, as the algorithm considers an interaction with classifier to select discriminative feature subset, it makes it computationally expensive. Uguz [35] also uses the FS-FE method (IG-GA) for Doppler signal selection. Usually, a strict term matching is used to represent a document in the VSM. It does not take into account the semantic correlation between features, hence, leads to a poor classification/clustering accuracy. To overcome this issue, Menga et al. [22] use LSI to replace the individual features with statistically derived conceptual indices. First, they use a feature selection method FCD to select discriminative features set and then construct a new semantic space using LSI method. They show the effectiveness of their model on a spam database categorization. Song and Park [33] also use LSI to reduce dimensions in the feature space. They demonstrate superiority of their approach genetic algorithm based on latent semantic (GAL) over conventional GA applied in VSM model on clustering of Reuters-21,578 documents dataset. Though, LSI

158

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169

Table 1 Summarization of recent developments in dimension reduction. Authors

Key concept

Measures

Methods

Algorithm used

Remarks

Uguz [35]

Removes irrelevant features through two-stage process Quantify the relevance by considering the positivity of features with respect the class label information Reduce dimension using LSI and then use GA to create clusters of documents Utilize computational efficiency and accuracy of classifier to select discriminative set of features Combine MRMR and GA for gene selection Propose two step procedure for dimension reduction Use the MI available from the filter model to weigh the bit selection probabilities in the discrete PSO Integrate FS and FE methods for dimension reduction Use dispersion based measure of quantifying the relevancy of features

IG, PCA, GA

Filter-wrapper and FS-FE

KNN, C4.5

Generates classifier specific features subset

FCD, LSI

FS-FE

SVM

Use statistically derived conceptual indices to replace the individual features

LSI

FE

GA

Transformation of feature space is performed in redundant feature space

IG, F-Score, SBS, SFS

Filter-Wrapper

SVM

Computationally expensive

MRMR, GA

Filter-Wrapper

SVM, NB

ReliefF, MRMR

Filter-Filter

SVM, NB

MRMR, PSO

Filter-Wrapper

SVM

Do not deal with noisy features Fails to contain valuable information as present in original space Fails to remove noisy features

IG, PCA

FS-FE

SVM

Automatically select number of features

MM, MAD, AMGM, AC

Filter

SVM

Fails to remove noisy features

Menga et al. [22]

Song and Park [33]

Hsu et al. [13]

Akadi et al. [1] Zhang et al. [42]

Unler et al. [37]

Uguz [36]

Ferreira and Figueired [8]

reduces dimensions in the feature space significantly, reduced feature space still suffer from the irrelevant features. Hsu et al. [13] acquaint a hybrid feature selection method, which amalgamate filter methods (F-score, IG) and wrapper method (inverse sequential floating search method). First, they choose candidate feature subsets from original feature space using F-score and IG separately and then refine the obtained feature subspaces using wrapper method. They conduct experiments with two bioinformatics problems namely protein disordered region prediction and gene selection in microarray cancer data and show that equal or better prediction accuracy can be achieved with a smaller feature set also. Foithong et al. [9] propose the concept of hybrid filter-wrapper method for dimension reduction. They select feature subspace using mutual information without requiring a user-defined factor for the selection of the candidate feature subset. They also reduce the computational cost of wrapper method. They illustrate the effectiveness of their model using classifier (multilayer perceptron) on 10 UCI datasets. Here, authors focus only on removal of irrelevant and redundant features, however presence of noisy features negatively affects performance of the underlying algorithm. Akadi et al. [1] introduce a two-stage selection process for gene selection. They combine MRMR with GA to create an informative genes subset. Initially, they use MRMR to filter out noisy and redundant genes from the high dimensional space and then use GA to select a subset of relevant discriminative features. In their study, the authors use support vector machine (SVM) and naive bayes (NB) classifiers to estimate fitness of the selected features. Their experimental results show that their model is able to select the smallest gene subset that obtains the most classification accuracy in leave-one out cross-validation (LOOCV). Zhang et al. [42] and Unler et al. [37] also use MRMR to select discriminative features. Zhang et al. [42], integrate MRMR with ReliefF [16], which is an extension

of Relief [14]. Unler et al. [37] combine MRMR with discrete PSO to bring the efficiency and accuracy of filters and wrappers to select discriminative subset of features. The main problem with these methods is that they are effective in removing irrelevant and redundant features however fail to remove noisy features. Various studies have been conducted by researchers [1,37,2] to remove irrelevant and redundant features from original feature space. Continuing with this concept Ferreira and Figueired [8] propose a FS-FS model to create relevant and non redundant feature space. At first, dispersion based methods (AMGM, MM) are used to assess the relevance of a feature. Then, AC method is applied to remove redundant features. Their finding shows that their proposed model significantly improves performance of the underlying algorithm. A summary of above discussed papers is presented in Table 1. 3. Theoretical background In this section, first we present motivation for the proposed three-stage dimension reduction model in Section 3.1. In Section 3.2, we present a brief introduction to the underlying concepts and terminologies used. In Section 3.3, we present a brief introduction to the term weighting scheme. 3.1. Theoretical motivation In case of one-stage model, dimension reduction can be performed either with a FS method [30,11,19] or with a FE method [38,31]. Removal of irrelevant, redundant and noisy features which affect performance of the underlying algorithm negatively, is not possible with one-stage dimension reduction methods [30,39,16,14,20]. As the FE methods preserve most of the

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169

information of the original feature space while reducing the dimension, their performance is significantly affected by the irrelevant and redundant features present in the original space. For this reason they are not a good choice in case of presence of high degree of irrelevant and redundant features in the feature space. On the other hand, FS methods are better choice to remove irrelevant and redundant features. Though FS methods efficiently identify relevant features in the original feature space without getting much affected by the irrelevant features, they fail to retain information in the original space. It suggests that hybrid model is better way to utilize the advantages of one method to lessen the drawback of the other method to achieve (near) global optimal solution. Therefore, recently two-stage models have been proposed by various researchers [36,35,22,13,1]. In this case, selection of discriminative features are performed with two different feature scoring methods having different characteristics. This methodology provides an additional advantage to deal with features having different types of problem characteristics together, hence construct a more informative space compared to the one-stage model. There are four possible ways FS-FS, FS-FE, FE-FS, and FE-FE to conduct two-stage dimension reduction. The FE-FS and FE-FE are not used for dimension reduction as they are not suitable choices for dimension reduction. In case of FE-FS method, FE method usually transforms a high dimensional feature space into a nonunderstandable low dimensional feature space, which makes it almost impossible to apply a FS methods at later stage to further reduce dimensions in the feature space. In case of FE-FE methods, transformation of high dimensional feature space into low dimensional space is performed with FE methods. Though these methods reduce dimensions of the feature space significantly, the reduced feature space is highly affected by the irrelevant and redundant features. As a result, they rather deteriorate performance of the underlying algorithm. Remaining options FS-FS and FS-FE are used for dimension reduction. These methods efficiently remove irrelevant features and redundant features, and irrelevant and noisy features respectively from original feature space. Among the two, the FS-FE methods are relatively more popular than the FS-FS methods as they efficiently remove irrelevant features from original feature space without losing much information. However, despite various advantages, two-stage dimension reduction methods suffer from some inherent drawbacks. Twostage dimension reduction FS-FS methods remove irrelevant and redundant features only. Though FS-FS methods reduce dimensions in the feature space significantly, they fail to remove noisy features along with preserving valuable information present in feature space. On the other hand, FS-FE methods first use FS methods to remove irrelevant features and then use FE method to transform high dimensional feature space into low dimensional feature space, which contains as much of the information as possible in original feature space. However, transformation of high dimensional feature space into low dimensional feature space is performed in redundant feature space, which affects the transformed feature space significantly. Therefore, to address these issues, we propose three-stage models, which systematically remove irrelevant, redundant and noisy features along with preserving valuable information. In view of the above discussion, it is clear that the three-stage dimension reduction can be performed in four ways: FS-FS-FE, FS-FS-FS, FS-FE-FE, and FS-FE-FS. In case of FS-FE-FS, FE methods usually transform a high dimensional feature space into non-understandable low dimensional feature space, which makes it almost impossible to apply FS methods at later stage. Therefore, it is also not a good choice for dimension reduction. In case of FS-FS-FS methods, it is possible that it removes some of the valuable information (features) also because of strict removal characteristics, which will deteriorate performance of the clustering algorithm significantly. In case

159

of FS-FE-FE methods, FE-FE methods only reduce dimensions in the feature space significantly, which do not provide any additional advantage in addition to reducing the computational complexity of the underlying algorithm. The FS-FS-FE works systematically; first, removes irrelevant features from original feature space using a FS method, then removes redundant features from relevant feature space using another FS method and at last transforms high dimensional relevant and non-redundant feature space into a noiseless low dimension feature space along with preserving valuable information using a FE method. Therefore, in this work, we propose three-stage FS-FS-FE methods for dimension reduction. A detailed description of the proposed methodology is presented in Section 4. In this work, we use MAD [8] and MM [8] to identify relevant features as they perform better for sparse as well as dense data and use AC [8] to remove redundant features from relevant feature space. We use feature extraction method principal component analysis [24] to transform high dimensional feature space into low dimensional feature space as it minimizes the loss of information and is also able to reduce the noise efficiently. Finally, we use kmeans clustering [21] algorithm to create clusters of documents. A brief introduction to these terminologies is presented below. 3.2. A brief of feature selection and feature extraction methods 3.2.1. Mean absolute difference (MAD) This method [8] is a simplified form of term variance [20]. It assigns a relevance score to each feature by computing the difference of sample from the mean value. A formulation of MAD is presented in Eq. (1). 1 | Xij − X i | n n

MADi =

(1)

j=1

where Xij is the value of feature i with respect to document j and X i is the mean of the feature i, which is computed as shown in Eq. (2). Xi =

n 1

n

Xij

(2)

j=1

3.2.2. Mean-median (MM) This method [8] is a simplified form of skewness. It assigns a relevance score to each feature based on the absolute difference between the mean and median of X i . It is shown in Eq. (3). MM i = |X i − median(Xi )|

(3)

Both the methods are good choice for dense as well as sparse datasets [8]. The characteristics of these methods prompted us to use them for feature selection. 3.2.3. Absolute cosine (AC) Instead of relying solely on a subspace of relevant features, various researchers [8,5,40] proposed removal of redundant features also to improve performance of the underlying algorithm. In [8], authors introduce a new measure absolute cosine (AC) for removal of redundant features. They perform comparative analysis of AC with correlation coefficient (CC) [5], maximal information compression index (MICI) [5], and fast correlation based filter (FCBF) [40] to show its effectiveness empirically. Its brief introduction is as follows. It quantifies the similarity between two feature vectors by measuring the cosine of the angle between them. Its formulation is presented in Eq. (4).

   wi .wt    w w 

cos(wi ,wt ) = 

i

t

(4)

160

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169

Here wi and wt are observed features,  .  denotes the dot product and  . is the euclidean norm. It measures the angle between the given vectors. Its value varies between 0 and 1; 0 means both the features are orthogonal and 1 means both the features are colinear. 3.2.4. Principal component analysis (PCA) Principal component analysis (PCA) [24] is introduced by Karl Pearson in 1901. It is also known as Karhunen–Loeve or K–L method. It is a statistical method that uses an orthogonal linear transformation to transform a high dimensional feature space into a new low dimensional feature subspace. The number of transformed principal components may be less than or equal to the original variables. Let m be the number of original variables and p be the number of transformed variables, then p may be less than or equal to m (p ≤ m). Transformation of data to a new space is carried out in such a way that the greatest variance of the data by any projection lies on the first component (called the first principal component), the second greatest variance lies on the second component, and so on. In other words, each component is having high variance than its succeeding component and less variance then the preceding component (refer [32], for a detailed description of mathematical process of PCA). It has several advantages, e.g., it is computationally inexpensive, it can handle sparse and skewed data, it has an ability to remove noisy features. The most important step in the PCA is the determination of the principal components, i.e., p, up to which original datasets have to be transformed. In our study, we use cumulative variance criteria for component (dimension) selection as used by Uguz [36]. The sensible range for cumulative variance (CV) varies in the range of 70–90 [36]. In our study, we set it to 70% through experimental analysis.

3.3. Term weighting The vectors representing the documents are constructed based on the frequency of occurrence of each feature in the respective documents. The formulation of term weighting scheme [26] used in this work is presented in Eq. (5).



TF − IDF(i, d) = (

wf id ) ln

N df i

if wf id ≥1;

0 otherwise (5)

Here TF − IDF(i, d) is the TF − IDF of the ith word (feature) in the dth document, fid is the word frequency (i.e., frequency of the ith word in the dth document), N is the total number of documents in the corpus, and dfi is the document frequency of the ith word (i.e., the number of documents in the corpus that include the ith word). In this study, we use k-means clustering to create clusters of documents. Details of the k-means clustering can be reached from [21]. 4. Proposed three-stage unsupervised dimension reduction methods Here, we present a detailed description of the proposed three-stage dimension reduction model. Algorithm 1 presents a pseudo-code of the proposed model. The algorithm consists of three stages: (i) relevant feature selection (lines 1–14) to select the top m ranked features based on relevance score function (ii) redundant feature removal (lines 15–23) to remove redundant features from the relevant feature subspace to create relevant and non-redundant feature subspace with m features, and create VSM with m features (line 24) and (iii) final vector generation (line 25) after transformation of high dimensional feature space having m dimension into low dimension space with m dimensions. Algorithm 4.1. sion reduction

Three stage unsupervised procedure for dimen-

INPUT: I : n × m matrix, n is the number of documents, m is the number of features. Const1 : threshold to automatically select an adequate number of features. Const2 : maximum similarity allowed between observed pair of features. {SF1 } and {SF2 } : empty sets to store selected features for further processing. OUTPUT: O : n × m matrix, m is the number of reduced dimensional features subspace, with dimensions sorted by decreasing order of their variance. Algorithm: // Stage 1: Identify relevant features. Compute the relevance reli of each feature Ii using feature scoring functions. 1: 2: Sort the features by decreasing order of relevance reli , i.e., rel1 ≥ rel2 ≥ ·· · relm 3: CR = 0 ; sum = 0; /* Initialize cumulative relevance (CR) and sum with zero */ for j = 1 to m do 4: 5: 6: 7:

sum =

m

j=1

reli

/* Compute sum of the relevance score of all features */

end for for j = 1 to m do rel

CR = CR + sumj /* compute cumulative relevance (CR) */ 8: if (CR ≤ Const1 ) then 9: {SF 1 } = {SF 1 } ∪ Frelj /* Keep current feature for further processing */ 10: 11: else break 12: end if 13: end for 14: // Now, the number of selected features are m such that m ≤ m. // Stage 2: Remove redundant features from relevant features space. for j = 1 to m do 15: s = Sim(Fj , Fj+1 ) /* Compute similarity between two consecutive features */ 16: if (s ≤ Const2 ) then 17: {SF2 } = {SF2 } ∪ Fj /* Keep current feature further processing */ 18: 19: end if if ((j + 1) = m ) then 20: {SF2 } = {SF2 } ∪ Fj+1 /* Keep current feature further processing */ 21: end if 22: end for 23: //Now the number of selected features are m such that m ≤ m Create matrix d × m matrix, using TF-IDF measure (refer, Eq. 5). 24: // Stage 3: Create final reduced dimensional matrix. Transform matrix d × m into matrix d × m using a FE method PCA 25: /* Uses cumulative variance (CV) criteria to select the number of principal components. */

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169

In this work, at first stage we use MM and MAD individually to calculate the relevance score of each feature. These methods assign a relevance score to each feature based on deviation of a feature from its mean value. Mean value shows an average distribution of features among given sets of documents. A high deviation of a feature from its mean value indicates that the feature is non-uniformly distributed among the given set of documents; on the other hand, its small value shows that the given feature is uniformly distributed among the given set of documents. Uniformly distributed features are not able to efficiently discriminate the different documents. Thus, we select top m features which highly deviate from its mean value. To automatically select the best discriminative feature subset, we use cumulative relevance (CR) criteria as reported by [8]. In the second stage, we remove redundant features from relevant feature space using AC measure, which reduces dimensions in the feature space from m to m . The MAD AC and MM AC combinations are proposed by Ferreira and Figueired [8]. In this paper, we also perform experiments with FS-FE methods. In this case, we use cumulative variance (CV) criteria to create informative features subset. Finally, in the third stage, we create two different models with MM and MAD, i.e., MM AC PCA, MAD AC PCA. In this stage, we transform m dimension space into m using feature extraction method PCA. The main reason for using PCA is that it transforms a high dimensional feature space into a noiseless low dimensional subspace without losing much information. Thus, we reduce dimensions of d × m feature space to a d × m feature space. Next, we use k-means clustering algorithm to create clusters of documents with a transformed reduced feature subspace.

161

precision and recall employed in the information retrieval system. Precision is the fragment of a cluster that consists of documents of a specified class and recall measures the extent to which a cluster comprises all documents of a particular class. For particular category i, the precision (P), recall (R), and F-score (F) are defined as follows: Pi =

TP i TP i + FP i

(6)

Ri =

TP i TP i + FN i

(7)

Fi =

2 × Pi × Ri Pi + Ri

(8)

The micro and macro F-score is computed by Fmicro =

2×P×R

(9)

P+R

k Fmacro =

F i=1 i

(10)

k

Here P and R are the precision and recall values across all classes, respectively. The micro F-score and macro F-score emphasize performance of the system on common and rare categories, respectively. As a result, for text datasets of skewed categories, micro F-score and macro F-score may give quite different evaluations. Thus, using these averages, we can carefully examine the effect of different kinds of data on the clustering algorithm. 5.2. Datasets and parameter setting

5. Experimental setup This section presents empirical results of the proposed threestage model in comparison to the results of one-stage and two-stage models to illustrate the efficiency and effectiveness of the proposed models. We use two well-known performance measures micro F-score and macro F-score for this purpose on three well-known benchmark datasets 20 Newsgroups (N), Reuters21,578 (R), and WebKB (W). These datasets differ considerably in the number of features, skewness, and sparsity. All experiments are performed in windows 7 environment on a machine having core i7 processor and 2GB RAM. All feature selection algorithms are coded in java and final clusters of the documents using k-means have been obtained using MATLAB. 5.1. Evaluation criteria The micro F-score and macro F-score values are used to evaluate the effectiveness of the clustering algorithm with different dimension reduction methods. F-score is the harmonic combination of

20 Newsgroups (N): The 20 Newsgroups dataset is collected by Ken Lang in 1995. It consists of approximately 20,000 articles distributed almost evenly among 20 different UseNet discussion groups. This dataset is minorly skewed in nature compared to R and W datasets. For example, the largest category is assigned 595 documents, whereas, the smallest category is assigned 480 documents. In our experiment, we select 3,383 documents belonging to six categories. Reuters-21,578 (R): The Reuters-21,578 collection is a set of news published by Reuters newswire in 1987. It consists of 21,578 documents distributed among 135 thematic categories, mostly concerning business and economy. It is highly skewed in nature. For example, the largest category is assigned to 2,840 documents, whereas, the smallest category is assigned 41 document only. In our experiment, we randomly select 5,485 documents belong to eight categories. WebKB (W): The WebKB (World Wide Knowledge Base) is collected by Craven in 1998. It contains 8282 web pages gathered from four academic domains. The original data set has seven categories, but only four of them: course, faculty, project and student are used.

Table 2 Descriptive statistics of the datasets used in experimentation. Dataset

20 Newsgroups

Reuters-21,578

WebKB

Abbreviation No. of doc. No. of classes No. of features Class size min. Class size max. Skewness Sparsity Name

N 3383 6 24,718 480 595 Slightly skewed Less sparse than R Alt.Atheism (480), Comp.Graphics (584), Misc.Forsale (585), Rec.Autos (594), Sci.Crypt (595), Talk.Politics.Guns (545)

R 5485 8 14,534 41 2840 Highly skew Highly sparse Acq (1596), Crude (253), Earn (2840), Grain (41), Interest (190), Money-Fx (206), Ship (108), Trade (251)

W 2803 4 7,178 336 1097 Comparatively less skew than R Less sparse than R and NG Project (336), Course (620), Faculty (750), Student (1097)

Note: Above datasets can be downloaded from http://web.ist.utl.pt/acardoso/datasets/.

162

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169

Fig. 1. Sparsity analysis of (a) N, (b) R, and (c) W datasets. (For interpretation of references to colour in the text, the reader is referred to the web version of this article.)

Table 3 Summary of experimental models. Models

Variation

Description

Implementation

One-stage

MM MAD MM AC

Mean-median Mean absolute difference Mean-median, absolute cosine

MAD AC

Mean absolute difference, absolute cosine Mean-median, principal component analysis Mean absolute difference, principal component analysis Mean-median, absolute cosine, principal component analysis Mean absolute difference, absolute cosine, principal component analysis

MM is used for relevant features identification MAD is used for relevant features identification Initially MM is used for relevant features identification then AC is used to remove redundant features Initially MAD is used for relevant features identification then AC is used to remove redundant features Initially MM is used for relevant features identification then PCA is used to transform the high dimensional relevant feature space into a low dimensional subspace Initially MAD is used for relevant features identification then PCA is used to transform the high dimensional relevant feature space into a low dimensional subspace Transforms the relevant and non-redundant high dimensional feature space obtained through MM AC to a low dimensional subspace Transforms the relevant and non-redundant high dimensional feature space obtained through MAD AC to a low dimensional subspace

Two-stage

MM PCA MAD PCA Three-stage

MM AC PCA MAD AC PCA

It is skewed in nature. For example, the largest category is assigned 1097 documents, whereas, the smallest category is assigned 336 documents. In our experiment, we select 2803 documents belong to four categories. The characteristics of these datasets is summarized in Table 2.

If the majority of elements of a matrix are zero it is to referred as a sparse matrix, otherwise it is a dense matrix [34]. Here, we propose an effective measure to quantify sparsity of the datasets. This measure computes a ratio of number of nonzero entries to the total number of entries in a matrix. Let n is the number of documents and

Table 4 Number of total reduction percentage for three datasets with and without dimension reduction.

NDR One-stage Two-stage

Three-stage One-stage Two-stage

Three-stage

Dataset

N

R

W

Total MM MAD MM AC MAD AC MM PCA MAD PCA MM AC PCA MAD AC PCA MM MAD MM AC MAD AC MM PCA MAD PCA MM AC PCA MAD AC PCA

24718 7863 8278 7623 7987 420 425 420 438 0.681892 0.665102 0.691601 0.676875 0.983008 0.982806 0.983008 0.98228

14534 3972 4234 3891 4133 409 419 417 433 0.72671 0.708683 0.732283 0.715632 0.971859 0.971171 0.971309 0.970208

7178 3506 3670 3386 3550 330 335 336 343 0.511563 0.488716 0.528281 0.505433 0.954026 0.95333 0.95319 0.952215

Features

Reduction rate

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169 0.36

163

0.22

MM

MM

MM AC

0.34

MM AC

0.2

MM PCA

MM PCA

MM AC PCA

MM AC PCA

0.18

Macro F−score

Micro F−score

0.32 0.3 0.28 0.26

0.16 0.14 0.12

0.24 0.1

0.22

0.08

0.2 0.18 50

60

70

80

90

0.06 50

100

60

70

80

90

100

# of selected features

# of selected features

(a) 0.42

0.4

MM

MM MM AC

MM AC

0.38

MM PCA

0.4

MM PCA MM AC PCA

0.36

0.38

Macro F−score

Micro F−score

MM AC PCA

0.36 0.34 0.32

0.32 0.3 0.28

0.3

50

0.34

0.26

60

70

80

90

0.24 50

100

60

# of selected features

70

80

90

100

# of selected features

(b) 0.39

0.305

MM MM AC

0.385

MM PCA

Macro F−score

Micro F−score

MM AC PCA

0.3

0.38 MM

0.375

MM AC MM PCA

0.37

MM AC PCA

0.365

0.295

0.29

0.36 0.355

0.285

0.35 0.345 50

60

70

80

90

100

50

# of selected features

60

70

80

90

100

# of selected features

(c) Fig. 2. Comparison of all MM based models in terms of the micro F-score and macro F-score on N, R and W datasets.

m is the total number of features, then size of the matrix is n × m. Further, let nz is the number of nonzero entries in the matrix, then sparsity of the dataset is defined as S=

nz n×m

(11)

The value of S varies in the range of 0–1, where 0 indicates highly sparse dataset (i.e., there is no nonzero element) and 1 refers to

highly dense dataset (i.e., all elements are nonzero). Its value is 0.013537 for N dataset, 0.008447 for R dataset, and 0.027038 for W dataset. It indicates that R dataset is comparatively sparse and W dataset is comparatively dense than other two datasets. The same is visualized pictorially as shown Fig. 1. Here, the blue dots show nonzero entries whereas the white dots show zero entries. The white dots are comparatively high in the R dataset and comparatively less in the W dataset.

164

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169 0.36

0.22

MAD

MAD MAD AC

MAD AC

0.34

0.2

MAD PCA

MAD PCA MAD AC PCA

MAD AC PCA

0.18

Macro F−score

Micro F−score

0.32 0.3 0.28 0.26

0.16 0.14 0.12

0.24 0.1

0.22

0.08

0.2 0.18 50

60

70

80

90

0.06 50

100

60

# of selected features

70

80

90

100

90

100

# of selected features

0.4

0.38

0.38

0.36

0.36

0.34

0.34

Macro F−score

Micro F−score

(a)

0.32 0.3 0.28

0.22 0.2

0.28

0.24 MAD

MAD

MAD AC

0.22

MAD PCA MAD AC PCA

50

0.3

0.26

0.26 0.24

0.32

60

70

80

90

0.2

100

MAD AC MAD PCA MAD AC PCA

50

60

# of selected features

70

80

# of selected features

(b) 0.355

0.286

MAD

MAD MAD AC

MAD AC

0.354

MAD PCA

MAD PCA

0.285

MAD AC PCA

MAD AC PCA

Macro F−score

Micro F−score

0.353 0.352 0.351 0.35

0.284

0.283

0.282 0.349 0.281 0.348 0.347 50

60

70

80

90

0.28 50

100

# of selected features

60

70

80

90

100

# of selected features

(c) Fig. 3. Comparison of all MAD based models in terms of the micro F-score and macro F-score on N, R and W datasets.

5.3. User specified parameters It is required to specify values of the user-specified parameters cumulative relevance (CR), similarity threshold (ST), and cumulative variance (CV) to obtain a discriminative subset of features automatically. We run the models with different sets of values for CR, ST and CV to set the parameters and select the ones, which provide comparatively better results. With empirical experiments, we fixed the value of CR for one-stage models at 0.90, the ST for redundant feature identification is fixed at 0.4, and CV for three-stage models is fixed at 0.70.

In order to judge performance (efficiency and efficacy) of the proposed three-stage model, we conduct experiments on the datasets with eight different variations as shown in Table 3. The second column in the table defines an abbreviation for each variation, the third column describes it, and last column illustrates details of the implementation. Table 4 illustrates the dimension reduction rate in the feature space compared to the original feature space. The first row in the table labeled an NDR (no dimension reduction) indicates number of features in the original feature space in the datasets.

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169

165

6000 8000 7000

5000

TET (Seconds)

TET (Seconds)

6000 4000

3000

2000

5000 4000 3000 2000

1000 1000 MM 0

MM

AC

PCA

MM

Kmeans

MM_AC

MM_PCA

0

MM

MM_AC_PCA

AC

PCA

MM_AC

Kmeans MM_PCA

MM_AC_PCA

Models

Models

(a) 20 Newsgroups

(b) Reuters-21,578

1200

TET (Seconds)

1000

800

600

400

200 MM 0

AC

MM

PCA

MM_AC

Kmeans MM_PCA

MM_AC_PCA

Models

(c) WebKB Fig. 4. Total execution time (seconds) for MM based model with N, R, and W datasets.

The reduction rate of the number of features is defined as: RR = 1 −

features obtained after FS total number of features

(12)

Table 4 shows that all dimension reduction models drastically reduce dimensions in the feature space. One-stage models reduce dimensions to 49% and 73%, two-stage models reduce dimensions to 51% and 98%, and the proposed three-stage models reduce dimensions to 95% and 98% of the original feature space.

5.4. Experimental evaluation This section presents the results obtained using several experiments performed to test the behavior of the proposed three-stage models vis-a-vis their performance in comparison with the one-stage and the two-stage models. Section 5.4.1 presents performance of the models with varying number of features (50%-100%) of the features selected using the model. To evaluate performance of the k-means clustering algorithm, we employ two performance measures, micro F-score and macro F-score. Section 5.4.2 illustrates performance of the competing models in terms of total execution time (TET) - time taken by each model from dimension reduction to cluster creation.

5.4.1. Experimental results and analysis Fig. 2 and Fig 3 illustrate results of the empirical analysis applied on the three benchmark text datasets for k-means clustering using different MM and MAD based dimension reduction models, respectively. In Fig. 2, the vertical axis corresponds to micro F-score and macro F-score recorded with different dimension reduction models, and horizontal axis corresponds to varying number of feature dimensions. Various experiments have been conducted by researchers Ferreira and Figueired [8], Uguz [36], and Menga et al. [22] with two-stage models (hybrid model) and they conclude that two-stage models improve performance of the underlying algorithm. We also observe the same behaviour (refer Fig. 2) through our experimental analysis. Although, feature space obtained with two-stage models or hybrid models remove most of irrelevant information, some of the non-informative features may be still present in the created feature subspace. To further refine the created feature subspace, we propose the three-stage dimension reduction model. Experimental results observed with this study is shown in Fig. 2. It is clearly visualized that the proposed three-stage models improve performance of the underlying algorithm significantly. Next, we perform the same analysis with MAD based dimension reduction models. Fig. 3 illustrates the obtained plots of micro F-score and macro F-score (on vertical axis) versus the varying number of feature dimensions (on horizontal axis) on the N, R and

166

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169 8000

7000

7000

6000

6000

TET (Seconds)

TET (Seconds)

5000

4000

3000

2000

4000 3000 2000

1000

1000 MAD

0

5000

MAD

AC

PCA

MAD_AC

MAD

Kmeans MAD_PCA

0

MAD

MAD_AC_PCA

AC

PCA

MAD_AC

Kmeans MAD_PCA

MAD_AC_PCA

Models (b) Reuters-21,578

Models (a) 20 Newsgroups 1200

TET (Seconds)

1000

800

600

400

200 MAD 0

MAD

AC

PCA

MAD_AC

Kmeans MAD_PCA

MAD_AC_PCA

Models (c) WebKB Fig. 5. Total execution time (seconds) for MAD based model with N, R, and W datasets.

W datasets. It is clearly visualized that the proposed three-stage models perform better than other dimension reduction models on all datasets except in the case of W dataset. This analysis reveals that informative feature space improves performance of the underlying algorithm significantly and our proposed models are able to create more informative feature space (relevant, non-redundant, and noiseless) compared to other models. Hence, it helps the underlying algorithm to perform significantly better in comparison to other competing models. 5.4.2. Evaluation on efficiency In this section, we perform a comparative study of the total execution time (TET) taken by each of the competing dimension reduction model to create clusters of documents. A pictorial representation of this study for MM based models and MAD based models are given in Figs. 4 and 5 respectively. In Figs. 4 and 5, the vertical axis corresponds to time taken by each phase of the dimension reduction models to create clusters and horizontal axis corresponds to different dimension reduction models. It is depicted from Figs. 4 and 5 that the total execution time for feature selection linearly increases with one-stage to three-stage dimension reduction models. However, execution time of k-means clustering linearly decreases with one-stage to threestage. It draws an important conclusion that, though the multistage

models increase the TET time from dimension reduction to cluster creation, they reduce the execution time of the clustering algorithm significantly. The execution time of k-means clustering for three-stage models are comparatively lower than one-stage and two-stage models in all cases. 5.5. Summary Table 5 illustrates a summary of all the presented experiments. Here, ‘+’ indicates that the proposed three-stage model is superior to the competitive model for the considered measure, similarly ‘−’ indicates that the proposed model is inferior to the competing model. The first column in the table indicates the dataset and the measure in parentheses for which the computed values are shown in next column for that row. For example, N (NOD) in the first column of first row indicates the number of dimensions (NOD) for N dataset. In second column, first entry corresponds to the computed value for the MM based model and second entry corresponds to computed value for the MAD based model. For example, the first entry in second column in first row (7863-430) indicates that MM reduces the dimensions to 7863 whereas the proposed model MM AC PCA reduces number of dimensions in N dataset to 430. Similarly the second entry in the second column in first row (8278438) indicates that MAD reduces the dimensions to 8278 whereas

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169

167

Table 5 Summary of experimental models. Datasets

One-stage versus three-stage

Two-stage versus three-stage

N(NOD)

(7863-430) +

(7623-430) + (420-430) +

(8278-438) +

(7987-438) + (425-438) −

(3972-417) +

(3891-417) + (409-417) −

(4234-433) +

(4133-433) + (419-433) −

(3506-336) +

(3386-336) + (330-336) −

(3670-343) +

(3550-343) + (335-343) +

(20.52-31.17) +

(24.23-31.17) + (25.38-31.17) +

(22.89-30.89) +

(24.33-30.89) + (25.51-30.89) +

(29.38-38.80) +

(35.30-38.80) + (32.62-38.80) +

(28.94-37.93) +

(33.51-37.93) + (33.39-37.93) +

(35.01-38.86) +

(35.42-38.86) + (35.93-38.86) +

(34.92-35.13) +

(35.10-35.13) + (35.06-35.13) +

(7.31-17.31) +

(8.35-17.31) + (14.57-17.31) +

(8.05-18.01) +

(10.09-18.01) + (15.16-18.01) +

(26.80-36.05) +

(32.48-36.05) + (30.19-36.05) +

(26.65-35.16) +

(30.27-35.16) + (30.78-35.16) +

(28.23-29.78) +

(28.71-29.78) + (29.24-29.78) +

(28.14-28.30) +

(28.23-28.30) + (28.34-28.30) −

(1288.179-1377.586) −

(1265.552-1377.586) − (1385.761-1377.586) +

(1485.475-1569.518) −

(1469.59-1569.518) − (1570.881-1569.518) +

(1838.674-1920.706) −

(1826.485-1920.706) − (1926.509-1920.706) +

(1792.233-1830.719) −

(1783.497-1830.719) − (1850.581-1830.719) +

(202.703-262.9407) −

(201.204-262.9407) − (268.5463-262.9407) +

(219.2171-290.0365) −

(214.4832-290.0365) − (293.6416-290.0365) +

18/24

37/48

R(NOD)

W(NOD)

N(Micro F-score)

R(Micro F-score)

W(Micro F-score)

N(Macro F-score)

R(Macro F-score)

W(Macro F-score)

N(TET)

R(TET)

W(TET)

Total

the proposed model MAD AC PCA reduces the dimensions to 438. In a similar way, in third column, the first entry corresponds to the computed value of the measure for the MM AC, second entry corresponds to value of MM PCA v/s to MM AC PCA, third entry corresponds to MAD AC and fourth entry corresponds to MAD PCA. The last row, summarizes results of this analysis, i.e., how many times the proposed model obtains better results in the comparison to the competing model.

Remarks

Reduction in number of dimensions, smaller values are preferable

Micro F-score for varying number of dimensions, larger value is more preferable

Macro F-score for varying number of dimensions, larger value is more preferable

TET for dimension reduction to clustering, smaller values is preferable

Table 5 clearly illustrates that the proposed three-stage models perform better than the one-stage models and the two-stage models in most of the cases on all the three benchmark text datasets. It clearly demonstrates that the proposed models reduce dimensions in the feature space, improve performance of the clustering algorithm effectively, and considerably reduce the execution time of clustering algorithm efficiently in comparison to the one-stage models and the two-stage models.

168

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169

6. Conclusion & future direction Dimension reduction is one of the prominent pre-processing step in text clustering. Various FS/FE methods and two-stage models based on FS methods and/or FE methods are proposed in the literature for dimension reduction. In this paper, we propose a framework for three-stage dimension reduction models that combine the strengths of two FS methods and one FE method to create relevant, non-redundant and noiseless features subspace along with preserving valuable information. The performance of the proposed models are compared with one-stage model and twostage models with respect to dimension reduction rate, clustering accuracy, and execution time. We use three well-known benchmark text datasets 20 Newsgroups, Reuters-21,578, and WebKB having varying characteristics in terms of number of features, skewness and sparsity to illustrate the efficacy and efficiency of the proposed model. The results and analysis of a thorough experimental study clearly indicates that the proposed three-stage models (a) reduce dimensions in the feature space effectively, (b) improve performance of the clustering algorithm with fewer features, and (c) reduce the total execution time (from dimension reduction to cluster creation). It is due to the fact that the first FS method in the proposed three-stage FS-FS-FE model effectively obtains the relevant features, the second FS method removes the redundant features, and at last the FE method efficiently reduces the dimension (features) and noise along with preserving valuable information. Therefore, the proposed three-stage models are effective and efficient to remove irrelevant, redundant and noisy features along with preserving valuable information in the high dimensional datasets. In this work, we restricted to experiment with limited featurefeature interaction to identify and remove irrelevant, redundant and noisy features from original feature space along with preserving valuable information, other feature-feature interactions are remained to explore. Therefore, we plan to study different dimension reduction methods and explore their interaction with each other in order to suggest other good combinations of FS and FE methods for the proposed three-stage model. Moreover, as the selection of feature subsets and performance of the clustering algorithm greatly depends on the parameters values, developing a more efficient approach to determine the optimal parameters value should also be examined in the future work. Most importantly initial cluster centroid selection affects performance of the clustering algorithm significantly. Therefore, we also plan to conduct empirical analysis in order to select appropriate cluster centroid for k-means clustering.

References [1] A. Akadi, A. Amine, A. Ouardighi, D. Aboutajdine, A two-stage gene selection scheme utilizing MRMR filter and GA wrapper, Knowledge and Information System 26 (2011) 487–500. [2] R. Bala, R. Agrawal, A two-stage approach for relevant gene selection for cancer classification, in: IADIS European Conference Data Mining, 2009, pp. 127–132. [3] J. Bezdec, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. [4] K. Church, P. Hanks, Word association norm, mutual information and lexicography, Computational Linguistics 16 (1990) 22–29. [5] P. Mitra, C.A. Murthy, S.K. Pal, Unsupervised feature selection using feature similarity, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 301–312. [6] S. Deerwester, Improving information retrieval with latent semantic indexing, in: Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, 1988, pp. 36–40. [7] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the em algorithm, Journal of the Royal Statistical Society. Series B (Methodological) 39 (1977) 1–38. [8] A. Ferreira, M. Figueired, Efficient feature selection filters for high-dimensional data, Pattern Recognition Letters 33 (2012) 1794–1804.

[9] S. Foithong, O. Pinngern, B. Attachoo, Feature subset selection wrapper based on mutual information and rough sets, Expert Systems with Applications 39 (2012) 574–584. [10] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003) 1157–1182. [11] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in: Proceedings of Neural Information Processing Systems, 2005, pp. 505–512. [12] L. Heyer, S. Kruglyak, S. Yooseph, Exploring expression data: Identification and analysis of coexpressed genes, Genome Research 9 (1999) 1106–1115. [13] H. Hsu, C. Hsieh, M. Lu, Hybrid feature selection by combining filters and wrappers, Expert Systems with Applications 38 (2011) 8144–8150. [14] K. Kira, L. Rendell, The feature selection problem: traditional methods and a new algorithm, in: Association for the Advancement of Artificial Intelligence, AAAI Press and MIT Press, 1992, pp. 129–134. [15] R. Kohavi, G. John, Wrappers for feature subset selection, Artificial Intelligence 97 (1997) 273–324. [16] I. Kononenko, Estimating attributes: analysis and extensions of relief, in: Proc. European Conference on Machine Learning, Springer-Verlag, 1994, pp. 171–182). [17] H. Kriegel, P. Krger, J. Sander, A. Zimek, Density-based clustering, WIREs Data Mining and Knowledge Discovery 1 (2011) 231–240. [18] J.M.W. Kruskal, Multidimensional Scaling, Sage University Paper series on Quantitative Application in the Social Sciences, 1978. [19] Y. Li, C. Luo, S. Chung, Text clustering with feature selection by using statistical data, IEEE Transactions on Knowledge and Data Engineering 20 (2008) 641–652. [20] L. Liu, J. Kang, J. Yu, Z. Wang, A comparative study on unsupervised feature selection methods for text clustering, in: IEEE International Conference on Natural Language Processing and Knowledge Engineering, 2005, pp. 597–601. [21] J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, 1967, pp. 167–172. [22] J. Menga, H. Lin, Y. Yu, A two-stage feature selection method for text categorization, Knowledge-Based Systems 62 (2011) 2793–2800. [23] D. Mladenic, M. Grobelnik, Feature selection on hierarchy of web documents, Decision Support Systems 35 (2003) 45–87. [24] K. Pearson, On lines and planes of closest filt to systems of points in space, Philosophical Magazine 1 (1901) 559–572. [25] J. Quinlan, Induction of decision tree, Machine learning 1 (1986) 81–106. [26] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing & Management 24 (1988) 513–523. [27] Y. Saeys, I. Inza, P. Larranaga, A review of feature selection techniques in bioinformatics, Bioinformatics 23 (2007) 2507–2517. [28] A. Salton, G. Wong, C. Yang, A vector space model for automatic indexing, Communications of the ACM 18 (1975) 613–620. [29] M. Girolami, Mercer kernel-based clustering in feature space, IEEE Transactions on Neural Networks 13 (2002) 780–784. [30] W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, Z. Wang, A novel feature selection algorithm for text categorization, Expert Systems with Applications 33 (2007) 1–5. [31] L. Shi, J. Zhangm, E. Liu, P. He, Text classification based on nonlinear dimensionality reduction techniques and support vector machines, in: Proceedings of the Third International Conference on Natural Computation, 2007, pp. 674–677. [32] J. Shlens, A tutorial on principal component analysis, Systems Neurobiology Laboratory, University of California at San Diego, 2005. [33] W. Song, S. Park, Genetic algorithm for text clustering based on latent semantic indexing, Computers and Mathematics with Applications 57 (2009) 1901–1907. [34] J. Stoer, R. Bulirsch, Introduction to Numerical Analysis, 3rd ed., SpringerVerlag, 2002. [35] H. Uguz, A hybrid system based on information gain and principal component analysis for the classification of transcranial doppler signals, Computer Methods and Programs in Biomedicine 107 (2011) 598–609. [36] H. Uguz, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm, Knowledge-Based Systems 24 (2012) 1024–1032. [37] A. Unler, A. Murat, R. Chinnam, mr2pso: a maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification, Information Sciences 181 (2011) 4625–4641. [38] X. Wang, K. Paliwal, Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition, Pattern Recognition 36 (2003) 2429–2439. [39] Y. Yang, Noise reduction in a statistical approach to text categorization, in: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95), 1995, pp. 256–263. [40] L. Yu, H. Liu, Feature selection for high-dimensional data: a fast correlationbased filter solution, in: Proceedings of International Conference on Machine Learning ICML03, 2003, pp. 856–863. [41] C. Zahn, Graph-theoretical methods for detecting and describing getalt clusters, IEEE Transactions on Computers 20 (1971) 68–86. [42] Y. Zhang, C. Ding, T. Li, Gene selection algorithm by combining reliefF and MRMR, in: IEEE 7th International Conference on Bioinformatics and Bioengineering, 2008, pp. 127–132.

K.K. Bharti, P.K. Singh / Journal of Computational Science 5 (2014) 156–169 [43] R. Battiti, Using mutual information for selecting features in supervised neural net learning, IEEE Transactions on Neural Networks 5 (1994) 537–550. Kusum Kumari Bharti is currently pursuing her Ph.D. in Information Technology at Atal Bihari Vajpayee-Indian Institute of Information Technology and Management Gwalior, Gwalior. She recevied M.Tech degree from Maulana Azad National Institute of Technology, Bhopal. Her research interests include data mining and text mining.

169

Dr. Pramod Kumar Singh is currently working as Associate Professor at Atal Bihari Vajpayee – Indian Institute of Information Technology & Management Gwalior for last five years (2008 – onwards). Prior to this he was associated with Indian Institute of Technology Kharagpur as Senior Networking Engineer for approximate seven years (2001–2008), and as Assistant Professor for more than two years (1999–2001) with Sant Longowal Institute of Engineering and Technology Longowal (a Government of Indian enterprise), and as Lecturer/Senior Lecturer for more than nine years (1990–1999) with National Institute of Technology Jalandhar (formerly Regional Engineering College, Jalandhar). His research interests include Soft Computing, Multi/Many-Objective Optimization, Data Mining, Computer Networks, Wireless and Sensor Networks.